Auto-parallelization Overview

The auto-parallelization feature of the Intel® compiler automatically translates serial portions of the input program into equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as needed in programming with OpenMP* directives. The OpenMP and auto-parallelization applications provide the performance gains from shared memory on multiprocessor and dual-core systems.

The auto-parallelization feature of the Intel® compiler automatically translates serial portions of the input program into equivalent multithreaded code. The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel.

This behavior enables the potential exploitation of the parallel architecture found in symmetric multiprocessor (SMP) systems.

Automatic parallelization frees developers from having to:

Deal with the details of finding loops that are good worksharing candidates
Perform the dataflow analysis to verify correct parallel execution
Partition the data for threaded code generation as is needed in programming with OpenMP* directives.

The parallel run-time support provides the same run-time features as found in OpenMP, such as handling the details of loop iteration modification, thread scheduling, and synchronization.

While OpenMP directives enable serial applications to transform into parallel applications quickly, a programmer must explicitly identify specific portions of the application code that contain parallelism and add the appropriate compiler directives.

Auto-parallelization, which is triggered by the -parallel (Linux* and Mac OS*) or /Qparallel (Windows*) option, automatically identifies those loop structures that contain parallelism. During compilation, the compiler automatically attempts to deconstruct the code sequences into separate threads for parallel processing. No other effort by the programmer is needed.

Note

IA-64 architecture only: Specifying these options implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows).

Serial code can be divided so that the code can execute concurrently on multiple threads. For example, consider the following serial code example.

Example 1: Original Serial Code
subroutine ser(a, b, c) integer, dimension(100) :: a, b, c do i=1,100 a(i) = a(i) + b(i) * c(i) enddo end subroutine ser

Example 1: Original Serial Code

subroutine ser(a, b, c)

integer, dimension(100) :: a, b, c

do i=1,100

a(i) = a(i) + b(i) * c(i)

enddo

end subroutine ser

The following example illustrates one method showing how the loop iteration space, shown in the previous example, might be divided to execute on two threads.

Example 2: Transformed Parallel Code
subroutine par(a, b, c) integer, dimension(100) :: a, b, c ! Thread 1 do i=1,50 a(i) = a(i) + b(i) * c(i) enddo ! Thread 2 do i=51,100 a(i) = a(i) + b(i) * c(i) enddo end subroutine par

Example 2: Transformed Parallel Code

subroutine par(a, b, c)

integer, dimension(100) :: a, b, c

! Thread 1

do i=1,50

a(i) = a(i) + b(i) * c(i)

enddo

! Thread 2

do i=51,100

a(i) = a(i) + b(i) * c(i)

enddo

end subroutine par

Auto-Vectorization and Parallelization

Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results. For example, in the code below, thread-level parallelism can be exploited in the outermost loop, while instruction-level parallelism can be exploited in the innermost loop.

Example
DO I = 1, 100 ! Execute groups of iterations in different hreads (TLP) DO J = 1, 32 ! Execute in SIMD style with multimedia extension (ILP) A(J,I) = A(J,I) + 1 ENDDO ENDDO

Example

DO I = 1, 100 ! Execute groups of iterations in different hreads (TLP)

DO J = 1, 32 ! Execute in SIMD style with multimedia extension (ILP)

A(J,I) = A(J,I) + 1

ENDDO

Auto-vectorization can help improve performance of an application that runs on systems based on Pentium®, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors.

With the right choice of options, you can:

Increase the performance of your application with minimum effort
Use compiler features to develop multithreaded programs faster

Additionally, with the relatively small effort of adding OpenMP directives to existing code you can transform a sequential program into a parallel program. The following example shows OpenMP directives within the code.

Example
!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C) ! Defines a parallel region !OMP$ PARALLEL DO ! Specifies a parallel region that ! implicitly contains a single DO directive DO I = 1, 1000 NUM = FOO(B(i), C(I)) X(I) = BAR(A(I), NUM) ! Assume FOO and BAR have no other effect ENDDO

Example

!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C)

! Defines a parallel region

!OMP$ PARALLEL DO

! Specifies a parallel region that

! implicitly contains a single DO directive

DO I = 1, 1000

NUM = FOO(B(i), C(I))

X(I) = BAR(A(I), NUM)

! Assume FOO and BAR have no other effect

ENDDO

See examples of the auto-parallelization and auto-vectorization directives in the following topics.