The OpenMP* sample illustrates how create and compile multi-threaded applications.
See Included Samples for other samples included with the compiler.
Source |
Locations | ||||
---|---|---|---|---|---|
openmp_sample.c |
|
This sample illustrates combining compiler options and OpenMP* pragmas to compile and run multi- and single-threaded executables. The sample generates a multi-thread executable when you add the -openmp (Linux and Mac OS) or /Qopenmp (Windows) compiler option to the compilation command. Without that command the same source code results in a single-threaded executable.
In the multi-threaded implementations each thread can concurrently compute some sub-matrix of the product without needing OpenMP data or control synchronization.
In the sample, each element of the product matrix c[i][j] is computed from a unique row and column of the factor matrices, a[i][k] and b[k][j]. The algorithm uses OpenMP* to parallelize the outer-most loop, using the "i" row index.
Both the outer-most "i" loop and middle "k" loop are manually unrolled by 4. The inner-most "j" loop iterates one-by-one over the columns of the product and factor matrices.
In many cases multi-threaded sources requires a large stack size. The commands listed below demonstrate the suggested stack size and command needed to compile the sample source. Linux and Mac OS: these commands assume bash.
Platform |
Commands |
---|---|
Linux |
ulimit -s unlimited icc -openmp -std=c99 openmp_sample.c |
Mac OS |
ulimit -s 64000 icc -openmp -std=c99 openmp_sample.c |
Windows |
icl /Qopenmp /Qstd=c99 openmp_sample.c /F256000000 |
The compiler generates status messages informing you which defined loops or regions were parallelized. The following examples illustrate typical messages. (The Windows status messages include the linker phase.)
Platform |
Status Messages |
---|---|
Linux |
openmp_sample.c(109): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(116): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(96): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED. |
Windows |
openmp_sample.c(106): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(113): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(93): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED. Microsoft (R) Incremental Linker
Version 8.00.50727.42 -out:openmp_sample.exe -stack:256000000 |
Run the multi-threaded executable.
Platform |
Commands |
---|---|
Linux and Mac OS |
./a.out |
Windows |
openmp_sample |
The multi-threaded executable should generate results similar to the following.
Sample Output |
---|
Using time() for wall clock time Problem size: c(600,2400) = a(600,1200) * b(1200,2400) Calculating product 5 time(s)
We are using 2 thread(s)
Finished calculations. Matmul kernel wall clock time = 12.00 sec Wall clock time/thread = 6.00 sec MFlops = 1440.000000 |
Note the number of threads reported; at least two threads should have been used.
Linux and Mac OS (bash): If you receive an error message stating that the executable caused a segmentation fault the most likely cause is the stack size. Verify the stack size setting. enter the following command: ulimit -s.
Delete the executable created earlier, and enter the following compilation command. Notice the -openmp (Linux and Mac OS) or /Qopenmp (Windows) option is not included.
Linux and Mac OS: If you've closed the session since the last time you set the stack size, you must set the stack size again.
Platform |
Commands |
---|---|
Linux and Mac OS |
icc -std=c99 openmp_sample.c |
Windows |
icl /Qstd=c99 openmp_sample.c /F256000000 |
Notice that the compiler does not generate messages about parallelized loops or regions. OpenMP support was disabled; however, the compiler does list status messages about the ignored OpenMP* pragmas.
Run the single-threaded executable.
Platform |
Commands |
---|---|
Linux and Mac OS |
./a.out |
Windows |
openmp_sample |
The executable should generate results similar to the following.
Sample Output |
---|
Using time() for wall clock time Problem size: c(600,2400) = a(600,1200) * b(1200,2400) Calculating product 5 time(s)
We are using 1 thread(s)
Finished calculations. Matmul kernel wall clock time = 23.00 sec Wall clock time/thread = 23.00 sec MFlops = 751.304348 |
Notice that only one thread was used, and the run time increased significantly.