OpenMP* Sample

The OpenMP* sample illustrates how create and compile multi-threaded applications.

See Included Samples for other samples included with the compiler.

Sample files and locations

Source

Locations

openmp_sample.c

Linux* and Mac OS*	<install-dir>/samples/openmp_samples/
Windows*	<install-dir>\samples\openmp_samples\

Description

This sample illustrates combining compiler options and OpenMP* pragmas to compile and run multi- and single-threaded executables. The sample generates a multi-thread executable when you add the -openmp (Linux and Mac OS) or /Qopenmp (Windows) compiler option to the compilation command. Without that command the same source code results in a single-threaded executable.

In the multi-threaded implementations each thread can concurrently compute some sub-matrix of the product without needing OpenMP data or control synchronization.

In the sample, each element of the product matrix c[i][j] is computed from a unique row and column of the factor matrices, a[i][k] and b[k][j]. The algorithm uses OpenMP* to parallelize the outer-most loop, using the "i" row index.

Both the outer-most "i" loop and middle "k" loop are manually unrolled by 4. The inner-most "j" loop iterates one-by-one over the columns of the product and factor matrices.

Compile the sample as a multi-threaded application

In many cases multi-threaded sources requires a large stack size. The commands listed below demonstrate the suggested stack size and command needed to compile the sample source. Linux and Mac OS: these commands assume bash.

Platform	Commands
Linux	ulimit -s unlimited icc -openmp -std=c99 openmp_sample.c
Mac OS	ulimit -s 64000 icc -openmp -std=c99 openmp_sample.c
Windows	icl /Qopenmp /Qstd=c99 openmp_sample.c /F256000000

Platform

Commands

Linux

ulimit -s unlimited

icc -openmp -std=c99 openmp_sample.c

Mac OS

ulimit -s 64000

icc -openmp -std=c99 openmp_sample.c

Windows

icl /Qopenmp /Qstd=c99 openmp_sample.c /F256000000

The compiler generates status messages informing you which defined loops or regions were parallelized. The following examples illustrate typical messages. (The Windows status messages include the linker phase.)

Platform	Status Messages
Linux	openmp_sample.c(109): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(116): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(96): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
Windows	openmp_sample.c(106): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(113): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. openmp_sample.c(93): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED. Microsoft (R) Incremental Linker Version 8.00.50727.42 Copyright (C) Microsoft Corporation. All rights reserved. -out:openmp_sample.exe -stack:256000000

Platform

Status Messages

Linux

openmp_sample.c(109): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

openmp_sample.c(116): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

openmp_sample.c(96): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

Windows

openmp_sample.c(106): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

openmp_sample.c(113): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

openmp_sample.c(93): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

-out:openmp_sample.exe

-stack:256000000

Run the multi-threaded executable.

Platform	Commands
Linux and Mac OS	./a.out
Windows	openmp_sample

The multi-threaded executable should generate results similar to the following.

Sample Output
Using time() for wall clock time Problem size: c(600,2400) = a(600,1200) * b(1200,2400) Calculating product 5 time(s) We are using 2 thread(s) Finished calculations. Matmul kernel wall clock time = 12.00 sec Wall clock time/thread = 6.00 sec MFlops = 1440.000000

Sample Output

Using time() for wall clock time

Problem size: c(600,2400) = a(600,1200) * b(1200,2400)

Calculating product 5 time(s)

We are using 2 thread(s)

Finished calculations.

Matmul kernel wall clock time = 12.00 sec

Wall clock time/thread = 6.00 sec

MFlops = 1440.000000

Note the number of threads reported; at least two threads should have been used.

Linux and Mac OS (bash): If you receive an error message stating that the executable caused a segmentation fault the most likely cause is the stack size. Verify the stack size setting. enter the following command: ulimit -s.

Compile the sample as a single-threaded application

Delete the executable created earlier, and enter the following compilation command. Notice the -openmp (Linux and Mac OS) or /Qopenmp (Windows) option is not included.

Caution

Linux and Mac OS: If you've closed the session since the last time you set the stack size, you must set the stack size again.

Platform	Commands
Linux and Mac OS	icc -std=c99 openmp_sample.c
Windows	icl /Qstd=c99 openmp_sample.c /F256000000

Notice that the compiler does not generate messages about parallelized loops or regions. OpenMP support was disabled; however, the compiler does list status messages about the ignored OpenMP* pragmas.

Run the single-threaded executable.

Platform	Commands
Linux and Mac OS	./a.out
Windows	openmp_sample

The executable should generate results similar to the following.

Sample Output
Using time() for wall clock time Problem size: c(600,2400) = a(600,1200) * b(1200,2400) Calculating product 5 time(s) We are using 1 thread(s) Finished calculations. Matmul kernel wall clock time = 23.00 sec Wall clock time/thread = 23.00 sec MFlops = 751.304348

Sample Output

Using time() for wall clock time

Problem size: c(600,2400) = a(600,1200) * b(1200,2400)

Calculating product 5 time(s)

We are using 1 thread(s)

Finished calculations.

Matmul kernel wall clock time = 23.00 sec

Wall clock time/thread = 23.00 sec

MFlops = 751.304348

Notice that only one thread was used, and the run time increased significantly.