OpenMP* Advanced Issues

This topic discusses how to use the OpenMP* library functions and environment variables, and discusses some guidelines for enhancing performance with OpenMP*.

OpenMP* provides specific function calls, and environment variables. See the following topics to refresh you memory about the primary functions and environment variable used in this topic:

To use the function calls, include the omp.h and omp_lib.h header files, which are installed in the INCLUDE directory during the compiler installation, and compile the application using the -openmp (Linux* and Mac OS*) or /Qopenmp (Windows*) option. No additional libraries are required for linking.

The following example, which demonstrates how to use the OpenMP* functions to print the alphabet, also illustrates several important concepts.

First, when using function instead of pragmas your code must be rewritten; any rewrite can mean extra debugging, testing, and maintenance efforts.

Second, it becomes difficult to compile without OpenMP support.

Third, it is very easy to introduce bugs, as in the loop (below) that fails to print all the letters of the alphabet when the number of threads is not a multiple of 26.

Fourth and finally, you lose the ability to adjust loop scheduling without creating your own work-queue algorithm, which is a lot of extra effort. You are limited by your own scheduling, which is mostly likely static scheduling as shown in the example.

Example
#include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(4); int i; #pragma omp parallel private(i) { // OMP_NUM_THREADS is not a multiple of 26, // which can be considered a bug in this code. int LettersPerThread = 26 / omp_get_num_threads(); int ThisThreadNum = omp_get_thread_num(); int StartLetter = 'a'+ThisThreadNumLettersPerThread; int EndLetter = 'a'+ThisThreadNumLettersPerThread+LettersPerThread; for (i=StartLetter; i<EndLetter; i++) printf("%c", i); } printf("\n"); return 0; }

Example

#include <stdio.h>

#include <omp.h>

int main(void)

{

omp_set_num_threads(4);

int i;

#pragma omp parallel private(i)

{ // OMP_NUM_THREADS is not a multiple of 26,

// which can be considered a bug in this code.

int LettersPerThread = 26 / omp_get_num_threads();

int ThisThreadNum = omp_get_thread_num();

int StartLetter = 'a'+ThisThreadNum*LettersPerThread;

int EndLetter = 'a'+ThisThreadNum*LettersPerThread+LettersPerThread;

for (i=StartLetter; i<EndLetter; i++)

printf("%c", i);

}

printf("\n");

return 0;

}

Debugging threaded applications is a complex process, because debuggers change the runtime performance, which can mask race conditions. Even print statements can mask issues, because they use synchronization and operating system functions. OpenMP* adds even more complications. OpenMP* inserts private variables, shared variables, and additional code that is impossible to examine and step through without a specialized debugger that supports OpenMP*. When using OpenMP, your key debugging tool is the process of elimination.

Remember that most mistakes are race conditions. Most race conditions are caused by shared variables that really should have been declared private. Start by looking at the variables inside the parallel regions and make sure that the variables are declared private when necessary. Next, check functions called within parallel constructs. By default, variables declared on the stack are private, but the C/C++ keyword static will change the variable to be placed on the global heap and therefore shared for OpenMP loops.

The default(none) clause, shown below, can be used to help find those hard-to-spot variables. If you specify the default(none), then every variable must be declared with a data-sharing attribute clause.

Example
#pragma omp parallel for default(none) private(x,y) shared(a,b)

Another common mistake is using uninitialized variables. Remember that private variables do not have initial values upon entering a parallel construct. Use the firstprivate and lastprivate clauses to initialize them only when necessary, because doing so adds extra overhead.

If you still can't find the bug, then consider the possibility of reducing the scope. Try a binary-hunt. Force parallel sections to be serial again with if(0) on the parallel construct or commenting out the pragma altogether. Another method is to force large chunks of a parallel region to be critical sections. Pick a region of the code that you think contains the bug and place it within a critical section. Try to find the section of code that suddenly works when it is within a critical section and fails when it is not. Now look at the variables, and see if the bug is apparent. If that still doesn't work, try setting the entire program to run in serial by setting the compiler-specific environment variable KMP_LIBRARY=serial.

If the code is still not working, compile it without the -openmp (Linux) or /Qopenmp (Windows) option to make sure the serial version works.

Performance

OpenMP threaded application performance is largely dependent upon the following things:

The underlying performance of the single-threaded code.
CPU utilization, idle threads, and poor load balancing.
The percentage of the application that is executed in parallel.
The amount of synchronization and communication among the threads.
The overhead needed to create, manage, destroy, and synchronize the threads, made worse by the number of single-to-parallel or parallel-to-single transitions called fork-join transitions.
Performance limitations of shared resources such as memory, bus bandwidth, and CPU execution units.
Memory conflicts caused by shared memory or falsely shared memory.

Threaded code performance is affected by two things:

How well the single-threaded version runs
How well you divide the work among multiple processors with the least amount of overhead.

Performance always begins with a properly constructed parallel algorithm or application. It should be obvious that parallelizing a bubble-sort, even one written in hand-optimized assembly language, is not a good place to start. Keep scalability in mind; creating a program that runs well on two CPUs is not as efficient as creating one that runs well on n CPUs. With OpenMP* the number of threads is chosen by the compiler, so programs that work well regardless of the number of threads are highly desirable. Producer/consumer architectures are rarely efficient, because they are made specifically for two threads.

Once the algorithm is in place, make sure that the code runs efficiently on the targeted Intel® architecture; a single-threaded version can be a big help. Turn off the -openmp (Linux* and Mac OS*) or /Qopenmp (Windows*) option to generate a single-threaded version, and run the single-threaded version through the usual set of optimizations. See Worksharing Using OpenMP* for more information.

Once you have gotten the single-threaded performance, it is time to generate the multi-threaded version and start doing some analysis.

Start by looking at the amount of time spent in the idle loop of the operating system. The VTune™ Performance Analyzer is a great tool to help with the investigation. Idle time can indicate unbalanced loads, lots of blocked synchronization, and serial regions. Fix those issues, and then go back to the VTune™ Analyzer to look for excessive cache misses and memory issues like false-sharing. Solve these basic problems, and you will have a well-optimized parallel program that will run well on Hyper-Threading Technology as well as multiple physical CPUs.

Optimizations are really a combination of patience, experimentation, and practice. Make little test programs that mimic the way your application uses the computer's resources to get a feel for what things are faster than others. Be sure to try the different scheduling clauses for the parallel sections. If the overhead of a parallel region is large compared to the compute time, you may want to use an if clause to execute the section serially.