Workqueuing Code Generation

This topic discusses how the compiler generates multithreaded code using the Intermediate Language Scalar Optimizer.

Assume that the compiler is given the following while loop. In the following example, and in all examples in this topic, the line numbers are listed to allow you to better follow the discussion.

Example
1 void test1(LIST p) 2 { 3 #pragma intel omp parallel taskq shared(p) 4 { 5 while (p != NULL) 6 { 7 #pragma intel omp task captureprivate(p) 8 { 9 do_work1(p); 10 } 11 p = p->next; 12 } 13 }

Example

1 void test1(LIST p)

2 {

3 #pragma intel omp parallel taskq shared(p)

4 {

5 while (p != NULL)

6 {

7 #pragma intel omp task captureprivate(p)

8 {

9 do_work1(p);

10 }

11 p = p->next;

12 }

13 }

The parallel taskq pragma specifies an environment for the while loop in which to enqueue the units of work specified by the enclosed task pragma. The control structure for the loop and the enqueuing are executed single-threaded, while the other threads in the team participate in dequeuing the work from the taskq queue and executing it. The captureprivate clause ensures that a private copy of the link pointer p is captured at the time each task is being enqueued, which preserves the sequential semantics.

The compiler first generates an IL0 representation of the workqueuing code as shown below, where the while loop has been lowered into if and goto statements, and each workqueuing pragma has been converted into a pair of IL0 begin/end directives that define the boundaries of the construct.

Example
1 void test1(p) 2 { 3 DIR_PARALLEL_TASKQ SHARED(p) 4 if (p != 0) 5 { 6 L1: 7 DIR_TASK CAPTUREPRIVATE(p) 8 do_work1(p) 9 DIR_END_TASK 10 p = p->next 11 if (p != 0) 12 { 13 goto L1 14 } 15 } 16 DIR_END_PARALLEL_TASKQ 17 return 18 }

Example

1 void test1(p)

2 {

3 DIR_PARALLEL_TASKQ SHARED(p)

4 if (p != 0)

5 {

6 L1:

7 DIR_TASK CAPTUREPRIVATE(p)

8 do_work1(p)

9 DIR_END_TASK

10 p = p->next

11 if (p != 0)

12 {

13 goto L1

14 }

15 }

16 DIR_END_PARALLEL_TASKQ

17 return

18 }

Next, the OpenMP back-end creates new data structures to handle private and shared variables. Two struct types are defined for each taskq construct. The first one, shared_t, holds the pointers to its shared, firstprivate, lastprivate, and reduction variables.

The other struct, thunk_t, has a pointer to shared_t, in addition to fields that hold the private copies of variables listed in the taskq as private, firstprivate, lastprivate, captureprivate, or reduction. At compile-time, a pointer to each struct is created, while the actual objects they point to are instantiated at run-time by invoking workqueuing library routines.

For this example, the symbol table entries generated are as follows:

Example
typedef struct shared_t { ... // fields for internal use p_ptr; // pointer to shared variables p }; auto struct shared_t shareds; typedef struct thunk_t { ... // fields for internal use struct shared_t shr; // := shareds p; // captureprivate p }; auto struct thunk_t *taskq_thunk;

Example

typedef struct shared_t {

... // fields for internal use

p_ptr; // pointer to shared variables p

};

auto struct shared_t *shareds;

typedef struct thunk_t {

... // fields for internal use

struct shared_t *shr; // := shareds

p; // captureprivate p

};

auto struct thunk_t *taskq_thunk;

In this example, the automatic pointers shareds and taskq_thunk are allocated outside of the threaded entry of taskq. In addition to the private variables, thunk_t holds enough information about the code and data of a taskq so that its execution, if suspended due to a full queue, can be resumed later.

The OpenMP back-end lowers the IL0 directives into multithreaded IL0 code that explicitly calls Intel OpenMP run-time library routines to enqueue/dequeue the tasks and to manage and synchronize the threads. As a result of the code transformations, three threaded entries, or T-entries, are inserted into the original function test1(): test1_par_taskq() (lines 6 through 16) corresponds to the semantics of the parallel portion of the combined parallel taskq pragma. test1_taskq() (lines 18 through 50) corresponds to the taskq portion of the combined pragma; and nested within it is test1_task() (lines 36 through 40), which corresponds to the enclosed task construct:

Example
1 void test1(p) 2 { 3 __kmpc_fork_call(test1_par_taskq, &p) 4 goto L3 5 6 T-entry test1_par_taskq(p_ptr) 7 { ... 16 } 17 18 T-entry test1_taskq(taskq_thunk) 19 { ... 36 T-entry test1_task(task_thunk) 37 { ... 40 } ... 50 } 51 L3: 52 return 53 }

Example

1 void test1(p)

2 {

3 __kmpc_fork_call(test1_par_taskq, &p)

4 goto L3

6 T-entry test1_par_taskq(p_ptr)

7 {

...

16 }

18 T-entry test1_taskq(taskq_thunk)

19 {

...

36 T-entry test1_task(task_thunk)

37 {

...

40 }

...

50 }

51 L3:

52 return

53 }

Note

A T-entry is similar to a function entry, despite some subtle differences.

The Intel OpenMP library routine __kmpc_fork_call() invoked in line 3 creates a team of threads at run-time. All threads execute test1_par_taskq(), but only one thread proceeds to execute test1_taskq(), which is like a skeletal version of the while loop and whose main purpose is to enqueue, on every iteration, the work specified in test1_task(). All the other threads become worker threads that dequeue the tasks and execute them.

Let's look at these T-entries, starting with test1_par_taskq():

Example
6 T-entry test1_par_taskq(p_ptr) 7 { 8 taskq_thunk = __kmpc_taskq(test1_taskq, 4, 4, &shareds) 9 shareds->p_ptr = p_ptr 10 if (taskq_thunk != 0) 11 { 12 test1_taskq(taskq_thunk) 13 } 14 __kmpc_end_taskq(taskq_thunk) 15 T-return 16 }

Example

6 T-entry test1_par_taskq(p_ptr)

7 {

8 taskq_thunk = __kmpc_taskq(test1_taskq, 4, 4, &shareds)

9 shareds->p_ptr = p_ptr

10 if (taskq_thunk != 0)

11 {

12 test1_taskq(taskq_thunk)

13 }

14 __kmpc_end_taskq(taskq_thunk)

15 T-return

16 }

The parameter p_ptr (line 6) is a pointer to the shared variable p. All threads call the library routine __kmpc_taskq() in line 8 to instantiate shared_t (in shareds) and also to attempt to instantiate thunk_t (for taskq_thunk), but only one of the threads will succeed in that attempt and only it will proceed to the T-entry test1_taskq() (line 12). The other threads fail the test in line 10 and call __kmpc_end_taskq() in line 14 to become worker threads. The T-entry test1_taskq() is shown below:

Example
18 T-entry test1_taskq(taskq_thunk) 19 { 20 if (taskq_thunk->status == 1) 21 { 22 goto L2 23 } 24 if ((taskq_thunk->shr->p_ptr) != 0) 25 { 26 L1: 27 task_thunk = __kmpc_task_buffer(taskq_thunk, test1_task) 28 task_thunk->p = (taskq_thunk->shr->p_ptr) 29 if (__kmpc_task(task_thunk)!= 0) 30 { 31 __kmpc_taskq_task(taskq_thunk, 1) 32 T-return 33 } ... 41 L2: 42 (taskq_thunk->shr->p_ptr) = ((taskq_thunk->shr->p_ptr))->next 43 if (*(taskq_thunk->shr->p_ptr) != 0) 44 { 45 goto L1 46 } 47 } 48 __kmpc_end_taskq_task(taskq_thunk) 49 T-return 50 }

Example

18 T-entry test1_taskq(taskq_thunk)

19 {

20 if (taskq_thunk->status == 1)

21 {

22 goto L2

23 }

24 if (*(taskq_thunk->shr->p_ptr) != 0)

25 {

26 L1:

27 task_thunk = __kmpc_task_buffer(taskq_thunk, test1_task)

28 task_thunk->p = *(taskq_thunk->shr->p_ptr)

29 if (__kmpc_task(task_thunk)!= 0)

30 {

31 __kmpc_taskq_task(taskq_thunk, 1)

32 T-return

33 }

...

41 L2:

42 *(taskq_thunk->shr->p_ptr) =

(*(taskq_thunk->shr->p_ptr))->next

43 if (*(taskq_thunk->shr->p_ptr) != 0)

44 {

45 goto L1

46 }

47 }

48 __kmpc_end_taskq_task(taskq_thunk)

49 T-return

50 }

The main purpose of test1_taskq() is to enqueue each task as it runs through the loop's control structure. The test for the loop appears in lines 24 and 43, and its pointer update in line 42. Access to the shared p is accomplished through the expression *(taskq_thunk->shr->p_ptr).

Before enqueuing the task for an iteration, the library routine __kmpc_task_buffer() is called (line 27) to create task_thunk and initialize it based on taskq_thunk. The address of the T-entry test1_task() is also stored in task_thunk for the worker thread that dequeues it to run the task. To satisfy the captureprivate semantics, line 28 copies the value of the shared variable p into the task's private copy of p stored in its task_thunk.

The actual enqueuing of a task is done by the library routine __kmpc_task() in line 29. A return value of zero means that the queue is not full, allowing the next task to be enqueued. A non-zero return value indicates that the queue is becoming full; the execution of the taskq is suspended (the thread becomes a worker thread) and will be resumed later by potentially a different thread.

For this purpose, the library routine __kmpc_taskq_task() in line 31 enqueues the taskq_thunk itself. Later, the worker thread that dequeues it will resume execution of the taskq at the location L2 in line 41. To accomplish this, a jump table value (integer 1 in the example) is passed in as an argument to __kmpc_taskq_task().

This value is stored in taskq_thunk and must uniquely identify this call site, because there may be several tasks enclosed in the same taskq. This value must also be non-zero, because zero is reserved for the first execution of the taskq.

Based on this value, the jump table (lines 20 through 23) determines whether the current execution of test1_taskq() is the first one or a continuation of a suspended run, and accordingly transfers execution to the correct location.

After a worker thread dequeues a task_thunk, it extracts the address of test1_task() and proceeds to execute the task contained. This T-entry is shown below:

Example
36 T-entry test1_task(task_thunk) 37 { 38 do_work1(task_thunk->p) 39 T-return 40 }

Example

36 T-entry test1_task(task_thunk)

37 {

38 do_work1(task_thunk->p)

39 T-return

40 }