Loop Independence

Loop independence is important since loops that are independent can be parallelized. Independent loops can be parallelized in a number of ways, from the course-grained parallelism of OpenMP*, to fine-grained Instruction Level Parallelism (ILP) of vectorization and software pipelining.

Loops are considered independent when the computation of iteration Y of a loop can be done independently of the computation of iteration X. In other words, if iteration 1 of a loop can be computed and iteration 2 simultaneously could be computed without using any result from iteration 1, then the loops are independent.

Occasionally, you can determine if a loop is independent by comparing results from the output of the loop with results from the same loop written with a decrementing index counter.

For example, the loop shown in example 1 might be independent if the code in example 2 generates the same result.

Example
#define MAX 1024 void loop_indep1(int a[MAX],int b[MAX]) { for (int j=0;j<MAX;j++) a[j] = b[j]; } #define MAX 1024 void loop_indep2(int a[MAX],int b[MAX]) { for (int j=MAX;j>0;j--) a[j] = b[j]; }

Example

#define MAX 1024

void loop_indep1(int a[MAX],int b[MAX])

{

for (int j=0;j<MAX;j++)

a[j] = b[j];

}

#define MAX 1024

void loop_indep2(int a[MAX],int b[MAX])

{

for (int j=MAX;j>0;j--)

a[j] = b[j];

}

When loops are dependent, improving loop performance becomes much more difficult. Loops can be dependent in several, general ways.

Flow Dependency
Anti Dependency
Output Dependency
Reductions

The following sections illustrate the different loop dependencies.

Flow Dependency - Read After Write

Cross-iteration flow dependence is created when variables are written then read in different iterations, as shown in the following example:

Example
void flow_dep(double A[]) { for (int j=1; j<1024; j++) A[j]=A[j-1]; }

Example

void flow_dep(double A[])

{

for (int j=1; j<1024; j++)

A[j]=A[j-1];

}

The above example is equivalent to the following lines for the first few iterations:

Sample iterations
A[1]=A[0]; A[2]=A[1];

Sample iterations

A[1]=A[0];

A[2]=A[1];

Recurrence relations feed information forward from one iteration to the next.

Example
void time_stepping_loops(double a[], double b[]) { for(int j=1; j<MAX; j++) { a[j] = a[j-1] + b[j]; } }

Example

void time_stepping_loops(double a[], double b[])

{

for(int j=1; j<MAX; j++) {

a[j] = a[j-1] + b[j];

}

Most recurrences cannot be made fully parallel. Instead, look for a loop further out or further in to parallelize. You might be able to get more performance gains through unrolling.

Anti Dependency - Write After Read

Cross-iteration anti-dependence is created when variables are read then written in different iterations, as shown in the following example:

Example
void anti_dep1(double A[]) { for (int j=1; j<1024; j++) A[j]=A[j+1]; }

Example

void anti_dep1(double A[])

{

for (int j=1; j<1024; j++)

A[j]=A[j+1];

}

The above example is equivalent to the following lines for the first few iterations:

Sample interations
A[1]=A[2]; A[2]=A[3];

Sample interations

A[1]=A[2];

A[2]=A[3];

Output Dependency - Write After Write

Cross-iteration output dependence is where variables are written then rewritten in a different iteration. The following example illustrates this type of dependency:

Example
void anti_dep2(double A[], double B[], double C[]) { for (int j=1; j<1024; j++) { A[j]=B[j]; A[j+1]=C[j]; } }

Example

void anti_dep2(double A[], double B[], double C[])

{

for (int j=1; j<1024; j++) {

A[j]=B[j];

A[j+1]=C[j];

}

The above example is equivalent to the following lines for the first few iterations:

Sample interations
A[1]=B[1]; A[2]=C[1]; A[2]=B[2]; A[3]=C[2];

Sample interations

A[1]=B[1];

A[2]=C[1];

A[2]=B[2];

A[3]=C[2];

Reductions

The Intel® compiler can successfully vectorize or software pipeline (SWP) most loops containing reductions on simple math operators like multiplication (*), addition (+), subtraction (-), and division (/). Reductions collapse array data to scalar data by using associative operations:

Example
void reduction(double * sum, double c[]) { for (int j=0; j<MAX; j++) { sum = sum + c[j]; } }

Example

void reduction(double * sum, double c[])

{

for (int j=0; j<MAX; j++) {

*sum = *sum + c[j];

}

The compiler might occasionally misidentify a reduction and report flow-, anti-, output-dependencies or sometimes loop-carried memory-dependency-edges; in such cases, the compiler will not vectorize or SWP the loop. In such cases, recognize that the programming construct is simply a reduction, and direct the compiler through the use of pragmas to vectorize or SWP the loop; you can pragmas (like, #pragma ivdep or #pragma swp) to help the compiler in these cases.