Vectorization Support

These pragmas control the vectorization of the subsequent loop in the program, but the compiler does not apply them to nested loops. Each nested loop needs a preceding pragma statement; place the pragma before the loop control statement.

vector always Pragma

Syntax

#pragma vector always

The vector always pragma instructs the compiler to override any efficiency heuristic during the decision to vectorize or not, and will vectorize non-unit strides or very unaligned memory accesses.

Example: vector always Directive

void vec_always(int *a, int *b, int m)

{

  #pragma vector always

  for(int i = 0; i <= m; i++)

    a[32*i] = b[99*i];

}

ivdep Pragma

Syntax

#pragma ivdep

The ivdep pragma instructs the compiler to ignore assumed vector dependencies. To ensure correct code, the compiler treats an assumed dependence as a proven dependence, which prevents vectorization. This pragma overrides that decision. Only use this when you know that the assumed loop dependences are safe to ignore.

The loop in this example will not vectorize without the ivdep pragma, since the value of k is not known; vectorization would be illegal if k<0.

Example

void ignore_vec_dep(int *a, int k, int c, int m)

{

  #pragma ivdep

  for (int i = 0; i < m; i++)

    a[i] = a[i + k] * c;

}

The pragma binds only the for loop contained in current function. This includes a for loop contained in a subfunction called by the current function.

vector Pragma

Syntax

#pragma vector {aligned | unaligned}

The  pragma indicates that the loop should be vectorized, if it is legal to do so, ignoring normal heuristic decisions about profitability. When the aligned (or unaligned) qualifier is used with this pragma, the loop should be vectorized using aligned (or unaligned) operations. Specify only one qualifier: aligned or unaligned.

Caution

If you specify aligned as an argument, you must be sure that the loop will be vectorizable using this instruction. Otherwise, the compiler will generate incorrect code.

The loop in the example that follows uses the aligned qualifier to request that the loop be vectorized with aligned instructions, as the arrays are declared in such a way that the compiler could not normally prove this would be safe to do so.

Example

void vec_aligned(float *a, int m, int c)

{

  int i;

  // Instruct compiler to ignore assumed vector dependencies.

  #pragma vector aligned

  for (i = 0; i < m; i++)

    a[i] = a[i] * c;

  // Alignment unknown but compiler can still align.

  for (i = 0; i < 100; i++)

    a[i] = a[i] + 1.0f;

}

The compiler has at its disposal several alignment strategies in case the alignment of data structures is not known at compile-time. A simple example is shown (but several other strategies are supported as well). If, in the loop, the alignment is unknown, the compiler will generate a prelude loop that iterates until the array reference that occurs the most hits an aligned address.

Example: Alignment Strategies

float *a;

// Alignment unknown

for (i = 0; i < 100; i++)

{

   a[i] = a[i] + 1.0f;

}

// Dynamic loop peeling

p = a & 0x0f;

if (p != 0)

{

   p = (16 - p) / 4;

   for (i = 0; i < p; i++)

   {

      a[i] = a[i] + 1.0f;

   }

}

// Loop with a aligned (will be vectorized accordingly)

for (i = p; i < 100; i++)

{

   a[i] = a[i] + 1.0f;

}

novector Pragma

Syntax

#pragma novector

The pragma specifies that the loop should never be vectorized, even if it is legal to do so. In this example, suppose you know the trip count (ub - lb) is too low to make vectorization worthwhile. You can use novector to tell the compiler not to vectorize, even if the loop is considered vectorizable.

Example: novector Directive

void foo(int lb, int ub)

{

  #pragma novector

  for(j=lb; j<ub; j++)

  {

     a[j]=a[j]+b[j];

  }

}

vector nontemporal Pragma (Windows*)

Syntax

#pragma vector nontemporal

The pragma results in streaming stores on systems based on IA-32 architecture. An example loop (float type) together with the generated assembly are shown in the example that follows. For large N, significant performance improvements result on a Pentium 4 systems over a non-streaming implementation.

Example

#pragma vector nontemporal

for (i = 0; i < N; i++)

  a[i] = 1;

  .B1.2:

movntps XMMWORD PTR _a[eax], xmm0

movntps XMMWORD PTR _a[eax+16], xmm0

add eax, 32

cmp eax, 4096

jl .B1.2

 

Example: Dynamic Dependence Testing Example

float *p, *q;

for (i = L; I <= U; i++)

{

  p[i] = q[i];

}

...

pL = p * 4*L;

pH = p + 4*U;

qL = q + 4*L;

qH = q + 4*U;

if (pH < qL || pL > qH)

{

  // loop without data dependence

  for (i = L; i <= U; i++)

  {

     p[i] = q[i];

  } else {

  for (i = L; i <= U; i++)

  {

     p[i] = q[i];

  }

}