Index
HPC - Universidad de Sevilla

Single-thread optimization and vectorization

Go and get the best compiler you can afford!

gcc benchmark

Source: gcc benchmark

Conclusions from another gcc-icc benchmark:

In this project we analyzed the impact of various compiler optimizations on program performance using two widely used state-of-the-art compiler suites: GNU C Compiler and Intel’s C/C++ Compiler using PARSEC benchmark suites. The results indicate that Intel’s compiler with optimization (O2, O3) in general outperforms GNU C compile r in almost all the parameters we observed. Speed up is the factor we primarily taken into account. We have observed that the overall speed up is best for icc-02 for 1, 2 4 and 8 threads. The overall speed up with icc-O2 is almost 3 times compared to the speed up of gcc without any optimization

Own experience:

Results on cluster CICA (gcc 4.4.7)
====================================

mandelbrot1..3(img,4096,-1.2,-0.7,0.5,0.0);
Time1: 9.259019e+00 Flops: 1.292850e+10 GFlops/s: 1.396314e+00
Time2: 1.023380e+01
Time3: 1.964751e+00

Results on cluster CICA (gcc 4.8.2)
====================================

mandelbrot1..3(img,4096,-1.2,-0.7,0.5,0.0);
Time1: 5.804896e+00 Flops: 1.292845e+10 GFlops/s: 2.227163e+00
Time2: 6.173179e+00
Time3: 2.076520e+00

Learn about your CPUs

For this course:

General "recipe" for vectorization/optimization

Also check this recent article from James Reinders AVX-512 Programming

  1. Extract relevant code (profile if necessary)
  2. Meassure timing of starting point (baseline)
  3. Check if there is a good library (see Libraries) to use for the innermost part of your code. If this is the case you are probably done.
  4. If not: (START)
    1. Inline frequently called functions
    2. Beware of aliasing when using pointers
    3. Align your data
    4. Understand what is vectorizable (see examples and some tests)
    5. Read vectorization report -ftree-vectorize -ftree-vectorizer-verbose=2. For gcc 5.2 use -fopt-info-vec-missed or x -fopt-info-vec-optimized.
    6. If you have openmp 4 installed, use #pragma omp simd
    7. Rewrite innermost loops if not vectorized
      1. avoid if () by splitting loops or convert if() to conditional statements
    8. Recompile and check timing. Goto START. If no improvement:
      1. what does the compiler actually create? gcc -S ...
      2. Do I REALLY want to use intrinsics?

Example 1 - Scale/add of a vector

Goto Example

Example 2 - Mandelbrot set

Goto Example

Alternatives

Example Cilk Plus

B[0:7] = 5; 
B[7:3] = 4;
A[:] = B[:] + 5;

sum = __sec_reduce_add (A[:])

#pragma simd 
  for (i=0; i<n; i++)

Example Fortran

REAL, DIMENSION(10) :: A, B
B=A

FORALL(i = 1:n) a(i, i) = x(i)

Amdahls law

Vectorized code is data - parallel. Amdahl's law applies

T[n]=T[1]*(B + 1/n (1-B)), B=fraction of serial code

Amdahls law

References