Index
HPC - Universidad de Sevilla

NUMA aspects with OpenMP

Up to this point we have not used memory accesses intensively. Most HPC computations take place on "big" objects in memory, in order to do calculations we have to get the data from memory and write back the results to memory.

Many problems are limited by memory bandwidth, i.e. the machine takes longer in reading/writing data from/to memory than for doing the computations.

Stream benchmark

See STREAM webpage

4 Functions tested:

//COPY
for (j=0; j<STREAM_ARRAY_SIZE; j++)
  c[j] = a[j];

//SCALE
for (j=0; j<STREAM_ARRAY_SIZE; j++)
  b[j] = scalar*c[j];

//ADD
for (j=0; j<STREAM_ARRAY_SIZE; j++)
   c[j] = a[j]+b[j];

//TRIAD
for (j=0; j<STREAM_ARRAY_SIZE; j++)
   a[j] = b[j]+scalar*c[j];

Results on scadm01 node:

export OMP_NUM_THREADS=1
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            7523.5     0.172573     0.170134     0.174811
Scale:           7547.7     0.173014     0.169588     0.174938
Add:             8363.8     0.232955     0.229562     0.235681
Triad:           8401.7     0.232630     0.228524     0.235479

Scaling behaviour (Triad function):

Threads   Best Rate MB/s
 2          16819.8   
 4          32447.8 
 8          44280.2 
12          45524.5 
24          43230.5

Why do 4 threads almost show full performance? 8 threads is practically full performance.

Now let's do our own version of the stream benchmark.

Own stream benchmark, version 1

See directory src/openmp/stream/stream1.c

  printf("initializing ..\n");
  for (i=0;i<NMAX;i++) {
    a[i]=1.0;
    b[i]=2.0;
    c[i]=0.0;
  }

  #pragma omp parallel
  {
    #pragma omp master
    {
      nt =  omp_get_num_threads();
      printf("running on %d threads, repeating %d times\n",nt,NREPEAT);
    }

    #pragma omp single
    secs=omp_get_wtime();

    #pragma omp for
    for (i=1;i<NMAX;i++) {
      c[i] = a[i] + scal * b[i];
    }

    #pragma omp single
    secs=omp_get_wtime()-secs;

  }

Exercise:

First touch placement

Numa1

Numa2

Own stream benchmark, version 2
  #pragma omp parallel
  {
    #pragma omp master
    {
      nt =  omp_get_num_threads();
      printf("running on %d threads\n",nt);
    }

    #pragma omp master
    printf("initializing ..\n");

    #pragma omp for
    for (i=0;i<NMAX;i++) {
      a[i]=1.0;
      b[i]=2.0;
      c[i]=0.0;
    }

    #pragma omp single
    secs=omp_get_wtime();

    #pragma omp for
    for (i=1;i<NMAX;i++) {
      c[i] = a[i] + scal * b[i];
    }

    #pragma omp single
    secs=omp_get_wtime()-secs;

  }

Exercises:

How is the memory layout of our cluster nodes? Remember the command lstopo ?

lstopo

Show running threads of program "stream" (does not work on cygwin)

top  f->j u->user H (shows threads)

~/hpc/bin/show_threads stream

How to control "CPU affinity" of threads?

export GOMP_CPU_AFFINITY="0"          # all on CPU 0

export GOMP_CPU_AFFINITY="0-7"        # first 8 threads on 0,1,..., then repeat 
export GOMP_CPU_AFFINITY="0 6 12 18 2 8 14 20" # give explicit cpu numbers

Note: With the Intel compiler you use KMP_AFFINITY (see OpenMP Thread Affinity Control)