From Code to Performance

MPI and OpenMP Programming

Hybrid MPI + OpenMP: mapping & best practices

Why hybrid?

Mapping strategy

Example run (OpenMPI example):

export OMP_NUM_THREADS=24
mpirun -np 4 --map-by socket --bind-to core ./my_app

This run 2 MPI ranks (one per socket) and each rank spawns 24 OpenMP threads (bind to cores).

Overlap communication with computation

UseMPI_Isend/MPI_Irecv inside OpenMP parallel sections and make sure the computation that follows does not touch the send/recv buffers.

Hybrid pattern:

#pragma omp parallel { // split threads into communication and compute roles or let all threads compute // post nonblocking MPI communications (outside or inside parallel region) // compute locally // wait for communications }

Notes:

Algorithmic & data-level optimizations (practical recipes)

  1. Choose the right algorithms
  2. Algorithmic complexity always dominates micro-optimizations. If an O(N^2) algorithm can be replaced by O(N log N) or O(N), do that first.

  1. Data locality and cache blocking (tiling)
  2. For matrix operations, reorganize loops to operate on cache-sized tiles so data reused in cache before eviction.

    Naive triple loop for matrix multiply:

    for(i) for(j) for(k) C[i][j] += A[i][k] * B[k][j];

    Cache-blocked version processes subblocks of size B to keep working set resident.

    1. Vectorization
      • Use -O3 -march=native and #pragma omp simd where appropriate.
      • Ensure contiguous memory access in the inner loop and avoid pointer (use restrict in C/C++).
    1. Reduce communication
      • Aggregate small messages into larger ones.
      • Use collectives optimized by the MPI implementation.
      • Reorganize computation to reduce frequency of global synchronizations.
    1. Overlap computation & communication
      • Post non-blocking sends/receives early, compute independent work, then complete.
    1. Avoid synchronizations hot spots
      • Replace global barriers with per-node or per-team synchronization if possible.
      • Use reductions instead of critical sections where applicable.

Compiler and Runtime Optimizations

Compiler flags:

Runtime settings:

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores
mpirun -np 4 --map-by socket ./hybird_matrix

Use environment variables to control thread placement and binding for better performance. The number of threads and binding strategy crucial for NUMA performance.

Advanced Optimizations Tips

  1. Cache blocking:ensure your tiles fit in L1/L2/L3 caches to reduce memory traffic.
  1. Loop fusion: combine multiple loops accessing the same data to increase temporal locality.
  1. Asynchonous MPI:overlap computation and communication to reduce idle time.
  1. Thread-private buffers: avoid contention in reductions or temporary arrays.
  1. Hybrid tuning experiment with MPI ranks per node vs threads per rank to balance memory and bandwidth.

Summary

  1. Optimize locally first: cache blocking, vectorization, OpenMP threading.
  1. Scale across nodes: use MPI for distribution and non-blocking communication.
  1. Profile and analyze: identify hotsopts, comminication delays, menory inefficiencies.
  1. Iterate: adjust loop tiling, thread mapping, MPI ranks for maxinal efficiency.
  1. Validate correctness ensure numerical results are consistent and race conditions are avoided.

End Result: An HPC program that scales efficiently across cores and nodes, uses memory hierarchies effectively, and minimizes communication overhead - the full spectrum of MPI + OpenMP optimizations.