MPI and OpenMP Programming
Hybrid MPI + OpenMP: mapping & best practices
Why hybrid?
- Reduce number of MPI ranks to decrease expensive per-rank memory and communication.
- Use OpenMP threads to exploit many cores per socket and to be NUMA-aware.
Mapping strategy
- One MPI rank per socket (NUMA domain), which
nthreads = core_per_socket. - Or one MPI rank per nodewith fewer threads to avoid contention if memory bandwith per thread is limited.
Example run (OpenMPI example):
export OMP_NUM_THREADS=24
mpirun -np 4 --map-by socket --bind-to core ./my_app
This run 2 MPI ranks (one per socket) and each rank spawns 24 OpenMP threads (bind to cores).
Overlap communication with computation
UseMPI_Isend/MPI_Irecv inside OpenMP parallel sections and make sure the computation that follows does not touch the send/recv buffers.
Hybrid pattern:
#pragma omp parallel
{
// split threads into communication and compute roles or let all threads compute
// post nonblocking MPI communications (outside or inside parallel region)
// compute locally
// wait for communications
}
Notes:
- Some MPI libraries are not fully thread-safe at
MPI_THREAD_MULTIPLElevels; useMPI_Init_threadand test support.
Algorithmic & data-level optimizations (practical recipes)
- Choose the right algorithms
Algorithmic complexity always dominates micro-optimizations. If an O(N^2) algorithm can be replaced by O(N log N) or O(N), do that first.
- Data locality and cache blocking (tiling)
- Vectorization
- Use
-O3 -march=nativeand#pragma omp simdwhere appropriate. - Ensure contiguous memory access in the inner loop and avoid pointer (use
restrictin C/C++). - Reduce communication
- Use collectives optimized by the MPI implementation.
- Reorganize computation to reduce frequency of global synchronizations.
- Overlap computation & communication
- Avoid synchronizations hot spots
- Use reductions instead of critical sections where applicable.
For matrix operations, reorganize loops to operate on cache-sized tiles so data reused in cache before eviction.
Naive triple loop for matrix multiply:
for(i) for(j) for(k) C[i][j] += A[i][k] * B[k][j];
Cache-blocked version processes subblocks of size B to keep working set resident.
- Aggregate small messages into larger ones.
- Post non-blocking sends/receives early, compute independent work, then complete.
- Replace global barriers with per-node or per-team synchronization if possible.
Compiler and Runtime Optimizations
Compiler flags:
-O3for aggressive optimization.-fopenmpfor OpenMP support.-ffast-mathonly if floating-point precision can be relaxed.-funroll-loopsfor vectorization.
Runtime settings:
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores
mpirun -np 4 --map-by socket ./hybird_matrix
Use environment variables to control thread placement and binding for better performance. The number of threads and binding strategy crucial for NUMA performance.
Advanced Optimizations Tips
- Cache blocking:ensure your tiles fit in L1/L2/L3 caches to reduce memory traffic.
- Loop fusion: combine multiple loops accessing the same data to increase temporal locality.
- Asynchonous MPI:overlap computation and communication to reduce idle time.
- Thread-private buffers: avoid contention in reductions or temporary arrays.
- Hybrid tuning experiment with MPI ranks per node vs threads per rank to balance memory and bandwidth.
Summary
- Optimize locally first: cache blocking, vectorization, OpenMP threading.
- Scale across nodes: use MPI for distribution and non-blocking communication.
- Profile and analyze: identify hotsopts, comminication delays, menory inefficiencies.
- Iterate: adjust loop tiling, thread mapping, MPI ranks for maxinal efficiency.
- Validate correctness ensure numerical results are consistent and race conditions are avoided.
End Result: An HPC program that scales efficiently across cores and nodes, uses memory hierarchies effectively, and minimizes communication overhead - the full spectrum of MPI + OpenMP optimizations.