Learning High-Performance Computing with an Online Toolkit

MPI and OpenMP Programming

Hybrid MPI + OpenMP: mapping & best practices

Why hybrid?

Reduce number of MPI ranks to decrease expensive per-rank memory and communication.
Use OpenMP threads to exploit many cores per socket and to be NUMA-aware.

Mapping strategy

One MPI rank per socket (NUMA domain), which nthreads = core_per_socket.
Or one MPI rank per nodewith fewer threads to avoid contention if memory bandwith per thread is limited.

Example run (OpenMPI example):

export OMP_NUM_THREADS=24

mpirun -np 4 --map-by socket --bind-to core ./my_app

This run 2 MPI ranks (one per socket) and each rank spawns 24 OpenMP threads (bind to cores).

Overlap communication with computation

UseMPI_Isend/MPI_Irecv inside OpenMP parallel sections and make sure the computation that follows does not touch the send/recv buffers.

Hybrid pattern:

#pragma omp parallel { // split threads into communication and compute roles or let all threads compute // post nonblocking MPI communications (outside or inside parallel region) // compute locally // wait for communications }

Notes:

Some MPI libraries are not fully thread-safe at MPI_THREAD_MULTIPLE levels; use MPI_Init_thread and test support.

Algorithmic & data-level optimizations (practical recipes)

Choose the right algorithms

Algorithmic complexity always dominates micro-optimizations. If an O(N^2) algorithm can be replaced by O(N log N) or O(N), do that first.

Data locality and cache blocking (tiling)

For matrix operations, reorganize loops to operate on cache-sized tiles so data reused in cache before eviction.

Naive triple loop for matrix multiply:

for(i) for(j) for(k) C[i][j] += A[i][k] * B[k][j];

Cache-blocked version processes subblocks of size B to keep working set resident.

Vectorization

Use -O3 -march=native and #pragma omp simd where appropriate.
Ensure contiguous memory access in the inner loop and avoid pointer (use restrict in C/C++).

Reduce communication

Use collectives optimized by the MPI implementation.
Reorganize computation to reduce frequency of global synchronizations.

Overlap computation & communication

Post non-blocking sends/receives early, compute independent work, then complete.

Avoid synchronizations hot spots

Use reductions instead of critical sections where applicable.

Compiler and Runtime Optimizations

Compiler flags:

-O3 for aggressive optimization.
-fopenmp for OpenMP support.
-ffast-math only if floating-point precision can be relaxed.
-funroll-loops for vectorization.

Runtime settings:

export OMP_NUM_THREADS=8

export OMP_PROC_BIND=close

export OMP_PLACES=cores

mpirun -np 4 --map-by socket ./hybird_matrix

Use environment variables to control thread placement and binding for better performance. The number of threads and binding strategy crucial for NUMA performance.

Advanced Optimizations Tips

Cache blocking:ensure your tiles fit in L1/L2/L3 caches to reduce memory traffic.

Loop fusion: combine multiple loops accessing the same data to increase temporal locality.

Asynchonous MPI:overlap computation and communication to reduce idle time.

Thread-private buffers: avoid contention in reductions or temporary arrays.

Hybrid tuning experiment with MPI ranks per node vs threads per rank to balance memory and bandwidth.

Summary

Optimize locally first: cache blocking, vectorization, OpenMP threading.

Scale across nodes: use MPI for distribution and non-blocking communication.

Profile and analyze: identify hotsopts, comminication delays, menory inefficiencies.

Iterate: adjust loop tiling, thread mapping, MPI ranks for maxinal efficiency.

Validate correctness ensure numerical results are consistent and race conditions are avoided.

End Result: An HPC program that scales efficiently across cores and nodes, uses memory hierarchies effectively, and minimizes communication overhead - the full spectrum of MPI + OpenMP optimizations.