From Code to Performance

Optimization in Context: MPI and OpenMP

High-level parallel models: shared vs distributed

  • OpenMP (shared memory) - threads share the same address space. Use for parallelism inside a node.
  • MPI (distributed memory - proocesses have private memory; data moves by explicit messages. use for mutli-node scaling.
  • Hybrid (MPI+OpenMP) - MPI between nodes, OpenMP inside a node; often best for NUMA and reduction MPI ranks.

Design choice: if you have many cores per socket and few sockets per node, prefer fewer MPI ranks with more threads (one rank per NUMA domain or socket). If your code is memory heavy per thread, fewer threads per rank may be better.

Optimization in Context: MPI and OpenMP

  • MPI optimization focuses on reducing communication overhead and ensuring load is balanced across distributed nodes.
  • OpenMP optimization focuses on efficient use of threads, minimizing contention, and avoiding false sharing in shared memory.
  • Together, they allow HPC applications to run faster, cheaper, and more reliably at scale.

Dimensions of Optimization

To understand optimization better, let's break it into levels:

  1. Algorithmic Optimization
    • Choosing better algorithms has the largest impact.
    • Example: Fast Fourier Transform (FFT) reduces complexity from O(n²) to O(n log n).
  2. Code-Level Optimization
    • Writing efficient loops, minimizing function calls, and reusing memory buffers.
    • Example: Replacing nested loops with vectorized operations.
  3. Parallelization Optimization
    • Using MPI and OpenMP to exploit concurrency.
    • Ensuring even workload distribution and minimizing idle time.
  4. Memory Optimization
    • Reducing cache misses and improving locality of reference.
    • Aligning data with memory hierarchy (L1, L2, L3 caches).
  5. Communication Optimization
    • Reducing data transfer and grouping messages.
    • Overlapping communication with computation.
  6. Hardware-Specific Optimization
    • Tailoring software to GPUs, vectorization (AVX), and NUMA architectures.
    • Example: Using CUDA/OpenACC or CPU vector intrinsics.