Learning High-Performance Computing with an Online Toolkit

MPI & OpenMP Programming

MPI deep dive (practical patterns & performance)

Core MPI Concepts

MPI_Init, MPI_Finalize - start/end.
MPI_Comm_size, MPI_Comm_rank - rank and communicator size.
Point-to-point: MPI_Send, MPI_Recv (blocking) MPI_Isend, MPI_Irecv, MPI_Wait (none-blocking) - send/receive messages.
Collectives: MPI_Bcast, MPI_Reduce, MPI_Allreduce, MPI_Alltoall, MPI_Scatter, MPI_Gather.
Derived datatypes — let MPI describe non-contiguous memory layouts to avoid manual packing.
Communicators — partition processes (e.g., MPI_Comm_split, MPI_Cart_create).
RMA/One-sided: MPI_Put, MPI_Get, MPI_Win*for remote memory access (potentially lower overheard).

Blocking vs non-blocking

Blocking calls (e.g.,MPI_Send) can block the caller until the message is safe to reuse.
Non-blocking (MPI_Isend/MPI_Irecv) allow overlap of communication and computation:

post MPI_Isend and MPI_Irecv.

Perform useful computation that does not touch the communicated buffers.

call MPI_Wait to synchronize and ensure completion.

Non-blocking pattern example:

MPI_Isend(sendbuf, count, MPI_DOUBLE, dest, tag, comm, &reqs[0]);

MPI_Irecv(recvbuf, count, MPI_DOUBLE, src, tag, comm, &reqs[1]);

// do independent work here (compute on other data)

MPI_Waitall(2, reqs, MPI_STATUSES_IGNORE);

Reduce/Allreduce tuning

Prefer MPI_Allreduce to many point-to-point reduces when you need global results — MPI implementations implement optimized tree or recursive doubling algorithms.
For custom reductions use MPI_Op_create with correct commutativity

Derived datatypes

MPI derived datatypes allow processes to communicate non-contiguous data directly from memory without manually rearranging the data into temporary buffers. Instead of copying elements into a packed array before sending, programmers can define a datatype that describes the memory layout of the data. MPI then interprets the layout and transfers the data correctly between processes. This approach is especially useful in scientific computing, matrix operations, and high-performance computing (HPC) applications where large multidimensional datasets are common.

For example, when working with matrices distributed across multiple processors, a process may need to send only a portion of a matrix, such as a row block, column block, or submatrix. Since matrix data is often stored contiguously by rows in memory, sending a column or smaller rectangular region normally requires manually copying the desired elements into a contiguous buffer. MPI derived datatypes eliminate this extra step by allowing the programmer to define patterns such as strided memory layouts using functions like MPI_Type_vector, MPI_Type_indexed, or MPI_Type_create_subarray.

Derived datatypes are also valuable when transferring strided arrays, where elements are separated by fixed intervals in memory. A common example is sending every nth element of an array during domain decomposition or stencil computations. Instead of looping through the array and manually packing elements into a communication buffer, MPI can directly interpret the stride pattern and transmit only the required data.

In addition, MPI datatypes support the communication of complex structures that combine multiple data types, such as integers, doubles, and character arrays within a C struct. Using MPI_Type_create_struct, programmers can describe the exact memory offsets of each field, enabling MPI to transfer the structure correctly across systems with different architectures or alignment rules. This simplifies code maintenance and improves portability.

Another major advantage of derived datatypes is performance optimization. Manual packing and unpacking introduce additional memory copies, which increase CPU overhead and memory bandwidth usage. Derived datatypes can reduce or eliminate these extra copies. Many modern MPI implementations optimize datatype communication internally and may use zero-copy transfer techniques, where data is transmitted directly from application memory to the network interface without intermediate buffers. This can significantly improve communication efficiency, especially for large datasets and distributed-memory supercomputing applications.

Overall, MPI derived datatypes improve both programmer productivity and application performance by simplifying communication of irregular data layouts, reducing unnecessary memory operations, and enabling efficient high-performance data transfers in parallel applications.

Communicators and topology

Remote Memory Access (RMA), also known as one-sided communication in MPI, enables a process to directly read from or write to the memory of another process without requiring the target process to explicitly participate in the communication operation at that moment. Unlike traditional two-sided communication using MPI_Send and MPI_Recv, RMA separates data movement from synchronization, which can reduce communication overhead and improve performance in certain parallel computing patterns.

RMA is particularly useful for applications that involve frequent small updates, shared data regions, distributed counters, ghost cell exchanges, or irregular communication patterns. In these cases, traditional message passing may introduce unnecessary synchronization delays because both the sender and receiver must coordinate every communication event. With RMA, a process can independently access remote memory, which can significantly reduce latency when implemented carefully.

The foundation of MPI RMA communication is the MPI window object, created using functions such as MPI_Win_create. A window exposes a region of memory so that other processes in the communicator can access it remotely. Once the window is established, processes can perform one-sided operations like:

MPI_Put — writes data directly into another process’s memory window.
MPI_Get — retrieves data from another process’s memory window.
MPI_Accumulate — performs atomic update operations such as sums or increments on remote data.

For example, in a distributed simulation, neighboring processes may frequently update boundary or ghost-cell data. Instead of sending many small messages back and forth, a process can use MPI_Put to directly place updated values into the neighboring process’s exposed memory region. This reduces messaging overhead and can improve scalability on large systems.

However, synchronization semantics are critical in RMA programming because MPI must ensure memory consistency and correctness. MPI provides both active and passive synchronization models, and the choice depends on the communication pattern used by the application.

In active target synchronization, both the origin and target processes explicitly participate in synchronization operations. Functions such as MPI_Win_fence, MPI_Win_post, MPI_Win_start, MPI_Win_complete, and MPI_Win_wait coordinate access to shared windows. This model works well for structured communication phases where all processes synchronize together.

Passive target synchronization allows one process to access another process’s memory without the target process actively participating at that moment. Functions like MPI_Win_lock and MPI_Win_unlock are used to establish access epochs. Passive synchronization is especially effective for dynamic or asynchronous communication patterns because it reduces global synchronization overhead and enables greater overlap between communication and computation.

Although RMA can improve performance, improper synchronization may lead to race conditions, inconsistent memory states, or reduced scalability. Excessive locking or fine-grained updates can also create contention between processes. Therefore, programmers must carefully design communication patterns to balance synchronization costs with the performance benefits of direct remote memory access.

Overall, MPI RMA provides a powerful mechanism for low-latency communication in distributed-memory systems. By using MPI windows, one-sided operations such as MPI_Put and MPI_Get, and appropriate synchronization strategies, applications can achieve more efficient communication patterns and better scalability in high-performance computing environments.

One-sided RMA

False sharing is a performance problem that occurs in shared-memory parallel programming when multiple threads modify different variables that happen to reside on the same CPU cache line. Even though the threads are not accessing the exact same variable, the processor treats the entire cache line as shared data. As a result, the cache coherence protocol repeatedly invalidates and updates the cache line between processor cores, causing unnecessary memory traffic and significantly reducing performance.

Modern processors organize memory into cache lines, typically 64 bytes in size. When a thread writes to any variable within a cache line, that entire line becomes invalid in the caches of other processor cores. If another thread then writes to a different variable located in the same cache line, the cache line must again be transferred and synchronized between cores. This repeated movement of cache lines creates contention and slows down parallel execution, especially in high-performance computing (HPC) applications with frequent updates.

False sharing commonly appears in parallel loops where each thread updates its own counter, accumulator, or temporary variable stored in adjacent memory locations. For example, if an array of thread-specific counters is stored contiguously in memory, multiple counters may occupy the same cache line. Although each thread accesses only its own element, the hardware still detects writes to the shared cache line and triggers expensive coherence operations.

One common solution is to pad per-thread data structures so that each thread’s variables occupy separate cache lines. Padding inserts unused memory between variables to prevent two threads from writing to the same cache line. Another approach is to use arrays indexed by thread ID while ensuring that each element is aligned to the cache-line boundary. In many systems, aligning data to 64-byte boundaries ensures that each thread writes to a distinct cache line, minimizing contention.

For example, programmers may define structures with additional padding fields or use compiler directives and alignment attributes such as alignas(64) in C++ or cache-aligned memory allocators. This guarantees that thread-local variables are physically separated in memory, reducing cache invalidation overhead.

Eliminating false sharing can dramatically improve scalability and throughput in multithreaded applications because processor cores spend less time synchronizing cache states and more time performing useful computation. This optimization is particularly important in OpenMP programs, numerical simulations, parallel reductions, and other performance-critical workloads where many threads frequently update shared data structures.

Overall, understanding false sharing is essential for writing efficient parallel software. By carefully organizing memory layouts, padding thread-local data, and aligning variables to cache-line boundaries, programmers can avoid unnecessary cache coherence traffic and achieve significantly better multicore performance.

OpenMP deep dive (practical patterns & performance)

Important directives & clauses

#pragma omp parallel - create a parallel region.
#pragma omp for (or parallel for) - worksharing loops.
schedule (static/dynamic/guided) - how iterations map to threads.
private(var), firstprivate(var), lastprivate(var), shared(var) - data scoping.
reduction (+:sum) - safe parallel reductions.
#pragma omp critical/atomic - synchronization (use atomic when possible for lower overhead).
#pragma omp simd - hint to vectorize loop body.
#pragma omp task / taskloop - asynchronous task; good for irregular parallelism.
collapse(n) - distributed nested loops

Thread affinity & environment variables

Common environment variables (export before running):

 export OMP_NUM_THREADS=16        # number of threads per process

 export OMP_PROC_BIND=close      # bind threads close to the master thread

 export OMP_PLACES=cores         # place threads on cores (or sockets)

Affinity reduces cross-socket traffic and improves cache locality.

Avoid false sharing

OpenMP Tasks

OpenMP tasks provide a flexible way to express parallelism for workloads that are irregular, recursive, or dynamically generated at runtime. Unlike traditional loop-based parallelism, where iterations are divided statically among threads, tasking allows independent units of work to be created and scheduled dynamically by the OpenMP runtime system. This improves load balancing and enables better utilization of processor cores for applications with uneven or unpredictable workloads.

In a typical OpenMP tasking model, a parallel region is first created using #pragma omp parallel, which launches a team of worker threads. Inside the parallel region, the #pragma omp single directive ensures that only one thread is responsible for generating tasks, while the remaining threads immediately begin executing available tasks from the shared task queue.

For example, in the following structure:

#pragma omp parallel
#pragma omp single
{
for (i=0; i#pragma omp task firstprivate(i)
do_work(i);
}
#pragma omp taskwait
}

the single thread creates one task for each loop iteration. Each task independently executes the do_work(i) function. The firstprivate(i) clause is important because it ensures that each task receives its own private copy of the loop variable i at task creation time. Without this clause, tasks could incorrectly share the same loop variable value, leading to race conditions and incorrect results.

The #pragma omp taskwait directive forces synchronization by ensuring that all previously created tasks complete before execution continues beyond that point. This is necessary when later computations depend on the results of the generated tasks.

Task-based parallelism is especially beneficial for recursive algorithms, graph traversals, adaptive mesh refinement, tree structures, and irregular scientific workloads where the amount of work per iteration may vary significantly. Since tasks are scheduled dynamically, idle threads can “steal” tasks from busy threads, improving load balancing and scalability across multicore systems.

Newer versions of OpenMP introduce the taskloop directive, which simplifies task creation for loops and often reduces runtime overhead compared to manually generating tasks inside a loop. Instead of explicitly creating one task per iteration, taskloop automatically partitions loop iterations into groups of tasks that are scheduled efficiently by the runtime system.

For example:

#pragma omp parallel
#pragma omp single
{
#pragma omp taskloop
for (i=0; i } do_work(i); } }



The #pragma omp taskloop construct reduces task-management overhead by grouping iterations together, which can improve performance for fine-grained workloads where creating too many small tasks would otherwise become expensive. It also simplifies code readability and reduces programmer effort.
Overall, OpenMP tasking enables flexible and scalable parallel execution for dynamic workloads. By using constructs such as task, taskwait, and the newer taskloop directive, developers can efficiently parallelize irregular computations while minimizing synchronization costs and improving processor utilization in high-performance computing applications.




    Vectorization
    
        Guide the complier with #pragma omp simd or __attribute__((aligned(64))).
        Ensure loop bodies are simple, no hidden dependencies, and data is aligned.
        Use complier reports (e.g., -ftree-vectorizer-verbose or Intel reports) to check vectorization.