From Code to Performance

Glossary of Terms

General Computing & Programming

Algorithm: A step-by-step procedure for solving a problem or performin a computation.

Application Program Interface (API): A defined way for software programs to communicate with each other.

Array: A collection of data items stored at contiguous memory locations.

Cache: A small, high speed memory used to temporarily store frequently accessed data to improve performance.

CPU: The main processor that performs computation in a computer system.

Compiler: A program that translates source code into machine code executable by a computer.

Debugging: The process of finding and fixing errors in software code.

Latency: The time delay between initiating a request and receiving a response.

Throughput: The amount of work a system can perform in a given amount of time.

Git & GitHub Terminology

Branch: A parallel version of a repository that allows independent development without affecting the main codebase.

Clone: A local copy of a remote GitHub repository.

Commit: A snapshot of your project’s changes, including a message describing what was changed.

Conflict: A situation that arises when changes from different branches contradict each other and must be resolved manually.

Fork: A copy of someone else’s repository that you can modify freely under your account.

Git: A distributed version control system for tracking changes in source code.

GitHub: A web-based platform built on Git for hosting repositories, collaboration, and project management.

Merge: The process of combining changes from one branch into another.

Merge Conflict: Occurs when Git cannot automatically reconcile differences in code between two branches.

Pull: Downloads the latest changes from a remote repository to your local one.

Push: Uploads local commits to a remote repository (e.g., GitHub).

Pull Request (PR): A request to merge code changes from one branch or fork into another. It often includes peer review.

Repository: A project folder tracked by Git that contains code, files, and revision history.

Staging Area: The place where changes are reviewed before being committed.

Version Control: A system for recording changes to files over time so that specific versions can be recalled later.

High-Performance Computing (HPC)

Clusters: A group of interconnected computers (nodes) that work together as a single system.

Core: An individual processing unit within a CPU.

High Performance Computing (HPC): The use of supercomputers and clusters of computers to perform calculations far faster than a standard computer.

Node: A single computer in an HPC cluster, usually containing multiple CPUs (sockets).

Interconnected / Network Fabric: The communication system that links HPC nodes (e.g., InfiniBand, Omni-Path).

Job Scheduler: Software (like Slurm or PBS) that manages when and where jobs run across the cluster.

Parallel Computing: Dividing a computational problem into parts that can be executed simultaneously on multiple processors.

Scalability: The ability of a program to maintain efficiency as the number of processors increases.

Strong Scaling: Fix the total problem size and increase the number of processors to reduce runtime.

Weak Scaling: Increase both the problem size and number of processors so each processor’s workload stays constant.

Speedup (S): The ratio of serial runtime to parallel runtime. S(P)=T(1)T(P)S(P)=T(P)T(1)​

Efficiency: Measures how effectively parallel resources are used. E(P)=S(P)PE(P)=PS(P)​

Load Balancing: Ensuring all processors have roughly equal work to avoid idle time.

Latency: The delay before data transfer begins following a communication request.

Bandwidth: The rate at which data can be transferred, typically measured in GB/s.

Non-Uniform Memory Access (NUMA): Memory architecture where access time depends on which processor’s memory is used.

MPI Terminology

MPI: Standard for distributed communication.

Rank: Unique ID of each process.

Communicator: A group of MPI processes that can communicate with one another (default: MPI_COMM_WORLD).

Collective Communication: MPI operations that involve all processes (e.g., MPI_Bcast, MPI_Reduce, MPI_Allreduce).

Point-to-Point: Direct message exchange between two MPI processes (e.g., MPI_Send and MPI_Recv).

Blocking: The function waits until communication completes.

Non-Blocking: The function returns immediately, allowing computation overlap.

Remote Memory Access (RMA): One-sided communication in MPI that allows direct memory reads/writes between processes without explicit synchronization.

MPI Derived Datatypes: Custom data layouts defined for efficient transfer of structured or non-contiguous data.

MPI I/O: Parallel input/output routines for reading/writing large datasets collectively across multiple nodes.

OpenMP Terminology

OpenMP: An API for parallel programming on shared memory systems using compiler directives.

Thread Affinity: A lightweight execution unit that shares memory with other threads within a process.

Parallel Region: Code block that runs across multiple threads simultaneously (#pragma omp parallel).

Work-Sharing Construct: Divides tasks among threads (e.g., #pragma omp for, #pragma omp sections).

Reduction: Combines partial results from multiple threads into a single result safely.

Critical Section: A code block that only one thread can execute at a time, preventing data races.

False Sharing: Performance issue where multiple threads write to variables on the same cache line.

Thread Affinity: Binding threads to specific CPU cores to minimize context switching and maximize cache reuse.

Private Variables: Each thread has its own copy.

Shared Variables: All threads access the same variable.

False Sharing: Threads competing for cache.

Optimization & Performance

Optimization: The process of improving a program’s performance by maximizing computation and minimizing bottlenecks (e.g., memory, communication, I/O).

Profiling: Analyzing a program’s runtime behavior to identify performance bottlenecks.

Roofline Model: A visual model that relates performance (FLOPS) to arithmetic intensity (FLOPS/byte) to identify whether an application is compute- or memory-bound.

Arithmetic Intensity: Ratio of operations to data movement: AI=FLOPsBytes movedAI=Bytes movedFLOPs​

Vectorization: Performing multiple calculations simultaneously using vector registers (e.g., AVX instructions).

Cache Blocking: Reordering computations to reuse data in cache and minimize memory access.

Memory Hierarchy: Organization of memory from fastest to slowest: Registers → L1 → L2 → L3 → RAM → Disk.

Amdahl’s Law: Defines the theoretical speedup limit of a parallel system based on the sequential portion of code: S=1(1−P)+P/NS=(1−P)+P/N1​

Gustafson’s Law: Shows that as problem size grows, scaling efficiency can remain high even with more processors.

Bottleneck: Any part of a system that limits overall performance — often CPU, memory, or network bandwidth.

Tools & Frameworks

Perf: Linux profiling tool for measuring CPU cycles, cache misses, and branch predictions..

mpiP: Lightweight MPI profiler for analyzing communication patterns.

HPCToolkit: Advanced profiling and tracing tool for hybrid MPI + OpenMP programs.

Intel VTune: PCommercial profiler that visualizes vectorization, cache, and thread performance.

Slurm: Popular job scheduler in HPC environments.

Valgrind: Memory debugging and leak detection tool.

ThreadSanitizer: Detects data races in multithreaded applications.

Key Formulas Summary

Concept Description Formula
Speedup (S) How much faster parallel code runs S = T1 / TpS = T1/TP
Efficiency (E) Resource utilization E = S / PE = S / P
Amdahl's Law Limits of parallel speedup S=1(1−P)+P/NS=(1−P)+P/N1​
Arithmetic Intensity (AI) FLOPs per byte moved AI=FLOPsBytesAI=BytesFLOPs​
Roofline Bound Theoretical performance cap/td> P=min⁡(Ppeak,BW×AI)P=min(Ppeak​,BW×AI)​