Glossary of Terms
General Computing & Programming
Algorithm: A step-by-step procedure for solving a problem or performin a computation.
Application Program Interface (API): A defined way for software programs to communicate with each other.
Array: A collection of data items stored at contiguous memory locations.
Cache: A small, high speed memory used to temporarily store frequently accessed data to improve performance.
CPU: The main processor that performs computation in a computer system.
Compiler: A program that translates source code into machine code executable by a computer.
Debugging: The process of finding and fixing errors in software code.
Latency: The time delay between initiating a request and receiving a response.
Throughput: The amount of work a system can perform in a given amount of time.
Git & GitHub Terminology
Branch: A parallel version of a repository that allows independent development without affecting the main codebase.
Clone: A local copy of a remote GitHub repository.
Commit: A snapshot of your project’s changes, including a message describing what was changed.
Conflict: A situation that arises when changes from different branches contradict each other and must be resolved manually.
Fork: A copy of someone else’s repository that you can modify freely under your account.
Git: A distributed version control system for tracking changes in source code.
GitHub: A web-based platform built on Git for hosting repositories, collaboration, and project management.
Merge: The process of combining changes from one branch into another.
Merge Conflict: Occurs when Git cannot automatically reconcile differences in code between two branches.
Pull: Downloads the latest changes from a remote repository to your local one.
Push: Uploads local commits to a remote repository (e.g., GitHub).
Pull Request (PR): A request to merge code changes from one branch or fork into another. It often includes peer review.
Repository: A project folder tracked by Git that contains code, files, and revision history.
Staging Area: The place where changes are reviewed before being committed.
Version Control: A system for recording changes to files over time so that specific versions can be recalled later.
High-Performance Computing (HPC)
Clusters: A group of interconnected computers (nodes) that work together as a single system.
Core: An individual processing unit within a CPU.
High Performance Computing (HPC): The use of supercomputers and clusters of computers to perform calculations far faster than a standard computer.
Node: A single computer in an HPC cluster, usually containing multiple CPUs (sockets).
Interconnected / Network Fabric: The communication system that links HPC nodes (e.g., InfiniBand, Omni-Path).
Job Scheduler: Software (like Slurm or PBS) that manages when and where jobs run across the cluster.
Parallel Computing: Dividing a computational problem into parts that can be executed simultaneously on multiple processors.
Scalability: The ability of a program to maintain efficiency as the number of processors increases.
Strong Scaling: Fix the total problem size and increase the number of processors to reduce runtime.
Weak Scaling: Increase both the problem size and number of processors so each processor’s workload stays constant.
Speedup (S): The ratio of serial runtime to parallel runtime. S(P)=T(1)T(P)S(P)=T(P)T(1)
Efficiency: Measures how effectively parallel resources are used. E(P)=S(P)PE(P)=PS(P)
Load Balancing: Ensuring all processors have roughly equal work to avoid idle time.
Latency: The delay before data transfer begins following a communication request.
Bandwidth: The rate at which data can be transferred, typically measured in GB/s.
Non-Uniform Memory Access (NUMA): Memory architecture where access time depends on which processor’s memory is used.
MPI Terminology
MPI: Standard for distributed communication.
Rank: Unique ID of each process.
Communicator: A group of MPI processes that can communicate with one another (default: MPI_COMM_WORLD).
Collective Communication: MPI operations that involve all processes (e.g., MPI_Bcast, MPI_Reduce, MPI_Allreduce).
Point-to-Point: Direct message exchange between two MPI processes (e.g., MPI_Send and MPI_Recv).
Blocking: The function waits until communication completes.
Non-Blocking: The function returns immediately, allowing computation overlap.
Remote Memory Access (RMA): One-sided communication in MPI that allows direct memory reads/writes between processes without explicit synchronization.
MPI Derived Datatypes: Custom data layouts defined for efficient transfer of structured or non-contiguous data.
MPI I/O: Parallel input/output routines for reading/writing large datasets collectively across multiple nodes.
OpenMP Terminology
OpenMP: An API for parallel programming on shared memory systems using compiler directives.
Thread Affinity: A lightweight execution unit that shares memory with other threads within a process.
Parallel Region: Code block that runs across multiple threads simultaneously (#pragma omp parallel).
Work-Sharing Construct: Divides tasks among threads (e.g., #pragma omp for, #pragma omp sections).
Reduction: Combines partial results from multiple threads into a single result safely.
Critical Section: A code block that only one thread can execute at a time, preventing data races.
False Sharing: Performance issue where multiple threads write to variables on the same cache line.
Thread Affinity: Binding threads to specific CPU cores to minimize context switching and maximize cache reuse.
Private Variables: Each thread has its own copy.
Shared Variables: All threads access the same variable.
False Sharing: Threads competing for cache.
Optimization & Performance
Optimization: The process of improving a program’s performance by maximizing computation and minimizing bottlenecks (e.g., memory, communication, I/O).
Profiling: Analyzing a program’s runtime behavior to identify performance bottlenecks.
Roofline Model: A visual model that relates performance (FLOPS) to arithmetic intensity (FLOPS/byte) to identify whether an application is compute- or memory-bound.
Arithmetic Intensity: Ratio of operations to data movement: AI=FLOPsBytes movedAI=Bytes movedFLOPs
Vectorization: Performing multiple calculations simultaneously using vector registers (e.g., AVX instructions).
Cache Blocking: Reordering computations to reuse data in cache and minimize memory access.
Memory Hierarchy: Organization of memory from fastest to slowest: Registers → L1 → L2 → L3 → RAM → Disk.
Amdahl’s Law: Defines the theoretical speedup limit of a parallel system based on the sequential portion of code: S=1(1−P)+P/NS=(1−P)+P/N1
Gustafson’s Law: Shows that as problem size grows, scaling efficiency can remain high even with more processors.
Bottleneck: Any part of a system that limits overall performance — often CPU, memory, or network bandwidth.
Tools & Frameworks
Perf: Linux profiling tool for measuring CPU cycles, cache misses, and branch predictions..
mpiP: Lightweight MPI profiler for analyzing communication patterns.
HPCToolkit: Advanced profiling and tracing tool for hybrid MPI + OpenMP programs.
Intel VTune: PCommercial profiler that visualizes vectorization, cache, and thread performance.
Slurm: Popular job scheduler in HPC environments.
Valgrind: Memory debugging and leak detection tool.
ThreadSanitizer: Detects data races in multithreaded applications.
Key Formulas Summary
| Concept | Description | Formula |
|---|---|---|
| Speedup (S) | How much faster parallel code runs | S = T1 / TpS = T1/TP |
| Efficiency (E) | Resource utilization | E = S / PE = S / P | Amdahl's Law | Limits of parallel speedup | S=1(1−P)+P/NS=(1−P)+P/N1 |
| Arithmetic Intensity (AI) | FLOPs per byte moved | AI=FLOPsBytesAI=BytesFLOPs |
| Roofline Bound | Theoretical performance cap/td> | P=min(Ppeak,BW×AI)P=min(Ppeak,BW×AI) |