Learning High-Performance Computing with an Online Toolkit

Performance Analysis and Tools

Profiling tools

Intel VTune Profiler: CPU performance analysis, hotspot detection, threading analysis.
NVIDIA Nsight Systems: GPU performance analysis, kernel profiling, timeline visualization.
Perfsuite: Open-source performance analysis tools for HPC applications.

Debugging tools

GDB: GNU Debugger for C/C++ applications.
Valgrind: Memory debugging and profiling tool.
Intel Inspector: Memory and thread error detection.

Best practices for using tools

Start with profiling to identify hotspots before optimizing code.
Use debugging tools to ensure correctness before performance tuning.
Iteratively profile and optimize to track performance improvements.

Performance measurement & models

Baseline first

Get a reproducible baseline.

Time critical kernels (microbenchmarks) and full runs.

Strong and weak scaling

Strong scaling:fix total problem size, increase processors. Measure time T(P).
Weak scaling: fix work per processor, increase processors. Measure time T(P).

Strong Scaling Efficiency

Efficiency formula:
Estrong(P) = T(1) / (P × T(P))

Example:

T(1) = 100 seconds
P = 8 processors
T(8) = 20 seconds

Compute:
P × T(P) = 8 × 20 = 160
Efficiency = 100 / 160 = 0.625 = 62.5%

Weak Scaling Efficiency
Weak scaling keeps workload per processor constant.

Efficiency formula:
Eweak(P) = T(1) / T(P)

Ideal case: runtime stays constant as processors increase.

Roofline Model

The Roofline model connects performance (GFLOP/s) to arithmetic intensity.

Arithmetic Intensity = FLOPs / Bytes moved
Memory-bound: low intensity
Compute-bound: high intensity

Matrix Multiplication Example:

FLOPs ≈ 2N³
Memory ≈ 3N² elements
Arithmetic Intensity ≈ N / 12

Larger N → higher intensity → better performance potential.

Compiler Flags & Build Options

Best Practices

Production: -O3 -march=native -funroll-loops -fopenmp
Debugging: -g -O0
Reproducibility: avoid aggressive floating-point reordering

Compile Examples

mpicc -O3 -march=native -o app app.c

mpicc -O3 -march=native -fopenmp -o hybrid_app hybrid.c

Runtime Environment

OMP_NUM_THREADS: number of threads
OMP_PROC_BIND: thread binding
OMP_PLACES: placement strategy
MKL_NUM_THREADS: math library threading

Profiling & Debugging Tools

perf: CPU hotspot profiling
gprof: function-level profiling
PAPI: hardware counters
VTune: advanced performance analysis
Valgrind: memory debugging

Matrix Multiplication Optimization

Naive Version

for(i=0;i<N;i++)

for(j=0;j<N;j++) {

double sum=0.0;

for(k=0;k<N;k++)

sum += A[i][k]*B[k][j];

C[i][j]=sum;

Problem: poor cache usage.

Tiled Version

for(ii=0; ii<N; ii+=T)

for(jj=0; jj<N; jj+=T)

for(kk=0; kk<N; kk+=T)

for(i=ii; i<min(ii+T,N); ++i)

for(k=kk; k<min(kk+T,N); ++k)

for(j=jj; j<min(jj+T,N); ++j)

C[i][j] += A[i][k]*B[k][j];

OpenMP Parallelization

#pragma omp parallel for collapse(2)

for(ii=0; ii<N; ii+=T)

for(jj=0; jj<N; jj+=T)

Vectorization

#pragma omp simd

for(j = jj; j < min(jj + T, N); ++j)

C[i][j] += A[i][k] * B[k][j];

for(j = jj; j < min(jj + T, N); ++j)

C[i][j] += A[i][k] * B[k][j];

MPI Distribution

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

Distribute rows of C across ranks:

int rows_per_rank = N / size;

int start_row = rank * rows_per_rank;

Use MPI_Gather or MPI_Reduce to collect results.

Performance Metrics

Speedup: S(P) = T(1) / T(P)
Efficiency: E(P) = S(P) / P