Performance Analysis and Tools
Profiling tools
- Intel VTune Profiler: CPU performance analysis, hotspot detection, threading analysis.
- NVIDIA Nsight Systems: GPU performance analysis, kernel profiling, timeline visualization.
- Perfsuite: Open-source performance analysis tools for HPC applications.
Debugging tools
- GDB: GNU Debugger for C/C++ applications.
- Valgrind: Memory debugging and profiling tool.
- Intel Inspector: Memory and thread error detection.
Best practices for using tools
- Start with profiling to identify hotspots before optimizing code.
- Use debugging tools to ensure correctness before performance tuning.
- Iteratively profile and optimize to track performance improvements.
Performance measurement & models
Baseline first
- Get a reproducible baseline.
- Time critical kernels (microbenchmarks) and full runs.
Strong and weak scaling
- Strong scaling:fix total problem size, increase processors. Measure time T(P).
- Weak scaling: fix work per processor, increase processors. Measure time T(P).
Strong Scaling Efficiency
Efficiency formula:
Estrong(P) = T(1) / (P × T(P))
Example:
- T(1) = 100 seconds
- P = 8 processors
- T(8) = 20 seconds
Compute:
P × T(P) = 8 × 20 = 160
Efficiency = 100 / 160 = 0.625 = 62.5%
Weak Scaling Efficiency
Weak scaling keeps workload per processor constant.
Efficiency formula:
Eweak(P) = T(1) / T(P)
Ideal case: runtime stays constant as processors increase.
Roofline Model
The Roofline model connects performance (GFLOP/s) to arithmetic intensity.
- Arithmetic Intensity = FLOPs / Bytes moved
- Memory-bound: low intensity
- Compute-bound: high intensity
Matrix Multiplication Example:
FLOPs ≈ 2N³
Memory ≈ 3N² elements
Arithmetic Intensity ≈ N / 12
Larger N → higher intensity → better performance potential.
Compiler Flags & Build Options
Best Practices
- Production:
-O3 -march=native -funroll-loops -fopenmp - Debugging:
-g -O0 - Reproducibility: avoid aggressive floating-point reordering
Compile Examples
mpicc -O3 -march=native -o app app.c
mpicc -O3 -march=native -fopenmp -o hybrid_app hybrid.c
Runtime Environment
OMP_NUM_THREADS: number of threadsOMP_PROC_BIND: thread bindingOMP_PLACES: placement strategyMKL_NUM_THREADS: math library threading
Profiling & Debugging Tools
- perf: CPU hotspot profiling
- gprof: function-level profiling
- PAPI: hardware counters
- VTune: advanced performance analysis
- Valgrind: memory debugging
Matrix Multiplication Optimization
Naive Version
for(i=0;i<N;i++)
for(j=0;j<N;j++) {
double sum=0.0;
for(k=0;k<N;k++)
sum += A[i][k]*B[k][j];
C[i][j]=sum;
}
for(i=0;i<N;i++)for(j=0;j<N;j++) {double sum=0.0;for(k=0;k<N;k++)sum += A[i][k]*B[k][j];C[i][j]=sum;}Problem: poor cache usage.
Tiled Version
for(ii=0; ii<N; ii+=T)
for(jj=0; jj<N; jj+=T)
for(kk=0; kk<N; kk+=T)
for(i=ii; i<min(ii+T,N); ++i)
for(k=kk; k<min(kk+T,N); ++k)
for(j=jj; j<min(jj+T,N); ++j)
C[i][j] += A[i][k]*B[k][j];
OpenMP Parallelization
#pragma omp parallel for collapse(2)
for(ii=0; ii<N; ii+=T)
for(jj=0; jj<N; jj+=T)
Vectorization
#pragma omp simd
for(j = jj; j < min(jj + T, N); ++j)
C[i][j] += A[i][k] * B[k][j];
for(j = jj; j < min(jj + T, N); ++j)
C[i][j] += A[i][k] * B[k][j];
MPI Distribution
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
Distribute rows of C across ranks:
int rows_per_rank = N / size;
int start_row = rank * rows_per_rank;
Use MPI_Gather or MPI_Reduce to collect results.
Performance Metrics
- Speedup: S(P) = T(1) / T(P)
- Efficiency: E(P) = S(P) / P