Understanding CPU Caches: Why Your Code is Slow
You’ve optimized your algorithm from O(n²) to O(n log n), but it’s still slow. The culprit? CPU cache misses. Modern CPUs can execute billions of instructions per second, but memory access is the bottleneck. Understanding caches is the key to writing fast code.
The Memory Hierarchy
CPU Registers: ~1 cycle (4 KB)
L1 Cache: ~4 cycles (32 KB per core)
L2 Cache: ~12 cycles (256 KB per core)
L3 Cache: ~40 cycles (8 MB shared)
Main RAM: ~200 cycles (16 GB)
SSD: ~100,000 cycles
Key Insight: Accessing RAM is 50x slower than L1 cache. Cache-friendly code can be 10-100x faster.
Cache Lines: The Fundamental Unit
CPUs don’t fetch individual bytes—they fetch cache lines (typically 64 bytes):
struct Point {
int x, y; // 8 bytes total
};
Point points[1000];
// This fetches 8 cache lines (64 bytes each = 8 Points per line)
for (int i = 0; i < 1000; i++) {
points[i].x += 1;
}
Implication: Accessing points[0].x also loads points[1] through points[7] into cache for free.
Cache Miss Types
1. Compulsory Miss (Cold Start)
First access to data—unavoidable:
int sum = 0;
for (int i = 0; i < n; i++) {
sum += array[i]; // First iteration is a compulsory miss
}
2. Capacity Miss
Data doesn’t fit in cache:
// If array is larger than L3 cache (e.g., 100 MB)
int huge_array[25000000]; // 100 MB
for (int i = 0; i < 25000000; i++) {
huge_array[i] = i; // Thrashes cache
}
3. Conflict Miss
Multiple addresses map to the same cache location:
// Accessing every 4096th element (common stride)
for (int i = 0; i < n; i += 4096) {
array[i] = 0; // May conflict in cache
}
Optimization 1: Spatial Locality
Bad: Jumping around memory
struct Player {
int id;
char name[64];
float x, y, z;
// ... 200 bytes total
};
Player players[10000];
// Update positions (scattered access)
for (int i = 0; i < 10000; i++) {
players[i].x += velocity[i].x;
players[i].y += velocity[i].y;
players[i].z += velocity[i].z;
}
Good: Struct of Arrays (SoA)
struct Players {
float x[10000];
float y[10000];
float z[10000];
};
Players players;
// Sequential access, cache-friendly
for (int i = 0; i < 10000; i++) {
players.x[i] += velocity.x[i];
players.y[i] += velocity.y[i];
players.z[i] += velocity.z[i];
}
Speedup: 5-10x faster due to better cache utilization.
Optimization 2: Temporal Locality
Reuse data while it’s still in cache:
// Bad: Two separate loops
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
for (int i = 0; i < n; i++) {
d[i] = a[i] * 2; // a[i] might not be in cache anymore
}
// Good: Fuse loops
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
d[i] = a[i] * 2; // a[i] is definitely in cache
}
Optimization 3: Cache Blocking (Tiling)
For matrix operations:
// Bad: Poor cache locality
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j]; // B accessed non-sequentially
}
}
}
// Good: Block/tile the computation
const int BLOCK = 64;
for (int i = 0; i < N; i += BLOCK) {
for (int j = 0; j < N; j += BLOCK) {
for (int k = 0; k < N; k += BLOCK) {
// Multiply BLOCK x BLOCK submatrices
for (int ii = i; ii < min(i + BLOCK, N); ii++) {
for (int jj = j; jj < min(j + BLOCK, N); jj++) {
for (int kk = k; kk < min(k + BLOCK, N); kk++) {
C[ii][jj] += A[ii][kk] * B[kk][jj];
}
}
}
}
}
}
Speedup: 3-5x for large matrices.
False Sharing: The Hidden Performance Killer
When multiple threads write to different variables on the same cache line:
// Bad: False sharing
struct Counter {
int count1; // Thread 1 writes here
int count2; // Thread 2 writes here
}; // Both on same cache line!
Counter counter;
// Thread 1
for (int i = 0; i < 1000000; i++) {
counter.count1++; // Invalidates cache line for Thread 2
}
// Thread 2
for (int i = 0; i < 1000000; i++) {
counter.count2++; // Invalidates cache line for Thread 1
}
Good: Pad to separate cache lines
struct Counter {
alignas(64) int count1; // Force 64-byte alignment
char padding[60];
alignas(64) int count2;
};
Speedup: 10-100x in multi-threaded code.
Prefetching: Hint the CPU
// Manual prefetch
for (int i = 0; i < n; i++) {
__builtin_prefetch(&array[i + 8]); // Prefetch 8 elements ahead
process(array[i]);
}
When to use: When access patterns are predictable but not sequential.
Measuring Cache Performance
Linux: perf
perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./myprogram
# Output:
# 1,234,567 cache-references
# 123,456 cache-misses # 10% miss rate
Intel VTune
vtune -collect memory-access ./myprogram
Valgrind Cachegrind
valgrind --tool=cachegrind ./myprogram
cg_annotate cachegrind.out.12345
Real-World Example: Binary Search
// Cache-unfriendly: Random jumps
int binary_search(int* arr, int n, int target) {
int left = 0, right = n - 1;
while (left <= right) {
int mid = (left + right) / 2;
if (arr[mid] == target) return mid;
if (arr[mid] < target) left = mid + 1;
else right = mid - 1;
}
return -1;
}
// Cache-friendly: Eytzinger layout (BFS order)
// Rearrange array so parent/children are nearby
int eytzinger_search(int* arr, int n, int target) {
int i = 0;
while (i < n) {
if (arr[i] == target) return i;
i = 2 * i + 1 + (arr[i] < target);
}
return -1;
}
Speedup: 2-3x for large arrays.
Conclusion
Cache optimization is about:
- Spatial locality: Access nearby memory
- Temporal locality: Reuse recently accessed data
- Avoiding false sharing: Pad multi-threaded data structures
- Blocking: Process data in cache-sized chunks
Key Takeaways:
- RAM access is 50x slower than L1 cache
- Cache lines are 64 bytes—use them fully
- Struct of Arrays beats Array of Structs
- Profile with
perfor VTune
What cache optimizations have you implemented? Share your results!