Understanding CPU Caches in Go

Reading part 11 of Deep Dive

// TL;DR

L1 Cache (Instruction & Data): Tiny (~64KB data per core), near-instant access (1–4 cycles).
L2 Cache: Larger (1–4MB per core), still extremely fast, your true private workspace.
L3 / Last Level Cache (LLC): Shared across cores, much larger (32MB+), but noticeably slower.

When you’re building Go services that handle millions of operations per second, the hardware beneath your abstractions starts to matter. Specifically, the CPU cache hierarchy, and whether your data fits in it.

The Hardware Context: It’s Not Just “RAM”

Your server has 32GB or 64GB of RAM, but the CPU avoids touching it whenever possible. Instead, it works through a chain of caches:

L1 Cache (Instruction & Data): Tiny (~64KB data per core), near-instant access (1–4 cycles).
L2 Cache: Larger (1–4MB per core), still extremely fast, your true private workspace.
L3 / Last Level Cache (LLC): Shared across cores, much larger (32MB+), but noticeably slower.

On modern ARM server chips (Ampere Altra, AWS Graviton) and Apple M-series silicon, L2 is your most valuable per-core resource. If your Go service’s active working set fits in L2, it runs fast. Spill into L3 or RAM, and you hit a latency cliff.

The practical question isn’t “do you know what a cache is?” It’s: what does modern hardware actually penalize you for?

What Modern Hardware Gets Right (And What It Doesn’t)

I ran benchmarks on an Apple M4 to find out. The results changed some assumptions.

Sequential vs. Stride Access: The Prefetcher Wins

The classic cache benchmark is matrix traversal: iterating a 1000×1000 integer matrix by row (sequential, cache-friendly) vs. by column (jumping 8KB every step, cache-hostile). On older CPUs, column traversal is reliably ~10× slower.

On the M4:

text

1BenchmarkMatrixRowTraversal     236,982 ns/op
2BenchmarkMatrixColTraversal     239,203 ns/op

Identical. The hardware prefetcher detected the stride pattern and pre-fetched the next chunk before the CPU needed it, hiding the latency entirely.

This holds for stride-walk benchmarks too, even 64KB jumps don’t fool it:

text

1BenchmarkStrideWalk/Stride_64         0.30 ns/op
2BenchmarkStrideWalk/Stride_65536      0.28 ns/op

Takeaway: Modern CPUs are remarkably good at hiding predictable access patterns. Simple, linear data structures (slices, arrays) are still best, but you probably don’t need to hand-tune memory layout unless profiling shows a real bottleneck. If you want to benchmark true cache misses, use pointer-chasing (random linked-list traversal), where the CPU cannot predict the next address until it loads the current one.

What Still Kills Performance: Shared Mutable State

This is where the latency cliff is real, and where Go developers get burned.

When multiple cores read and write the same memory location, even through atomic operations, they trigger the CPU’s cache coherence protocol (MESI). Every write on one core must invalidate copies on every other core. The more cores fighting over the same cache line, the worse it gets.

I measured this directly with adjacent atomic counters under concurrent load:

Scenario	Latency
Single core, no contention	~2.2 ns/op
8 cores, high contention	~28.4 ns/op

A 12× slowdown, not from algorithmic complexity, but from cores arguing over a 64-byte chunk of memory.

This is the real enemy in high-throughput Go services, a shared var RequestCount int64 or a single global mutex protecting a hot map.

Fix 1: Shard your counters

Don’t use a single shared counter. Map goroutines or connections to buckets:

 1type ShardedCounter struct {
 2    counts [128]struct {
 3        n   int64
 4        _   [56]byte // pad to 64-byte cache line
 5    }
 6}
 7
 8func (c *ShardedCounter) Inc(shard int) {
 9    atomic.AddInt64(&c.counts[shard%128].n, 1)
10}

Each shard lives on its own cache line. Cores no longer fight.

Fix 2: Pad hot struct fields

If two fields in a struct are written by different goroutines, put padding between them so they land on separate cache lines:

1type HotStruct struct {
2    FieldA int64
3    _      [56]byte // force onto separate cache line
4    FieldB int64
5}

Without the padding, FieldA and FieldB share a line. A write to one invalidates the other on every other core.

Fix 3: Prefer channels over shared memory

Go’s channel semantics transfer ownership, not just a value. The sending goroutine gives up its reference; the receiver takes over. This is exactly what cache coherence wants, one writer at a time, and it’s why idiomatic Go with channels often outperforms lock-heavy designs even when the channel has overhead.

Keep Your Working Set in L2

The 12× contention penalty is dramatic, but the more insidious problem in real services is simply having a working set that’s too large.

If a request handler touches 10MB of data, a large in-process cache, a bloated context struct, a deep call chain with big allocations, that data spills out of L2 and into L3 or RAM on every request. You won’t see a single expensive operation; you’ll see uniformly elevated tail latency under load.

Practical rules:

Keep per-connection and per-request structs small. Prefer slices of IDs over slices of full objects when you don’t need all the fields.
Avoid large in-process caches for hot-path data. A 50MB LRU cache sounds helpful, but if it exceeds your LLC, you’re paying RAM latency for every miss.
Use sync.Pool for short-lived allocations to reduce GC pressure and improve locality, reused objects tend to stay in cache.

Measuring It Yourself

To see cache effects in your own service, the most useful tools are:

go test -bench for microbenchmarks. Run with -cpu=1,4,8 to see how contention scales:

bash

1go test -bench=. -benchmem -cpu=1,4,8 .

Linux perf for hardware counters on production workloads:

bash

1perf stat -e cache-misses,cache-references,LLC-load-misses ./your-binary

High LLC (Last Level Cache) miss rates, especially above 5–10%, are the clearest signal that your working set is exceeding the cache. That’s where to optimize first.

I’ve added a benchmark suite to experiment with these patterns yourself. Download cache_test.go

Summary

Modern hardware prefetchers have largely eliminated the penalties for predictable access patterns. The naive “row vs. column traversal” benchmarks from ten years ago often don’t apply anymore.

What does still matter, and significantly:

Shared mutable state across cores, cache coherence contention is a 10× problem, not a 2× problem. Shard counters, pad structs, prefer channels.
Working set size, if your per-request data footprint exceeds L2 (~1–4MB per core), you’re paying LLC or RAM latency on every operation. Keep hot structures lean.
Measure before optimizing, use perf to look for LLC misses. A high miss rate is the smoking gun; a low one means your bottleneck is somewhere else entirely.