When you’re building Go services that handle millions of operations per second, the hardware beneath your abstractions starts to matter. Specifically, the CPU cache hierarchy, and whether your data fits in it.
The Hardware Context: It’s Not Just “RAM”
Your server has 32GB or 64GB of RAM, but the CPU avoids touching it whenever possible. Instead, it works through a chain of caches:
- L1 Cache (Instruction & Data): Tiny (~64KB data per core), near-instant access (1–4 cycles).
- L2 Cache: Larger (1–4MB per core), still extremely fast, your true private workspace.
- L3 / Last Level Cache (LLC): Shared across cores, much larger (32MB+), but noticeably slower.
On modern ARM server chips (Ampere Altra, AWS Graviton) and Apple M-series silicon, L2 is your most valuable per-core resource. If your Go service’s active working set fits in L2, it runs fast. Spill into L3 or RAM, and you hit a latency cliff.
The practical question isn’t “do you know what a cache is?” It’s: what does modern hardware actually penalize you for?
What Modern Hardware Gets Right (And What It Doesn’t)
I ran benchmarks on an Apple M4 to find out. The results changed some assumptions.
Sequential vs. Stride Access: The Prefetcher Wins
The classic cache benchmark is matrix traversal: iterating a 1000×1000 integer matrix by row (sequential, cache-friendly) vs. by column (jumping 8KB every step, cache-hostile). On older CPUs, column traversal is reliably ~10× slower.
On the M4:
1BenchmarkMatrixRowTraversal 236,982 ns/op
2BenchmarkMatrixColTraversal 239,203 ns/opIdentical. The hardware prefetcher detected the stride pattern and pre-fetched the next chunk before the CPU needed it, hiding the latency entirely.
This holds for stride-walk benchmarks too, even 64KB jumps don’t fool it:
1BenchmarkStrideWalk/Stride_64 0.30 ns/op
2BenchmarkStrideWalk/Stride_65536 0.28 ns/opTakeaway: Modern CPUs are remarkably good at hiding predictable access patterns. Simple, linear data structures (slices, arrays) are still best, but you probably don’t need to hand-tune memory layout unless profiling shows a real bottleneck. If you want to benchmark true cache misses, use pointer-chasing (random linked-list traversal), where the CPU cannot predict the next address until it loads the current one.
What Still Kills Performance: Shared Mutable State
This is where the latency cliff is real, and where Go developers get burned.
When multiple cores read and write the same memory location, even through atomic operations, they trigger the CPU’s cache coherence protocol (MESI). Every write on one core must invalidate copies on every other core. The more cores fighting over the same cache line, the worse it gets.
I measured this directly with adjacent atomic counters under concurrent load:
| Scenario | Latency |
|---|---|
| Single core, no contention | ~2.2 ns/op |
| 8 cores, high contention | ~28.4 ns/op |
A 12× slowdown, not from algorithmic complexity, but from cores arguing over a 64-byte chunk of memory.
This is the real enemy in high-throughput Go services, a shared var RequestCount int64 or a single global mutex protecting a hot map.
Fix 1: Shard your counters
Don’t use a single shared counter. Map goroutines or connections to buckets:
1type ShardedCounter struct {
2 counts [128]struct {
3 n int64
4 _ [56]byte // pad to 64-byte cache line
5 }
6}
7
8func (c *ShardedCounter) Inc(shard int) {
9 atomic.AddInt64(&c.counts[shard%128].n, 1)
10}Each shard lives on its own cache line. Cores no longer fight.
Fix 2: Pad hot struct fields
If two fields in a struct are written by different goroutines, put padding between them so they land on separate cache lines:
1type HotStruct struct {
2 FieldA int64
3 _ [56]byte // force onto separate cache line
4 FieldB int64
5}Without the padding, FieldA and FieldB share a line. A write to one invalidates the other on every other core.
Fix 3: Prefer channels over shared memory
Go’s channel semantics transfer ownership, not just a value. The sending goroutine gives up its reference; the receiver takes over. This is exactly what cache coherence wants, one writer at a time, and it’s why idiomatic Go with channels often outperforms lock-heavy designs even when the channel has overhead.
Keep Your Working Set in L2
The 12× contention penalty is dramatic, but the more insidious problem in real services is simply having a working set that’s too large.
If a request handler touches 10MB of data, a large in-process cache, a bloated context struct, a deep call chain with big allocations, that data spills out of L2 and into L3 or RAM on every request. You won’t see a single expensive operation; you’ll see uniformly elevated tail latency under load.
Practical rules:
- Keep per-connection and per-request structs small. Prefer slices of IDs over slices of full objects when you don’t need all the fields.
- Avoid large in-process caches for hot-path data. A 50MB LRU cache sounds helpful, but if it exceeds your LLC, you’re paying RAM latency for every miss.
- Use
sync.Poolfor short-lived allocations to reduce GC pressure and improve locality, reused objects tend to stay in cache.
Measuring It Yourself
To see cache effects in your own service, the most useful tools are:
go test -bench for microbenchmarks. Run with -cpu=1,4,8 to see how contention scales:
1go test -bench=. -benchmem -cpu=1,4,8 .Linux perf for hardware counters on production workloads:
1perf stat -e cache-misses,cache-references,LLC-load-misses ./your-binaryHigh LLC (Last Level Cache) miss rates, especially above 5–10%, are the clearest signal that your working set is exceeding the cache. That’s where to optimize first.
I’ve added a benchmark suite to experiment with these patterns yourself. Download cache_test.go
Summary
Modern hardware prefetchers have largely eliminated the penalties for predictable access patterns. The naive “row vs. column traversal” benchmarks from ten years ago often don’t apply anymore.
What does still matter, and significantly:
- Shared mutable state across cores, cache coherence contention is a 10× problem, not a 2× problem. Shard counters, pad structs, prefer channels.
- Working set size, if your per-request data footprint exceeds L2 (~1–4MB per core), you’re paying LLC or RAM latency on every operation. Keep hot structures lean.
- Measure before optimizing, use
perfto look for LLC misses. A high miss rate is the smoking gun; a low one means your bottleneck is somewhere else entirely.