Go GC Deep Dive: How to Reduce Latency and Allocation Pressure in Production

"Why is our service slow?" "I don't know, the heap is only 200MB". "But we're allocating... wait, how much?" "12 terabytes". "...in how long?" "30 seconds profile". That's when we realized: we weren't running a service. We were running a garbage factory that occasionally served API requests. The Go garbage collector was heroically trying to clean up our mess, and we were blaming it for not cleaning fast enough. This deep dive into GC internals, profiling tools, and production war stories will teach you how to stop fighting the garbage collector and start working with it.

WORDS: 3138 | CODE BLOCKS: 8 | EXT. LINKS: 11

In the world of high-throughput backend services, we often obsess over the usual suspects of performance: database indexing, network latency, and algorithmic complexity. But recently, while debugging our core gateway service (backend-gw), I encountered a bottleneck that defied standard logic.

The service was CPU-bound, yet active heap usage was surprisingly low (~200MB). P99 latency was spiking at random intervals, but database queries were returning in milliseconds.

The culprit was not the business logic. It was memory management. I was effectively running a Denial-of-Service attack on my own runtime.

This post is a detailed breakdown of the Go Garbage Collector (GC). I’ll explain how the GC actually works, dissect its specific phases (including the dreaded “Stop The World”), and show you how to use the Go Trace tool to identify when your application is losing the battle against allocation churn.

Part 1: Finding the Smoking Gun

My investigation began with a standard CPU profile. The service processes JSON documents from Couchbase and serves them via HTTP. I expected the CPU time to be dominated by business rules or I/O handling.

Instead, the profile (cpu_top_cumulative.txt) showed this:

text
1Showing nodes accounting for 29.17s, 72.56% of 40.20s total
2      flat  flat%   sum%        cum   cum%
3     0.02s  0.05%  ...     19.04s 47.36%  couchbaseDB.GetWithCAS
4         0     0%  ...     18.40s 45.77%  encoding/json.Unmarshal
5     0.64s  1.59%  ...     13.89s 34.55%  encoding/json.(*decodeState).object
6     ...
7     0.48s  1.19%  ...      1.94s  4.83%  runtime.scanobject
8     0.02s  0.05%  ...      1.81s  4.50%  runtime.gcDrain

I had two major bottlenecks consuming roughly equal CPU time:

  1. Database operations (47.36% cumulative) - Couchbase fetch and I/O
  2. JSON deserialization (45.77% cumulative) - Standard library unmarshalling

But crucially, functions like runtime.scanobject and runtime.gcDrain were consuming significant cycles. These are internal GC methods. While only 4.83% of CPU was directly in GC functions, this was a sign of a deeper problem.

The allocation profile (allocs_top_cumulative.txt) confirmed my suspicion. We weren’t leaking memory, we were “churning” it. Allocation profile shows cumulative allocations since last deployment (4 hours in this case).

text
1Showing nodes accounting for 11054705.39MB, 89.25% of 12386152.87MB total
2
32981421.22MB  24.07%  github.com/go-kit/log.With
4 216546.74MB   1.75%  encoding/json.Unmarshal

Understanding the Gap: Heap vs. Allocations

Here’s the critical insight that I had missed. The heap profile showed:

text
1Type: inuse_space
2Showing nodes accounting for 188.98MB, 92.83% of 203.58MB total

Let that sink in:

  • Active Heap: 203 MB (memory in use at snapshot time)
  • Total Allocations: 12,386 GB (cumulative garbage created since last deployment)
  • Churn Ratio: 62,000:1 (cumulative allocations / active heap)

We were generating twelve terabytes of cumulative garbage while maintaining only 200MB of live data. This meant that for every byte we needed, we were creating 62,000 bytes of garbage that the GC had to clean up.

This is allocation churn, and it’s the silent killer. The GC must scan and mark every reachable object during collection, and when you’re creating millions of temporary objects, the GC spends more time cleaning than the application spends doing actual work.

We were generating this garbage using standard library JSON parsers and an inefficient logging strategy. To understand why this crushed our CPU, I had to look under the hood of the Go Runtime.

Then I started reading about the Go GC. I had some familiarity with Python’s GC. So, I started reading countless articles, reading Go docs and reading the source code. After about 25 hours, I finally got some clarity on how the GC works and why the backend was so slow.

Part 2: The Architecture of the Go GC

The Go GC is a Non-Generational, Concurrent, Tri-Color Mark-and-Sweep collector.

That is a mouthful. Let’s break down the implications:

  1. Non-Generational: Unlike Java or Python, Go does not separate objects into “Young” and “Old” generations. It treats the heap as one singular space. This simplifies the runtime but means the GC must scan everything that is live during every cycle.
  2. Concurrent: The GC runs alongside your application code (except during brief Stop-The-World phases). It does not pause your app for the entire duration of the collection.
  3. Mark-and-Sweep: It traverses the object graph to find live objects (“Mark”) and reclaims the rest (“Sweep”).

The Object Graph

You must visualize your application’s memory as a directed graph.

  • Nodes: Objects (structs, strings, slices).
  • Edges: Pointers (references).

The GC starts at the Roots (global variables, stack frames of active goroutines) and follows the pointers. Anything it can reach is “Live”. Anything it cannot reach is “Garbage”.

Part 3: The Lifecycle of a GC Cycle (The 4 Phases)

A GC cycle is a carefully choreographed dance between your code (the “Mutator”) and the GC background workers. It consists of four specific phases.

Phase 1: Sweep Termination (Stop-The-World)

The cycle begins here. The GC must ensure that the previous cycle’s cleanup is 100% complete before starting a new one.

  • What happens: The runtime pauses all application goroutines (Stop-The-World, or STW). It finishes any remaining sweeping of memory spans.
  • Latency Impact: In modern Go versions (1.18+), this is typically sub-millisecond. However, if your heap is highly fragmented or massive, this pause can stretch, causing “jitter” in your p99 latency.

Phase 2: Concurrent Mark (The Heavy Lifter)

This is the longest phase, accounting for the bulk of the GC-related CPU usage I saw in my profiles. The world resumes, and the GC runs in the background (consuming ~25% of CPU capacity by default).

This phase uses the Tri-Color Invariant:

  1. White: Candidate for collection (not yet visited).
  2. Grey: Alive, but its children (pointers) have not been scanned.
  3. Black: Alive, and fully scanned.

The GC picks a Grey object, scans its memory for pointers (runtime.scanobject), marks the referenced objects Grey, and turns the original object Black.

The Bottleneck: The cost of this phase is determined by Pointer Density, not just object size.

text
1Cost = (BytesScanned * C_1) + (PointerCount * C_2)

In my profiling data, I saw runtime.scanobject consuming 4.83% of CPU. This is because encoding/json creates a graph full of interface{} and pointer wrappers. I was forcing the GC to traverse millions of tiny edges.

Phase 3: Mark Termination (Stop-The-World)

Once the Concurrent Mark is done, the GC must ensure no Grey objects remain.

  • What happens: A second Stop-The-World (STW) pause. The GC performs final housekeeping, flushes caches, and calculates the target heap size for the next cycle.
  • Latency Impact: This pause is generally longer than Sweep Termination. Its duration scales with the number of active goroutines and the complexity of the root set.

Phase 4: Concurrent Sweep

The STW pause ends. The GC now knows that any object still marked “White” is garbage.

  • What happens: The GC does not explicitly “delete” every object immediately. Instead, as your application requests new memory (via mallocgc), the runtime lazily reclaims the White space.
  • Resource Usage: This happens largely off the critical path, but high churn can cause fragmentation here.

Part 4: Analyzing the Trace – The “G Block” and Mark Assists

This is where the theory meets the reality of debugging. When you run go tool trace trace.out, you are presented with a timeline view.

What is a “G Block”?

In the trace tool, you will see rows labeled Proc 0, Proc 1, … (representing logical Processors/Cores). The colorful bars inside these rows are G Blocks (Goroutine execution blocks).

A “G Block” represents a specific Goroutine running on a specific Processor for a duration of time.

  • Green: User code running normally.
  • Blue: Networking/IO.
  • Yellow/Red: GC related delays or synchronization waiting.

The Latency Killer: Mark Assists

The Go GC has a “Pacing” algorithm. It aims to prevent the heap from growing out of control before the GC cycle finishes.

The Concept: Imagine your application is allocating memory (creating garbage) faster than the background GC workers can clean it up. The runtime realizes it is losing the race. To prevent an Out-Of-Memory (OOM) crash, it begins to “tax” your application.

The Mechanism: When a Goroutine attempts to allocate memory (runtime.mallocgc), the runtime checks the current GC credit. If the Goroutine is in “debt” (it has allocated too much relative to the scan progress), the runtime hijacks the Goroutine.

Instead of running your business logic, your Goroutine is forced to switch into Mark Assist mode. It must scan a portion of the heap (doing the GC’s job) to “pay off its debt” before it is allowed to perform the allocation.

Interpreting the Trace

If you see this in your trace, you have an allocation problem:

  1. Look at the “G” row (Goroutine analysis).
  2. Look for the state labels. You want to see “Running”.
  3. If you see large chunks labeled “Mark Assist” or time spent in “GCWaiting” states, your CPU is not serving users; it is fighting the heap.

In my specific case, my API handlers were spending significant time in Mark Assist because log.With and json.Unmarshal were flooding the heap with pointers.

Part 5: The Dual Problem - JSON and Logging

The JSON Unmarshalling Problem

While I can’t optimize the database (network and disk I/O are physical constraints), I can optimize how I process the data once it arrives. The standard library’s encoding/json package is general-purpose and safe, but it’s not optimized for high-throughput scenarios.

The issue is how it allocates memory:

  • Creates many small allocations for each field
  • Uses interface{} extensively, forcing heap allocations
  • Generates intermediate objects during parsing
  • Creates pointer-heavy object graphs

The Logging Problem (The Silent Memory Hog)

Looking deeper at the allocation profile:

text
12981421.22MB  24.07%  github.com/go-kit/log.With
2 464404.86MB   3.75%  backend-gw/routes/rctx.LoggerWithFn

The go-kit logger was allocating 2.9 TB of memory - nearly 24% of all allocations! This is because log.With() creates a new logger instance every time it’s called, and we are calling it in HTTP middleware and every function for every single request.

Every log line with context creates:

  • A new logger instance
  • Copies of all context key-value pairs
  • String conversions for values
  • Interface boxing for non-string types

When you’re handling thousands of requests per second, this adds up catastrophically.

Why This Compounds the GC Problem: These allocations don’t just cost memory, they create millions of objects that the GC must scan during the mark phase. Even though each logger instance is short-lived, the GC must still prove it’s garbage by scanning the entire object graph.

Part 6: The Fix and The Benchmarks

I replaced encoding/json with goccy/go-json, a drop-in replacement that’s heavily optimized for performance. The impact was visible in micro-benchmarks immediately.

Benchmark Setup

go
 1func BenchmarkUnmarshal_Std_UserSE(b *testing.B) {
 2    data := []byte(`{"userId":"..".,"sessions":[...],...}`) // Representative data
 3    b.ReportAllocs()
 4    b.ResetTimer()
 5    
 6    for i := 0; i < b.N; i++ {
 7        var result UserSE
 8        if err := json.Unmarshal(data, &result); err != nil {
 9            b.Fatal(err)
10        }
11    }
12}
13
14func BenchmarkUnmarshal_Goccy_UserSE(b *testing.B) {
15    data := []byte(`{"userId":"..".,"sessions":[...],...}`)
16    b.ReportAllocs()
17    b.ResetTimer()
18    
19    for i := 0; i < b.N; i++ {
20        var result UserSE
21        if err := goccyjson.Unmarshal(data, &result); err != nil {
22            b.Fatal(err)
23        }
24    }
25}

Benchmark Results: The “Pointer Chasing” Proof

I benchmarked the unmarshalling of several of our critical database document types on Apple M2:

Struct Type Library Time (ns/op) Allocs/op Bytes/op Speedup Alloc Reduction
UserSE encoding/json 127,135 589 33,625 - -
goccy/go-json 27,118 273 32,637 4.7x 53.7%
UserEE encoding/json 133,181 347 29,745 - -
goccy/go-json 25,736 154 36,283 5.2x 55.6%
User encoding/json 4,819 22 2,392 - -
goccy/go-json 1,095 9 2,545 4.4x 59.1%
Post encoding/json 6,685 18 1,184 - -
goccy/go-json 1,621 11 1,689 4.1x 38.9%
UserENR encoding/json 10,345 44 1,704 - -
goccy/go-json 2,286 16 2,533 4.5x 63.6%

The Analysis:

Across all document types, I saw:

  • Average speedup: 4.6x faster unmarshalling
  • Average allocation reduction: 54% fewer allocations
  • Bytes per operation: Roughly similar (slight increase in some cases)

The byte count (B/op) remained similar or slightly higher in some cases. This is the key insight. The GC does not care about the total bytes allocated as much as it cares about the number of objects (allocations) it must track.

When unmarshalling UserSE:

  • Standard lib: 589 separate allocations = 589 objects to scan
  • goccy/go-json: 273 allocations = 273 objects to scan

This change effectively cut the work of runtime.scanobject by more than half for these hot paths.

Why goccy/go-json Is Faster

The performance difference comes from:

  1. Reduced intermediate allocations - Directly writes to destination structs
  2. Optimized pointer handling - Minimizes pointer chains in the object graph
  3. Better memory layout - Fewer separate heap allocations
  4. SIMD optimizations - Uses CPU vector instructions where available

The slight increase in bytes/op for some structs is acceptable because it reduces allocation count, the metric that actually drives GC pressure.

Next Steps: Logging Optimization

My next phase of optimization is to replace the go-kit logger with a zero-allocation alternative like zerolog or zap, and to restructure the logging middleware to reduce the quantity and improve the quality of logs.

Based on my profiling data showing 2.9TB allocated by log.With(), I anticipate this will:

  • Eliminate ~600GB per hour of memory churn
  • Free up an additional 15-20% of CPU cycles
  • Further reduce GC pressure

Architectural takeaways

For Senior Engineers: Rethink “Clean Code”

We often prefer interfaces, wrappers, and layers of abstraction for “clean architecture”. In Go, every interface wrapper that escapes to the heap is a node in the graph.

  • The Trade-off: Be wary of libraries that accept interface{} for everything (like standard log or json).
  • Structure: Design structs to keep pointers contiguous. The GC scans memory faster when pointers are grouped at the start or end of a struct, rather than interleaved with scalars (int/bool), due to how bitmap marking works.

For Junior Engineers: Value vs. Pointer

  • Pointers are not free. Passing a pointer to a function might save a memory copy, but it might burden the GC if that pointer escapes to the heap.
  • Pre-allocate Slices. Use make([]T, 0, capacity) if you know the size. Appending to a slice causes resizing, which creates a new backing array and leaves the old one as garbage.

Red Flags: Is Your Application GC-Bound?

Watch for these warning signs in your profiles:

  1. Allocation/Heap Ratio > 1000:1 - You’re churning far more than you’re using
  2. runtime.scanobject > 5% of CPU - GC is spending significant time scanning
  3. Many small allocations - Objects < 1KB in hot paths create pointer-chasing
  4. interface{} in unmarshal targets - Forces heap allocation for every value
  5. Logger allocations in middleware - Creating context on every request

General Optimization Strategies

  1. Use sync.Pool for frequently allocated objects - Reuse instead of allocating
  2. Prefer value types over pointers - When objects are small (<100 bytes)
  3. Consider binary protocols - protobuf or msgpack have less allocation overhead
  4. Batch allocations - Pre-allocate slices to known capacity
  5. Profile before and after - Always measure; intuition fails at scale

Conclusion

The Go Garbage Collector is a marvel of engineering, prioritizing low latency over high throughput. However, it obeys the laws of physics. If you create millions of objects, the GC must touch millions of objects.

By analyzing my profiles and traces, I realized the service wasn’t slow, it was just distracted. The CPU was spending its time:

  • Fetching data from Couchbase (necessary I/O)
  • Deserializing JSON with inefficient allocations (optimizable)
  • Creating logger instances for every request (optimizable)
  • Running the GC to clean up the resulting garbage (consequence)

By addressing the two optimizable factors, switching to goccy/go-json and planning to fix the logging, I’m reclaiming significant CPU capacity. I reduced allocation counts by 50%+, which directly reduces the work that runtime.scanobject must perform during each GC cycle.

The lesson: In high-performance Go services, allocation count matters more than allocation size. The GC must scan the object graph, and every object is a node. Minimize the nodes, and you minimize the GC’s work.

Understanding your memory allocation patterns through profiling is not optional, it’s essential for building services that scale.

Reference Documentation:

Profiling Commands Used:

bash
 1# CPU profiling
 2go test -cpuprofile=cpu.prof -bench=. ./...
 3go tool pprof -top cpu.prof
 4
 5# Memory allocation profiling  
 6go test -memprofile=mem.prof -bench=. ./...
 7go tool pprof -alloc_space -top mem.prof
 8
 9# Heap profiling
10go tool pprof -inuse_space -top mem.prof
11
12# Generating trace
13go test -trace=trace.out -bench=. ./...
14go tool trace trace.out

Collect profiling data from production systems:

bash
 1# 1. Collect the raw data
 2{
 3    # 0. CPU Profile: Identifies exactly which line is creating the most garbage.
 4    curl -L -u pprofuser:pwd "http://backend-gw/debug/pprof/profile?seconds=30" -o cpu.pprof &
 5    # 3. Execution Trace: Captures a 10-second timeline of every event (GC, Network, Scheduler).
 6    curl -L -u pprofuser:pwd "http://backend-gw/debug/pprof/trace?seconds=15" -o trace.out &
 7
 8    sleep 2
 9    # 1. Heap Profile: Identifies exactly which line is creating the most garbage.
10    curl -L -u pprofuser:pwd "http://backend-gw/debug/pprof/heap" -o heap.pprof
11
12    # 2. Goroutine Profile: Shows what all concurrent workers are doing at this exact moment.
13    curl -L -u pprofuser:pwd "http://backend-gw/debug/pprof/goroutine" -o goroutine.pprof
14
15    # Block Profile: Shows where code is waiting for Channels or Network.
16    curl -L -u pprofuser:pwd "http://backend-gw/debug/pprof/block" -o block.pprof
17
18    # Mutex Profile: Shows which locks are causing contention.
19    curl -L -u pprofuser:pwd "http://backend-gw/debug/pprof/mutex" -o mutex.pprof
20
21    # Allocation profiles
22    curl -L -u pprofuser:pwd "http://backend-gw/debug/pprof/allocs" -o allocs.pprof
23
24    wait
25    echo "All 30s/15s profiles collected."
26}
27
28# 2. Convert the data into readable format (you can use web version as well)
29{
30    go tool pprof -top -cum cpu.pprof > cpu_top_cumulative.txt
31    go tool pprof -tree cpu.pprof >> cpu_tree.txt
32
33    go tool pprof -top -cum heap.pprof > heap_top_cumulative.txt
34    go tool pprof -tree heap.pprof >> heap_tree.txt
35
36    go tool pprof -top -cum allocs.pprof > allocs_top_cumulative.txt
37    go tool pprof -tree allocs.pprof >> allocs_tree.txt
38
39    go tool pprof -top -cum block.pprof > block_top_cumulative.txt
40    go tool pprof -tree block.pprof >> block_tree.txt
41
42    go tool pprof -top -cum mutex.pprof > mutex_top_cumulative.txt
43    go tool pprof -tree mutex.pprof >> mutex_tree.txt
44
45    go tool pprof -top goroutine.pprof >> goroutines.txt
46
47    go tool trace -pprof=sync trace.out > trace_sync.pprof
48    go tool pprof -top trace_sync.pprof > trace_summary.txt
49}

Then you can open these text files in any text editor to view the profiles. You can also upload these to an LLM to get insights. I used Claude Opus 4.5 for correlating these prifiles with my p90 and p99 metrics and keeping them open on the side to read and cross-reference. While you are reading the profiles you should also look at allocation patterns in your code and make notes of them to see if you can optimize them.

Note: If you use goccy/go-json, test your code extensively. It is a drop-in replacement for encoding/json as claimed, but it has some issues parsing multibyte characters in some languages like Hindi.

PS: Just deployed to prod, and p90 and p99 are looking much better.

Next: I will be testing the bytedance/sonic library and see how it’s JIT and SIMD features compare to goccy/go-json. And maybe I will try to build a simple transcoder to learn more about GC, SIMD, JIT and systems programming. I am liking this a lot.

Until next time.