Building a Poor Document Store inside PostgreSQL

A forensic analysis of why abusing JSONB for schemaless architecture leads to write amplification, TOAST bloat, and catastrophic query planner failures.

WORDS: 998 | CODE BLOCKS: 2 | EXT. LINKS: 5
Reading part 4 of PostgreSQL

The marketing for “schemaless” architecture was brilliant. It promised speed, agility, and a life free from the tyranny of ALTER TABLE migrations. When PostgreSQL introduced JSONB in version 9.4, many developers saw it as a green light to treat Postgres like MongoDB.

I’ve seen this pattern in dozens of codebases. It starts with a single metadata column, but within months, the entire business logic is buried inside a 2MB JSON blob.

This isn’t agility; it’s an architectural debt trap. By bypassing the relational schema, you aren’t just losing type safety; you are fighting the physical storage engine of PostgreSQL.

PostgreSQL JSONB Performance Failure

The Cost of Write Amplification

PostgreSQL was designed to handle rows. When you update a row, Postgres doesn’t change the data in place. Instead, it uses a mechanism called MVCC to create a new version of that row.

Glossary //
Multi-Version Concurrency Control (MVCC) is how Postgres handles multiple people reading and writing at the same time without locking the whole table. Instead of overwriting data, Postgres keeps the old version (for people still reading it) and writes a brand new version of the row for the new data. Later, a background process called VACUUM cleans up the old “dead” rows.

If you have a normalized table with 50 columns, and you update a single boolean flag, Postgres writes a few bytes to a new row version.

But if you store your data in a massive JSONB document, Postgres has no idea how to “partially” update it. If your document is 1MB and you update a single key, Postgres must write the entire 1MB document to a new location on disk.

Disk Bloat from TOAST

Postgres stores data in 8KB pages. If a row exceeds that size, Postgres moves the large data into a separate storage area called TOAST.

Glossary //
TOAST (The Oversized-Attribute Storage Technique) is the “trunk” of the database. When a row is too big to fit in a standard 8KB “closet” (a page), Postgres breaks the large parts into chunks and puts them in the TOAST table. It leaves a small pointer in the original row so it can find the data later.

When you abuse JSONB to store large documents, almost every row ends up in TOAST. Every time you update a small field inside that JSON, Postgres has to rewrite the entire TOAST entry.

I ran a quick test to see the impact. I updated a single flag in a 2MB JSON document 1,000 times.

sql
1-- Checking the physical size of the TOAST table
2SELECT relname, relpages 
3FROM pg_class 
4WHERE oid = (SELECT reltoastrelid FROM pg_class WHERE relname = 'users_jsonb');

The result was 2GB of disk bloat for a single row’s updates. In a normalized table, that same operation would have cost less than 100KB. You are literally burning disk I/O to maintain a “flexible” schema. Quite literally, your disk is getting TOASTed.

Query Planner Blindness

The PostgreSQL query planner is an incredible piece of engineering. It uses statistics to decide whether to use an index or scan the whole table. It knows how many NULLs are in a column and the distribution of values.

But the planner is blind to what is inside your JSONB blob.

If you query a standard column, the planner knows exactly how many rows to expect. If you query a key inside JSON, it has to guess. Usually, it guesses a very small number (like 10 rows).

If the planner thinks only 10 rows match, it might choose a “Nested Loop” join, which is fast for 10 rows but catastrophic for 100,000 rows. I’ve seen queries that should take 5ms balloon to 30 seconds because the planner made a bad guess based on a JSONB blind spot.

The Cost of GIN Indexes

To make JSON queries fast, you usually end up using a GIN (Generalized Inverted Index).

GIN indexes are powerful, but they are heavy. Every time you insert a JSON document, Postgres has to break the entire document apart, find every key and value, and update the index.

In my benchmarks, inserting into a table with a GIN index is 5x to 10x slower than a standard B-Tree index. You’ve traded your write performance for the convenience of not defining a schema.

Valid Use Cases for JSONB

After heavily trashing it, I should clarify: JSONB is an incredible tool when used for its intended purpose. You should absolutely use it for:

  1. Third-Party Webhooks: When a service like Stripe or GitHub sends you a webhook, you don’t control the schema. Storing the raw payload in a JSONB column ensures you never drop data if they add a new field.
  2. User-Defined Custom Fields: If you are building a SaaS where users can define their own attributes for a CRM contact, a sparse JSONB column is vastly superior to creating a table with 100 custom_field_x columns.
  3. Historical Snapshots: Storing a frozen snapshot of an order receipt or a configuration state at a specific point in time, where you need to guarantee the shape of the data won’t change even if your active relational tables evolve.

Fixing Slow JSON Queries

If you have a valid JSONB payload but find yourself frequently filtering or joining on a specific key inside of it, you don’t have to choose between flexibility and performance. You can use Generated Columns to bridge the gap:

sql
1ALTER TABLE users 
2ADD COLUMN external_id TEXT 
3GENERATED ALWAYS AS (payload->>'id') STORED;
4
5CREATE INDEX idx_external_id ON users(external_id);

This gives you the best of both worlds: the raw payload stays safely in your JSON blob, but the query planner gets a physical column with real statistics to work with.

Don’t let the allure of “agility” trick you into building a second-rate document store inside the world’s most advanced relational database. Define your core schema, and use JSONB only for the margins. Your disk, your CPU, and your future self will thank you.

Further Reading