ClickHouse data masking with regex

Ingesting PII is easy; cleaning it up is hard. Here is how to use ClickHouse's data masking policies and regex to redact sensitive logs in flight without killing your query performance.

WORDS: 734 | CODE BLOCKS: 3 | EXT. LINKS: 0

If you’re running a production observability stack, you’ve already leaked PII. An engineer forgot to redact an email in a log line, or a JWT token ended up in a stack trace.

In most databases, the only fix is to DELETE the data—killing your metrics along with the sensitive info. But ClickHouse has a more elegant approach: Data Masking Policies.

By defining a policy at the role level, you can use regex to swap sensitive patterns with [REDACTED] in real-time, before the query results even leave the server.


Glossary // How Regex Masking works

ClickHouse uses the re2 engine for regex. When a masking policy is applied, the query planner injects a regexpReplace function into the SELECT path.

This means your raw data remains on disk untouched (preserving its checksums and history), but the “View” layer ensures unauthorized users only see the masked version.


Defining the policy

The core of the “Clean Room” strategy is identifying the patterns that shouldn’t be there. Let’s look at a policy for redacting emails and IPv4 addresses using multiRegexpReplace.

sql

-- 1. Create the policy
CREATE MASKING POLICY email_masking ON logs.payload
USING (
    multiRegexpReplace(
        payload,
        ['[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', '\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}'],
        ['[EMAIL_REDACTED]', '[IP_REDACTED]']
    )
);

-- 2. Apply it to an analyst role
GRANT SELECT ON logs.payload MASKING POLICY email_masking TO analyst_role;
Important
Notice the use of multiRegexpReplace. It is significantly faster than chaining multiple regexpReplace calls because it traverses the string once for all patterns, rather than rescanning the entire string for every individual regex.

The Throughput Penalty: A Billion Row Tax

Regex is computationally expensive. When you apply a masking policy, you are effectively adding a “CPU tax” to every SELECT query. If your query is scanning a billion rows, the regex engine has to run a billion times.

In a high-throughput OLAP system like ClickHouse, where we expect sub-second aggregations, a “busy” regex can turn a 100ms query into a 10-second wait.

Here is how to optimize your patterns to keep your “beast” status.

1. Anchors are your friend

If you know the sensitive data always appears after a specific key in a JSON log, anchor your regex. Without anchors, the engine has to attempt a match at every single character position in the string.

text

# BAD: Scans every character for a potential email
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

# GOOD: Only looks for patterns following the "user_email" key
(?i)"user_email"\s*:\s*"([^"]+)"

By using the key as an anchor, the RE2 engine can quickly skip over irrelevant parts of the log line using fast substring matching (like Boyer-Moore) before even engaging the complex regex state machine.

2. Avoid back-tracking and “The Greediness Trap”

ClickHouse uses RE2, which is designed to prevent “catastrophic backtracking” (a common vulnerability in other engines like PCRE). However, RE2 still pays a price for ambiguity.

The most common performance killer is the .* operator when used inside a capture group.

text

# DANGEROUS: Highly ambiguous, forces the engine to explore many paths
ID:.*Email:.*

# FAST: Uses negated character classes to limit the search space
ID:[^,]+,Email:[^,\s]+

By being explicit about what cannot be in the match (e.g., [^,]+ means “anything that isn’t a comma”), you provide the engine with a clear “stop” signal, reducing the number of states the machine has to track.


Benchmarking the Tax

If you are deciding whether to mask at the database level or the application level, consider these numbers (improvised based on typical RE2 behavior):

Strategy Rows/sec Latency (p99)
Raw SELECT (No Masking) 500M 80ms
Anchored Regex Masking 120M 450ms
Unanchored/Naive Regex 15M 4200ms

The takeaway: A poorly written regex is 30x slower than an optimized one. If your masking policy is poorly designed, it doesn’t just protect data; it effectively DOSes your own database.


Closing thoughts

Data masking isn’t a replacement for proper sanitization at the edge. But in the real world, things leak. ClickHouse’s regex masking gives you a safety net that protects your data without requiring a full database wipe.

The goal of a senior engineer isn’t just to “make it work”, but to “make it work within the constraints of the hardware”.

Stay slow and steady. Secure your logs, but don’t kill your throughput.