How Multi-Tenant SaaS Actually Works

A complete system design of Relay, a multi-tenant AI API gateway. Covers multi-tenant database architecture (silo vs pool vs bridge), API key authentication, provider routing with failover, dual-layer rate limiting, token-based billing, response caching, row-level security, tenant provisioning, and observability. Full Postgres schemas, mermaid diagrams, and Go snippets included.

WORDS: 7821 | CODE BLOCKS: 41 | EXT. LINKS: 4

For the past month I have been reading about multi-tenant SaaS architecture. Not as an academic exercise. I kept running into the same questions on every project I looked at: where exactly does tenant data go, how do you stop one customer’s bug from becoming every customer’s problem, how does billing actually work at the database level. The blog posts I found were either too abstract or skipped the hard parts entirely.

So I designed one from scratch as a system design exercise.

We are going to design a multi-tenant AI API gateway called Relay (relay.lorbic.com).

Relay does one thing: it sits between your application and every major LLM provider. You send a single API call to Relay. Relay figures out which provider to use, forwards the request, streams the response back, counts the tokens, logs the cost, and bills you at the end of the month. You never manage provider API keys in your application. You never write a retry loop for provider outages. You just call Relay.

1your app
2  → POST relay.lorbic.com/v1/chat/completions
3      { model: "gpt-5.5", messages: [...] }
4  → Relay validates your key, checks limits, routes to OpenAI
5  → OpenAI responds
6  → Relay streams the response back, logs the tokens, records the cost

That is the product. Now let us design the system that makes it work.


What is Multi-Tenant SaaS

Multi-tenancy means one running instance of your software serves many separate customers. Each customer is a tenant. Every tenant has their own data, their own users, their own configuration. But they all share the same servers, the same codebase, and usually the same database.

The word that matters here is isolation. Tenant A must not see tenant B’s data. Tenant A must not slow down tenant B’s requests. Tenant A must not exhaust resources that tenant B is counting on. Isolation is not just a nice property. It is the entire contract with your customers.

In B2B SaaS, a tenant is a company. In B2C, it can be an individual user. For Relay, tenants are engineering teams: a startup that signs up, creates API keys, and calls the Relay API inside their own product. Their end users never know Relay exists.

Three concepts come up everywhere in this design. It is worth defining them upfront before the schemas start:

  • Tenant: the customer unit. One company = one tenant. Has a tenant_id that appears on every row of data they own.
  • User: a person inside a tenant. One tenant can have many users. A user belongs to exactly one tenant at a time (though the same email address can have accounts in multiple tenants).
  • Plan: the subscription tier (free, starter, pro, enterprise). Determines what features are enabled and what rate limits apply.

API Compatibility

Relay uses the OpenAI API format. The same request shape, the same response shape, the same streaming protocol. This is not an accident. It means a developer can point their existing OpenAI SDK at Relay by changing one line.

python

# Before: calling OpenAI directly
client = OpenAI(api_key="sk-...")

# After: calling Relay, zero other changes
client = OpenAI(
    api_key="rly_live_sk_4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA",
    base_url="https://relay.lorbic.com/v1"
)

response = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Hello"}]
)

Relay exposes three endpoints:

1POST /v1/chat/completions     — chat completions (streaming and non-streaming)
2POST /v1/completions          — legacy text completions
3POST /v1/embeddings           — text embeddings

The request body and response body match the OpenAI spec exactly. Relay adds a few optional headers for its own features: x-relay-cache-ttl to control caching TTL, x-relay-fallback-models to specify per-request fallback preferences, and x-relay-metadata to attach arbitrary key-value pairs to the request log.

Internally, Relay translates requests to each provider’s native format. Anthropic’s API uses a different message structure than OpenAI’s. Mistral’s streaming format has small differences. The provider router handles all of that. The tenant never sees it.


Choosing a Multi-Tenant Database Architecture

Every multi-tenant SaaS starts with the same decision: how do you separate one customer’s data from another’s?

There are three approaches.

Silo gives each tenant their own database. Complete isolation. Very expensive at scale. Used by regulated industries or large enterprise contracts.

Bridge gives each tenant their own schema inside a shared database. acme.requests, beta.requests. Medium isolation, medium cost.

Pool puts everyone in the same tables with a tenant_id column on every row. Cheapest, hardest to get right, works for thousands of tenants.

Relay uses the pool model. Here is why: Relay’s customers are developers and startups. There will be thousands of them. The data shape is uniform across all tenants (requests, tokens, costs). There is no per-tenant schema customization. And the operational cost of running a database per tenant would make the pricing unworkable.

When I was reading about this, the thing that surprised me was how many production SaaS companies started with silo because it felt “safer”, then spent months migrating to pool because the operational cost was unsustainable at scale. The right model depends almost entirely on who your customer is and what you are charging them.

For the very few large enterprise tenants who need data residency or dedicated infrastructure, Relay offers silo as an upgrade. The architecture supports routing them to a separate database via a tenant config flag. But that is the exception, not the default.

graph TD subgraph pool["Pool Model"] A[acme] --> DB[(relay_db)] B[beta] --> DB C[gamma] --> DB DB --> R1[tenant_id = acme rows] DB --> R2[tenant_id = beta rows] DB --> R3[tenant_id = gamma rows] end subgraph silo["Silo - Enterprise"] D[bigcorp] --> DB2[(bigcorp_db)] end

In the pool model, all tenants share one database. Rows are separated by tenant_id. In the silo model, a tenant gets a dedicated database. Relay defaults to pool. Enterprise tenants that need dedicated infrastructure are routed to their own database via a flag in tenant_config.


System Overview

Before going into each component, here is the full picture of what Relay looks like.

graph TB subgraph client["Client"] APP[Your Application] end subgraph relayedge["Relay Edge"] GW[API Gateway] AUTH[Auth Middleware] RL[Rate Limiter] CACHE[Cache Layer] end subgraph relaycore["Relay Core"] ROUTER[Provider Router] STREAM[Stream Handler] LOGGER[Request Logger] end subgraph providers["Providers"] OAI[OpenAI] ANT[Anthropic] MIS[Mistral] GRQ[Groq] end subgraph data["Data"] PG[(Postgres)] RD[(Redis)] end subgraph background["Background"] AGG[Usage Aggregator] BILL[Billing Worker] end APP -->|POST /v1/chat/completions| GW GW --> AUTH AUTH --> RL RL --> CACHE CACHE -->|miss| ROUTER ROUTER --> OAI ROUTER --> ANT ROUTER --> MIS ROUTER --> GRQ ROUTER --> STREAM STREAM -->|SSE chunks| APP STREAM --> LOGGER LOGGER --> PG LOGGER --> RD RD --> AGG AGG --> PG PG --> BILL

Every request passes through auth, rate limiting, and cache check before reaching the provider router. If the cache misses, the router picks a provider, forwards the request, and streams the response back. Background workers handle everything async: logging, usage aggregation, billing, and provider health monitoring.


API Key Authentication and Tenant Resolution

Relay is an API product. There are no browser sessions, no subdomains per tenant, no cookie-based auth for the hot path. Authentication is an API key in the Authorization header.

1Authorization: Bearer rly_live_sk_abc123xyz...

The key itself encodes enough information to skip a lookup in most cases, but the real resolution always hits the database (or Redis cache) once per request.

API key design

Relay uses prefixed, hashed API keys. The key the user sees is never stored. Only its SHA-256 hash is stored.

1Full key:    rly_live_sk_4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA
2Prefix:      rly_live_
3Environment: live (vs test)
4Random part: 4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA (256 bits, URL-safe base64)

The prefix makes it easy to detect keys in source code scanning, error logs, and GitHub secret scanning. Stripe does this. Anthropic does this. It is a good pattern.

sql
 1CREATE TABLE api_keys (
 2    id              TEXT        PRIMARY KEY,
 3    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
 4    name            TEXT        NOT NULL,           -- "Production", "Dev", "CI"
 5    key_prefix      TEXT        NOT NULL,           -- rly_live_ or rly_test_
 6    key_hash        TEXT        NOT NULL UNIQUE,    -- SHA-256 of the full key
 7    key_hint        TEXT        NOT NULL,           -- last 4 chars: "...1eA"
 8    environment     TEXT        NOT NULL DEFAULT 'live',
 9    status          TEXT        NOT NULL DEFAULT 'active',
10    last_used_at    TIMESTAMPTZ,
11    expires_at      TIMESTAMPTZ,
12    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
13    created_by      TEXT        NOT NULL REFERENCES users(id)
14);
15
16CREATE INDEX idx_api_keys_tenant_id ON api_keys(tenant_id);
17CREATE INDEX idx_api_keys_key_hash  ON api_keys(key_hash);

You never store the full key. When you create a key, you show it to the user once and tell them to copy it. After that, if they lose it, they rotate it. The key_hint column (last 4 characters) helps them identify which key is which in the dashboard.

Test mode

rly_test_sk_ keys run in test mode. Relay does not forward test requests to real providers. It returns deterministic mocked responses from a fixed response library. This means no provider costs during development, predictable responses in CI/CD pipelines, and no test traffic polluting the live usage dashboard.

Test mode behaves identically to live mode in every other way: rate limits apply, tokens are counted, requests are logged, errors are returned in the same format. The only difference is the backend. When Relay sees a test key, the router skips provider selection entirely and calls a local mock handler instead.

sql
1-- Requests table distinguishes test from live via the api_key's environment
2-- api_keys.environment = 'live' | 'test'
3-- usage_daily is partitioned by environment so dashboards stay clean

Test responses are deterministic based on the request hash. The same input always produces the same mock output. This makes test assertions reliable.

Auth flow

flowchart LR A["Request
Authorization: Bearer rly_live_sk_..."] --> B["Extract key
from header"] B --> C["SHA-256 hash
the key"] C --> D{"Redis cache
key_hash lookup?"} D -->|hit| E["Return cached
tenant + key metadata"] D -->|miss| F["Postgres lookup
WHERE key_hash = ?"] F --> G{"Found and
status = active?"} G -->|no| H[401 Unauthorized] G -->|yes| I["Cache in Redis
TTL 5 minutes"] I --> J["Inject TenantContext
into request"] J --> K[Next middleware]

The key is hashed on every request and looked up in Redis first. A cache hit means zero database reads for auth. A miss hits Postgres and populates Redis for the next 5 minutes. The resolved tenant context flows through every subsequent middleware and handler via the request context.

go

func APIKeyMiddleware(keys KeyStore, cache Cache) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            raw := strings.TrimPrefix(r.Header.Get("Authorization"), "Bearer ")
            if raw == "" {
                writeError(w, 401, "missing api key")
                return
            }

            hash := sha256Hex(raw)

            key, err := cache.GetAPIKey(r.Context(), hash)
            if err != nil {
                key, err = keys.GetByHash(r.Context(), hash)
                if err != nil || key.Status != "active" {
                    writeError(w, 401, "invalid api key")
                    return
                }
                cache.SetAPIKey(r.Context(), hash, key, 5*time.Minute)
            }

            ctx := WithAPIKey(r.Context(), key)
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}

Database Schema Design

Tenants and users

sql
 1CREATE TABLE tenants (
 2    id          TEXT        PRIMARY KEY,             -- ten_abc123
 3    slug        TEXT        UNIQUE NOT NULL,         -- acme
 4    name        TEXT        NOT NULL,                -- Acme Inc.
 5    plan        TEXT        NOT NULL DEFAULT 'free', -- free | starter | pro | enterprise
 6    status      TEXT        NOT NULL DEFAULT 'trial',-- trial | active | suspended | deleted
 7    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
 8    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
 9    deleted_at  TIMESTAMPTZ,
10    metadata    JSONB
11);
12
13CREATE INDEX idx_tenants_slug   ON tenants(slug);
14CREATE INDEX idx_tenants_status ON tenants(status);
15
16CREATE TABLE users (
17    id          TEXT        PRIMARY KEY,
18    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
19    email       TEXT        NOT NULL,
20    role        TEXT        NOT NULL DEFAULT 'member', -- owner | admin | member | viewer
21    status      TEXT        NOT NULL DEFAULT 'active',
22    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
23    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
24    deleted_at  TIMESTAMPTZ,
25    UNIQUE(tenant_id, email)
26);
27
28CREATE INDEX idx_users_tenant_id ON users(tenant_id);
29CREATE INDEX idx_users_email     ON users(email);

Model catalog

Relay maintains a catalog of all supported models with their pricing. This drives cost calculation and provider routing.

sql
 1CREATE TABLE providers (
 2    id          TEXT    PRIMARY KEY,         -- openai | anthropic | mistral | groq
 3    name        TEXT    NOT NULL,            -- OpenAI
 4    base_url    TEXT    NOT NULL,            -- https://api.openai.com/v1
 5    status      TEXT    NOT NULL DEFAULT 'active', -- active | degraded | down
 6    priority    INT     NOT NULL DEFAULT 100,      -- lower = more preferred
 7    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
 8);
 9
10CREATE TABLE models (
11    id                  TEXT    PRIMARY KEY,        -- gpt-5.5 | claude-3-5-sonnet-20241022
12    provider_id         TEXT    NOT NULL REFERENCES providers(id),
13    relay_name          TEXT    UNIQUE NOT NULL,    -- the name tenants use: "gpt-5.5"
14    provider_name       TEXT    NOT NULL,           -- name the provider actually accepts
15    display_name        TEXT    NOT NULL,           -- GPT-4o
16    input_cost_per_1m   NUMERIC NOT NULL,           -- USD per 1M input tokens
17    output_cost_per_1m  NUMERIC NOT NULL,           -- USD per 1M output tokens
18    context_window      INT     NOT NULL,           -- max tokens in context
19    supports_streaming  BOOLEAN NOT NULL DEFAULT true,
20    supports_functions  BOOLEAN NOT NULL DEFAULT false,
21    supports_vision     BOOLEAN NOT NULL DEFAULT false,
22    status              TEXT    NOT NULL DEFAULT 'active',
23    created_at          TIMESTAMPTZ NOT NULL DEFAULT now()
24);
25
26CREATE INDEX idx_models_provider_id ON models(provider_id);

A few rows to make it concrete:

1id                            provider_id  relay_name           input_cost   output_cost  context_window
2gpt-5.5                        openai       gpt-5.5               2.50         10.00        128000
3claude-3-5-sonnet-20241022    anthropic    claude-3-5-sonnet    3.00         15.00        200000
4mistral-large-2411            mistral      mistral-large        2.00         6.00         131072
5llama-3.3-70b-versatile       groq         llama-3.3-70b        0.59         0.79         128000

Costs are in USD per 1 million tokens.

Provider credentials

Relay uses its own API keys to call providers. These are pooled: Relay pays the providers, marks up the cost, and charges tenants.

For enterprise tenants who want to bring their own keys (BYOK) and pay providers directly, Relay supports that too. The tenant_id being NULL means the credential belongs to Relay.

Provider credentials are among the most sensitive data Relay stores. A leaked provider key means an attacker can run LLM calls billed to that key. The key_encrypted column stores the key encrypted with AES-256-GCM using a data encryption key (DEK) that is itself encrypted by a master key stored in a KMS (AWS KMS or HashiCorp Vault). The plaintext key never touches disk. Relay calls the KMS API to decrypt the DEK at runtime, decrypts the credential in memory, uses it for the provider request, and discards it.

Key rotation works in two steps: the KMS master key is rotated on a schedule (annually at minimum), then a background job re-encrypts every key_encrypted value under the new master key. The job runs row by row with checkpointing so it can resume if interrupted.

sql
 1CREATE TABLE provider_credentials (
 2    id              TEXT        PRIMARY KEY,
 3    tenant_id       TEXT        REFERENCES tenants(id),  -- NULL = Relay's pooled key
 4    provider_id     TEXT        NOT NULL REFERENCES providers(id),
 5    key_encrypted   BYTEA       NOT NULL,    -- encrypted with Relay's KMS key
 6    key_hint        TEXT        NOT NULL,    -- last 4 chars for display
 7    is_relay_owned  BOOLEAN     NOT NULL DEFAULT true,
 8    status          TEXT        NOT NULL DEFAULT 'active',
 9    rpm_limit       INT,                     -- provider's RPM cap for this key
10    tpm_limit       INT,                     -- provider's TPM cap for this key
11    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
12    UNIQUE(tenant_id, provider_id)
13);

Requests

Every API call Relay processes gets a row here. This is the source of truth for billing, debugging, and usage dashboards.

sql
 1CREATE TABLE requests (
 2    id              TEXT        PRIMARY KEY,            -- req_abc123
 3    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
 4    api_key_id      TEXT        NOT NULL REFERENCES api_keys(id),
 5    model_id        TEXT        NOT NULL REFERENCES models(id),
 6    provider_id     TEXT        NOT NULL REFERENCES providers(id),
 7    credential_id   TEXT        NOT NULL REFERENCES provider_credentials(id),
 8
 9    -- What kind of request
10    request_type    TEXT        NOT NULL,               -- chat | completion | embedding
11    stream          BOOLEAN     NOT NULL DEFAULT false,
12
13    -- Token counts (filled in after response)
14    input_tokens    INT,
15    output_tokens   INT,
16    total_tokens    INT,
17
18    -- Cost in USD
19    input_cost      NUMERIC(12,8),
20    output_cost     NUMERIC(12,8),
21    total_cost      NUMERIC(12,8),
22
23    -- Relay's markup (stored separately for accounting)
24    markup_rate     NUMERIC(5,4) NOT NULL DEFAULT 0.20, -- 20% markup by default
25    relay_revenue   NUMERIC(12,8),
26
27    -- Timing
28    started_at      TIMESTAMPTZ NOT NULL,
29    first_token_at  TIMESTAMPTZ,       -- for streaming: TTFT
30    completed_at    TIMESTAMPTZ,
31    latency_ms      INT,
32    ttft_ms         INT,               -- time to first token
33
34    -- Outcome
35    status          TEXT        NOT NULL, -- success | error | cached | timeout
36    error_code      TEXT,
37    error_message   TEXT,
38    http_status     INT,
39
40    -- Cache
41    cache_hit       BOOLEAN     NOT NULL DEFAULT false,
42    cache_key       TEXT,
43
44    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
45);
46
47CREATE INDEX idx_requests_tenant_created  ON requests(tenant_id, created_at DESC);
48CREATE INDEX idx_requests_model_created   ON requests(model_id, created_at DESC);
49CREATE INDEX idx_requests_api_key_created ON requests(api_key_id, created_at DESC);
50CREATE INDEX idx_requests_status          ON requests(status, created_at DESC);

The requests table grows fast. At 1,000 tenants each making 100 requests per day, that is 100,000 rows per day, 3 million per month. After 90 days you want to start archiving cold rows to cheaper storage (S3 + Parquet). Keep hot rows (last 30 days) in Postgres. Query historical data via a warehouse.

Daily usage rollup

Querying requests for billing is expensive at scale. Aggregate into a daily rollup table that background workers update continuously.

sql
 1CREATE TABLE usage_daily (
 2    tenant_id       TEXT    NOT NULL REFERENCES tenants(id),
 3    model_id        TEXT    NOT NULL REFERENCES models(id),
 4    date            DATE    NOT NULL,
 5    request_count   BIGINT  NOT NULL DEFAULT 0,
 6    input_tokens    BIGINT  NOT NULL DEFAULT 0,
 7    output_tokens   BIGINT  NOT NULL DEFAULT 0,
 8    total_tokens    BIGINT  NOT NULL DEFAULT 0,
 9    total_cost      NUMERIC(12,4) NOT NULL DEFAULT 0,
10    relay_revenue   NUMERIC(12,4) NOT NULL DEFAULT 0,
11    cache_hits      BIGINT  NOT NULL DEFAULT 0,
12    error_count     BIGINT  NOT NULL DEFAULT 0,
13    avg_latency_ms  INT,
14    PRIMARY KEY (tenant_id, model_id, date)
15);

Request Flow: From API Call to Provider and Back

This is the core of Relay. Every API call goes through this pipeline.

sequenceDiagram actor App as Your App participant GW as Relay Gateway participant Auth as Auth Middleware participant RL as Rate Limiter participant Cache as Cache Layer participant Router as Provider Router participant OAI as OpenAI App->>GW: POST /v1/chat/completions
Authorization: Bearer rly_live_sk_... GW->>Auth: validate API key Auth->>Auth: SHA-256 hash key
lookup in Redis / Postgres Auth-->>GW: TenantContext injected GW->>RL: check RPM + TPM limits RL-->>GW: ok GW->>Cache: check cache
hash(model + messages + params) Cache-->>GW: miss GW->>Router: route request
model=gpt-5.5 Router->>Router: lookup model in catalog
select provider + credential Router->>OAI: POST api.openai.com/v1/chat/completions
Authorization: Bearer sk-relay-pooled-... OAI-->>Router: 200 OK (streaming SSE) Router-->>App: pipe SSE chunks Router->>Router: count tokens from response
calculate cost Router->>GW: log request async GW->>GW: INSERT into requests
INCR Redis counters

Model routing and failover

The router resolves which provider to use for a given model name. If the primary provider is down or over its rate limit, it falls back to an alternative.

flowchart TD A["Request
model = gpt-5.5"] --> B["Lookup model
in catalog"] B --> C["Get providers
ordered by priority"] C --> D{"Tenant has
BYOK key?"} D -->|yes| E[Use tenant's key] D -->|no| F[Use Relay pooled key] E --> G{"Provider
healthy?"} F --> G G -->|yes| H[Forward request] G -->|no| I["Next provider
by priority"] I --> J{Any left?} J -->|yes| G J -->|no| K[503 No providers available] H -->|5xx from provider| I H -->|success| L[Return response]
go

func (r *Router) Route(ctx context.Context, req *RelayRequest) (*RelayResponse, error) {
    model, err := r.catalog.GetModel(req.Model)
    if err != nil {
        return nil, ErrModelNotFound
    }

    providers := r.getOrderedProviders(ctx, model)
    for _, p := range providers {
        cred, err := r.getCredential(ctx, req.TenantID, p.ID)
        if err != nil {
            continue
        }

        resp, err := r.forward(ctx, p, cred, req)
        if err == nil {
            return resp, nil
        }

        // 5xx from provider: try next
        if isProviderError(err) {
            r.health.MarkDegraded(p.ID)
            continue
        }

        // 4xx from provider: the request itself is bad, do not retry
        return nil, err
    }

    return nil, ErrNoProvidersAvailable
}

Handling streaming (SSE)

When the tenant sends "stream": true, the provider returns a Server-Sent Events stream. Relay proxies it chunk by chunk. This is not a buffered response, it is a live pipe.

sequenceDiagram participant App participant Relay participant Provider App->>Relay: POST /v1/chat/completions
stream: true Relay->>Provider: POST (stream: true) Provider-->>Relay: data: {"delta": {"content": "Hello"}} Relay-->>App: data: {"delta": {"content": "Hello"}} Provider-->>Relay: data: {"delta": {"content": " world"}} Relay-->>App: data: {"delta": {"content": " world"}} Provider-->>Relay: data: [DONE] Relay-->>App: data: [DONE] Relay->>Relay: count tokens from chunks
log request async

Relay counts tokens as chunks arrive by tracking the usage field that providers send in the final chunk. For providers that do not send usage in streaming mode, Relay uses a local tokenizer (tiktoken for OpenAI models, SentencePiece for others) to estimate counts. The estimate is reconciled against the actual count when available.

Token counting is done on a goroutine that reads from the same SSE stream, separate from the goroutine that proxies chunks to the client. The client sees no added latency.

Retry logic

Relay retries failed provider requests automatically, but only for the right kind of failures.

Retryable: 5xx errors from the provider (server errors), network timeouts, connection resets. These indicate the provider had a problem, not that the request was wrong.

Not retryable: 4xx errors from the provider (except 429). A 400 means the request was malformed. A 401 means the credential is wrong. Retrying will not help and wastes time.

Provider rate limit (429 from provider): Relay does not wait and retry on the same key. It immediately switches to the next available provider key or provider. Waiting for a rate limit window to reset would add seconds of latency for the tenant.

Retry policy: up to 3 attempts, 100ms exponential backoff between attempts, each attempt on a different provider or key if possible. If all attempts fail, Relay returns the last error to the tenant.

Relay does not retry on behalf of the tenant after returning an error. If Relay returns a 502 or 503, the tenant’s code decides whether to retry at their layer.

Error response format

Every error Relay returns uses the same JSON structure. Developers should be able to write one error handler and handle all Relay errors with it.

json
 1{
 2  "error": {
 3    "code": "rate_limit_exceeded",
 4    "message": "You have exceeded 100 requests per minute on the free plan.",
 5    "type": "rate_limit_error",
 6    "details": {
 7      "limit": 100,
 8      "window": "per_minute",
 9      "reset_at": "2026-07-01T12:05:00Z",
10      "upgrade_url": "https://relay.lorbic.com/billing"
11    }
12  }
13}

The full set of error codes:

HTTP status error.code Meaning
401 invalid_api_key Key not found or revoked
402 tenant_suspended Account suspended, check billing
402 quota_exceeded Hard quota limit hit (seats, storage)
404 model_not_found Requested model does not exist in catalog
422 invalid_request Request body failed validation
429 rate_limit_exceeded RPM or TPM limit hit
502 provider_error Provider returned an error after retries
503 no_provider_available All providers for this model are degraded
504 request_timeout Provider did not respond within the timeout

The details field is present on rate_limit_exceeded and quota_exceeded errors so tenants can surface actionable information to their own users. If a tenant’s app is hitting Relay’s rate limit, they should know the reset time and the upgrade path, not just a generic “too many requests.”


Rate Limiting

Relay has to manage rate limits at two levels.

Inbound: protecting Relay’s own infrastructure from tenants who send too many requests.

Outbound: protecting Relay’s provider API keys from hitting the provider’s limits.

Both use Redis counters with sliding windows.

The dual-layer rate limiting was the part I found most interesting while designing this. Most rate limiting content covers the inbound side. The outbound side is specific to API gateway products but just as important. If Relay’s OpenAI key hits OpenAI’s TPM limit because of one large tenant, every other tenant on that key starts getting 429s from the provider. The router has to detect this and re-route before it becomes the tenant’s problem.

Inbound tenant limits

sql
 1-- Plan-level defaults
 2CREATE TABLE quota_definitions (
 3    plan         TEXT    NOT NULL,
 4    resource     TEXT    NOT NULL,   -- rpm | tpm | requests_per_day
 5    limit_value  BIGINT  NOT NULL,   -- -1 = unlimited
 6    window       TEXT    NOT NULL,   -- per_minute | per_day
 7    UNIQUE(plan, resource, window)
 8);
 9
10-- Sample values:
11-- plan      resource          limit     window
12-- free      rpm               60        per_minute
13-- free      tpm               40000     per_minute
14-- free      requests_per_day  500       per_day
15-- starter   rpm               500       per_minute
16-- starter   tpm               400000    per_minute
17-- pro       rpm               3000      per_minute
18-- pro       tpm               2000000   per_minute
19-- pro       requests_per_day  -1        per_day (unlimited)
20
21-- Per-tenant overrides (for enterprise deals or support credits)
22CREATE TABLE quota_overrides (
23    id           TEXT        PRIMARY KEY,
24    tenant_id    TEXT        NOT NULL REFERENCES tenants(id),
25    resource     TEXT        NOT NULL,
26    limit_value  BIGINT      NOT NULL,
27    window       TEXT        NOT NULL,
28    reason       TEXT,
29    expires_at   TIMESTAMPTZ,
30    created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
31    UNIQUE(tenant_id, resource, window)
32);

Redis key structure for inbound limits:

1rl:in:ten_abc123:rpm   → sliding window counter, 60s window
2rl:in:ten_abc123:tpm   → sliding window counter, 60s window

Outbound provider limits

Each of Relay’s provider API keys has its own RPM and TPM cap set by the provider. Relay tracks these in Redis too.

1rl:out:openai:key_xyz:rpm   → counter, 60s window
2rl:out:openai:key_xyz:tpm   → counter, 60s window

When a pooled key is close to its provider limit, the router skips it and picks the next available key or provider.

The rate limit pipeline

flowchart LR A[Request] --> B{"Inbound tenant
RPM check"} B -->|over| C["429 Too Many Requests
retry-after header"] B -->|ok| D{"Inbound tenant
TPM check"} D -->|over| C D -->|ok| E[Route to provider] E --> F{"Outbound key
RPM check"} F -->|over| G["Pick next key
or provider"] F -->|ok| H[Forward to provider] G --> F

A request fails fast at inbound RPM or TPM before any provider call is made. The outbound check happens inside the router, per-key. If a key is saturated, the router picks the next key or the next provider, not the next retry window.

The TPM check on inbound traffic uses an estimated token count from the request body (input tokens only). The actual token count (input + output) is only known after the response. Relay does a conservative pre-check against the estimated input, then reconciles and updates the TPM counter after the response.


Token Cost Tracking and Billing

Relay’s pricing model: Relay pays the provider at published rates, marks up by 20%, and charges tenants at the marked-up rate. The markup funds Relay’s infrastructure and margin.

1Tenant sends:  1,000 input tokens, 500 output tokens on gpt-5.5
2Provider cost: (1000 / 1M) × $2.50 + (500 / 1M) × $10.00 = $0.00250 + $0.00500 = $0.00750
3Relay markup:  $0.00750 × 1.20 = $0.00900
4Tenant billed: $0.00900
5Relay revenue: $0.00150

Every request row in the requests table stores both the raw provider cost and the Relay revenue. This makes accounting and audits straightforward.

Cost calculation

flowchart LR A["Response received
input_tokens=1000
output_tokens=500"] --> B["Lookup model pricing
input_cost_per_1m
output_cost_per_1m"] B --> C["input_cost = tokens / 1M x rate
output_cost = tokens / 1M x rate"] C --> D[total_cost = input + output] D --> E[relay_revenue = total_cost x markup_rate] E --> F[INSERT into requests] F --> G[INCR usage_daily rollup]

Billing cycle

Relay charges tenants monthly via Stripe. The billing worker runs at the start of each month, reads the usage_daily table for the prior month, and creates a Stripe invoice line item per model per tenant.

sql
 1CREATE TABLE billing_periods (
 2    id          TEXT        PRIMARY KEY,
 3    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
 4    period_start DATE       NOT NULL,
 5    period_end   DATE       NOT NULL,
 6    status      TEXT        NOT NULL DEFAULT 'open',  -- open | finalized | paid | failed
 7    total_cost  NUMERIC(12,4),
 8    invoice_id  TEXT,              -- Stripe invoice ID
 9    finalized_at TIMESTAMPTZ,
10    paid_at     TIMESTAMPTZ,
11    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
12    UNIQUE(tenant_id, period_start)
13);
14
15CREATE TABLE billing_line_items (
16    id              TEXT        PRIMARY KEY,
17    billing_period_id TEXT      NOT NULL REFERENCES billing_periods(id),
18    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
19    model_id        TEXT        NOT NULL REFERENCES models(id),
20    request_count   BIGINT      NOT NULL,
21    input_tokens    BIGINT      NOT NULL,
22    output_tokens   BIGINT      NOT NULL,
23    total_cost      NUMERIC(12,4) NOT NULL,
24    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
25    UNIQUE(billing_period_id, model_id)
26);

Billing is an entire system design on its own: Stripe subscription lifecycle, idempotent webhook handling, retry logic for failed payments, proration, credits, and tax calculation. This post covers the data model. The billing pipeline gets its own post.


Response Caching: Exact Match and Semantic

Caching completions saves money. If 1,000 tenants call the same model with the same system prompt and similar questions, Relay can serve many of those from cache instead of paying the provider.

Relay supports two cache strategies.

Exact match caching

The simplest form. Hash the full request (model + messages + temperature + all params). If you have seen this exact request before, return the cached response.

1cache_key = SHA-256(
2    model_id +
3    JSON.stringify(messages) +
4    temperature + top_p + max_tokens + ...
5)

Stored in Redis with a configurable TTL. Default 1 hour. Tenants can set their own TTL per API call via a x-relay-cache-ttl header.

Semantic caching

Exact match only works for identical requests. Semantic caching uses embeddings to find requests that are similar enough to return the same response.

flowchart TD A[New request] --> B["Embed the user message
using a small model"] B --> C["Vector similarity search
against cached embeddings"] C --> D{"Similarity score
> threshold?"} D -->|yes| E["Return cached response
cache_hit = semantic"] D -->|no| F["Forward to provider
cache the response + embedding"]

The embedding model for cache lookup should be cheap and fast (text-embedding-3-small or similar). The cost of generating the lookup embedding is tiny compared to the LLM call it might save.

sql
 1CREATE TABLE response_cache (
 2    id              TEXT        PRIMARY KEY,
 3    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
 4    cache_key       TEXT        NOT NULL,         -- exact match hash
 5    model_id        TEXT        NOT NULL REFERENCES models(id),
 6    request_hash    TEXT        NOT NULL,
 7    response_body   JSONB       NOT NULL,
 8    input_tokens    INT         NOT NULL,
 9    output_tokens   INT         NOT NULL,
10    embedding       vector(1536),                 -- pgvector, for semantic cache
11    hit_count       INT         NOT NULL DEFAULT 0,
12    expires_at      TIMESTAMPTZ NOT NULL,
13    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
14);
15
16CREATE INDEX idx_response_cache_tenant_key ON response_cache(tenant_id, cache_key);
17CREATE INDEX idx_response_cache_embedding  ON response_cache
18    USING hnsw (embedding vector_cosine_ops)
19    WHERE embedding IS NOT NULL;

Observability

You cannot operate a multi-tenant API platform without per-tenant observability. When a tenant files a support ticket saying “your API is slow,” you need to be able to pull up exactly their requests, their latencies, and their errors in under a minute.

What to track per request

Every request log entry should have:

  • tenant_id and api_key_id
  • model_id and provider_id
  • latency_ms and ttft_ms (time to first token, the metric tenants care about for streaming)
  • input_tokens, output_tokens, total_cost
  • status and error_code
  • cache_hit

These are already in the requests table. Make sure every log line, trace span, and metric tag also carries tenant_id.

Structured logging

go

func RequestLoggingMiddleware(logger *slog.Logger) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            key := APIKeyFromContext(r.Context())
            reqLogger := logger.With(
                slog.String("tenant_id", key.TenantID),
                slog.String("api_key_id", key.ID),
                slog.String("request_id", newRequestID()),
            )
            ctx := WithLogger(r.Context(), reqLogger)
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}

Health status per provider

Relay’s router uses a health tracker to avoid sending requests to degraded providers. The tracker works on a circuit breaker pattern: after N consecutive errors from a provider, mark it degraded and stop routing to it until it recovers.

sql
 1CREATE TABLE provider_health_events (
 2    id          TEXT        PRIMARY KEY,
 3    provider_id TEXT        NOT NULL REFERENCES providers(id),
 4    status      TEXT        NOT NULL,       -- healthy | degraded | down
 5    error_rate  NUMERIC,                   -- 5-minute error rate at time of event
 6    p95_ms      INT,                       -- p95 latency at time of event
 7    message     TEXT,
 8    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
 9);
10
11CREATE INDEX idx_provider_health_events ON provider_health_events(provider_id, created_at DESC);

The provider status page (status.relay.lorbic.com) reads from this table. Tenants can subscribe to status updates.

Noisy neighbor detection

In the pool model, one tenant running an aggressive batch job can saturate Relay’s provider keys and cause latency for everyone else. Detect this by tracking per-tenant request volume in Redis with a 1-minute rolling window. If a single tenant exceeds a percentage threshold of total traffic, flag them and apply stricter rate limits automatically.


Role-Based Access Control

API calls use API key auth. The Relay dashboard (where tenants manage keys, view usage, configure settings) uses JWT auth.

json
 1{
 2  "iss": "relay.lorbic.com",
 3  "sub": "usr_abc123",
 4  "tenant_id": "ten_xyz789",
 5  "tenant_slug": "acme",
 6  "role": "admin",
 7  "permissions": ["api_keys:read", "api_keys:write", "billing:read", "usage:read"],
 8  "iat": 1751400000,
 9  "exp": 1751500000
10}

RBAC tables

sql
 1CREATE TABLE roles (
 2    id          TEXT        PRIMARY KEY,
 3    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
 4    name        TEXT        NOT NULL,       -- owner | admin | developer | viewer
 5    is_system   BOOLEAN     NOT NULL DEFAULT false,
 6    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
 7    UNIQUE(tenant_id, name)
 8);
 9
10CREATE TABLE permissions (
11    id          TEXT PRIMARY KEY,
12    resource    TEXT NOT NULL,              -- api_keys | usage | billing | members | settings
13    action      TEXT NOT NULL,             -- create | read | update | delete
14    UNIQUE(resource, action)
15);
16
17CREATE TABLE role_permissions (
18    role_id       TEXT NOT NULL REFERENCES roles(id),
19    permission_id TEXT NOT NULL REFERENCES permissions(id),
20    PRIMARY KEY (role_id, permission_id)
21);
22
23CREATE TABLE user_roles (
24    user_id   TEXT NOT NULL REFERENCES users(id),
25    role_id   TEXT NOT NULL REFERENCES roles(id),
26    PRIMARY KEY (user_id, role_id)
27);

Audit log

Every sensitive action in the dashboard gets recorded: key creation, key revocation, plan changes, member removal, billing configuration.

sql
 1CREATE TABLE audit_log (
 2    id            TEXT        PRIMARY KEY,
 3    tenant_id     TEXT        REFERENCES tenants(id),
 4    actor_id      TEXT        NOT NULL,
 5    actor_type    TEXT        NOT NULL,    -- user | platform_admin | system
 6    action        TEXT        NOT NULL,    -- api_key.created | api_key.revoked | member.removed
 7    resource_id   TEXT,
 8    resource_type TEXT,
 9    metadata      JSONB,
10    ip_address    INET,
11    created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
12);
13
14CREATE INDEX idx_audit_log_tenant   ON audit_log(tenant_id, created_at DESC);
15CREATE INDEX idx_audit_log_actor    ON audit_log(actor_id, created_at DESC);

Tenant Provisioning: From Signup to First API Call

What happens when someone signs up

When a user submits the signup form, Relay does the minimum work synchronously: insert a tenant row with status=PROVISIONING, create the owner user record, and return a 201. The browser redirects immediately. Everything else runs in the background.

The background worker picks up the job and does the slower parts: seeding system roles, creating a Stripe customer record, and flipping the tenant to TRIAL. The Stripe customer is created at signup, not when the user adds a payment method. This is intentional. You need a Stripe customer ID to attach payment methods, subscriptions, and invoices later. Creating it lazily (at payment time) means every billing code path has to handle the case where the ID does not exist yet, which causes bugs.

The welcome email includes the tenant’s first API key. The key is generated by the worker, not the API handler, because key generation requires the tenant record to be fully set up.

If the worker fails partway through, the tenant stays in PROVISIONING. A monitoring job scans for tenants stuck in PROVISIONING for more than 5 minutes and either retries the job or pages on-call.

sequenceDiagram actor User participant API participant Queue participant Worker participant DB participant Stripe User->>API: POST /auth/signup
{ email, password, org_name } API->>DB: INSERT tenant (status=PROVISIONING) API->>DB: INSERT user (role=owner) API->>Queue: enqueue ProvisionTenant job API-->>User: 201 Created
{ tenant_id, redirect: /onboarding } Worker->>Queue: dequeue job Worker->>DB: seed default roles (owner, admin, developer, viewer) Worker->>DB: seed plan quota definitions Worker->>Stripe: create Stripe customer Worker->>DB: UPDATE tenant SET status=TRIAL, stripe_customer_id=cus_... Worker-->>User: send welcome email with first API key

Tenant status transitions

The status column on the tenants table is the single source of truth for what a tenant can do. Every API request, every dashboard login, and every background job checks this column before doing anything.

PROVISIONING: The tenant record exists but setup is not complete. API calls are rejected with a 503. The dashboard shows a loading state.

TRIAL: Setup is complete. The tenant has full API access up to the trial plan’s rate limits. A trial expiry timestamp lives in tenant_config. When it passes without a payment method being added, a background job moves the tenant to EXPIRED.

ACTIVE: A payment method is on file and billing is working. Full access based on the tenant’s plan.

SUSPENDED: Payment failed, or a Relay admin manually suspended the account. API calls return a 402 with a clear message and a link to the billing page. The tenant’s data is untouched. If payment resolves within the grace period, the account returns to ACTIVE with no data loss. This is intentional: you want tenants to fix their billing, not lose access to everything they built.

EXPIRED: The trial ended without a payment method being added. Same behavior as SUSPENDED for API calls. After N days with no action, the tenant moves to DELETED.

DELETED: Soft delete. The row stays in the database with deleted_at set, but all API calls are rejected. Data is retained for 30 days for recovery or dispute resolution, then a GDPR-compliant hard deletion job removes all tenant rows across every table.

Status transitions are intentionally one-directional except for SUSPENDED back to ACTIVE. You cannot un-delete a tenant. You cannot go from ACTIVE back to TRIAL. These constraints are enforced at the application layer via explicit state machine logic, not just database constraints.

stateDiagram-v2 [*] --> PROVISIONING : signup PROVISIONING --> TRIAL : provisioning complete TRIAL --> ACTIVE : payment method added TRIAL --> EXPIRED : trial period ends ACTIVE --> SUSPENDED : payment failed or admin action SUSPENDED --> ACTIVE : payment resolved SUSPENDED --> DELETED : grace period expires EXPIRED --> DELETED : no conversion ACTIVE --> DELETED : user cancels DELETED --> [*]

Suspended tenants get a 429 response with a billing error message on every API call. This is important: the error message should tell them what is wrong and link them to the billing page. A silent 401 causes confusion.


Feature Flags and Per-Tenant Configuration

sql
 1CREATE TABLE plan_features (
 2    plan     TEXT    NOT NULL,
 3    feature  TEXT    NOT NULL,    -- semantic_cache | byok | sso | audit_log | higher_rpm | webhooks
 4    enabled  BOOLEAN NOT NULL DEFAULT true,
 5    PRIMARY KEY (plan, feature)
 6);
 7
 8CREATE TABLE tenant_feature_overrides (
 9    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
10    feature     TEXT        NOT NULL,
11    enabled     BOOLEAN     NOT NULL,
12    reason      TEXT,
13    expires_at  TIMESTAMPTZ,
14    PRIMARY KEY (tenant_id, feature)
15);
16
17CREATE TABLE tenant_config (
18    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
19    key         TEXT        NOT NULL,
20    value       TEXT        NOT NULL,
21    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
22    PRIMARY KEY (tenant_id, key)
23);
24
25-- Example config rows:
26-- tenant_id   key                      value
27-- ten_abc     default_cache_ttl_sec    3600
28-- ten_abc     markup_rate              0.15     (negotiated lower markup for volume)
29-- ten_abc     allowed_models           ["gpt-5.5","claude-3-5-sonnet"]
30-- ten_abc     webhook_url              https://acme.com/hooks/relay

Cache the feature flag state per tenant in Redis (60s TTL). Do not query Postgres on every API call.


Full Architecture

graph TB subgraph tenants["Tenants"] APP1[App: acme] APP2[App: beta] DASH[Dashboard] end subgraph edge["Edge"] DNS[Wildcard DNS] LB[Load Balancer] end subgraph gateway["Gateway"] GW[API Gateway] AUTH[API Key Auth] RL_IN[Rate Limiter] CACHE[Cache Layer] ROUTER[Provider Router] STREAM[SSE Handler] end subgraph dashapi["Dashboard API"] DASH_API[Dashboard API] end subgraph providers["Providers"] OAI[OpenAI] ANT[Anthropic] MIS[Mistral] GRQ[Groq] end subgraph data["Data"] PG[(Postgres)] RD[(Redis)] S3[(S3)] end subgraph workers["Workers"] LOGGER[Request Logger] AGG[Usage Aggregator] BILL[Billing Worker] HEALTH[Health Monitor] ARCHIVE[Request Archiver] end APP1 --> DNS APP2 --> DNS DASH --> DNS DNS --> LB LB --> GW LB --> DASH_API GW --> AUTH AUTH --> RL_IN RL_IN --> CACHE CACHE -->|miss| ROUTER ROUTER --> OAI ROUTER --> ANT ROUTER --> MIS ROUTER --> GRQ ROUTER --> STREAM STREAM --> LOGGER DASH_API --> PG DASH_API --> RD LOGGER --> PG LOGGER --> RD RD --> AGG AGG --> PG PG --> BILL PG --> ARCHIVE ARCHIVE --> S3 HEALTH --> PG HEALTH --> RD

The gateway layer is stateless. Tenant apps hit the load balancer, which routes to any gateway pod. All shared state (rate limits, cache, feature flags) lives in Redis. All durable state (requests, usage, billing, config) lives in Postgres. Workers run separately and never block the hot path.


Scaling a Multi-Tenant API Gateway

Relay is stateless at the gateway layer. You can add more gateway pods behind the load balancer without coordination. The state lives in Postgres and Redis.

Redis is the first bottleneck. Rate limit counters and cache lookups hit Redis on every request. Run Redis in cluster mode with at least 3 shards. Shard by tenant_id so one tenant’s traffic does not create hot keys for others.

Postgres becomes the bottleneck as the requests table grows. Partition it by created_at (monthly partitions). Drop or archive partitions older than your retention window. Use PgBouncer in transaction mode for the gateway (short, fast INSERTs) and session mode for the background workers (longer transactions, aggregation queries).

Provider key rotation: as Relay grows, a single OpenAI API key will hit TPM limits. Relay maintains a pool of provider keys per provider and distributes load across them using the outbound rate limit counters in Redis. The router picks the key with the most remaining capacity in the current window.


FAQ

Why store the hash of an API key instead of the key itself?

If your database is compromised, the attacker gets hashes, not keys. A raw API key in the database is a plaintext credential. SHA-256 is one-way: you verify a key by hashing the incoming value and comparing it against the stored hash, but you cannot reverse a hash back into the original key. Treat API keys the same way you treat passwords.

How do you prevent a missing WHERE tenant_id clause from leaking data across tenants?

Row-Level Security in Postgres. You attach a policy to every tenant-scoped table that restricts all queries to rows matching app.current_tenant_id, a session variable the application sets at the start of every request. A bare SELECT * FROM requests with no WHERE clause still returns only that tenant’s rows. The policy runs inside the database engine, below the application layer, so a bug in application code cannot bypass it.

What happens if OpenAI goes down?

The router tries providers in priority order. If OpenAI returns a 5xx, it marks OpenAI as degraded and tries the next provider that supports the requested model. If no provider is available, the request returns 503. The health monitor runs on a short interval and flips providers back to healthy once they start returning 2xx responses again. Tenants see a brief error window during a full provider outage, not an extended one.

How do you count tokens accurately for billing when streaming does not always return usage data?

Providers that support stream_options: { include_usage: true } (OpenAI does) include a usage object in the final SSE chunk. Relay reads those counts directly. For providers that do not, Relay runs a local tokenizer (tiktoken for OpenAI-compatible models, SentencePiece for others) to estimate counts. The estimate is within 1-2% of the actual count in practice. When there is a discrepancy, Relay bills the lower number rather than over-billing the tenant.

How do you handle rate limiting if Redis goes down?

Redis Cluster with at least 3 nodes and automatic failover handles most availability scenarios. If the entire cluster becomes unreachable, Relay fails open: it allows requests through rather than rejecting all traffic. Losing rate limit enforcement for a short period during a Redis failure is an acceptable trade-off. Taking down the entire API is not.

What is the noisy neighbor problem and how does it show up in a pool model SaaS?

One tenant runs a large batch job and consumes a disproportionate share of shared infrastructure: database connections, query time, provider key capacity. This increases latency for every other tenant on the same cluster without them doing anything wrong. Relay tracks per-tenant request volume in Redis with a 1-minute rolling window. Tenants that exceed a share threshold get their rate limits tightened automatically. If the pattern persists, they are a candidate for migration to a silo.

How do you migrate a tenant from the pool model to a silo without downtime?

Postgres logical replication. Set up a replication slot on the shared database filtered to the tenant’s rows. Let the silo database catch up to near-real-time lag. Then pause writes for that tenant briefly (a few seconds), apply the final delta, update the routing config to point new requests at the silo, and release the write pause. The tenant sees a brief stall, not a maintenance window.

Why keep billing records in separate tables instead of querying the requests table at invoice time?

The requests table gets very large and is archived after 30-90 days to keep query performance acceptable. Billing records need to be retained for years for tax, legal, and dispute reasons. Separating billing_line_items and usage_daily from raw request logs means you can archive or drop old request partitions without losing billing history. The rollup also makes monthly invoice generation a fast aggregation query rather than a multi-billion-row scan.

What does BYOK mean and why would an enterprise tenant want it?

Bring Your Own Keys. The tenant provides their own OpenAI or Anthropic API keys instead of Relay’s pooled ones. They pay the provider directly at published rates, and Relay charges a flat platform fee instead of a per-token markup. Enterprises want BYOK for three reasons: cost control at high volume, data privacy (their traffic flows under their own API key, not Relay’s), and compliance requirements that restrict which credentials can touch their data.

Why use prefixed API keys like rly_live_sk_?

Three reasons. First, the prefix makes keys identifiable in logs, error messages, and accidental source code commits. Second, secret scanning tools (GitHub, trufflehog, Doppler) can be configured to detect your specific prefix and alert on leaks before damage is done. Third, the environment segment (live vs test) lets Relay reject live keys on test endpoints and test keys on live endpoints at the routing layer, before any business logic runs.

How does semantic caching differ from exact match caching, and what can go wrong?

Exact match caching hashes the full request and returns a cached response only if the exact same request was made before. Hit rates are low. Semantic caching embeds the user message into a vector and finds cached responses from requests that are semantically similar. Hit rates are higher, but there is a risk: two questions that are phrased similarly but have meaningfully different answers could get the same cached response. The similarity threshold controls this trade-off. Set it too high and you get wrong answers. Set it too low and cache hits are rare. Semantic caching should never be enabled for requests where correctness is critical (medical, legal, financial).

What happens to a tenant’s data when their account is deleted?

Relay uses a soft delete: the tenant row gets deleted_at set and status set to DELETED. All API access is immediately rejected. The underlying data stays in the database for 30 days. During that window, a tenant can contact support to recover their account. After 30 days, a scheduled deletion job hard-deletes all rows across every table where tenant_id matches, in dependency order (child tables first, then the tenant row). This satisfies GDPR Article 17 (right to erasure) and gives you an audit trail during the retention window.


Things Worth Their Own Post

  • Billing pipeline in depth: Stripe subscription webhooks, idempotent event processing, dunning logic, credit notes, usage-based invoice generation.
  • Semantic caching: embedding model selection, similarity thresholds, cache invalidation strategy, cost-benefit analysis.
  • BYOK (Bring Your Own Keys): encrypting and storing tenant provider credentials, key rotation, audit trail for credential usage.
  • Webhooks: delivering request completion events to tenants, retry logic, failure handling, signing payloads.
  • Provider health monitoring: circuit breaker implementation, alerting on degradation, automatic failover testing.
  • Request archival: streaming from Postgres to Parquet on S3, querying historical data with DuckDB or Athena.
  • Multiregion: running Relay in US, EU, AP with tenant routing to nearest region, cross-region cost aggregation.