# How Multi-Tenant SaaS Actually Works

> **Source:** [https://lorbic.com/how-multi-tenant-saas-works/](https://lorbic.com/how-multi-tenant-saas-works/)
> **Author:** [Vikash Patel](https://vikashpatel.net)
> **Published:** July 01, 2026
> **Reading Time:** 37 min
> 
> *This is the raw Markdown source of the article from the [Lorbic Technical Journal](https://lorbic.com/).*

---


For the past month I have been reading about multi-tenant SaaS architecture. Not as an academic exercise. I kept running into the same questions on every project I looked at: where exactly does tenant data go, how do you stop one customer's bug from becoming every customer's problem, how does billing actually work at the database level. The blog posts I found were either too abstract or skipped the hard parts entirely.

So I designed one from scratch as a system design exercise.

We are going to design a multi-tenant AI API gateway called **Relay** (`relay.lorbic.com`).

Relay does one thing: it sits between your application and every major LLM provider. You send a single API call to Relay. Relay figures out which provider to use, forwards the request, streams the response back, counts the tokens, logs the cost, and bills you at the end of the month. You never manage provider API keys in your application. You never write a retry loop for provider outages. You just call Relay.

```
your app
  → POST relay.lorbic.com/v1/chat/completions
      { model: "gpt-5.5", messages: [...] }
  → Relay validates your key, checks limits, routes to OpenAI
  → OpenAI responds
  → Relay streams the response back, logs the tokens, records the cost
```

That is the product. Now let us design the system that makes it work.

---

## What is Multi-Tenant SaaS

Multi-tenancy means one running instance of your software serves many separate customers. Each customer is a **tenant**. Every tenant has their own data, their own users, their own configuration. But they all share the same servers, the same codebase, and usually the same database.

The word that matters here is **isolation**. Tenant A must not see tenant B's data. Tenant A must not slow down tenant B's requests. Tenant A must not exhaust resources that tenant B is counting on. Isolation is not just a nice property. It is the entire contract with your customers.

In B2B SaaS, a tenant is a company. In B2C, it can be an individual user. For Relay, tenants are engineering teams: a startup that signs up, creates API keys, and calls the Relay API inside their own product. Their end users never know Relay exists.

Three concepts come up everywhere in this design. It is worth defining them upfront before the schemas start:

- **Tenant**: the customer unit. One company = one tenant. Has a `tenant_id` that appears on every row of data they own.
- **User**: a person inside a tenant. One tenant can have many users. A user belongs to exactly one tenant at a time (though the same email address can have accounts in multiple tenants).
- **Plan**: the subscription tier (free, starter, pro, enterprise). Determines what features are enabled and what rate limits apply.

---

## API Compatibility

Relay uses the OpenAI API format. The same request shape, the same response shape, the same streaming protocol. This is not an accident. It means a developer can point their existing OpenAI SDK at Relay by changing one line.

{{< code lang="python" >}}
# Before: calling OpenAI directly
client = OpenAI(api_key="sk-...")

# After: calling Relay, zero other changes
client = OpenAI(
    api_key="rly_live_sk_4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA",
    base_url="https://relay.lorbic.com/v1"
)

response = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Hello"}]
)
{{< /code >}}

Relay exposes three endpoints:

```
POST /v1/chat/completions     — chat completions (streaming and non-streaming)
POST /v1/completions          — legacy text completions
POST /v1/embeddings           — text embeddings
```

The request body and response body match the OpenAI spec exactly. Relay adds a few optional headers for its own features: `x-relay-cache-ttl` to control caching TTL, `x-relay-fallback-models` to specify per-request fallback preferences, and `x-relay-metadata` to attach arbitrary key-value pairs to the request log.

Internally, Relay translates requests to each provider's native format. Anthropic's API uses a different message structure than OpenAI's. Mistral's streaming format has small differences. The provider router handles all of that. The tenant never sees it.

---

## Choosing a Multi-Tenant Database Architecture

Every multi-tenant SaaS starts with the same decision: how do you separate one customer's data from another's?

There are three approaches.

**Silo** gives each tenant their own database. Complete isolation. Very expensive at scale. Used by regulated industries or large enterprise contracts.

**Bridge** gives each tenant their own schema inside a shared database. `acme.requests`, `beta.requests`. Medium isolation, medium cost.

**Pool** puts everyone in the same tables with a `tenant_id` column on every row. Cheapest, hardest to get right, works for thousands of tenants.

Relay uses the pool model. Here is why: Relay's customers are developers and startups. There will be thousands of them. The data shape is uniform across all tenants (requests, tokens, costs). There is no per-tenant schema customization. And the operational cost of running a database per tenant would make the pricing unworkable.

When I was reading about this, the thing that surprised me was how many production SaaS companies started with silo because it felt "safer", then spent months migrating to pool because the operational cost was unsustainable at scale. The right model depends almost entirely on who your customer is and what you are charging them.

For the very few large enterprise tenants who need data residency or dedicated infrastructure, Relay offers silo as an upgrade. The architecture supports routing them to a separate database via a tenant config flag. But that is the exception, not the default.

```mermaid
graph TD
    subgraph pool["Pool Model"]
        A[acme] --> DB[(relay_db)]
        B[beta] --> DB
        C[gamma] --> DB
        DB --> R1[tenant_id = acme rows]
        DB --> R2[tenant_id = beta rows]
        DB --> R3[tenant_id = gamma rows]
    end

    subgraph silo["Silo - Enterprise"]
        D[bigcorp] --> DB2[(bigcorp_db)]
    end
```

In the pool model, all tenants share one database. Rows are separated by `tenant_id`. In the silo model, a tenant gets a dedicated database. Relay defaults to pool. Enterprise tenants that need dedicated infrastructure are routed to their own database via a flag in `tenant_config`.

---

## System Overview

Before going into each component, here is the full picture of what Relay looks like.

```mermaid
graph TB
    subgraph client["Client"]
        APP[Your Application]
    end

    subgraph relayedge["Relay Edge"]
        GW[API Gateway]
        AUTH[Auth Middleware]
        RL[Rate Limiter]
        CACHE[Cache Layer]
    end

    subgraph relaycore["Relay Core"]
        ROUTER[Provider Router]
        STREAM[Stream Handler]
        LOGGER[Request Logger]
    end

    subgraph providers["Providers"]
        OAI[OpenAI]
        ANT[Anthropic]
        MIS[Mistral]
        GRQ[Groq]
    end

    subgraph data["Data"]
        PG[(Postgres)]
        RD[(Redis)]
    end

    subgraph background["Background"]
        AGG[Usage Aggregator]
        BILL[Billing Worker]
    end

    APP -->|POST /v1/chat/completions| GW
    GW --> AUTH
    AUTH --> RL
    RL --> CACHE
    CACHE -->|miss| ROUTER
    ROUTER --> OAI
    ROUTER --> ANT
    ROUTER --> MIS
    ROUTER --> GRQ
    ROUTER --> STREAM
    STREAM -->|SSE chunks| APP
    STREAM --> LOGGER
    LOGGER --> PG
    LOGGER --> RD
    RD --> AGG
    AGG --> PG
    PG --> BILL
```

Every request passes through auth, rate limiting, and cache check before reaching the provider router. If the cache misses, the router picks a provider, forwards the request, and streams the response back. Background workers handle everything async: logging, usage aggregation, billing, and provider health monitoring.

---

## API Key Authentication and Tenant Resolution

Relay is an API product. There are no browser sessions, no subdomains per tenant, no cookie-based auth for the hot path. Authentication is an API key in the `Authorization` header.

```
Authorization: Bearer rly_live_sk_abc123xyz...
```

The key itself encodes enough information to skip a lookup in most cases, but the real resolution always hits the database (or Redis cache) once per request.

### API key design

Relay uses prefixed, hashed API keys. The key the user sees is never stored. Only its SHA-256 hash is stored.

```
Full key:    rly_live_sk_4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA
Prefix:      rly_live_
Environment: live (vs test)
Random part: 4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA (256 bits, URL-safe base64)
```

The prefix makes it easy to detect keys in source code scanning, error logs, and GitHub secret scanning. Stripe does this. Anthropic does this. It is a good pattern.

```sql
CREATE TABLE api_keys (
    id              TEXT        PRIMARY KEY,
    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
    name            TEXT        NOT NULL,           -- "Production", "Dev", "CI"
    key_prefix      TEXT        NOT NULL,           -- rly_live_ or rly_test_
    key_hash        TEXT        NOT NULL UNIQUE,    -- SHA-256 of the full key
    key_hint        TEXT        NOT NULL,           -- last 4 chars: "...1eA"
    environment     TEXT        NOT NULL DEFAULT 'live',
    status          TEXT        NOT NULL DEFAULT 'active',
    last_used_at    TIMESTAMPTZ,
    expires_at      TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    created_by      TEXT        NOT NULL REFERENCES users(id)
);

CREATE INDEX idx_api_keys_tenant_id ON api_keys(tenant_id);
CREATE INDEX idx_api_keys_key_hash  ON api_keys(key_hash);
```

You never store the full key. When you create a key, you show it to the user once and tell them to copy it. After that, if they lose it, they rotate it. The `key_hint` column (last 4 characters) helps them identify which key is which in the dashboard.

### Test mode

`rly_test_sk_` keys run in test mode. Relay does not forward test requests to real providers. It returns deterministic mocked responses from a fixed response library. This means no provider costs during development, predictable responses in CI/CD pipelines, and no test traffic polluting the live usage dashboard.

Test mode behaves identically to live mode in every other way: rate limits apply, tokens are counted, requests are logged, errors are returned in the same format. The only difference is the backend. When Relay sees a test key, the router skips provider selection entirely and calls a local mock handler instead.

```sql
-- Requests table distinguishes test from live via the api_key's environment
-- api_keys.environment = 'live' | 'test'
-- usage_daily is partitioned by environment so dashboards stay clean
```

Test responses are deterministic based on the request hash. The same input always produces the same mock output. This makes test assertions reliable.

### Auth flow

```mermaid
flowchart LR
    A["Request<br/>Authorization: Bearer rly_live_sk_..."] --> B["Extract key<br/>from header"]
    B --> C["SHA-256 hash<br/>the key"]
    C --> D{"Redis cache<br/>key_hash lookup?"}
    D -->|hit| E["Return cached<br/>tenant + key metadata"]
    D -->|miss| F["Postgres lookup<br/>WHERE key_hash = ?"]
    F --> G{"Found and<br/>status = active?"}
    G -->|no| H[401 Unauthorized]
    G -->|yes| I["Cache in Redis<br/>TTL 5 minutes"]
    I --> J["Inject TenantContext<br/>into request"]
    J --> K[Next middleware]
```

The key is hashed on every request and looked up in Redis first. A cache hit means zero database reads for auth. A miss hits Postgres and populates Redis for the next 5 minutes. The resolved tenant context flows through every subsequent middleware and handler via the request context.

{{< code lang="go" >}}
func APIKeyMiddleware(keys KeyStore, cache Cache) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            raw := strings.TrimPrefix(r.Header.Get("Authorization"), "Bearer ")
            if raw == "" {
                writeError(w, 401, "missing api key")
                return
            }

            hash := sha256Hex(raw)

            key, err := cache.GetAPIKey(r.Context(), hash)
            if err != nil {
                key, err = keys.GetByHash(r.Context(), hash)
                if err != nil || key.Status != "active" {
                    writeError(w, 401, "invalid api key")
                    return
                }
                cache.SetAPIKey(r.Context(), hash, key, 5*time.Minute)
            }

            ctx := WithAPIKey(r.Context(), key)
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}
{{< /code >}}

---

## Database Schema Design

### Tenants and users

```sql
CREATE TABLE tenants (
    id          TEXT        PRIMARY KEY,             -- ten_abc123
    slug        TEXT        UNIQUE NOT NULL,         -- acme
    name        TEXT        NOT NULL,                -- Acme Inc.
    plan        TEXT        NOT NULL DEFAULT 'free', -- free | starter | pro | enterprise
    status      TEXT        NOT NULL DEFAULT 'trial',-- trial | active | suspended | deleted
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    deleted_at  TIMESTAMPTZ,
    metadata    JSONB
);

CREATE INDEX idx_tenants_slug   ON tenants(slug);
CREATE INDEX idx_tenants_status ON tenants(status);

CREATE TABLE users (
    id          TEXT        PRIMARY KEY,
    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
    email       TEXT        NOT NULL,
    role        TEXT        NOT NULL DEFAULT 'member', -- owner | admin | member | viewer
    status      TEXT        NOT NULL DEFAULT 'active',
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    deleted_at  TIMESTAMPTZ,
    UNIQUE(tenant_id, email)
);

CREATE INDEX idx_users_tenant_id ON users(tenant_id);
CREATE INDEX idx_users_email     ON users(email);
```

### Model catalog

Relay maintains a catalog of all supported models with their pricing. This drives cost calculation and provider routing.

```sql
CREATE TABLE providers (
    id          TEXT    PRIMARY KEY,         -- openai | anthropic | mistral | groq
    name        TEXT    NOT NULL,            -- OpenAI
    base_url    TEXT    NOT NULL,            -- https://api.openai.com/v1
    status      TEXT    NOT NULL DEFAULT 'active', -- active | degraded | down
    priority    INT     NOT NULL DEFAULT 100,      -- lower = more preferred
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE models (
    id                  TEXT    PRIMARY KEY,        -- gpt-5.5 | claude-3-5-sonnet-20241022
    provider_id         TEXT    NOT NULL REFERENCES providers(id),
    relay_name          TEXT    UNIQUE NOT NULL,    -- the name tenants use: "gpt-5.5"
    provider_name       TEXT    NOT NULL,           -- name the provider actually accepts
    display_name        TEXT    NOT NULL,           -- GPT-4o
    input_cost_per_1m   NUMERIC NOT NULL,           -- USD per 1M input tokens
    output_cost_per_1m  NUMERIC NOT NULL,           -- USD per 1M output tokens
    context_window      INT     NOT NULL,           -- max tokens in context
    supports_streaming  BOOLEAN NOT NULL DEFAULT true,
    supports_functions  BOOLEAN NOT NULL DEFAULT false,
    supports_vision     BOOLEAN NOT NULL DEFAULT false,
    status              TEXT    NOT NULL DEFAULT 'active',
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_models_provider_id ON models(provider_id);
```

A few rows to make it concrete:

```
id                            provider_id  relay_name           input_cost   output_cost  context_window
gpt-5.5                        openai       gpt-5.5               2.50         10.00        128000
claude-3-5-sonnet-20241022    anthropic    claude-3-5-sonnet    3.00         15.00        200000
mistral-large-2411            mistral      mistral-large        2.00         6.00         131072
llama-3.3-70b-versatile       groq         llama-3.3-70b        0.59         0.79         128000
```

Costs are in USD per 1 million tokens.

### Provider credentials

Relay uses its own API keys to call providers. These are pooled: Relay pays the providers, marks up the cost, and charges tenants.

For enterprise tenants who want to bring their own keys (BYOK) and pay providers directly, Relay supports that too. The `tenant_id` being NULL means the credential belongs to Relay.

Provider credentials are among the most sensitive data Relay stores. A leaked provider key means an attacker can run LLM calls billed to that key. The `key_encrypted` column stores the key encrypted with AES-256-GCM using a data encryption key (DEK) that is itself encrypted by a master key stored in a KMS (AWS KMS or HashiCorp Vault). The plaintext key never touches disk. Relay calls the KMS API to decrypt the DEK at runtime, decrypts the credential in memory, uses it for the provider request, and discards it.

Key rotation works in two steps: the KMS master key is rotated on a schedule (annually at minimum), then a background job re-encrypts every `key_encrypted` value under the new master key. The job runs row by row with checkpointing so it can resume if interrupted.

```sql
CREATE TABLE provider_credentials (
    id              TEXT        PRIMARY KEY,
    tenant_id       TEXT        REFERENCES tenants(id),  -- NULL = Relay's pooled key
    provider_id     TEXT        NOT NULL REFERENCES providers(id),
    key_encrypted   BYTEA       NOT NULL,    -- encrypted with Relay's KMS key
    key_hint        TEXT        NOT NULL,    -- last 4 chars for display
    is_relay_owned  BOOLEAN     NOT NULL DEFAULT true,
    status          TEXT        NOT NULL DEFAULT 'active',
    rpm_limit       INT,                     -- provider's RPM cap for this key
    tpm_limit       INT,                     -- provider's TPM cap for this key
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(tenant_id, provider_id)
);
```

### Requests

Every API call Relay processes gets a row here. This is the source of truth for billing, debugging, and usage dashboards.

```sql
CREATE TABLE requests (
    id              TEXT        PRIMARY KEY,            -- req_abc123
    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
    api_key_id      TEXT        NOT NULL REFERENCES api_keys(id),
    model_id        TEXT        NOT NULL REFERENCES models(id),
    provider_id     TEXT        NOT NULL REFERENCES providers(id),
    credential_id   TEXT        NOT NULL REFERENCES provider_credentials(id),

    -- What kind of request
    request_type    TEXT        NOT NULL,               -- chat | completion | embedding
    stream          BOOLEAN     NOT NULL DEFAULT false,

    -- Token counts (filled in after response)
    input_tokens    INT,
    output_tokens   INT,
    total_tokens    INT,

    -- Cost in USD
    input_cost      NUMERIC(12,8),
    output_cost     NUMERIC(12,8),
    total_cost      NUMERIC(12,8),

    -- Relay's markup (stored separately for accounting)
    markup_rate     NUMERIC(5,4) NOT NULL DEFAULT 0.20, -- 20% markup by default
    relay_revenue   NUMERIC(12,8),

    -- Timing
    started_at      TIMESTAMPTZ NOT NULL,
    first_token_at  TIMESTAMPTZ,       -- for streaming: TTFT
    completed_at    TIMESTAMPTZ,
    latency_ms      INT,
    ttft_ms         INT,               -- time to first token

    -- Outcome
    status          TEXT        NOT NULL, -- success | error | cached | timeout
    error_code      TEXT,
    error_message   TEXT,
    http_status     INT,

    -- Cache
    cache_hit       BOOLEAN     NOT NULL DEFAULT false,
    cache_key       TEXT,

    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_requests_tenant_created  ON requests(tenant_id, created_at DESC);
CREATE INDEX idx_requests_model_created   ON requests(model_id, created_at DESC);
CREATE INDEX idx_requests_api_key_created ON requests(api_key_id, created_at DESC);
CREATE INDEX idx_requests_status          ON requests(status, created_at DESC);
```

The `requests` table grows fast. At 1,000 tenants each making 100 requests per day, that is 100,000 rows per day, 3 million per month. After 90 days you want to start archiving cold rows to cheaper storage (S3 + Parquet). Keep hot rows (last 30 days) in Postgres. Query historical data via a warehouse.

### Daily usage rollup

Querying `requests` for billing is expensive at scale. Aggregate into a daily rollup table that background workers update continuously.

```sql
CREATE TABLE usage_daily (
    tenant_id       TEXT    NOT NULL REFERENCES tenants(id),
    model_id        TEXT    NOT NULL REFERENCES models(id),
    date            DATE    NOT NULL,
    request_count   BIGINT  NOT NULL DEFAULT 0,
    input_tokens    BIGINT  NOT NULL DEFAULT 0,
    output_tokens   BIGINT  NOT NULL DEFAULT 0,
    total_tokens    BIGINT  NOT NULL DEFAULT 0,
    total_cost      NUMERIC(12,4) NOT NULL DEFAULT 0,
    relay_revenue   NUMERIC(12,4) NOT NULL DEFAULT 0,
    cache_hits      BIGINT  NOT NULL DEFAULT 0,
    error_count     BIGINT  NOT NULL DEFAULT 0,
    avg_latency_ms  INT,
    PRIMARY KEY (tenant_id, model_id, date)
);
```

---

## Request Flow: From API Call to Provider and Back

This is the core of Relay. Every API call goes through this pipeline.

```mermaid
sequenceDiagram
    actor App as Your App
    participant GW as Relay Gateway
    participant Auth as Auth Middleware
    participant RL as Rate Limiter
    participant Cache as Cache Layer
    participant Router as Provider Router
    participant OAI as OpenAI

    App->>GW: POST /v1/chat/completions<br/>Authorization: Bearer rly_live_sk_...
    GW->>Auth: validate API key
    Auth->>Auth: SHA-256 hash key<br/>lookup in Redis / Postgres
    Auth-->>GW: TenantContext injected

    GW->>RL: check RPM + TPM limits
    RL-->>GW: ok

    GW->>Cache: check cache<br/>hash(model + messages + params)
    Cache-->>GW: miss

    GW->>Router: route request<br/>model=gpt-5.5
    Router->>Router: lookup model in catalog<br/>select provider + credential
    Router->>OAI: POST api.openai.com/v1/chat/completions<br/>Authorization: Bearer sk-relay-pooled-...

    OAI-->>Router: 200 OK (streaming SSE)
    Router-->>App: pipe SSE chunks

    Router->>Router: count tokens from response<br/>calculate cost

    Router->>GW: log request async
    GW->>GW: INSERT into requests<br/>INCR Redis counters
```

### Model routing and failover

The router resolves which provider to use for a given model name. If the primary provider is down or over its rate limit, it falls back to an alternative.

```mermaid
flowchart TD
    A["Request<br/>model = gpt-5.5"] --> B["Lookup model<br/>in catalog"]
    B --> C["Get providers<br/>ordered by priority"]
    C --> D{"Tenant has<br/>BYOK key?"}
    D -->|yes| E[Use tenant's key]
    D -->|no| F[Use Relay pooled key]
    E --> G{"Provider<br/>healthy?"}
    F --> G
    G -->|yes| H[Forward request]
    G -->|no| I["Next provider<br/>by priority"]
    I --> J{Any left?}
    J -->|yes| G
    J -->|no| K[503 No providers available]
    H -->|5xx from provider| I
    H -->|success| L[Return response]
```

{{< code lang="go" >}}
func (r *Router) Route(ctx context.Context, req *RelayRequest) (*RelayResponse, error) {
    model, err := r.catalog.GetModel(req.Model)
    if err != nil {
        return nil, ErrModelNotFound
    }

    providers := r.getOrderedProviders(ctx, model)
    for _, p := range providers {
        cred, err := r.getCredential(ctx, req.TenantID, p.ID)
        if err != nil {
            continue
        }

        resp, err := r.forward(ctx, p, cred, req)
        if err == nil {
            return resp, nil
        }

        // 5xx from provider: try next
        if isProviderError(err) {
            r.health.MarkDegraded(p.ID)
            continue
        }

        // 4xx from provider: the request itself is bad, do not retry
        return nil, err
    }

    return nil, ErrNoProvidersAvailable
}
{{< /code >}}

### Handling streaming (SSE)

When the tenant sends `"stream": true`, the provider returns a Server-Sent Events stream. Relay proxies it chunk by chunk. This is not a buffered response, it is a live pipe.

```mermaid
sequenceDiagram
    participant App
    participant Relay
    participant Provider

    App->>Relay: POST /v1/chat/completions<br/>stream: true
    Relay->>Provider: POST (stream: true)
    Provider-->>Relay: data: {"delta": {"content": "Hello"}}
    Relay-->>App: data: {"delta": {"content": "Hello"}}
    Provider-->>Relay: data: {"delta": {"content": " world"}}
    Relay-->>App: data: {"delta": {"content": " world"}}
    Provider-->>Relay: data: [DONE]
    Relay-->>App: data: [DONE]
    Relay->>Relay: count tokens from chunks<br/>log request async
```

Relay counts tokens as chunks arrive by tracking the `usage` field that providers send in the final chunk. For providers that do not send usage in streaming mode, Relay uses a local tokenizer (tiktoken for OpenAI models, SentencePiece for others) to estimate counts. The estimate is reconciled against the actual count when available.

Token counting is done on a goroutine that reads from the same SSE stream, separate from the goroutine that proxies chunks to the client. The client sees no added latency.

### Retry logic

Relay retries failed provider requests automatically, but only for the right kind of failures.

**Retryable**: 5xx errors from the provider (server errors), network timeouts, connection resets. These indicate the provider had a problem, not that the request was wrong.

**Not retryable**: 4xx errors from the provider (except 429). A 400 means the request was malformed. A 401 means the credential is wrong. Retrying will not help and wastes time.

**Provider rate limit (429 from provider)**: Relay does not wait and retry on the same key. It immediately switches to the next available provider key or provider. Waiting for a rate limit window to reset would add seconds of latency for the tenant.

Retry policy: up to 3 attempts, 100ms exponential backoff between attempts, each attempt on a different provider or key if possible. If all attempts fail, Relay returns the last error to the tenant.

Relay does not retry on behalf of the tenant after returning an error. If Relay returns a 502 or 503, the tenant's code decides whether to retry at their layer.

### Error response format

Every error Relay returns uses the same JSON structure. Developers should be able to write one error handler and handle all Relay errors with it.

```json
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded 100 requests per minute on the free plan.",
    "type": "rate_limit_error",
    "details": {
      "limit": 100,
      "window": "per_minute",
      "reset_at": "2026-07-01T12:05:00Z",
      "upgrade_url": "https://relay.lorbic.com/billing"
    }
  }
}
```

The full set of error codes:

| HTTP status | error.code | Meaning |
|---|---|---|
| 401 | `invalid_api_key` | Key not found or revoked |
| 402 | `tenant_suspended` | Account suspended, check billing |
| 402 | `quota_exceeded` | Hard quota limit hit (seats, storage) |
| 404 | `model_not_found` | Requested model does not exist in catalog |
| 422 | `invalid_request` | Request body failed validation |
| 429 | `rate_limit_exceeded` | RPM or TPM limit hit |
| 502 | `provider_error` | Provider returned an error after retries |
| 503 | `no_provider_available` | All providers for this model are degraded |
| 504 | `request_timeout` | Provider did not respond within the timeout |

The `details` field is present on `rate_limit_exceeded` and `quota_exceeded` errors so tenants can surface actionable information to their own users. If a tenant's app is hitting Relay's rate limit, they should know the reset time and the upgrade path, not just a generic "too many requests."

---

## Rate Limiting

Relay has to manage rate limits at two levels.

**Inbound**: protecting Relay's own infrastructure from tenants who send too many requests.

**Outbound**: protecting Relay's provider API keys from hitting the provider's limits.

Both use Redis counters with sliding windows.

The dual-layer rate limiting was the part I found most interesting while designing this. Most rate limiting content covers the inbound side. The outbound side is specific to API gateway products but just as important. If Relay's OpenAI key hits OpenAI's TPM limit because of one large tenant, every other tenant on that key starts getting 429s from the provider. The router has to detect this and re-route before it becomes the tenant's problem.

### Inbound tenant limits

```sql
-- Plan-level defaults
CREATE TABLE quota_definitions (
    plan         TEXT    NOT NULL,
    resource     TEXT    NOT NULL,   -- rpm | tpm | requests_per_day
    limit_value  BIGINT  NOT NULL,   -- -1 = unlimited
    window       TEXT    NOT NULL,   -- per_minute | per_day
    UNIQUE(plan, resource, window)
);

-- Sample values:
-- plan      resource          limit     window
-- free      rpm               60        per_minute
-- free      tpm               40000     per_minute
-- free      requests_per_day  500       per_day
-- starter   rpm               500       per_minute
-- starter   tpm               400000    per_minute
-- pro       rpm               3000      per_minute
-- pro       tpm               2000000   per_minute
-- pro       requests_per_day  -1        per_day (unlimited)

-- Per-tenant overrides (for enterprise deals or support credits)
CREATE TABLE quota_overrides (
    id           TEXT        PRIMARY KEY,
    tenant_id    TEXT        NOT NULL REFERENCES tenants(id),
    resource     TEXT        NOT NULL,
    limit_value  BIGINT      NOT NULL,
    window       TEXT        NOT NULL,
    reason       TEXT,
    expires_at   TIMESTAMPTZ,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(tenant_id, resource, window)
);
```

Redis key structure for inbound limits:

```
rl:in:ten_abc123:rpm   → sliding window counter, 60s window
rl:in:ten_abc123:tpm   → sliding window counter, 60s window
```

### Outbound provider limits

Each of Relay's provider API keys has its own RPM and TPM cap set by the provider. Relay tracks these in Redis too.

```
rl:out:openai:key_xyz:rpm   → counter, 60s window
rl:out:openai:key_xyz:tpm   → counter, 60s window
```

When a pooled key is close to its provider limit, the router skips it and picks the next available key or provider.

### The rate limit pipeline

```mermaid
flowchart LR
    A[Request] --> B{"Inbound tenant<br/>RPM check"}
    B -->|over| C["429 Too Many Requests<br/>retry-after header"]
    B -->|ok| D{"Inbound tenant<br/>TPM check"}
    D -->|over| C
    D -->|ok| E[Route to provider]
    E --> F{"Outbound key<br/>RPM check"}
    F -->|over| G["Pick next key<br/>or provider"]
    F -->|ok| H[Forward to provider]
    G --> F
```

A request fails fast at inbound RPM or TPM before any provider call is made. The outbound check happens inside the router, per-key. If a key is saturated, the router picks the next key or the next provider, not the next retry window.

The TPM check on inbound traffic uses an estimated token count from the request body (input tokens only). The actual token count (input + output) is only known after the response. Relay does a conservative pre-check against the estimated input, then reconciles and updates the TPM counter after the response.

---

## Token Cost Tracking and Billing

Relay's pricing model: Relay pays the provider at published rates, marks up by 20%, and charges tenants at the marked-up rate. The markup funds Relay's infrastructure and margin.

```
Tenant sends:  1,000 input tokens, 500 output tokens on gpt-5.5
Provider cost: (1000 / 1M) × $2.50 + (500 / 1M) × $10.00 = $0.00250 + $0.00500 = $0.00750
Relay markup:  $0.00750 × 1.20 = $0.00900
Tenant billed: $0.00900
Relay revenue: $0.00150
```

Every request row in the `requests` table stores both the raw provider cost and the Relay revenue. This makes accounting and audits straightforward.

### Cost calculation

```mermaid
flowchart LR
    A["Response received<br/>input_tokens=1000<br/>output_tokens=500"] --> B["Lookup model pricing<br/>input_cost_per_1m<br/>output_cost_per_1m"]
    B --> C["input_cost = tokens / 1M x rate<br/>output_cost = tokens / 1M x rate"]
    C --> D[total_cost = input + output]
    D --> E[relay_revenue = total_cost x markup_rate]
    E --> F[INSERT into requests]
    F --> G[INCR usage_daily rollup]
```

### Billing cycle

Relay charges tenants monthly via Stripe. The billing worker runs at the start of each month, reads the `usage_daily` table for the prior month, and creates a Stripe invoice line item per model per tenant.

```sql
CREATE TABLE billing_periods (
    id          TEXT        PRIMARY KEY,
    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
    period_start DATE       NOT NULL,
    period_end   DATE       NOT NULL,
    status      TEXT        NOT NULL DEFAULT 'open',  -- open | finalized | paid | failed
    total_cost  NUMERIC(12,4),
    invoice_id  TEXT,              -- Stripe invoice ID
    finalized_at TIMESTAMPTZ,
    paid_at     TIMESTAMPTZ,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(tenant_id, period_start)
);

CREATE TABLE billing_line_items (
    id              TEXT        PRIMARY KEY,
    billing_period_id TEXT      NOT NULL REFERENCES billing_periods(id),
    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
    model_id        TEXT        NOT NULL REFERENCES models(id),
    request_count   BIGINT      NOT NULL,
    input_tokens    BIGINT      NOT NULL,
    output_tokens   BIGINT      NOT NULL,
    total_cost      NUMERIC(12,4) NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(billing_period_id, model_id)
);
```

> Billing is an entire system design on its own: Stripe subscription lifecycle, idempotent webhook handling, retry logic for failed payments, proration, credits, and tax calculation. This post covers the data model. The billing pipeline gets its own post.

---

## Response Caching: Exact Match and Semantic

Caching completions saves money. If 1,000 tenants call the same model with the same system prompt and similar questions, Relay can serve many of those from cache instead of paying the provider.

Relay supports two cache strategies.

### Exact match caching

The simplest form. Hash the full request (model + messages + temperature + all params). If you have seen this exact request before, return the cached response.

```
cache_key = SHA-256(
    model_id +
    JSON.stringify(messages) +
    temperature + top_p + max_tokens + ...
)
```

Stored in Redis with a configurable TTL. Default 1 hour. Tenants can set their own TTL per API call via a `x-relay-cache-ttl` header.

### Semantic caching

Exact match only works for identical requests. Semantic caching uses embeddings to find requests that are similar enough to return the same response.

```mermaid
flowchart TD
    A[New request] --> B["Embed the user message<br/>using a small model"]
    B --> C["Vector similarity search<br/>against cached embeddings"]
    C --> D{"Similarity score<br/>> threshold?"}
    D -->|yes| E["Return cached response<br/>cache_hit = semantic"]
    D -->|no| F["Forward to provider<br/>cache the response + embedding"]
```

The embedding model for cache lookup should be cheap and fast (text-embedding-3-small or similar). The cost of generating the lookup embedding is tiny compared to the LLM call it might save.

```sql
CREATE TABLE response_cache (
    id              TEXT        PRIMARY KEY,
    tenant_id       TEXT        NOT NULL REFERENCES tenants(id),
    cache_key       TEXT        NOT NULL,         -- exact match hash
    model_id        TEXT        NOT NULL REFERENCES models(id),
    request_hash    TEXT        NOT NULL,
    response_body   JSONB       NOT NULL,
    input_tokens    INT         NOT NULL,
    output_tokens   INT         NOT NULL,
    embedding       vector(1536),                 -- pgvector, for semantic cache
    hit_count       INT         NOT NULL DEFAULT 0,
    expires_at      TIMESTAMPTZ NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_response_cache_tenant_key ON response_cache(tenant_id, cache_key);
CREATE INDEX idx_response_cache_embedding  ON response_cache
    USING hnsw (embedding vector_cosine_ops)
    WHERE embedding IS NOT NULL;
```

---

## Observability

You cannot operate a multi-tenant API platform without per-tenant observability. When a tenant files a support ticket saying "your API is slow," you need to be able to pull up exactly their requests, their latencies, and their errors in under a minute.

### What to track per request

Every request log entry should have:

- `tenant_id` and `api_key_id`
- `model_id` and `provider_id`
- `latency_ms` and `ttft_ms` (time to first token, the metric tenants care about for streaming)
- `input_tokens`, `output_tokens`, `total_cost`
- `status` and `error_code`
- `cache_hit`

These are already in the `requests` table. Make sure every log line, trace span, and metric tag also carries `tenant_id`.

### Structured logging

{{< code lang="go" >}}
func RequestLoggingMiddleware(logger *slog.Logger) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            key := APIKeyFromContext(r.Context())
            reqLogger := logger.With(
                slog.String("tenant_id", key.TenantID),
                slog.String("api_key_id", key.ID),
                slog.String("request_id", newRequestID()),
            )
            ctx := WithLogger(r.Context(), reqLogger)
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}
{{< /code >}}

### Health status per provider

Relay's router uses a health tracker to avoid sending requests to degraded providers. The tracker works on a circuit breaker pattern: after N consecutive errors from a provider, mark it degraded and stop routing to it until it recovers.

```sql
CREATE TABLE provider_health_events (
    id          TEXT        PRIMARY KEY,
    provider_id TEXT        NOT NULL REFERENCES providers(id),
    status      TEXT        NOT NULL,       -- healthy | degraded | down
    error_rate  NUMERIC,                   -- 5-minute error rate at time of event
    p95_ms      INT,                       -- p95 latency at time of event
    message     TEXT,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_provider_health_events ON provider_health_events(provider_id, created_at DESC);
```

The provider status page (status.relay.lorbic.com) reads from this table. Tenants can subscribe to status updates.

### Noisy neighbor detection

In the pool model, one tenant running an aggressive batch job can saturate Relay's provider keys and cause latency for everyone else. Detect this by tracking per-tenant request volume in Redis with a 1-minute rolling window. If a single tenant exceeds a percentage threshold of total traffic, flag them and apply stricter rate limits automatically.

---

## Role-Based Access Control

API calls use API key auth. The Relay dashboard (where tenants manage keys, view usage, configure settings) uses JWT auth.

```json
{
  "iss": "relay.lorbic.com",
  "sub": "usr_abc123",
  "tenant_id": "ten_xyz789",
  "tenant_slug": "acme",
  "role": "admin",
  "permissions": ["api_keys:read", "api_keys:write", "billing:read", "usage:read"],
  "iat": 1751400000,
  "exp": 1751500000
}
```

### RBAC tables

```sql
CREATE TABLE roles (
    id          TEXT        PRIMARY KEY,
    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
    name        TEXT        NOT NULL,       -- owner | admin | developer | viewer
    is_system   BOOLEAN     NOT NULL DEFAULT false,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(tenant_id, name)
);

CREATE TABLE permissions (
    id          TEXT PRIMARY KEY,
    resource    TEXT NOT NULL,              -- api_keys | usage | billing | members | settings
    action      TEXT NOT NULL,             -- create | read | update | delete
    UNIQUE(resource, action)
);

CREATE TABLE role_permissions (
    role_id       TEXT NOT NULL REFERENCES roles(id),
    permission_id TEXT NOT NULL REFERENCES permissions(id),
    PRIMARY KEY (role_id, permission_id)
);

CREATE TABLE user_roles (
    user_id   TEXT NOT NULL REFERENCES users(id),
    role_id   TEXT NOT NULL REFERENCES roles(id),
    PRIMARY KEY (user_id, role_id)
);
```

### Audit log

Every sensitive action in the dashboard gets recorded: key creation, key revocation, plan changes, member removal, billing configuration.

```sql
CREATE TABLE audit_log (
    id            TEXT        PRIMARY KEY,
    tenant_id     TEXT        REFERENCES tenants(id),
    actor_id      TEXT        NOT NULL,
    actor_type    TEXT        NOT NULL,    -- user | platform_admin | system
    action        TEXT        NOT NULL,    -- api_key.created | api_key.revoked | member.removed
    resource_id   TEXT,
    resource_type TEXT,
    metadata      JSONB,
    ip_address    INET,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_audit_log_tenant   ON audit_log(tenant_id, created_at DESC);
CREATE INDEX idx_audit_log_actor    ON audit_log(actor_id, created_at DESC);
```

---

## Tenant Provisioning: From Signup to First API Call

### What happens when someone signs up

When a user submits the signup form, Relay does the minimum work synchronously: insert a tenant row with `status=PROVISIONING`, create the owner user record, and return a 201. The browser redirects immediately. Everything else runs in the background.

The background worker picks up the job and does the slower parts: seeding system roles, creating a Stripe customer record, and flipping the tenant to `TRIAL`. The Stripe customer is created at signup, not when the user adds a payment method. This is intentional. You need a Stripe customer ID to attach payment methods, subscriptions, and invoices later. Creating it lazily (at payment time) means every billing code path has to handle the case where the ID does not exist yet, which causes bugs.

The welcome email includes the tenant's first API key. The key is generated by the worker, not the API handler, because key generation requires the tenant record to be fully set up.

If the worker fails partway through, the tenant stays in `PROVISIONING`. A monitoring job scans for tenants stuck in `PROVISIONING` for more than 5 minutes and either retries the job or pages on-call.

```mermaid
sequenceDiagram
    actor User
    participant API
    participant Queue
    participant Worker
    participant DB
    participant Stripe

    User->>API: POST /auth/signup<br/>{ email, password, org_name }
    API->>DB: INSERT tenant (status=PROVISIONING)
    API->>DB: INSERT user (role=owner)
    API->>Queue: enqueue ProvisionTenant job
    API-->>User: 201 Created<br/>{ tenant_id, redirect: /onboarding }

    Worker->>Queue: dequeue job
    Worker->>DB: seed default roles (owner, admin, developer, viewer)
    Worker->>DB: seed plan quota definitions
    Worker->>Stripe: create Stripe customer
    Worker->>DB: UPDATE tenant SET status=TRIAL, stripe_customer_id=cus_...
    Worker-->>User: send welcome email with first API key
```

### Tenant status transitions

The `status` column on the `tenants` table is the single source of truth for what a tenant can do. Every API request, every dashboard login, and every background job checks this column before doing anything.

**PROVISIONING**: The tenant record exists but setup is not complete. API calls are rejected with a 503. The dashboard shows a loading state.

**TRIAL**: Setup is complete. The tenant has full API access up to the trial plan's rate limits. A trial expiry timestamp lives in `tenant_config`. When it passes without a payment method being added, a background job moves the tenant to `EXPIRED`.

**ACTIVE**: A payment method is on file and billing is working. Full access based on the tenant's plan.

**SUSPENDED**: Payment failed, or a Relay admin manually suspended the account. API calls return a 402 with a clear message and a link to the billing page. The tenant's data is untouched. If payment resolves within the grace period, the account returns to `ACTIVE` with no data loss. This is intentional: you want tenants to fix their billing, not lose access to everything they built.

**EXPIRED**: The trial ended without a payment method being added. Same behavior as `SUSPENDED` for API calls. After N days with no action, the tenant moves to `DELETED`.

**DELETED**: Soft delete. The row stays in the database with `deleted_at` set, but all API calls are rejected. Data is retained for 30 days for recovery or dispute resolution, then a GDPR-compliant hard deletion job removes all tenant rows across every table.

Status transitions are intentionally one-directional except for `SUSPENDED` back to `ACTIVE`. You cannot un-delete a tenant. You cannot go from `ACTIVE` back to `TRIAL`. These constraints are enforced at the application layer via explicit state machine logic, not just database constraints.

```mermaid
stateDiagram-v2
    [*] --> PROVISIONING : signup
    PROVISIONING --> TRIAL : provisioning complete
    TRIAL --> ACTIVE : payment method added
    TRIAL --> EXPIRED : trial period ends
    ACTIVE --> SUSPENDED : payment failed or admin action
    SUSPENDED --> ACTIVE : payment resolved
    SUSPENDED --> DELETED : grace period expires
    EXPIRED --> DELETED : no conversion
    ACTIVE --> DELETED : user cancels
    DELETED --> [*]
```

Suspended tenants get a 429 response with a billing error message on every API call. This is important: the error message should tell them what is wrong and link them to the billing page. A silent 401 causes confusion.

---

## Feature Flags and Per-Tenant Configuration

```sql
CREATE TABLE plan_features (
    plan     TEXT    NOT NULL,
    feature  TEXT    NOT NULL,    -- semantic_cache | byok | sso | audit_log | higher_rpm | webhooks
    enabled  BOOLEAN NOT NULL DEFAULT true,
    PRIMARY KEY (plan, feature)
);

CREATE TABLE tenant_feature_overrides (
    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
    feature     TEXT        NOT NULL,
    enabled     BOOLEAN     NOT NULL,
    reason      TEXT,
    expires_at  TIMESTAMPTZ,
    PRIMARY KEY (tenant_id, feature)
);

CREATE TABLE tenant_config (
    tenant_id   TEXT        NOT NULL REFERENCES tenants(id),
    key         TEXT        NOT NULL,
    value       TEXT        NOT NULL,
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, key)
);

-- Example config rows:
-- tenant_id   key                      value
-- ten_abc     default_cache_ttl_sec    3600
-- ten_abc     markup_rate              0.15     (negotiated lower markup for volume)
-- ten_abc     allowed_models           ["gpt-5.5","claude-3-5-sonnet"]
-- ten_abc     webhook_url              https://acme.com/hooks/relay
```

Cache the feature flag state per tenant in Redis (60s TTL). Do not query Postgres on every API call.

---

## Full Architecture

```mermaid
graph TB
    subgraph tenants["Tenants"]
        APP1[App: acme]
        APP2[App: beta]
        DASH[Dashboard]
    end

    subgraph edge["Edge"]
        DNS[Wildcard DNS]
        LB[Load Balancer]
    end

    subgraph gateway["Gateway"]
        GW[API Gateway]
        AUTH[API Key Auth]
        RL_IN[Rate Limiter]
        CACHE[Cache Layer]
        ROUTER[Provider Router]
        STREAM[SSE Handler]
    end

    subgraph dashapi["Dashboard API"]
        DASH_API[Dashboard API]
    end

    subgraph providers["Providers"]
        OAI[OpenAI]
        ANT[Anthropic]
        MIS[Mistral]
        GRQ[Groq]
    end

    subgraph data["Data"]
        PG[(Postgres)]
        RD[(Redis)]
        S3[(S3)]
    end

    subgraph workers["Workers"]
        LOGGER[Request Logger]
        AGG[Usage Aggregator]
        BILL[Billing Worker]
        HEALTH[Health Monitor]
        ARCHIVE[Request Archiver]
    end

    APP1 --> DNS
    APP2 --> DNS
    DASH --> DNS
    DNS --> LB
    LB --> GW
    LB --> DASH_API

    GW --> AUTH
    AUTH --> RL_IN
    RL_IN --> CACHE
    CACHE -->|miss| ROUTER
    ROUTER --> OAI
    ROUTER --> ANT
    ROUTER --> MIS
    ROUTER --> GRQ
    ROUTER --> STREAM
    STREAM --> LOGGER

    DASH_API --> PG
    DASH_API --> RD

    LOGGER --> PG
    LOGGER --> RD

    RD --> AGG
    AGG --> PG
    PG --> BILL
    PG --> ARCHIVE
    ARCHIVE --> S3
    HEALTH --> PG
    HEALTH --> RD
```

The gateway layer is stateless. Tenant apps hit the load balancer, which routes to any gateway pod. All shared state (rate limits, cache, feature flags) lives in Redis. All durable state (requests, usage, billing, config) lives in Postgres. Workers run separately and never block the hot path.

---

## Scaling a Multi-Tenant API Gateway

Relay is stateless at the gateway layer. You can add more gateway pods behind the load balancer without coordination. The state lives in Postgres and Redis.

**Redis** is the first bottleneck. Rate limit counters and cache lookups hit Redis on every request. Run Redis in cluster mode with at least 3 shards. Shard by `tenant_id` so one tenant's traffic does not create hot keys for others.

**Postgres** becomes the bottleneck as the `requests` table grows. Partition it by `created_at` (monthly partitions). Drop or archive partitions older than your retention window. Use PgBouncer in transaction mode for the gateway (short, fast INSERTs) and session mode for the background workers (longer transactions, aggregation queries).

**Provider key rotation**: as Relay grows, a single OpenAI API key will hit TPM limits. Relay maintains a pool of provider keys per provider and distributes load across them using the outbound rate limit counters in Redis. The router picks the key with the most remaining capacity in the current window.

---

## FAQ

**Why store the hash of an API key instead of the key itself?**

If your database is compromised, the attacker gets hashes, not keys. A raw API key in the database is a plaintext credential. SHA-256 is one-way: you verify a key by hashing the incoming value and comparing it against the stored hash, but you cannot reverse a hash back into the original key. Treat API keys the same way you treat passwords.

**How do you prevent a missing `WHERE tenant_id` clause from leaking data across tenants?**

Row-Level Security in Postgres. You attach a policy to every tenant-scoped table that restricts all queries to rows matching `app.current_tenant_id`, a session variable the application sets at the start of every request. A bare `SELECT * FROM requests` with no WHERE clause still returns only that tenant's rows. The policy runs inside the database engine, below the application layer, so a bug in application code cannot bypass it.

**What happens if OpenAI goes down?**

The router tries providers in priority order. If OpenAI returns a 5xx, it marks OpenAI as degraded and tries the next provider that supports the requested model. If no provider is available, the request returns 503. The health monitor runs on a short interval and flips providers back to healthy once they start returning 2xx responses again. Tenants see a brief error window during a full provider outage, not an extended one.

**How do you count tokens accurately for billing when streaming does not always return usage data?**

Providers that support `stream_options: { include_usage: true }` (OpenAI does) include a `usage` object in the final SSE chunk. Relay reads those counts directly. For providers that do not, Relay runs a local tokenizer (tiktoken for OpenAI-compatible models, SentencePiece for others) to estimate counts. The estimate is within 1-2% of the actual count in practice. When there is a discrepancy, Relay bills the lower number rather than over-billing the tenant.

**How do you handle rate limiting if Redis goes down?**

Redis Cluster with at least 3 nodes and automatic failover handles most availability scenarios. If the entire cluster becomes unreachable, Relay fails open: it allows requests through rather than rejecting all traffic. Losing rate limit enforcement for a short period during a Redis failure is an acceptable trade-off. Taking down the entire API is not.

**What is the noisy neighbor problem and how does it show up in a pool model SaaS?**

One tenant runs a large batch job and consumes a disproportionate share of shared infrastructure: database connections, query time, provider key capacity. This increases latency for every other tenant on the same cluster without them doing anything wrong. Relay tracks per-tenant request volume in Redis with a 1-minute rolling window. Tenants that exceed a share threshold get their rate limits tightened automatically. If the pattern persists, they are a candidate for migration to a silo.

**How do you migrate a tenant from the pool model to a silo without downtime?**

Postgres logical replication. Set up a replication slot on the shared database filtered to the tenant's rows. Let the silo database catch up to near-real-time lag. Then pause writes for that tenant briefly (a few seconds), apply the final delta, update the routing config to point new requests at the silo, and release the write pause. The tenant sees a brief stall, not a maintenance window.

**Why keep billing records in separate tables instead of querying the `requests` table at invoice time?**

The `requests` table gets very large and is archived after 30-90 days to keep query performance acceptable. Billing records need to be retained for years for tax, legal, and dispute reasons. Separating `billing_line_items` and `usage_daily` from raw request logs means you can archive or drop old request partitions without losing billing history. The rollup also makes monthly invoice generation a fast aggregation query rather than a multi-billion-row scan.

**What does BYOK mean and why would an enterprise tenant want it?**

Bring Your Own Keys. The tenant provides their own OpenAI or Anthropic API keys instead of Relay's pooled ones. They pay the provider directly at published rates, and Relay charges a flat platform fee instead of a per-token markup. Enterprises want BYOK for three reasons: cost control at high volume, data privacy (their traffic flows under their own API key, not Relay's), and compliance requirements that restrict which credentials can touch their data.

**Why use prefixed API keys like `rly_live_sk_`?**

Three reasons. First, the prefix makes keys identifiable in logs, error messages, and accidental source code commits. Second, secret scanning tools (GitHub, trufflehog, Doppler) can be configured to detect your specific prefix and alert on leaks before damage is done. Third, the environment segment (`live` vs `test`) lets Relay reject live keys on test endpoints and test keys on live endpoints at the routing layer, before any business logic runs.

**How does semantic caching differ from exact match caching, and what can go wrong?**

Exact match caching hashes the full request and returns a cached response only if the exact same request was made before. Hit rates are low. Semantic caching embeds the user message into a vector and finds cached responses from requests that are semantically similar. Hit rates are higher, but there is a risk: two questions that are phrased similarly but have meaningfully different answers could get the same cached response. The similarity threshold controls this trade-off. Set it too high and you get wrong answers. Set it too low and cache hits are rare. Semantic caching should never be enabled for requests where correctness is critical (medical, legal, financial).

**What happens to a tenant's data when their account is deleted?**

Relay uses a soft delete: the tenant row gets `deleted_at` set and `status` set to `DELETED`. All API access is immediately rejected. The underlying data stays in the database for 30 days. During that window, a tenant can contact support to recover their account. After 30 days, a scheduled deletion job hard-deletes all rows across every table where `tenant_id` matches, in dependency order (child tables first, then the tenant row). This satisfies GDPR Article 17 (right to erasure) and gives you an audit trail during the retention window.

---

## Things Worth Their Own Post

- **Billing pipeline in depth**: Stripe subscription webhooks, idempotent event processing, dunning logic, credit notes, usage-based invoice generation.
- **Semantic caching**: embedding model selection, similarity thresholds, cache invalidation strategy, cost-benefit analysis.
- **BYOK (Bring Your Own Keys)**: encrypting and storing tenant provider credentials, key rotation, audit trail for credential usage.
- **Webhooks**: delivering request completion events to tenants, retry logic, failure handling, signing payloads.
- **Provider health monitoring**: circuit breaker implementation, alerting on degradation, automatic failover testing.
- **Request archival**: streaming from Postgres to Parquet on S3, querying historical data with DuckDB or Athena.
- **Multiregion**: running Relay in US, EU, AP with tenant routing to nearest region, cross-region cost aggregation.

