For the past month I have been reading about multi-tenant SaaS architecture. Not as an academic exercise. I kept running into the same questions on every project I looked at: where exactly does tenant data go, how do you stop one customer’s bug from becoming every customer’s problem, how does billing actually work at the database level. The blog posts I found were either too abstract or skipped the hard parts entirely.
So I designed one from scratch as a system design exercise.
We are going to design a multi-tenant AI API gateway called Relay (relay.lorbic.com).
Relay does one thing: it sits between your application and every major LLM provider. You send a single API call to Relay. Relay figures out which provider to use, forwards the request, streams the response back, counts the tokens, logs the cost, and bills you at the end of the month. You never manage provider API keys in your application. You never write a retry loop for provider outages. You just call Relay.
1your app
2 → POST relay.lorbic.com/v1/chat/completions
3 { model: "gpt-5.5", messages: [...] }
4 → Relay validates your key, checks limits, routes to OpenAI
5 → OpenAI responds
6 → Relay streams the response back, logs the tokens, records the costThat is the product. Now let us design the system that makes it work.
What is Multi-Tenant SaaS
Multi-tenancy means one running instance of your software serves many separate customers. Each customer is a tenant. Every tenant has their own data, their own users, their own configuration. But they all share the same servers, the same codebase, and usually the same database.
The word that matters here is isolation. Tenant A must not see tenant B’s data. Tenant A must not slow down tenant B’s requests. Tenant A must not exhaust resources that tenant B is counting on. Isolation is not just a nice property. It is the entire contract with your customers.
In B2B SaaS, a tenant is a company. In B2C, it can be an individual user. For Relay, tenants are engineering teams: a startup that signs up, creates API keys, and calls the Relay API inside their own product. Their end users never know Relay exists.
Three concepts come up everywhere in this design. It is worth defining them upfront before the schemas start:
- Tenant: the customer unit. One company = one tenant. Has a
tenant_idthat appears on every row of data they own. - User: a person inside a tenant. One tenant can have many users. A user belongs to exactly one tenant at a time (though the same email address can have accounts in multiple tenants).
- Plan: the subscription tier (free, starter, pro, enterprise). Determines what features are enabled and what rate limits apply.
API Compatibility
Relay uses the OpenAI API format. The same request shape, the same response shape, the same streaming protocol. This is not an accident. It means a developer can point their existing OpenAI SDK at Relay by changing one line.
# Before: calling OpenAI directly
client = OpenAI(api_key="sk-...")
# After: calling Relay, zero other changes
client = OpenAI(
api_key="rly_live_sk_4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA",
base_url="https://relay.lorbic.com/v1"
)
response = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": "Hello"}]
)
Relay exposes three endpoints:
1POST /v1/chat/completions — chat completions (streaming and non-streaming)
2POST /v1/completions — legacy text completions
3POST /v1/embeddings — text embeddingsThe request body and response body match the OpenAI spec exactly. Relay adds a few optional headers for its own features: x-relay-cache-ttl to control caching TTL, x-relay-fallback-models to specify per-request fallback preferences, and x-relay-metadata to attach arbitrary key-value pairs to the request log.
Internally, Relay translates requests to each provider’s native format. Anthropic’s API uses a different message structure than OpenAI’s. Mistral’s streaming format has small differences. The provider router handles all of that. The tenant never sees it.
Choosing a Multi-Tenant Database Architecture
Every multi-tenant SaaS starts with the same decision: how do you separate one customer’s data from another’s?
There are three approaches.
Silo gives each tenant their own database. Complete isolation. Very expensive at scale. Used by regulated industries or large enterprise contracts.
Bridge gives each tenant their own schema inside a shared database. acme.requests, beta.requests. Medium isolation, medium cost.
Pool puts everyone in the same tables with a tenant_id column on every row. Cheapest, hardest to get right, works for thousands of tenants.
Relay uses the pool model. Here is why: Relay’s customers are developers and startups. There will be thousands of them. The data shape is uniform across all tenants (requests, tokens, costs). There is no per-tenant schema customization. And the operational cost of running a database per tenant would make the pricing unworkable.
When I was reading about this, the thing that surprised me was how many production SaaS companies started with silo because it felt “safer”, then spent months migrating to pool because the operational cost was unsustainable at scale. The right model depends almost entirely on who your customer is and what you are charging them.
For the very few large enterprise tenants who need data residency or dedicated infrastructure, Relay offers silo as an upgrade. The architecture supports routing them to a separate database via a tenant config flag. But that is the exception, not the default.
In the pool model, all tenants share one database. Rows are separated by tenant_id. In the silo model, a tenant gets a dedicated database. Relay defaults to pool. Enterprise tenants that need dedicated infrastructure are routed to their own database via a flag in tenant_config.
System Overview
Before going into each component, here is the full picture of what Relay looks like.
Every request passes through auth, rate limiting, and cache check before reaching the provider router. If the cache misses, the router picks a provider, forwards the request, and streams the response back. Background workers handle everything async: logging, usage aggregation, billing, and provider health monitoring.
API Key Authentication and Tenant Resolution
Relay is an API product. There are no browser sessions, no subdomains per tenant, no cookie-based auth for the hot path. Authentication is an API key in the Authorization header.
1Authorization: Bearer rly_live_sk_abc123xyz...The key itself encodes enough information to skip a lookup in most cases, but the real resolution always hits the database (or Redis cache) once per request.
API key design
Relay uses prefixed, hashed API keys. The key the user sees is never stored. Only its SHA-256 hash is stored.
1Full key: rly_live_sk_4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA
2Prefix: rly_live_
3Environment: live (vs test)
4Random part: 4xKp9mN2qR8vL3wB7tJ6cD0hF5sG1eA (256 bits, URL-safe base64)The prefix makes it easy to detect keys in source code scanning, error logs, and GitHub secret scanning. Stripe does this. Anthropic does this. It is a good pattern.
1CREATE TABLE api_keys (
2 id TEXT PRIMARY KEY,
3 tenant_id TEXT NOT NULL REFERENCES tenants(id),
4 name TEXT NOT NULL, -- "Production", "Dev", "CI"
5 key_prefix TEXT NOT NULL, -- rly_live_ or rly_test_
6 key_hash TEXT NOT NULL UNIQUE, -- SHA-256 of the full key
7 key_hint TEXT NOT NULL, -- last 4 chars: "...1eA"
8 environment TEXT NOT NULL DEFAULT 'live',
9 status TEXT NOT NULL DEFAULT 'active',
10 last_used_at TIMESTAMPTZ,
11 expires_at TIMESTAMPTZ,
12 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
13 created_by TEXT NOT NULL REFERENCES users(id)
14);
15
16CREATE INDEX idx_api_keys_tenant_id ON api_keys(tenant_id);
17CREATE INDEX idx_api_keys_key_hash ON api_keys(key_hash);You never store the full key. When you create a key, you show it to the user once and tell them to copy it. After that, if they lose it, they rotate it. The key_hint column (last 4 characters) helps them identify which key is which in the dashboard.
Test mode
rly_test_sk_ keys run in test mode. Relay does not forward test requests to real providers. It returns deterministic mocked responses from a fixed response library. This means no provider costs during development, predictable responses in CI/CD pipelines, and no test traffic polluting the live usage dashboard.
Test mode behaves identically to live mode in every other way: rate limits apply, tokens are counted, requests are logged, errors are returned in the same format. The only difference is the backend. When Relay sees a test key, the router skips provider selection entirely and calls a local mock handler instead.
1-- Requests table distinguishes test from live via the api_key's environment
2-- api_keys.environment = 'live' | 'test'
3-- usage_daily is partitioned by environment so dashboards stay cleanTest responses are deterministic based on the request hash. The same input always produces the same mock output. This makes test assertions reliable.
Auth flow
Authorization: Bearer rly_live_sk_..."] --> B["Extract key
from header"] B --> C["SHA-256 hash
the key"] C --> D{"Redis cache
key_hash lookup?"} D -->|hit| E["Return cached
tenant + key metadata"] D -->|miss| F["Postgres lookup
WHERE key_hash = ?"] F --> G{"Found and
status = active?"} G -->|no| H[401 Unauthorized] G -->|yes| I["Cache in Redis
TTL 5 minutes"] I --> J["Inject TenantContext
into request"] J --> K[Next middleware]
The key is hashed on every request and looked up in Redis first. A cache hit means zero database reads for auth. A miss hits Postgres and populates Redis for the next 5 minutes. The resolved tenant context flows through every subsequent middleware and handler via the request context.
func APIKeyMiddleware(keys KeyStore, cache Cache) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
raw := strings.TrimPrefix(r.Header.Get("Authorization"), "Bearer ")
if raw == "" {
writeError(w, 401, "missing api key")
return
}
hash := sha256Hex(raw)
key, err := cache.GetAPIKey(r.Context(), hash)
if err != nil {
key, err = keys.GetByHash(r.Context(), hash)
if err != nil || key.Status != "active" {
writeError(w, 401, "invalid api key")
return
}
cache.SetAPIKey(r.Context(), hash, key, 5*time.Minute)
}
ctx := WithAPIKey(r.Context(), key)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
}
Database Schema Design
Tenants and users
1CREATE TABLE tenants (
2 id TEXT PRIMARY KEY, -- ten_abc123
3 slug TEXT UNIQUE NOT NULL, -- acme
4 name TEXT NOT NULL, -- Acme Inc.
5 plan TEXT NOT NULL DEFAULT 'free', -- free | starter | pro | enterprise
6 status TEXT NOT NULL DEFAULT 'trial',-- trial | active | suspended | deleted
7 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
8 updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
9 deleted_at TIMESTAMPTZ,
10 metadata JSONB
11);
12
13CREATE INDEX idx_tenants_slug ON tenants(slug);
14CREATE INDEX idx_tenants_status ON tenants(status);
15
16CREATE TABLE users (
17 id TEXT PRIMARY KEY,
18 tenant_id TEXT NOT NULL REFERENCES tenants(id),
19 email TEXT NOT NULL,
20 role TEXT NOT NULL DEFAULT 'member', -- owner | admin | member | viewer
21 status TEXT NOT NULL DEFAULT 'active',
22 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
23 updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
24 deleted_at TIMESTAMPTZ,
25 UNIQUE(tenant_id, email)
26);
27
28CREATE INDEX idx_users_tenant_id ON users(tenant_id);
29CREATE INDEX idx_users_email ON users(email);Model catalog
Relay maintains a catalog of all supported models with their pricing. This drives cost calculation and provider routing.
1CREATE TABLE providers (
2 id TEXT PRIMARY KEY, -- openai | anthropic | mistral | groq
3 name TEXT NOT NULL, -- OpenAI
4 base_url TEXT NOT NULL, -- https://api.openai.com/v1
5 status TEXT NOT NULL DEFAULT 'active', -- active | degraded | down
6 priority INT NOT NULL DEFAULT 100, -- lower = more preferred
7 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
8);
9
10CREATE TABLE models (
11 id TEXT PRIMARY KEY, -- gpt-5.5 | claude-3-5-sonnet-20241022
12 provider_id TEXT NOT NULL REFERENCES providers(id),
13 relay_name TEXT UNIQUE NOT NULL, -- the name tenants use: "gpt-5.5"
14 provider_name TEXT NOT NULL, -- name the provider actually accepts
15 display_name TEXT NOT NULL, -- GPT-4o
16 input_cost_per_1m NUMERIC NOT NULL, -- USD per 1M input tokens
17 output_cost_per_1m NUMERIC NOT NULL, -- USD per 1M output tokens
18 context_window INT NOT NULL, -- max tokens in context
19 supports_streaming BOOLEAN NOT NULL DEFAULT true,
20 supports_functions BOOLEAN NOT NULL DEFAULT false,
21 supports_vision BOOLEAN NOT NULL DEFAULT false,
22 status TEXT NOT NULL DEFAULT 'active',
23 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
24);
25
26CREATE INDEX idx_models_provider_id ON models(provider_id);A few rows to make it concrete:
1id provider_id relay_name input_cost output_cost context_window
2gpt-5.5 openai gpt-5.5 2.50 10.00 128000
3claude-3-5-sonnet-20241022 anthropic claude-3-5-sonnet 3.00 15.00 200000
4mistral-large-2411 mistral mistral-large 2.00 6.00 131072
5llama-3.3-70b-versatile groq llama-3.3-70b 0.59 0.79 128000Costs are in USD per 1 million tokens.
Provider credentials
Relay uses its own API keys to call providers. These are pooled: Relay pays the providers, marks up the cost, and charges tenants.
For enterprise tenants who want to bring their own keys (BYOK) and pay providers directly, Relay supports that too. The tenant_id being NULL means the credential belongs to Relay.
Provider credentials are among the most sensitive data Relay stores. A leaked provider key means an attacker can run LLM calls billed to that key. The key_encrypted column stores the key encrypted with AES-256-GCM using a data encryption key (DEK) that is itself encrypted by a master key stored in a KMS (AWS KMS or HashiCorp Vault). The plaintext key never touches disk. Relay calls the KMS API to decrypt the DEK at runtime, decrypts the credential in memory, uses it for the provider request, and discards it.
Key rotation works in two steps: the KMS master key is rotated on a schedule (annually at minimum), then a background job re-encrypts every key_encrypted value under the new master key. The job runs row by row with checkpointing so it can resume if interrupted.
1CREATE TABLE provider_credentials (
2 id TEXT PRIMARY KEY,
3 tenant_id TEXT REFERENCES tenants(id), -- NULL = Relay's pooled key
4 provider_id TEXT NOT NULL REFERENCES providers(id),
5 key_encrypted BYTEA NOT NULL, -- encrypted with Relay's KMS key
6 key_hint TEXT NOT NULL, -- last 4 chars for display
7 is_relay_owned BOOLEAN NOT NULL DEFAULT true,
8 status TEXT NOT NULL DEFAULT 'active',
9 rpm_limit INT, -- provider's RPM cap for this key
10 tpm_limit INT, -- provider's TPM cap for this key
11 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
12 UNIQUE(tenant_id, provider_id)
13);Requests
Every API call Relay processes gets a row here. This is the source of truth for billing, debugging, and usage dashboards.
1CREATE TABLE requests (
2 id TEXT PRIMARY KEY, -- req_abc123
3 tenant_id TEXT NOT NULL REFERENCES tenants(id),
4 api_key_id TEXT NOT NULL REFERENCES api_keys(id),
5 model_id TEXT NOT NULL REFERENCES models(id),
6 provider_id TEXT NOT NULL REFERENCES providers(id),
7 credential_id TEXT NOT NULL REFERENCES provider_credentials(id),
8
9 -- What kind of request
10 request_type TEXT NOT NULL, -- chat | completion | embedding
11 stream BOOLEAN NOT NULL DEFAULT false,
12
13 -- Token counts (filled in after response)
14 input_tokens INT,
15 output_tokens INT,
16 total_tokens INT,
17
18 -- Cost in USD
19 input_cost NUMERIC(12,8),
20 output_cost NUMERIC(12,8),
21 total_cost NUMERIC(12,8),
22
23 -- Relay's markup (stored separately for accounting)
24 markup_rate NUMERIC(5,4) NOT NULL DEFAULT 0.20, -- 20% markup by default
25 relay_revenue NUMERIC(12,8),
26
27 -- Timing
28 started_at TIMESTAMPTZ NOT NULL,
29 first_token_at TIMESTAMPTZ, -- for streaming: TTFT
30 completed_at TIMESTAMPTZ,
31 latency_ms INT,
32 ttft_ms INT, -- time to first token
33
34 -- Outcome
35 status TEXT NOT NULL, -- success | error | cached | timeout
36 error_code TEXT,
37 error_message TEXT,
38 http_status INT,
39
40 -- Cache
41 cache_hit BOOLEAN NOT NULL DEFAULT false,
42 cache_key TEXT,
43
44 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
45);
46
47CREATE INDEX idx_requests_tenant_created ON requests(tenant_id, created_at DESC);
48CREATE INDEX idx_requests_model_created ON requests(model_id, created_at DESC);
49CREATE INDEX idx_requests_api_key_created ON requests(api_key_id, created_at DESC);
50CREATE INDEX idx_requests_status ON requests(status, created_at DESC);The requests table grows fast. At 1,000 tenants each making 100 requests per day, that is 100,000 rows per day, 3 million per month. After 90 days you want to start archiving cold rows to cheaper storage (S3 + Parquet). Keep hot rows (last 30 days) in Postgres. Query historical data via a warehouse.
Daily usage rollup
Querying requests for billing is expensive at scale. Aggregate into a daily rollup table that background workers update continuously.
1CREATE TABLE usage_daily (
2 tenant_id TEXT NOT NULL REFERENCES tenants(id),
3 model_id TEXT NOT NULL REFERENCES models(id),
4 date DATE NOT NULL,
5 request_count BIGINT NOT NULL DEFAULT 0,
6 input_tokens BIGINT NOT NULL DEFAULT 0,
7 output_tokens BIGINT NOT NULL DEFAULT 0,
8 total_tokens BIGINT NOT NULL DEFAULT 0,
9 total_cost NUMERIC(12,4) NOT NULL DEFAULT 0,
10 relay_revenue NUMERIC(12,4) NOT NULL DEFAULT 0,
11 cache_hits BIGINT NOT NULL DEFAULT 0,
12 error_count BIGINT NOT NULL DEFAULT 0,
13 avg_latency_ms INT,
14 PRIMARY KEY (tenant_id, model_id, date)
15);Request Flow: From API Call to Provider and Back
This is the core of Relay. Every API call goes through this pipeline.
Authorization: Bearer rly_live_sk_... GW->>Auth: validate API key Auth->>Auth: SHA-256 hash key
lookup in Redis / Postgres Auth-->>GW: TenantContext injected GW->>RL: check RPM + TPM limits RL-->>GW: ok GW->>Cache: check cache
hash(model + messages + params) Cache-->>GW: miss GW->>Router: route request
model=gpt-5.5 Router->>Router: lookup model in catalog
select provider + credential Router->>OAI: POST api.openai.com/v1/chat/completions
Authorization: Bearer sk-relay-pooled-... OAI-->>Router: 200 OK (streaming SSE) Router-->>App: pipe SSE chunks Router->>Router: count tokens from response
calculate cost Router->>GW: log request async GW->>GW: INSERT into requests
INCR Redis counters
Model routing and failover
The router resolves which provider to use for a given model name. If the primary provider is down or over its rate limit, it falls back to an alternative.
model = gpt-5.5"] --> B["Lookup model
in catalog"] B --> C["Get providers
ordered by priority"] C --> D{"Tenant has
BYOK key?"} D -->|yes| E[Use tenant's key] D -->|no| F[Use Relay pooled key] E --> G{"Provider
healthy?"} F --> G G -->|yes| H[Forward request] G -->|no| I["Next provider
by priority"] I --> J{Any left?} J -->|yes| G J -->|no| K[503 No providers available] H -->|5xx from provider| I H -->|success| L[Return response]
func (r *Router) Route(ctx context.Context, req *RelayRequest) (*RelayResponse, error) {
model, err := r.catalog.GetModel(req.Model)
if err != nil {
return nil, ErrModelNotFound
}
providers := r.getOrderedProviders(ctx, model)
for _, p := range providers {
cred, err := r.getCredential(ctx, req.TenantID, p.ID)
if err != nil {
continue
}
resp, err := r.forward(ctx, p, cred, req)
if err == nil {
return resp, nil
}
// 5xx from provider: try next
if isProviderError(err) {
r.health.MarkDegraded(p.ID)
continue
}
// 4xx from provider: the request itself is bad, do not retry
return nil, err
}
return nil, ErrNoProvidersAvailable
}
Handling streaming (SSE)
When the tenant sends "stream": true, the provider returns a Server-Sent Events stream. Relay proxies it chunk by chunk. This is not a buffered response, it is a live pipe.
stream: true Relay->>Provider: POST (stream: true) Provider-->>Relay: data: {"delta": {"content": "Hello"}} Relay-->>App: data: {"delta": {"content": "Hello"}} Provider-->>Relay: data: {"delta": {"content": " world"}} Relay-->>App: data: {"delta": {"content": " world"}} Provider-->>Relay: data: [DONE] Relay-->>App: data: [DONE] Relay->>Relay: count tokens from chunks
log request async
Relay counts tokens as chunks arrive by tracking the usage field that providers send in the final chunk. For providers that do not send usage in streaming mode, Relay uses a local tokenizer (tiktoken for OpenAI models, SentencePiece for others) to estimate counts. The estimate is reconciled against the actual count when available.
Token counting is done on a goroutine that reads from the same SSE stream, separate from the goroutine that proxies chunks to the client. The client sees no added latency.
Retry logic
Relay retries failed provider requests automatically, but only for the right kind of failures.
Retryable: 5xx errors from the provider (server errors), network timeouts, connection resets. These indicate the provider had a problem, not that the request was wrong.
Not retryable: 4xx errors from the provider (except 429). A 400 means the request was malformed. A 401 means the credential is wrong. Retrying will not help and wastes time.
Provider rate limit (429 from provider): Relay does not wait and retry on the same key. It immediately switches to the next available provider key or provider. Waiting for a rate limit window to reset would add seconds of latency for the tenant.
Retry policy: up to 3 attempts, 100ms exponential backoff between attempts, each attempt on a different provider or key if possible. If all attempts fail, Relay returns the last error to the tenant.
Relay does not retry on behalf of the tenant after returning an error. If Relay returns a 502 or 503, the tenant’s code decides whether to retry at their layer.
Error response format
Every error Relay returns uses the same JSON structure. Developers should be able to write one error handler and handle all Relay errors with it.
1{
2 "error": {
3 "code": "rate_limit_exceeded",
4 "message": "You have exceeded 100 requests per minute on the free plan.",
5 "type": "rate_limit_error",
6 "details": {
7 "limit": 100,
8 "window": "per_minute",
9 "reset_at": "2026-07-01T12:05:00Z",
10 "upgrade_url": "https://relay.lorbic.com/billing"
11 }
12 }
13}The full set of error codes:
| HTTP status | error.code | Meaning |
|---|---|---|
| 401 | invalid_api_key |
Key not found or revoked |
| 402 | tenant_suspended |
Account suspended, check billing |
| 402 | quota_exceeded |
Hard quota limit hit (seats, storage) |
| 404 | model_not_found |
Requested model does not exist in catalog |
| 422 | invalid_request |
Request body failed validation |
| 429 | rate_limit_exceeded |
RPM or TPM limit hit |
| 502 | provider_error |
Provider returned an error after retries |
| 503 | no_provider_available |
All providers for this model are degraded |
| 504 | request_timeout |
Provider did not respond within the timeout |
The details field is present on rate_limit_exceeded and quota_exceeded errors so tenants can surface actionable information to their own users. If a tenant’s app is hitting Relay’s rate limit, they should know the reset time and the upgrade path, not just a generic “too many requests.”
Rate Limiting
Relay has to manage rate limits at two levels.
Inbound: protecting Relay’s own infrastructure from tenants who send too many requests.
Outbound: protecting Relay’s provider API keys from hitting the provider’s limits.
Both use Redis counters with sliding windows.
The dual-layer rate limiting was the part I found most interesting while designing this. Most rate limiting content covers the inbound side. The outbound side is specific to API gateway products but just as important. If Relay’s OpenAI key hits OpenAI’s TPM limit because of one large tenant, every other tenant on that key starts getting 429s from the provider. The router has to detect this and re-route before it becomes the tenant’s problem.
Inbound tenant limits
1-- Plan-level defaults
2CREATE TABLE quota_definitions (
3 plan TEXT NOT NULL,
4 resource TEXT NOT NULL, -- rpm | tpm | requests_per_day
5 limit_value BIGINT NOT NULL, -- -1 = unlimited
6 window TEXT NOT NULL, -- per_minute | per_day
7 UNIQUE(plan, resource, window)
8);
9
10-- Sample values:
11-- plan resource limit window
12-- free rpm 60 per_minute
13-- free tpm 40000 per_minute
14-- free requests_per_day 500 per_day
15-- starter rpm 500 per_minute
16-- starter tpm 400000 per_minute
17-- pro rpm 3000 per_minute
18-- pro tpm 2000000 per_minute
19-- pro requests_per_day -1 per_day (unlimited)
20
21-- Per-tenant overrides (for enterprise deals or support credits)
22CREATE TABLE quota_overrides (
23 id TEXT PRIMARY KEY,
24 tenant_id TEXT NOT NULL REFERENCES tenants(id),
25 resource TEXT NOT NULL,
26 limit_value BIGINT NOT NULL,
27 window TEXT NOT NULL,
28 reason TEXT,
29 expires_at TIMESTAMPTZ,
30 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
31 UNIQUE(tenant_id, resource, window)
32);Redis key structure for inbound limits:
1rl:in:ten_abc123:rpm → sliding window counter, 60s window
2rl:in:ten_abc123:tpm → sliding window counter, 60s windowOutbound provider limits
Each of Relay’s provider API keys has its own RPM and TPM cap set by the provider. Relay tracks these in Redis too.
1rl:out:openai:key_xyz:rpm → counter, 60s window
2rl:out:openai:key_xyz:tpm → counter, 60s windowWhen a pooled key is close to its provider limit, the router skips it and picks the next available key or provider.
The rate limit pipeline
RPM check"} B -->|over| C["429 Too Many Requests
retry-after header"] B -->|ok| D{"Inbound tenant
TPM check"} D -->|over| C D -->|ok| E[Route to provider] E --> F{"Outbound key
RPM check"} F -->|over| G["Pick next key
or provider"] F -->|ok| H[Forward to provider] G --> F
A request fails fast at inbound RPM or TPM before any provider call is made. The outbound check happens inside the router, per-key. If a key is saturated, the router picks the next key or the next provider, not the next retry window.
The TPM check on inbound traffic uses an estimated token count from the request body (input tokens only). The actual token count (input + output) is only known after the response. Relay does a conservative pre-check against the estimated input, then reconciles and updates the TPM counter after the response.
Token Cost Tracking and Billing
Relay’s pricing model: Relay pays the provider at published rates, marks up by 20%, and charges tenants at the marked-up rate. The markup funds Relay’s infrastructure and margin.
1Tenant sends: 1,000 input tokens, 500 output tokens on gpt-5.5
2Provider cost: (1000 / 1M) × $2.50 + (500 / 1M) × $10.00 = $0.00250 + $0.00500 = $0.00750
3Relay markup: $0.00750 × 1.20 = $0.00900
4Tenant billed: $0.00900
5Relay revenue: $0.00150Every request row in the requests table stores both the raw provider cost and the Relay revenue. This makes accounting and audits straightforward.
Cost calculation
input_tokens=1000
output_tokens=500"] --> B["Lookup model pricing
input_cost_per_1m
output_cost_per_1m"] B --> C["input_cost = tokens / 1M x rate
output_cost = tokens / 1M x rate"] C --> D[total_cost = input + output] D --> E[relay_revenue = total_cost x markup_rate] E --> F[INSERT into requests] F --> G[INCR usage_daily rollup]
Billing cycle
Relay charges tenants monthly via Stripe. The billing worker runs at the start of each month, reads the usage_daily table for the prior month, and creates a Stripe invoice line item per model per tenant.
1CREATE TABLE billing_periods (
2 id TEXT PRIMARY KEY,
3 tenant_id TEXT NOT NULL REFERENCES tenants(id),
4 period_start DATE NOT NULL,
5 period_end DATE NOT NULL,
6 status TEXT NOT NULL DEFAULT 'open', -- open | finalized | paid | failed
7 total_cost NUMERIC(12,4),
8 invoice_id TEXT, -- Stripe invoice ID
9 finalized_at TIMESTAMPTZ,
10 paid_at TIMESTAMPTZ,
11 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
12 UNIQUE(tenant_id, period_start)
13);
14
15CREATE TABLE billing_line_items (
16 id TEXT PRIMARY KEY,
17 billing_period_id TEXT NOT NULL REFERENCES billing_periods(id),
18 tenant_id TEXT NOT NULL REFERENCES tenants(id),
19 model_id TEXT NOT NULL REFERENCES models(id),
20 request_count BIGINT NOT NULL,
21 input_tokens BIGINT NOT NULL,
22 output_tokens BIGINT NOT NULL,
23 total_cost NUMERIC(12,4) NOT NULL,
24 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
25 UNIQUE(billing_period_id, model_id)
26);Billing is an entire system design on its own: Stripe subscription lifecycle, idempotent webhook handling, retry logic for failed payments, proration, credits, and tax calculation. This post covers the data model. The billing pipeline gets its own post.
Response Caching: Exact Match and Semantic
Caching completions saves money. If 1,000 tenants call the same model with the same system prompt and similar questions, Relay can serve many of those from cache instead of paying the provider.
Relay supports two cache strategies.
Exact match caching
The simplest form. Hash the full request (model + messages + temperature + all params). If you have seen this exact request before, return the cached response.
1cache_key = SHA-256(
2 model_id +
3 JSON.stringify(messages) +
4 temperature + top_p + max_tokens + ...
5)Stored in Redis with a configurable TTL. Default 1 hour. Tenants can set their own TTL per API call via a x-relay-cache-ttl header.
Semantic caching
Exact match only works for identical requests. Semantic caching uses embeddings to find requests that are similar enough to return the same response.
using a small model"] B --> C["Vector similarity search
against cached embeddings"] C --> D{"Similarity score
> threshold?"} D -->|yes| E["Return cached response
cache_hit = semantic"] D -->|no| F["Forward to provider
cache the response + embedding"]
The embedding model for cache lookup should be cheap and fast (text-embedding-3-small or similar). The cost of generating the lookup embedding is tiny compared to the LLM call it might save.
1CREATE TABLE response_cache (
2 id TEXT PRIMARY KEY,
3 tenant_id TEXT NOT NULL REFERENCES tenants(id),
4 cache_key TEXT NOT NULL, -- exact match hash
5 model_id TEXT NOT NULL REFERENCES models(id),
6 request_hash TEXT NOT NULL,
7 response_body JSONB NOT NULL,
8 input_tokens INT NOT NULL,
9 output_tokens INT NOT NULL,
10 embedding vector(1536), -- pgvector, for semantic cache
11 hit_count INT NOT NULL DEFAULT 0,
12 expires_at TIMESTAMPTZ NOT NULL,
13 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
14);
15
16CREATE INDEX idx_response_cache_tenant_key ON response_cache(tenant_id, cache_key);
17CREATE INDEX idx_response_cache_embedding ON response_cache
18 USING hnsw (embedding vector_cosine_ops)
19 WHERE embedding IS NOT NULL;Observability
You cannot operate a multi-tenant API platform without per-tenant observability. When a tenant files a support ticket saying “your API is slow,” you need to be able to pull up exactly their requests, their latencies, and their errors in under a minute.
What to track per request
Every request log entry should have:
tenant_idandapi_key_idmodel_idandprovider_idlatency_msandttft_ms(time to first token, the metric tenants care about for streaming)input_tokens,output_tokens,total_coststatusanderror_codecache_hit
These are already in the requests table. Make sure every log line, trace span, and metric tag also carries tenant_id.
Structured logging
func RequestLoggingMiddleware(logger *slog.Logger) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
key := APIKeyFromContext(r.Context())
reqLogger := logger.With(
slog.String("tenant_id", key.TenantID),
slog.String("api_key_id", key.ID),
slog.String("request_id", newRequestID()),
)
ctx := WithLogger(r.Context(), reqLogger)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
}
Health status per provider
Relay’s router uses a health tracker to avoid sending requests to degraded providers. The tracker works on a circuit breaker pattern: after N consecutive errors from a provider, mark it degraded and stop routing to it until it recovers.
1CREATE TABLE provider_health_events (
2 id TEXT PRIMARY KEY,
3 provider_id TEXT NOT NULL REFERENCES providers(id),
4 status TEXT NOT NULL, -- healthy | degraded | down
5 error_rate NUMERIC, -- 5-minute error rate at time of event
6 p95_ms INT, -- p95 latency at time of event
7 message TEXT,
8 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
9);
10
11CREATE INDEX idx_provider_health_events ON provider_health_events(provider_id, created_at DESC);The provider status page (status.relay.lorbic.com) reads from this table. Tenants can subscribe to status updates.
Noisy neighbor detection
In the pool model, one tenant running an aggressive batch job can saturate Relay’s provider keys and cause latency for everyone else. Detect this by tracking per-tenant request volume in Redis with a 1-minute rolling window. If a single tenant exceeds a percentage threshold of total traffic, flag them and apply stricter rate limits automatically.
Role-Based Access Control
API calls use API key auth. The Relay dashboard (where tenants manage keys, view usage, configure settings) uses JWT auth.
1{
2 "iss": "relay.lorbic.com",
3 "sub": "usr_abc123",
4 "tenant_id": "ten_xyz789",
5 "tenant_slug": "acme",
6 "role": "admin",
7 "permissions": ["api_keys:read", "api_keys:write", "billing:read", "usage:read"],
8 "iat": 1751400000,
9 "exp": 1751500000
10}RBAC tables
1CREATE TABLE roles (
2 id TEXT PRIMARY KEY,
3 tenant_id TEXT NOT NULL REFERENCES tenants(id),
4 name TEXT NOT NULL, -- owner | admin | developer | viewer
5 is_system BOOLEAN NOT NULL DEFAULT false,
6 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
7 UNIQUE(tenant_id, name)
8);
9
10CREATE TABLE permissions (
11 id TEXT PRIMARY KEY,
12 resource TEXT NOT NULL, -- api_keys | usage | billing | members | settings
13 action TEXT NOT NULL, -- create | read | update | delete
14 UNIQUE(resource, action)
15);
16
17CREATE TABLE role_permissions (
18 role_id TEXT NOT NULL REFERENCES roles(id),
19 permission_id TEXT NOT NULL REFERENCES permissions(id),
20 PRIMARY KEY (role_id, permission_id)
21);
22
23CREATE TABLE user_roles (
24 user_id TEXT NOT NULL REFERENCES users(id),
25 role_id TEXT NOT NULL REFERENCES roles(id),
26 PRIMARY KEY (user_id, role_id)
27);Audit log
Every sensitive action in the dashboard gets recorded: key creation, key revocation, plan changes, member removal, billing configuration.
1CREATE TABLE audit_log (
2 id TEXT PRIMARY KEY,
3 tenant_id TEXT REFERENCES tenants(id),
4 actor_id TEXT NOT NULL,
5 actor_type TEXT NOT NULL, -- user | platform_admin | system
6 action TEXT NOT NULL, -- api_key.created | api_key.revoked | member.removed
7 resource_id TEXT,
8 resource_type TEXT,
9 metadata JSONB,
10 ip_address INET,
11 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
12);
13
14CREATE INDEX idx_audit_log_tenant ON audit_log(tenant_id, created_at DESC);
15CREATE INDEX idx_audit_log_actor ON audit_log(actor_id, created_at DESC);Tenant Provisioning: From Signup to First API Call
What happens when someone signs up
When a user submits the signup form, Relay does the minimum work synchronously: insert a tenant row with status=PROVISIONING, create the owner user record, and return a 201. The browser redirects immediately. Everything else runs in the background.
The background worker picks up the job and does the slower parts: seeding system roles, creating a Stripe customer record, and flipping the tenant to TRIAL. The Stripe customer is created at signup, not when the user adds a payment method. This is intentional. You need a Stripe customer ID to attach payment methods, subscriptions, and invoices later. Creating it lazily (at payment time) means every billing code path has to handle the case where the ID does not exist yet, which causes bugs.
The welcome email includes the tenant’s first API key. The key is generated by the worker, not the API handler, because key generation requires the tenant record to be fully set up.
If the worker fails partway through, the tenant stays in PROVISIONING. A monitoring job scans for tenants stuck in PROVISIONING for more than 5 minutes and either retries the job or pages on-call.
{ email, password, org_name } API->>DB: INSERT tenant (status=PROVISIONING) API->>DB: INSERT user (role=owner) API->>Queue: enqueue ProvisionTenant job API-->>User: 201 Created
{ tenant_id, redirect: /onboarding } Worker->>Queue: dequeue job Worker->>DB: seed default roles (owner, admin, developer, viewer) Worker->>DB: seed plan quota definitions Worker->>Stripe: create Stripe customer Worker->>DB: UPDATE tenant SET status=TRIAL, stripe_customer_id=cus_... Worker-->>User: send welcome email with first API key
Tenant status transitions
The status column on the tenants table is the single source of truth for what a tenant can do. Every API request, every dashboard login, and every background job checks this column before doing anything.
PROVISIONING: The tenant record exists but setup is not complete. API calls are rejected with a 503. The dashboard shows a loading state.
TRIAL: Setup is complete. The tenant has full API access up to the trial plan’s rate limits. A trial expiry timestamp lives in tenant_config. When it passes without a payment method being added, a background job moves the tenant to EXPIRED.
ACTIVE: A payment method is on file and billing is working. Full access based on the tenant’s plan.
SUSPENDED: Payment failed, or a Relay admin manually suspended the account. API calls return a 402 with a clear message and a link to the billing page. The tenant’s data is untouched. If payment resolves within the grace period, the account returns to ACTIVE with no data loss. This is intentional: you want tenants to fix their billing, not lose access to everything they built.
EXPIRED: The trial ended without a payment method being added. Same behavior as SUSPENDED for API calls. After N days with no action, the tenant moves to DELETED.
DELETED: Soft delete. The row stays in the database with deleted_at set, but all API calls are rejected. Data is retained for 30 days for recovery or dispute resolution, then a GDPR-compliant hard deletion job removes all tenant rows across every table.
Status transitions are intentionally one-directional except for SUSPENDED back to ACTIVE. You cannot un-delete a tenant. You cannot go from ACTIVE back to TRIAL. These constraints are enforced at the application layer via explicit state machine logic, not just database constraints.
Suspended tenants get a 429 response with a billing error message on every API call. This is important: the error message should tell them what is wrong and link them to the billing page. A silent 401 causes confusion.
Feature Flags and Per-Tenant Configuration
1CREATE TABLE plan_features (
2 plan TEXT NOT NULL,
3 feature TEXT NOT NULL, -- semantic_cache | byok | sso | audit_log | higher_rpm | webhooks
4 enabled BOOLEAN NOT NULL DEFAULT true,
5 PRIMARY KEY (plan, feature)
6);
7
8CREATE TABLE tenant_feature_overrides (
9 tenant_id TEXT NOT NULL REFERENCES tenants(id),
10 feature TEXT NOT NULL,
11 enabled BOOLEAN NOT NULL,
12 reason TEXT,
13 expires_at TIMESTAMPTZ,
14 PRIMARY KEY (tenant_id, feature)
15);
16
17CREATE TABLE tenant_config (
18 tenant_id TEXT NOT NULL REFERENCES tenants(id),
19 key TEXT NOT NULL,
20 value TEXT NOT NULL,
21 updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
22 PRIMARY KEY (tenant_id, key)
23);
24
25-- Example config rows:
26-- tenant_id key value
27-- ten_abc default_cache_ttl_sec 3600
28-- ten_abc markup_rate 0.15 (negotiated lower markup for volume)
29-- ten_abc allowed_models ["gpt-5.5","claude-3-5-sonnet"]
30-- ten_abc webhook_url https://acme.com/hooks/relayCache the feature flag state per tenant in Redis (60s TTL). Do not query Postgres on every API call.
Full Architecture
The gateway layer is stateless. Tenant apps hit the load balancer, which routes to any gateway pod. All shared state (rate limits, cache, feature flags) lives in Redis. All durable state (requests, usage, billing, config) lives in Postgres. Workers run separately and never block the hot path.
Scaling a Multi-Tenant API Gateway
Relay is stateless at the gateway layer. You can add more gateway pods behind the load balancer without coordination. The state lives in Postgres and Redis.
Redis is the first bottleneck. Rate limit counters and cache lookups hit Redis on every request. Run Redis in cluster mode with at least 3 shards. Shard by tenant_id so one tenant’s traffic does not create hot keys for others.
Postgres becomes the bottleneck as the requests table grows. Partition it by created_at (monthly partitions). Drop or archive partitions older than your retention window. Use PgBouncer in transaction mode for the gateway (short, fast INSERTs) and session mode for the background workers (longer transactions, aggregation queries).
Provider key rotation: as Relay grows, a single OpenAI API key will hit TPM limits. Relay maintains a pool of provider keys per provider and distributes load across them using the outbound rate limit counters in Redis. The router picks the key with the most remaining capacity in the current window.
FAQ
Why store the hash of an API key instead of the key itself?
If your database is compromised, the attacker gets hashes, not keys. A raw API key in the database is a plaintext credential. SHA-256 is one-way: you verify a key by hashing the incoming value and comparing it against the stored hash, but you cannot reverse a hash back into the original key. Treat API keys the same way you treat passwords.
How do you prevent a missing WHERE tenant_id clause from leaking data across tenants?
Row-Level Security in Postgres. You attach a policy to every tenant-scoped table that restricts all queries to rows matching app.current_tenant_id, a session variable the application sets at the start of every request. A bare SELECT * FROM requests with no WHERE clause still returns only that tenant’s rows. The policy runs inside the database engine, below the application layer, so a bug in application code cannot bypass it.
What happens if OpenAI goes down?
The router tries providers in priority order. If OpenAI returns a 5xx, it marks OpenAI as degraded and tries the next provider that supports the requested model. If no provider is available, the request returns 503. The health monitor runs on a short interval and flips providers back to healthy once they start returning 2xx responses again. Tenants see a brief error window during a full provider outage, not an extended one.
How do you count tokens accurately for billing when streaming does not always return usage data?
Providers that support stream_options: { include_usage: true } (OpenAI does) include a usage object in the final SSE chunk. Relay reads those counts directly. For providers that do not, Relay runs a local tokenizer (tiktoken for OpenAI-compatible models, SentencePiece for others) to estimate counts. The estimate is within 1-2% of the actual count in practice. When there is a discrepancy, Relay bills the lower number rather than over-billing the tenant.
How do you handle rate limiting if Redis goes down?
Redis Cluster with at least 3 nodes and automatic failover handles most availability scenarios. If the entire cluster becomes unreachable, Relay fails open: it allows requests through rather than rejecting all traffic. Losing rate limit enforcement for a short period during a Redis failure is an acceptable trade-off. Taking down the entire API is not.
What is the noisy neighbor problem and how does it show up in a pool model SaaS?
One tenant runs a large batch job and consumes a disproportionate share of shared infrastructure: database connections, query time, provider key capacity. This increases latency for every other tenant on the same cluster without them doing anything wrong. Relay tracks per-tenant request volume in Redis with a 1-minute rolling window. Tenants that exceed a share threshold get their rate limits tightened automatically. If the pattern persists, they are a candidate for migration to a silo.
How do you migrate a tenant from the pool model to a silo without downtime?
Postgres logical replication. Set up a replication slot on the shared database filtered to the tenant’s rows. Let the silo database catch up to near-real-time lag. Then pause writes for that tenant briefly (a few seconds), apply the final delta, update the routing config to point new requests at the silo, and release the write pause. The tenant sees a brief stall, not a maintenance window.
Why keep billing records in separate tables instead of querying the requests table at invoice time?
The requests table gets very large and is archived after 30-90 days to keep query performance acceptable. Billing records need to be retained for years for tax, legal, and dispute reasons. Separating billing_line_items and usage_daily from raw request logs means you can archive or drop old request partitions without losing billing history. The rollup also makes monthly invoice generation a fast aggregation query rather than a multi-billion-row scan.
What does BYOK mean and why would an enterprise tenant want it?
Bring Your Own Keys. The tenant provides their own OpenAI or Anthropic API keys instead of Relay’s pooled ones. They pay the provider directly at published rates, and Relay charges a flat platform fee instead of a per-token markup. Enterprises want BYOK for three reasons: cost control at high volume, data privacy (their traffic flows under their own API key, not Relay’s), and compliance requirements that restrict which credentials can touch their data.
Why use prefixed API keys like rly_live_sk_?
Three reasons. First, the prefix makes keys identifiable in logs, error messages, and accidental source code commits. Second, secret scanning tools (GitHub, trufflehog, Doppler) can be configured to detect your specific prefix and alert on leaks before damage is done. Third, the environment segment (live vs test) lets Relay reject live keys on test endpoints and test keys on live endpoints at the routing layer, before any business logic runs.
How does semantic caching differ from exact match caching, and what can go wrong?
Exact match caching hashes the full request and returns a cached response only if the exact same request was made before. Hit rates are low. Semantic caching embeds the user message into a vector and finds cached responses from requests that are semantically similar. Hit rates are higher, but there is a risk: two questions that are phrased similarly but have meaningfully different answers could get the same cached response. The similarity threshold controls this trade-off. Set it too high and you get wrong answers. Set it too low and cache hits are rare. Semantic caching should never be enabled for requests where correctness is critical (medical, legal, financial).
What happens to a tenant’s data when their account is deleted?
Relay uses a soft delete: the tenant row gets deleted_at set and status set to DELETED. All API access is immediately rejected. The underlying data stays in the database for 30 days. During that window, a tenant can contact support to recover their account. After 30 days, a scheduled deletion job hard-deletes all rows across every table where tenant_id matches, in dependency order (child tables first, then the tenant row). This satisfies GDPR Article 17 (right to erasure) and gives you an audit trail during the retention window.
Things Worth Their Own Post
- Billing pipeline in depth: Stripe subscription webhooks, idempotent event processing, dunning logic, credit notes, usage-based invoice generation.
- Semantic caching: embedding model selection, similarity thresholds, cache invalidation strategy, cost-benefit analysis.
- BYOK (Bring Your Own Keys): encrypting and storing tenant provider credentials, key rotation, audit trail for credential usage.
- Webhooks: delivering request completion events to tenants, retry logic, failure handling, signing payloads.
- Provider health monitoring: circuit breaker implementation, alerting on degradation, automatic failover testing.
- Request archival: streaming from Postgres to Parquet on S3, querying historical data with DuckDB or Athena.
- Multiregion: running Relay in US, EU, AP with tenant routing to nearest region, cross-region cost aggregation.