Engineering2026-03-15·11 min

Routing AI Inference Across 25 Providers (and Letting Users Bring Their Own Keys)

How RoleCall routes inference across internal servers, a load-balanced proxy, and user-provided API keys — with E2E encryption, tier enforcement, and graceful fallback.

When you're building an AI product, you start with one model from one provider. Then you add a second for fallback. Then a third because your users want it. Then someone asks "can I use my own API key?" and suddenly you're building an inference router.

At RoleCall, we support 25+ providers, internal model servers, a load-balanced proxy, and full BYOK (Bring Your Own Key) — all behind a single executeInference() call. Here's how we got there and why every AI product eventually needs something like this.

Why not just pick one provider?

Three reasons pushed us away from single-provider dependency:

Reliability. Every provider goes down. OpenAI has outages. Anthropic has capacity limits. Google has quota surprises. If your product depends on one provider, their downtime is your downtime. We needed fallback chains that switch automatically.

Cost. Different models have wildly different price/performance ratios. A simple lorebook lookup doesn't need Claude Opus — Gemini Flash at 1/40th the cost handles it fine. We needed to route different workloads to different models.

User choice. Our power users had strong opinions about models. Some swore by Claude for prose quality. Others preferred local models for privacy. Instead of arguing, we decided to let them bring their own keys.

The routing decision tree

Every inference request flows through the same path. The router examines the model ID and determines where to send it:

Model IDs use a prefix pattern that determines routing:

local/* and utility/* — free-tier models on our NemoAI proxy, no rate limits
work/* and closed/* — premium models through the proxy, tier-enforced
BYOK models — routed to the user's configured provider
Server-referenced models — routed to admin-managed inference servers

This pattern-based routing means we can add new models without changing routing code. A new model behind the proxy just needs the right prefix.

The provider abstraction layer

25 providers, 3 fundamentally different API formats. OpenAI uses chat/completions with messages. Anthropic uses /v1/messages with content blocks. Google uses generateContent with parts. And then there are 20+ OpenAI-compatible providers that are mostly standard but each have quirks.

We built an adapter layer that normalizes everything:

Each adapter implements four methods:

buildRequest() — transform our internal format into the provider's API format
parseResponse() — normalize the provider's response back to our format
parseStreamChunk() — parse SSE chunks during streaming
testConnection() — validate credentials and fetch available models

The OpenAI adapter handles 18 providers through subclasses that override specific behaviors. DeepSeek needs special request formatting. OpenRouter has reasoning output extraction. NanoGPT switches between subscription and pay-per-use endpoints. Z.AI toggles between common and coding modes. Each quirk is isolated in its adapter.

BYOK: Bring Your Own Key

BYOK was the feature our users asked for most — and the one with the hardest security requirements. You're asking users to give you their API keys. Those keys have billing attached. If you leak them, someone else runs up their tab.

E2E encryption

We built an end-to-end encryption system where the server never has access to user API keys at rest:

The user generates a 12-word BIP39 recovery phrase (like a crypto wallet). This derives an AES-256-GCM encryption key that stays in their browser. When they add a provider, the browser encrypts the API key locally and sends only the ciphertext to our server. We store it but can't read it.

When they start a chat, the browser decrypts the key locally and sends the plaintext in the request body. The server uses it for that one request and forgets it.

This means even if our database is compromised, the API keys are useless — they're encrypted with keys we don't have.

Server mode for group features

E2E encryption has a limitation: the server can't use keys autonomously. For features like group chat auto-responses (where the server needs to make inference calls without the user's browser being open), we offer an optional server encryption mode. The server encrypts/decrypts using its own key. Less secure, but necessary for certain features. Users opt in explicitly.

BYOK request flow

When a request hits a BYOK model:

Router detects model.is_byok = true with a matching provider ID
Skip all tier enforcement — the user is paying their own provider, not us
Retrieve the provider config — either from the request body (E2E) or decrypt from DB (server mode)
Get the provider's adapter (OpenAI, Anthropic, Google, etc.)
adapter.buildRequest() — format for the specific provider
Make the request with SSRF validation on custom base URLs
adapter.parseResponse() — normalize back
Log usage for the user's analytics (no billing on our side)

If the provider returns a 401, we mark the config as validation_status: "invalid" so the user sees it in their settings.

For custom endpoints, we even handle URL fallback — if the base URL 404s, we try appending /chat/completions, then /v1/chat/completions, then /api/v1/chat/completions, covering the common layouts.

NemoAI: our load-balanced proxy

For users on RoleCall's own models (not BYOK), requests go through NemoAI — our internal proxy that load-balances across multiple provider credentials.

The proxy maintains provider pools — multiple credential sets per provider type with health tracking:

isHealthy — currently operational
errorCount — consecutive failures (threshold: 3)
lastUsed / usageCount — for round-robin distribution

When a provider pool accumulates too many errors, the proxy automatically falls back to the next provider in the chain. For example, if our primary Gemini pool hits 3 consecutive errors, traffic shifts to the Claude pool. This is transparent to the user — they just see uninterrupted service.

The proxy also handles model remapping. When a user requests one model but our routing logic determines a different upstream model is better (or cheaper), the proxy substitutes transparently.

Retry logic uses exponential backoff: 3 retries max, starting at 1 second, doubling each time.

Tier enforcement

Not everything is unlimited. RoleCall has subscription tiers that gate model access, request quotas, and rate limits:

Tier	Monthly Requests	Per Minute	Max Context
Director	Unlimited	Unlimited	Unlimited
Premium	10,000	100	200k tokens
Standard	1,000	10	128k tokens
Standing Room	100	5	32k tokens

Enforcement happens before inference, not after. Rate limiting uses an atomic Postgres check — a row-level lock that prevents race conditions when multiple requests hit simultaneously. No Redis needed, no distributed coordination — just a database function that increments and checks in a single transaction.

BYOK requests skip tier enforcement entirely. If you're paying your own provider, we don't limit you.

Streaming without blocking

Every inference request can be streaming (SSE). The tricky part: we need to count tokens and log usage without blocking the stream.

We use a TransformStream that intercepts chunks as they flow through:

const transform = new TransformStream({
  transform(chunk, controller) {
    // Pass through immediately — don't block the stream
    controller.enqueue(chunk);
    
    // Parse in background for token counting
    const lines = decode(chunk).split("\n");
    for (const line of lines) {
      if (line.startsWith("data: ") && line !== "data: [DONE]") {
        const data = JSON.parse(line.slice(6));
        streamedContent += data.choices[0]?.delta?.content || "";
        if (data.usage) capturedUsage = data.usage;
      }
    }
  },
  async flush() {
    // Stream complete — log usage asynchronously
    logUsage(...).catch(console.error);
  }
});

The stream passes through to the client untouched. Token counting happens as a side effect. Usage logging fires asynchronously after the stream completes. The user never waits for our bookkeeping.

For token estimation, we prefer provider-reported counts (requested via stream_options: { include_usage: true }). If the provider doesn't support it, we fall back to content length divided by 4 characters per token — rough but sufficient for usage tracking.

Cost tracking

Every inference call gets a cost estimate based on token counts and model pricing:

calculateCost(usage, providerId, modelId)
// → { inputCost, outputCost, totalCost, isFree }

Pricing comes from the models table (synced periodically from provider APIs). OpenRouter is particularly helpful here — their /api/v1/models endpoint includes per-token pricing for every model they support.

For BYOK, we track the cost but don't bill — it's informational, helping users understand their spending across providers.

Security patterns

A few security details worth mentioning:

SSRF protection. When users provide custom base URLs (BYOK), we validate them against internal IP ranges, private networks, and localhost. We also set redirect: "error" in fetch to prevent redirect-based bypass.

Timing-safe API key validation. Checking API keys can leak information through response timing — an invalid key returns faster than a valid one, revealing which keys exist. We enforce a minimum response time of 100ms, sleeping to pad short responses. Even on invalid keys, the response takes the same time.

Key hint masking. The UI shows sk-****abcd — enough to identify which key you're looking at, not enough to use it. The full key only exists in the encrypted blob.

What we learned

Start with the abstraction, not the providers. We built the adapter interface before adding the second provider. That early investment paid for itself a hundred times over — adding provider #25 took 30 minutes, not a week.

BYOK is a trust accelerator. Users who were hesitant about our platform became advocates after adding their own keys. They felt in control. They could use their preferred models without asking us. And we didn't have to worry about scaling inference costs for them.

Atomic rate limiting in Postgres is underrated. Everyone reaches for Redis. But if your database can handle the load (ours can — rate limit checks are single-row locks), keeping it in Postgres means one fewer service to operate, one fewer failure point, and transactional consistency with your usage logs.

Never store what you don't need. E2E encryption isn't just a security feature — it's a liability shield. We can honestly tell users: we can't read your keys even if we wanted to. That's a stronger promise than "we promise not to look."

The inference router is the most boring part of RoleCall — and that's exactly how infrastructure should be. Boring, reliable, and invisible to the user. They pick a model, press send, and get a response. Everything else is plumbing.

← back_to_blog