Caching
The opt-in response cache and how to control it.
Unified Gateway can cache full responses in Redis and serve identical repeat requests without hitting a provider. It is opt-in per request and off by default — nothing is cached unless the client asks for it.
Enable it per request
Send the cache headers on the inference request:
| Header | Values | Effect |
|---|---|---|
x-unified-cache | true or 1 | Enables caching for this request. Any other value (or absent) = no cache. |
x-unified-cache-ttl | seconds | Time to live on a stored entry. Default 300 (5 min), max 86400 (24 h). |
curl -X POST $BASE/v1/chat/completions \
-H "Authorization: Bearer unified-..." \
-H "x-unified-cache: true" -H "x-unified-cache-ttl: 600" \
-H "content-type: application/json" -d '{
"model": "general",
"messages": [{ "role": "user", "content": "Capital of France?" }]
}'The first call is a miss (served from the provider and stored); identical calls within the TTL are
hits served from Redis. A cache hit is recorded in request_logs with cache_hit = true and bills as
zero cost.
How entries are scoped
- Per virtual key. The cache namespace is the virtual key's id, so one key never sees another key's cached responses.
- The master key never caches. Operator traffic always goes to the provider, even with the header set — this keeps administrative and test calls from polluting client caches.
- By exact request. The key is derived from the call type, namespace, and a hash of the canonical request, so any change to the prompt, model, or parameters is a different entry.
- Size cap. Responses larger than 512 KB (typically big embedding batches) are served normally but not stored — the per-hit saving does not justify the Redis memory, and they rarely repeat identically.
Invalidate
Operators can clear cached entries with the master key:
# Everything
DELETE /admin/cache
# Narrow by call type and/or namespace (a virtual key id)
DELETE /admin/cache?callType=chat&namespace=<virtual-key-id>Both filters are optional; omitting them widens the match. The response reports how many entries were removed. Entries also expire on their own once the TTL elapses.
When to use it
Good fits: deterministic or low-temperature prompts, repeated embedding lookups, idempotent retries, and read-heavy endpoints behind a cache-friendly client. Avoid it for high-temperature or personalized generations where every call should be fresh — leave the header off and they never touch the cache.