Gateway Configuration Guide#
Overview#
For detailed design information about the Gateway, please refer to the Gateway Architecture Design Document.
Service Discovery#
For detailed design information about service discovery, please refer to the Service Discovery Design Document.
Configuration#
The Gateway discovers backend inference instances through Redis or etcd. The discovery backend is selected via --llm-backend-discovery and configured with backend-specific flags.
General flags#
Flag |
Default |
Description |
|---|---|---|
|
|
Discovery backend for LLM instances: |
Redis discovery flags#
Flag |
Default |
Description |
|---|---|---|
|
|
Redis host |
|
|
Redis port |
|
|
Redis username |
|
|
Redis password |
|
|
Redis socket timeout in seconds |
|
|
Redis retry times on connection failure |
|
|
TTL in milliseconds for discovery entries; entries older than this are considered expired |
|
|
Polling interval in milliseconds for refreshing the instance list from Redis |
etcd discovery flags#
Flag |
Default |
Description |
|---|---|---|
|
|
etcd endpoints, comma-separated (e.g. |
|
|
etcd username |
|
|
etcd password |
|
|
etcd dial timeout in seconds |
|
|
etcd lease TTL in seconds; instances whose lease expires are automatically removed |
|
|
Periodic full-refresh interval in seconds (safety net for missed Watch events) |
Static endpoint discovery flags#
For development or testing, instances can be specified as a static comma-separated list instead of using Redis or etcd:
Flag |
Default |
Description |
|---|---|---|
|
|
Static backend endpoints (e.g. |
Deployment Example#
Redis-based discovery (default): See
deploy/base/redis.yamlfor the Redis deployment and any standard deployment example (e.g.deploy/neutral/lite-mode-scheduling/load-balance/).etcd-based discovery: See
deploy/etcd-discovery/for a 3-node etcd StatefulSet deployment with an integration test (test_etcd_failover.sh) that validates Gateway discovery recovery across single-node failure, quorum loss, full cluster restart with data loss, and instance pod restart.
PDD Forwarding Protocol#
Configuration#
Key configuration flags (cmd/config/config.go):
Flag |
Default |
Description |
|---|---|---|
|
|
PDD protocol type: vllm-kvt/vllm-mooncake |
|
|
Enable staged scheduling mode, batched scheduling mode when false |
Deployment Example#
For a complete Kubernetes deployment example with vllm-mooncake PDD protocol, see
deploy/pd/full-mode-scheduling/load-balance. For vllm-kvt PDD protocol, see
deploy/pd-kvs/full-mode-scheduling/load-balance.
Traffic Splitting#
Configuration#
Flag |
Default |
Description |
|---|---|---|
|
|
Routing policy: |
|
|
JSON array of route endpoint configurations |
|
|
Max retries for internal routing on retryable errors before triggering fallback |
|
|
Enable retry queue for 429 responses from fallback endpoints |
|
|
Max queued 429-retry tasks |
|
|
Concurrent goroutines processing 429-retry tasks |
|
|
Max 429 retries per request |
|
|
Initial backoff delay (ms) for 429 retries |
|
|
Max backoff delay (ms) for 429 retries |
Route Config JSON Format#
The --route-config flag accepts a JSON array. Each element describes one endpoint:
Field |
Type |
Description |
|---|---|---|
|
string |
Endpoint URL. Set to |
|
string |
API key for authentication. The gateway sets the |
|
string |
Reserved. Carried in the route config but not used to modify the proxied request; the original request model is forwarded as-is |
|
bool |
Whether this endpoint participates in the fallback chain |
|
int |
Weight for weight-based routing |
|
string |
Model name prefix pattern for prefix-based routing (e.g. |
Weight-based example:
[
{
"base_url": "local",
"weight": 50
},
{
"base_url": "http://vllm-external:8000",
"weight": 50,
"fallback": true
}
]
Prefix-based example:
[
{
"prefix": "Qwen/Qwen3-*",
"base_url": "local"
},
{
"prefix": "Qwen/Qwen2.5-*",
"base_url": "http://vllm-external:8000",
"fallback": true
}
]
Deployment Example#
For complete Kubernetes deployment examples with service routing, see:
Prefix-based routing:
deploy/traffic-splitting/prefix/Weight-based routing:
deploy/traffic-splitting/weight/
Each example includes an integration test that verifies routing and fallback behavior by sending requests through the gateway and checking that traffic is correctly distributed and fallback occurs on failure.
Traffic Mirror#
Configuration#
Traffic mirroring asynchronously copies a configurable percentage of requests to a secondary target without affecting client responses. Configure mirroring via a JSON file mounted at /mnt/mirror.json inside the gateway pod.
Mirror configuration fields#
Field |
Type |
Description |
|---|---|---|
|
bool |
Master switch for mirroring |
|
string |
Base URL of the mirror target |
|
float64 |
Percentage of requests to mirror (0-100) |
|
float64 |
Mirror request timeout in ms (0 = use request context) |
|
string |
Override Authorization header for mirror requests |
|
bool |
Enable mirror-related logging in gateway |
Example configuration#
{
"Enable": true,
"Target": "http://mirror-target:8000",
"Ratio": 10,
"Timeout": 5000,
"Authorization": "",
"EnableLog": true
}
With this configuration, approximately 10% of requests are mirrored to the target endpoint.
Hot-reload#
The gateway watches /mnt/mirror.json every 10 seconds and applies changes without restart. Two approaches to update:
Via ConfigMap (standard Kubernetes update, may take up to 60s):
kubectl edit configmap mirror-config -n <namespace>
Direct pod write (immediate effect):
kubectl exec -it deployment/gateway -n <namespace> -c gateway -- \
sh -c 'echo '\''{"Enable":false,"Target":"http://mirror-target:8000","Ratio":0}'\'' > /mnt/mirror.json'
Deployment example#
For a complete Kubernetes deployment example with traffic mirroring, see deploy/traffic-mirror/. This example includes an integration test (test_traffic_mirror.sh) that verifies:
Requests are mirrored when mirroring is enabled
Hot-reload disables mirroring without restarting the gateway
Hot-reload re-enables mirroring and traffic resumes