# Scheduler Configuration Guide

## Overview

For detailed design information about the Scheduler, please refer to
the [Scheduler Architecture Design Document](../design/scheduler/index.md).

---

## Service Discovery

### Configuration

The Scheduler discovers backend inference instances differently depending on the scheduling mode:

- **Lite-mode**: Uses the same Redis/etcd discovery as the Gateway (via `--llm-backend-discovery`).
- **Full-mode**: Uses CMS (Cluster Management Service) exclusively, polling instance metadata and status from a dedicated Redis.

#### Lite-mode discovery flags

These flags are shared with the Gateway. See the [Gateway Service Discovery configuration](gateway_conf.md#service-discovery) for the full flag reference (`--llm-backend-discovery`, `--discovery-redis-*`, `--discovery-etcd-*`).

#### Full-mode CMS flags

| Flag                              | Default   | Description                                          |
|-----------------------------------|-----------|------------------------------------------------------|
| `--cms-redis-host`                | `"redis"` | CMS Redis host                                      |
| `--cms-redis-port`                | `"6379"`  | CMS Redis port                                      |
| `--cms-redis-username`            | `""`      | CMS Redis username                                  |
| `--cms-redis-password`            | `""`      | CMS Redis password                                  |
| `--cms-redis-timeout`             | `1.0`     | CMS Redis socket timeout in seconds                 |
| `--cms-redis-retry-times`         | `1`       | CMS Redis retry times on connection failure          |
| `--cms-pull-status-interval-ms`   | `500`     | Polling interval (ms) for instance status from CMS   |
| `--cms-pull-metadata-interval-ms` | `10000`   | Polling interval (ms) for instance metadata from CMS |

---

## Policy Framework

### Usage and Extension Guidelines

- **Choosing a policy and mode**
    - SLO-restricted workloads with profiling data: **SLO + full-mode**.
    - Utilization-focused workloads with cache locality needs: **load-balance + full-mode + cache-aware scheduling**.
    - Simple deployments without CMS: **load-balance + lite-mode**, with `num_requests` or `num_tokens` as the load
      metric.

- **Switching between full-mode and lite-mode**
    - Flag: `--enable-full-mode-scheduling`.
    - **Full-mode (default)**: `--enable-full-mode-scheduling=true`.
    - **Lite-mode**: `--enable-full-mode-scheduling=false`.
    - Advanced scheduling features require full-mode.

- **Extending with new policies**
    - Reuse existing metrics, filters, and selectors; rewire combinations per infer type.
    - For new signals: implement `instanceSchedulingMetric` in `metrics.go`, register in `getSchedulingMetric`,
      configure in `schedule_policy_registry.go`.

---

## Cache-Aware Scheduling

### Prerequisites

Llumnix's cache-aware scheduling requires the inference engine to use a global KV cache store (e.g. Mooncake). Llumnix
queries the global KV cache store's metadata service to obtain prefix cache hit information for each request, so this
feature is only applicable when such a store is deployed.

Cache-aware scheduling is **disabled by default**. It is only supported in **full-mode scheduling**; the scheduler
forcibly disables it when full-mode is not enabled.

### Configuration

Key configuration flags (`cmd/config/config.go`):

| Flag                                     | Default       | Description                                               |
|------------------------------------------|---------------|-----------------------------------------------------------|
| `--enable-cache-aware-scheduling`        | `false`       | Enable or disable cache-aware scheduling                  |
| `--cache-aware-scheduling-min-tokens`    | `1024`        | Minimum prompt length to trigger cache-aware logic        |
| `--kvs-backend`                          | `mooncake`    | KVS metadata service backend                              |
| `--kvs-hash-algo`                        | `sha256_cbor` | Hash algorithm for prompt chunking                        |
| `--kvs-chunk-size`                       | `256`         | Token chunk size                                          |
| `--kvs-enable-save-unfull-chunk`         | `false`       | Whether to hash the last incomplete chunk                 |
| `--kvs-retry-times`                      | `5`           | Retry count for metadata service queries                  |
| `--kvs-retry-interval-ms`                | `100`         | Interval between retries                                  |
| `--kvs-metadata-service-down-duration-s` | `30`          | Duration to treat metadata service as down after failures |

Environment variables:

| Variable       | Default | Description                                                                                         |
|----------------|---------|-----------------------------------------------------------------------------------------------------|
| `GO_HASH_SEED` | `"0"`   | Hash seed for algorithms that require one (e.g. `sha256_cbor`). Must match the seed used by the KVS |

### Deployment Example

For a complete Kubernetes deployment example with PD disaggregation, Mooncake as KVS, and cache-aware scheduling
enabled, see `deploy/pd-kvs/full-mode-scheduling/load-balance/`.

---

## Predictor-Enhanced Scheduling

### Configuration

#### Scheduler Flags

| Flag                                     | Default | Description                                                                                  |
|------------------------------------------|---------|----------------------------------------------------------------------------------------------|
| `--enable-predictor-enhanced-scheduling` | `false` | Enable predictor-enhanced scheduling                                                         |
| `--max-num-batched-tokens`               | `65536` | Maximum tokens per prefill batch; must match the inference engine's `max_num_batched_tokens` |
| `--num-predictor-warmup-samples`         | `20`    | Minimum profiling samples before fitting the predictor model                                 |

#### Engine Environment Variables

| Variable                   | Default | Description                                                       |
|----------------------------|---------|-------------------------------------------------------------------|
| `LLUMNIX_ENABLE_PROFILING` | `0`     | Set to `1` to enable profiling data collection on the engine side |
| `LLUMNIX_PROFILING_STEPS`  | `50`    | Number of profiling samples to collect per instance               |

### Constraints

- **Full-mode only**: Requires `--enable-full-mode-scheduling=true` and CMS. The scheduler forcibly disables this
  feature when full-mode is not enabled.
- **`max-num-batched-tokens` alignment**: The scheduler-side `--max-num-batched-tokens` must match the inference
  engine's `max_num_batched_tokens` configuration. A mismatch causes inaccurate step simulation and degrades prediction
  quality.

---

## SLO-Aware Scheduling

### Configuration

| Flag                            | Setting                                | Description                       |
|---------------------------------|----------------------------------------|-----------------------------------|
| `--enable-full-mode-scheduling` | `true`                                 | Must be `true` for SLO policy     |
| `--scheduling-policy`           | `slo`                                  | Set to `slo` to enable SLO policy |
| `--ttft-profiling-data-path`    | Path to TTFT profiling JSON file       | (required)                        |
| `--tpot-profiling-data-path`    | Path to TPOT profiling JSON file       | (required)                        |
| `--ttft-slo`                    | Target TTFT SLO in milliseconds        | `6000.0`                          |
| `--tpot-slo`                    | Target TPOT SLO in milliseconds        | `50.0`                            |
| `--ttft-slo-dispatch-threshold` | Multiplier for TTFT dispatch threshold | `1.0`                             |
| `--tpot-slo-dispatch-threshold` | Multiplier for TPOT dispatch threshold | `1.0`                             |

The effective dispatch threshold is computed as `SLO * DispatchThreshold`. Instances with predicted latency exceeding
this threshold are filtered out. If no instances meet the threshold, a scheduling error (ErrorNoAvailableEndpoint) is
returned.

### Best Practices

1. **Accurate profiling data**: Collect profiling data on the same hardware configuration and engine launch parameters
   as production. Inaccurate profiling data leads to poor latency predictions.

2. **Conservative thresholds**: Start with `--ttft-slo-dispatch-threshold` and `--tpot-slo-dispatch-threshold` values
   slightly below 1.0 to allow some margin for prediction errors.

3. **Monitor actual latencies**: Compare predicted latencies with actual observed latencies and adjust profiling data if
   systematic bias is detected.

### Deployment Example

For a complete Kubernetes deployment example with SLO-aware scheduling, see `deploy/slo-aware/base/`.

---

## Adaptive PD Scheduling

### Configuration

| Flag                                                | Setting                                          | Description                                                                    |
|-----------------------------------------------------|--------------------------------------------------|--------------------------------------------------------------------------------|
| `--enable-full-mode-scheduling`                     | `true`                                           | Requires full-mode scheduling.                                                 |
| `--scheduling-policy`                               | `slo`                                            | Must be set to `slo` for adaptive PD.                                          |
| `--enable-adaptive-pd`                              | `true`                                           | Enable adaptive PD scheduling.                                                 |
| `--tpot-slo`                                        | `50`                                             | TPOT SLO target (ms).                                                          |
| `--tpot-slo-dispatch-threshold`                     | `0.85`                                           | Fraction of TPOT SLO used as the dispatch filter threshold.                    |
| `--colocated-reschedule-mode`                       | `true`                                           | Enable colocated reschedule mode. (standalone rescheduling is also supported). |
| `--reschedule-interval-ms`                          | `100`                                            | Reschedule interval.                                                           |
| `--reschedule-policies`                             | `binpacking_mitigation,binpacking_consolidation` | Reschedule policies.                                                           |
| `--tpot-migrate-out-ceil-threshold`                 | `0.95`                                           | Fraction of TPOT SLO above which overload rescheduling triggers.               |
| `--tpot-migrate-out-floor-threshold`                | `0.60`                                           | Fraction of TPOT SLO below which underload rescheduling triggers.              |
| `--enable-instance-status-local-account` (Optional) | `true`                                           | Enable instance status local account.                                          |

### Deployment Example

For a complete Kubernetes deployment example with adaptive PD scheduling, see `deploy/slo-aware/adaptive-pd/`.

---

## Rescheduler

### Configuration

#### Core Rescheduling Flags

| Flag                             | Default                                                           | Description                                   |
|----------------------------------|-------------------------------------------------------------------|-----------------------------------------------|
| `--enable-rescheduling`          | `false`                                                           | Enable rescheduling                           |
| `--rescheduling-policies`        | `"decode_load,prefill_failover,decode_failover,neutral_failover"` | Comma-separated list of rescheduling policies |
| `--rescheduling-interval-ms`     | `500`                                                             | Interval between rescheduling iterations      |
| `--colocated-rescheduling-mode`  | `false`                                                           | Run rescheduler inside scheduler process      |
| `--standalone-rescheduling-mode` | `false`                                                           | Run rescheduler as separate process           |

#### Load Balance Configuration

| Flag                                    | Default                            | Description                                                                                                                                  |
|-----------------------------------------|------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| `--rescheduling-decode-load-metric`     | `"kv_cache_usage_ratio_projected"` | Load metric for decode instances                                                                                                             |
| `--rescheduling-decode-load-threshold`  | `1.0`                              | Threshold for source/destination filtering: instances >= this value are migration sources, instances < this value are migration destinations |
| `--rescheduling-neutral-load-metric`    | `"kv_cache_usage_ratio_projected"` | Load metric for neutral instances                                                                                                            |
| `--rescheduling-neutral-load-threshold` | `1.0`                              | Threshold for source/destination filtering: instances >= this value are migration sources, instances < this value are migration destinations |
| `--rescheduling-load-balance-threshold` | `0.0`                              | Minimum load difference required to trigger migration                                                                                        |
| `--rescheduling-load-balance-scope`     | `"cluster"`                        | Balancing scope: `cluster` or `unit`                                                                                                         |

#### Adaptive PD Configuration

| Flag                                 | Default                                            | Description                                                                          |
|--------------------------------------|----------------------------------------------------|--------------------------------------------------------------------------------------|
| `--enable-adaptive-pd`               | `false`                                            | Enable adaptive PD scheduling                                                        |
| `--scheduling-policy`                | `"load-balance"`                                   | Must be set to `slo` for adaptive PD                                                 |
| `--tpot-slo`                         | `50`                                               | TPOT SLO target (ms)                                                                 |
| `--tpot-slo-dispatch-threshold`      | `0.85`                                             | Fraction of TPOT SLO used as dispatch/destination filter threshold                   |
| `--tpot-migrate-out-ceil-threshold`  | `0.95`                                             | Fraction of TPOT SLO above which mitigating rescheduling triggers (source filter)    |
| `--tpot-migrate-out-floor-threshold` | `0.60`                                             | Fraction of TPOT SLO below which consolidating rescheduling triggers (source filter) |
| `--rescheduling-policies`            | `"binpacking_mitigation,binpacking_consolidation"` | Rescheduling policies for adaptive PD                                                |
| `--rescheduling-interval-ms`         | `500`                                              | Interval between rescheduling iterations (use `100` for adaptive PD)                 |

> **Note**: For details, see [Adaptive PD Scheduling](../design/scheduler/adaptive_pd_scheduling.md).

#### Failover Configuration

| Flag                           | Default      | Description                                                      |
|--------------------------------|--------------|------------------------------------------------------------------|
| `--failover-domain`            | `"instance"` | Failure domain: `instance`, `node`, `instance-unit`, `node-unit` |
| `--instance-staleness-seconds` | `60`         | Time after which an instance is considered stale                 |

#### Migration Request Configuration

| Flag                              | Default   | Description                                                                 |
|-----------------------------------|-----------|-----------------------------------------------------------------------------|
| `--rescheduling-req-select-rule`  | `"TOKEN"` | Migration request selection rule: `NUM_REQ`, `TOKEN`, `RATIO`               |
| `--rescheduling-req-select-order` | `"SR"`    | Migration request selection order: `LCR`, `FCR`, `LR`, `SR`, `FCW`, `FCWSR` |
| `--rescheduling-req-select-value` | `1024`    | Number of requests/tokens or KV cache ratio to migrate                      |

#### gRPC Configuration

| Flag                                  | Default | Description                               |
|---------------------------------------|---------|-------------------------------------------|
| `--llumlet-grpc-connection-pool-size` | `10`    | Size of gRPC connection pool per instance |
| `--llumlet-grpc-timeout-seconds`      | `5`     | Timeout for gRPC migration calls          |

#### Deployment Modes

| Mode       | Flag                                  | Process                  |
|------------|---------------------------------------|--------------------------|
| Colocated  | `--colocated-rescheduling-mode=true`  | Inside scheduler process |
| Standalone | `--standalone-rescheduling-mode=true` | Separate process         |