mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
157 lines
8.0 KiB
Markdown
157 lines
8.0 KiB
Markdown
# Adaptive Speculative Decoding
|
|
|
|
Adaptive speculative decoding lets SGLang adjust `speculative_num_steps/speculative_num_draft_tokens` at runtime instead of keeping a single fixed value for the whole server lifetime.
|
|
It is designed for workloads whose accept length changes over time, where one static step count is rarely optimal.
|
|
|
|
## Current support
|
|
|
|
- Only `--speculative-algorithm EAGLE`
|
|
- Only `--speculative-eagle-topk 1`
|
|
- If either condition is not met, SGLang falls back to static speculative settings
|
|
|
|
## Why adaptive steps help
|
|
|
|
`speculative_num_steps` controls how many draft-model autoregressive steps run in each speculative round. In practice, the best value depends on the current workload.
|
|
|
|
- If `num_steps` is too small, the draft model could have produced more accepted tokens, but the round stops too early.
|
|
- If `num_steps` is too large, the draft model produces many candidate tokens that the target model rejects, so extra draft work is wasted.
|
|
|
|
Real traffic often moves between high-acceptance and low-acceptance phases, so one fixed step count is usually a compromise. Adaptive mode tries to follow the workload instead of hard-coding a single global `num_steps`.
|
|
|
|
## Design overview
|
|
|
|
The adaptive mechanism has three pieces:
|
|
|
|
- `AdaptiveSpeculativeParams`: the EMA-based policy
|
|
- `SpecRuntimeState`: the per-tier runtime state bundle
|
|
- `AdaptiveController`: the coordinator that chooses a tier and activates the matching runtime state
|
|
|
|
At startup, SGLang pre-builds one runtime state per candidate tier. By default, the candidate tiers are `candidate_steps = [1, 3, 7]`.
|
|
|
|
```text
|
|
┌──────────────────────────────────────────────────────────┐
|
|
│ SpecRuntimeState │
|
|
│ │
|
|
│ speculative_num_steps / speculative_num_draft_tokens │
|
|
│ │
|
|
│ ┌────────────────┐ ┌────────────────┐ ┌──────────────┐ │
|
|
│ │ Draft stage │ │ Verify stage │ │ Extend stage │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ attn_backend │ │ attn_backend │ │ attn_backend │ │
|
|
│ │ cuda_graph │ │ cuda_graph │ │ cuda_graph │ │
|
|
│ └────────────────┘ └────────────────┘ └──────────────┘ │
|
|
└──────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
This matters because `CudaGraphRunner` is shape-dependent. Each candidate tier owns its own graph and backend state, so runtime switching is a reference swap, not an online graph recapture.
|
|
|
|
## Runtime flow
|
|
|
|
The adaptive update happens after verify and affects the next round, not the current one:
|
|
|
|
```text
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ EAGLEWorker.forward_batch_generation() — decode path │
|
|
│ │
|
|
│ ① draft(batch) │
|
|
│ │ draft model multi-step generation with current tier │
|
|
│ v │
|
|
│ ② verify(batch, spec_info) │
|
|
│ │ target model tree verification │
|
|
│ │ → produces accept_length_per_req │
|
|
│ v │
|
|
│ ③ forward_draft_extend_after_decode(batch) │
|
|
│ │ draft model KV-cache catch-up │
|
|
│ v │
|
|
│ ④ adaptive_controller.on_verify_complete(accept_lengths) │
|
|
│ │ │
|
|
│ │ update EMA, apply warmup / interval / hysteresis gates │
|
|
│ │ if tier changed, select a pre-built state from pool │
|
|
│ v │
|
|
│ worker.apply_runtime_state(state) │
|
|
│ │
|
|
│ Tier switch happens after the current round completes. │
|
|
│ Backends and CUDA graphs are never swapped mid-round. │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## How the policy decides
|
|
|
|
After each verify pass, SGLang reads the accepted draft length per request, computes the batch average, smooths it with an exponential moving average (EMA), and switches among the pre-built candidate tiers `[1, 3, 7]` by default.
|
|
|
|
The decision logic is intentionally conservative:
|
|
|
|
- `warmup_batches` skips the first few batches
|
|
- `update_interval` avoids switching every batch
|
|
- `down_hysteresis` and `up_hysteresis` reduce oscillation
|
|
|
|
Conceptually, the policy probes one step beyond the observed acceptance:
|
|
|
|
```text
|
|
target_steps ≈ clamp(round(ema_accept_len) + 1, min(candidate_steps), max(candidate_steps))
|
|
```
|
|
|
|
So if recent requests consistently accept more drafted tokens, the policy tends to move up. If they start rejecting earlier, it tends to move down.
|
|
|
|
## Usage
|
|
|
|
`--speculative-adaptive-config` is optional, but the speculative setup still needs to be valid for adaptive mode.
|
|
|
|
```bash
|
|
python3 -m sglang.launch_server \
|
|
--model meta-llama/Llama-2-7b-chat-hf \
|
|
--speculative-algorithm EAGLE \
|
|
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
|
|
--speculative-eagle-topk 1 \
|
|
--speculative-num-steps 3 \
|
|
--speculative-num-draft-tokens 4 \
|
|
--speculative-adaptive
|
|
```
|
|
|
|
If you want to override the defaults, add `--speculative-adaptive-config /path/to/adaptive_spec.json`.
|
|
|
|
Example config:
|
|
|
|
```json
|
|
{
|
|
"candidate_steps": [1, 3, 7],
|
|
"ema_alpha": 0.2,
|
|
"warmup_batches": 10,
|
|
"update_interval": 5
|
|
}
|
|
```
|
|
|
|
## Config file reference
|
|
|
|
The config file is optional. Any omitted keys use defaults.
|
|
|
|
| Key | Default | Meaning |
|
|
|---|---|---|
|
|
| `candidate_steps` | `[1, 3, 7]` | Discrete `speculative_num_steps` tiers that adaptive mode can switch between |
|
|
| `ema_alpha` | `0.2` | EMA smoothing factor for accepted draft length |
|
|
| `update_interval` | `5` | Recompute interval, in verify batches, after warmup |
|
|
| `warmup_batches` | `10` | Number of verify batches to observe before switching |
|
|
| `down_hysteresis` | `-0.25` | Extra margin before moving to a smaller step |
|
|
| `up_hysteresis` | `0.0` | Extra margin before moving to a larger step |
|
|
|
|
The initial `--speculative-num-steps` is snapped to the nearest value in `candidate_steps`.
|
|
|
|
## Monitoring
|
|
|
|
You can inspect the active tier and acceptance metric via `/server_info`:
|
|
|
|
```bash
|
|
curl -s http://127.0.0.1:30000/server_info | jq '.internal_states[0] | {speculative_num_steps, avg_spec_accept_length}'
|
|
```
|
|
|
|
- `speculative_num_steps` is the current active tier
|
|
- `avg_spec_accept_length` helps explain whether the server is likely to move up or down
|
|
|
|
## Tuning tips
|
|
|
|
- Start with the default candidate tiers `[1, 3, 7]`
|
|
- Use fewer tiers if you want lower startup and graph-memory overhead
|
|
- Increase `ema_alpha` to react faster, or lower it for more stability
|
|
- Increase `warmup_batches` or `update_interval` if tier switching is too noisy
|
|
- If your workload is already stable and one static setting is well tuned, adaptive mode may not help much
|