mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
147 lines
3.3 KiB
Markdown
147 lines
3.3 KiB
Markdown
# Evaluating New Models with SGLang
|
|
|
|
This document provides commands for evaluating models' accuracy and performance. Before open-sourcing new models, we strongly suggest running these commands to verify whether the score matches your internal benchmark results.
|
|
|
|
**For cross verification, please submit commands for installation, server launching, and benchmark running with all the scores and hardware requirements when open-sourcing your models.**
|
|
|
|
[Reference: MiniMax M2](https://github.com/sgl-project/sglang/pull/12129)
|
|
|
|
## Accuracy
|
|
|
|
### LLMs
|
|
|
|
SGLang provides built-in scripts to evaluate common benchmarks.
|
|
|
|
**MMLU**
|
|
|
|
```bash
|
|
python -m sglang.test.run_eval \
|
|
--eval-name mmlu \
|
|
--port 30000 \
|
|
--num-examples 1000 \
|
|
--max-tokens 8192
|
|
```
|
|
|
|
**GSM8K**
|
|
|
|
```bash
|
|
python -m sglang.test.few_shot_gsm8k \
|
|
--host 127.0.0.1 \
|
|
--port 30000 \
|
|
--num-questions 200 \
|
|
--num-shots 5
|
|
```
|
|
|
|
**HellaSwag**
|
|
|
|
```bash
|
|
python benchmark/hellaswag/bench_sglang.py \
|
|
--host 127.0.0.1 \
|
|
--port 30000 \
|
|
--num-questions 200 \
|
|
--num-shots 20
|
|
```
|
|
|
|
**GPQA**
|
|
|
|
```bash
|
|
python -m sglang.test.run_eval \
|
|
--eval-name gpqa \
|
|
--port 30000 \
|
|
--num-examples 198 \
|
|
--max-tokens 120000 \
|
|
--repeat 8
|
|
```
|
|
|
|
```{tip}
|
|
For reasoning models, add `--thinking-mode <mode>` (e.g., `qwen3`, `deepseek-v3`). You may skip it if the model has forced thinking enabled.
|
|
```
|
|
|
|
**HumanEval**
|
|
|
|
```bash
|
|
pip install human_eval
|
|
|
|
python -m sglang.test.run_eval \
|
|
--eval-name humaneval \
|
|
--num-examples 10 \
|
|
--port 30000
|
|
```
|
|
|
|
### VLMs
|
|
|
|
**MMMU**
|
|
|
|
```bash
|
|
python benchmark/mmmu/bench_sglang.py \
|
|
--port 30000 \
|
|
--concurrency 64
|
|
```
|
|
|
|
```{tip}
|
|
You can set max tokens by passing `--extra-request-body '{"max_tokens": 4096}'`.
|
|
```
|
|
|
|
For models capable of processing video, we recommend extending the evaluation to include `VideoMME`, `MVBench`, and other relevant benchmarks.
|
|
|
|
## Performance
|
|
|
|
Performance benchmarks measure **Latency** (Time To First Token - TTFT) and **Throughput** (tokens/second).
|
|
|
|
### LLMs
|
|
|
|
**Latency-Sensitive Benchmark**
|
|
|
|
This simulates a scenario with low concurrency (e.g., single user) to measure latency.
|
|
|
|
```bash
|
|
python -m sglang.bench_serving \
|
|
--backend sglang \
|
|
--host 0.0.0.0 \
|
|
--port 30000 \
|
|
--dataset-name random \
|
|
--num-prompts 10 \
|
|
--max-concurrency 1
|
|
```
|
|
|
|
**Throughput-Sensitive Benchmark**
|
|
|
|
This simulates a high-traffic scenario to measure maximum system throughput.
|
|
|
|
```bash
|
|
python -m sglang.bench_serving \
|
|
--backend sglang \
|
|
--host 0.0.0.0 \
|
|
--port 30000 \
|
|
--dataset-name random \
|
|
--num-prompts 1000 \
|
|
--max-concurrency 100
|
|
```
|
|
|
|
**Single Batch Performance**
|
|
|
|
You can also benchmark the performance of processing a single batch offline.
|
|
|
|
```bash
|
|
python -m sglang.bench_one_batch_server \
|
|
--model <model-path> \
|
|
--batch-size 8 \
|
|
--input-len 1024 \
|
|
--output-len 1024
|
|
```
|
|
|
|
You can run more granular benchmarks:
|
|
|
|
- **Low Concurrency**: `--num-prompts 10 --max-concurrency 1`
|
|
- **Medium Concurrency**: `--num-prompts 80 --max-concurrency 16`
|
|
- **High Concurrency**: `--num-prompts 500 --max-concurrency 100`
|
|
|
|
## Reporting Results
|
|
|
|
For each evaluation, please report:
|
|
|
|
1. **Metric Score**: Accuracy % (LLMs and VLMs); Latency (ms) and Throughput (tok/s) (LLMs only).
|
|
2. **Environment settings**: GPU type/count, SGLang commit hash.
|
|
3. **Launch configuration**: Model path, TP size, and any special flags.
|
|
4. **Evaluation parameters**: Number of shots, examples, max tokens.
|