mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
144 lines
4.6 KiB
Plaintext
144 lines
4.6 KiB
Plaintext
---
|
||
title: "Offline Engine API"
|
||
metatags:
|
||
description: "Use SGLang's offline engine for direct batch inference without HTTP server overhead. Supports sync/async and streaming modes."
|
||
---
|
||
SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:
|
||
|
||
- Offline Batch Inference
|
||
- Custom Server on Top of the Engine
|
||
|
||
This document focuses on the offline batch inference, demonstrating four different inference modes:
|
||
|
||
- Non-streaming synchronous generation
|
||
- Streaming synchronous generation
|
||
- Non-streaming asynchronous generation
|
||
- Streaming asynchronous generation
|
||
|
||
Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).
|
||
|
||
## Nest Asyncio
|
||
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
|
||
```python Example
|
||
import nest_asyncio
|
||
|
||
nest_asyncio.apply()
|
||
|
||
```
|
||
|
||
## Advanced Usage
|
||
|
||
The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/tree/main/examples/runtime/hidden_states).
|
||
|
||
Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.
|
||
|
||
## Offline Batch Inference
|
||
|
||
SGLang offline engine supports batch inference with efficient scheduling.
|
||
|
||
```python Example
|
||
# launch the offline engine
|
||
import asyncio
|
||
|
||
import sglang as sgl
|
||
import sglang.test.doc_patch
|
||
from sglang.utils import async_stream_and_merge, stream_and_merge
|
||
|
||
llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")
|
||
```
|
||
|
||
### Non-streaming Synchronous Generation
|
||
|
||
```python Example
|
||
prompts = [
|
||
"Hello, my name is",
|
||
"The president of the United States is",
|
||
"The capital of France is",
|
||
"The future of AI is",
|
||
]
|
||
|
||
sampling_params = {"temperature": 0.8, "top_p": 0.95}
|
||
|
||
outputs = llm.generate(prompts, sampling_params)
|
||
for prompt, output in zip(prompts, outputs):
|
||
print("===============================")
|
||
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
|
||
```
|
||
|
||
### Streaming Synchronous Generation
|
||
|
||
```python Example
|
||
prompts = [
|
||
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
|
||
"Provide a concise factual statement about France’s capital city. The capital of France is",
|
||
"Explain possible future trends in artificial intelligence. The future of AI is",
|
||
]
|
||
|
||
sampling_params = {
|
||
"temperature": 0.2,
|
||
"top_p": 0.9,
|
||
}
|
||
|
||
print("\n=== Testing synchronous streaming generation with overlap removal ===\n")
|
||
|
||
for prompt in prompts:
|
||
print(f"Prompt: {prompt}")
|
||
merged_output = stream_and_merge(llm, prompt, sampling_params)
|
||
print("Generated text:", merged_output)
|
||
print()
|
||
```
|
||
|
||
### Non-streaming Asynchronous Generation
|
||
|
||
```python Example
|
||
prompts = [
|
||
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
|
||
"Provide a concise factual statement about France’s capital city. The capital of France is",
|
||
"Explain possible future trends in artificial intelligence. The future of AI is",
|
||
]
|
||
|
||
sampling_params = {"temperature": 0.8, "top_p": 0.95}
|
||
|
||
print("\n=== Testing asynchronous batch generation ===")
|
||
|
||
async def main():
|
||
outputs = await llm.async_generate(prompts, sampling_params)
|
||
|
||
for prompt, output in zip(prompts, outputs):
|
||
print(f"\nPrompt: {prompt}")
|
||
print(f"Generated text: {output['text']}")
|
||
|
||
asyncio.run(main())
|
||
```
|
||
|
||
### Streaming Asynchronous Generation
|
||
|
||
```python Example
|
||
prompts = [
|
||
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
|
||
"Provide a concise factual statement about France’s capital city. The capital of France is",
|
||
"Explain possible future trends in artificial intelligence. The future of AI is",
|
||
]
|
||
|
||
sampling_params = {"temperature": 0.8, "top_p": 0.95}
|
||
|
||
print("\n=== Testing asynchronous streaming generation (no repeats) ===")
|
||
|
||
async def main():
|
||
for prompt in prompts:
|
||
print(f"\nPrompt: {prompt}")
|
||
print("Generated text: ", end="", flush=True)
|
||
|
||
# Replace direct calls to async_generate with our custom overlap-aware version
|
||
async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
|
||
print(cleaned_chunk, end="", flush=True)
|
||
|
||
print() # New line after each prompt
|
||
|
||
asyncio.run(main())
|
||
```
|
||
|
||
```python Example
|
||
llm.shutdown()
|
||
```
|