mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
236 lines
7.0 KiB
Plaintext
236 lines
7.0 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Offline Engine API\n",
|
||
"\n",
|
||
"SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n",
|
||
"\n",
|
||
"- Offline Batch Inference\n",
|
||
"- Custom Server on Top of the Engine\n",
|
||
"\n",
|
||
"This document focuses on the offline batch inference, demonstrating four different inference modes:\n",
|
||
"\n",
|
||
"- Non-streaming synchronous generation\n",
|
||
"- Streaming synchronous generation\n",
|
||
"- Non-streaming asynchronous generation\n",
|
||
"- Streaming asynchronous generation\n",
|
||
"\n",
|
||
"Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Nest Asyncio\n",
|
||
"Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:\n",
|
||
"```python\n",
|
||
"import nest_asyncio\n",
|
||
"\n",
|
||
"nest_asyncio.apply()\n",
|
||
"\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Advanced Usage\n",
|
||
"\n",
|
||
"The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). \n",
|
||
"\n",
|
||
"Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Offline Batch Inference\n",
|
||
"\n",
|
||
"SGLang offline engine supports batch inference with efficient scheduling."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# launch the offline engine\n",
|
||
"import asyncio\n",
|
||
"\n",
|
||
"import sglang as sgl\n",
|
||
"import sglang.test.doc_patch\n",
|
||
"from sglang.utils import async_stream_and_merge, stream_and_merge\n",
|
||
"\n",
|
||
"llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Non-streaming Synchronous Generation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"prompts = [\n",
|
||
" \"Hello, my name is\",\n",
|
||
" \"The president of the United States is\",\n",
|
||
" \"The capital of France is\",\n",
|
||
" \"The future of AI is\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
||
"\n",
|
||
"outputs = llm.generate(prompts, sampling_params)\n",
|
||
"for prompt, output in zip(prompts, outputs):\n",
|
||
" print(\"===============================\")\n",
|
||
" print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Streaming Synchronous Generation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"prompts = [\n",
|
||
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
|
||
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
|
||
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"sampling_params = {\n",
|
||
" \"temperature\": 0.2,\n",
|
||
" \"top_p\": 0.9,\n",
|
||
"}\n",
|
||
"\n",
|
||
"print(\"\\n=== Testing synchronous streaming generation with overlap removal ===\\n\")\n",
|
||
"\n",
|
||
"for prompt in prompts:\n",
|
||
" print(f\"Prompt: {prompt}\")\n",
|
||
" merged_output = stream_and_merge(llm, prompt, sampling_params)\n",
|
||
" print(\"Generated text:\", merged_output)\n",
|
||
" print()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Non-streaming Asynchronous Generation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"prompts = [\n",
|
||
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
|
||
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
|
||
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
||
"\n",
|
||
"print(\"\\n=== Testing asynchronous batch generation ===\")\n",
|
||
"\n",
|
||
"\n",
|
||
"async def main():\n",
|
||
" outputs = await llm.async_generate(prompts, sampling_params)\n",
|
||
"\n",
|
||
" for prompt, output in zip(prompts, outputs):\n",
|
||
" print(f\"\\nPrompt: {prompt}\")\n",
|
||
" print(f\"Generated text: {output['text']}\")\n",
|
||
"\n",
|
||
"\n",
|
||
"asyncio.run(main())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Streaming Asynchronous Generation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"prompts = [\n",
|
||
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
|
||
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
|
||
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
|
||
"\n",
|
||
"print(\"\\n=== Testing asynchronous streaming generation (no repeats) ===\")\n",
|
||
"\n",
|
||
"\n",
|
||
"async def main():\n",
|
||
" for prompt in prompts:\n",
|
||
" print(f\"\\nPrompt: {prompt}\")\n",
|
||
" print(\"Generated text: \", end=\"\", flush=True)\n",
|
||
"\n",
|
||
" # Replace direct calls to async_generate with our custom overlap-aware version\n",
|
||
" async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):\n",
|
||
" print(cleaned_chunk, end=\"\", flush=True)\n",
|
||
"\n",
|
||
" print() # New line after each prompt\n",
|
||
"\n",
|
||
"\n",
|
||
"asyncio.run(main())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"llm.shutdown()"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|