--- title: "Offline Engine API" metatags: description: "Use SGLang's offline engine for direct batch inference without HTTP server overhead. Supports sync/async and streaming modes." --- SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases: - Offline Batch Inference - Custom Server on Top of the Engine This document focuses on the offline batch inference, demonstrating four different inference modes: - Non-streaming synchronous generation - Streaming synchronous generation - Non-streaming asynchronous generation - Streaming asynchronous generation Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py). ## Nest Asyncio Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code: ```python Example import nest_asyncio nest_asyncio.apply() ``` ## Advanced Usage The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/tree/main/examples/runtime/hidden_states). Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases. ## Offline Batch Inference SGLang offline engine supports batch inference with efficient scheduling. ```python Example # launch the offline engine import asyncio import sglang as sgl import sglang.test.doc_patch from sglang.utils import async_stream_and_merge, stream_and_merge llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct") ``` ### Non-streaming Synchronous Generation ```python Example prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = {"temperature": 0.8, "top_p": 0.95} outputs = llm.generate(prompts, sampling_params) for prompt, output in zip(prompts, outputs): print("===============================") print(f"Prompt: {prompt}\nGenerated text: {output['text']}") ``` ### Streaming Synchronous Generation ```python Example prompts = [ "Write a short, neutral self-introduction for a fictional character. Hello, my name is", "Provide a concise factual statement about France’s capital city. The capital of France is", "Explain possible future trends in artificial intelligence. The future of AI is", ] sampling_params = { "temperature": 0.2, "top_p": 0.9, } print("\n=== Testing synchronous streaming generation with overlap removal ===\n") for prompt in prompts: print(f"Prompt: {prompt}") merged_output = stream_and_merge(llm, prompt, sampling_params) print("Generated text:", merged_output) print() ``` ### Non-streaming Asynchronous Generation ```python Example prompts = [ "Write a short, neutral self-introduction for a fictional character. Hello, my name is", "Provide a concise factual statement about France’s capital city. The capital of France is", "Explain possible future trends in artificial intelligence. The future of AI is", ] sampling_params = {"temperature": 0.8, "top_p": 0.95} print("\n=== Testing asynchronous batch generation ===") async def main(): outputs = await llm.async_generate(prompts, sampling_params) for prompt, output in zip(prompts, outputs): print(f"\nPrompt: {prompt}") print(f"Generated text: {output['text']}") asyncio.run(main()) ``` ### Streaming Asynchronous Generation ```python Example prompts = [ "Write a short, neutral self-introduction for a fictional character. Hello, my name is", "Provide a concise factual statement about France’s capital city. The capital of France is", "Explain possible future trends in artificial intelligence. The future of AI is", ] sampling_params = {"temperature": 0.8, "top_p": 0.95} print("\n=== Testing asynchronous streaming generation (no repeats) ===") async def main(): for prompt in prompts: print(f"\nPrompt: {prompt}") print("Generated text: ", end="", flush=True) # Replace direct calls to async_generate with our custom overlap-aware version async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params): print(cleaned_chunk, end="", flush=True) print() # New line after each prompt asyncio.run(main()) ``` ```python Example llm.shutdown() ```