mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Alison Shao <alisonshao@mac.lan> Co-authored-by: Mick <mickjagger19@icloud.com>
380 lines
12 KiB
Plaintext
380 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Query VLM with Offline Engine\n",
|
|
"\n",
|
|
"This tutorial demonstrates how to use SGLang's **offline Engine API** to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:\n",
|
|
"\n",
|
|
"1. **Basic Call**: Directly pass images and text.\n",
|
|
"2. **Processor Output**: Use HuggingFace processor for data preprocessing.\n",
|
|
"3. **Precomputed Embeddings**: Pre-calculate image features to improve inference efficiency."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Understanding the Three Input Formats\n",
|
|
"\n",
|
|
"SGLang supports three ways to pass visual data, each optimized for different scenarios:\n",
|
|
"\n",
|
|
"### 1. **Raw Images** - Simplest approach\n",
|
|
"- Pass PIL Images, file paths, URLs, or base64 strings directly\n",
|
|
"- SGLang handles all preprocessing automatically\n",
|
|
"- Best for: Quick prototyping, simple applications\n",
|
|
"\n",
|
|
"### 2. **Processor Output** - For custom preprocessing\n",
|
|
"- Pre-process images with HuggingFace processor\n",
|
|
"- Pass the complete processor output dict with `format: \"processor_output\"`\n",
|
|
"- Best for: Custom image transformations, integration with existing pipelines\n",
|
|
"- Requirement: Must use `input_ids` instead of text prompt\n",
|
|
"\n",
|
|
"### 3. **Precomputed Embeddings** - For maximum performance\n",
|
|
"- Pre-calculate visual embeddings using the vision encoder\n",
|
|
"- Pass embeddings with `format: \"precomputed_embedding\"`\n",
|
|
"- Best for: Repeated queries on same images, caching, high-throughput serving\n",
|
|
"- Performance gain: Avoids redundant vision encoder computation (30-50% speedup)\n",
|
|
"\n",
|
|
"**Key Rule**: Within a single request, use only one format for all images. Don't mix formats.\n",
|
|
"\n",
|
|
"The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Querying Qwen2.5-VL Model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import nest_asyncio\n",
|
|
"\n",
|
|
"nest_asyncio.apply()\n",
|
|
"\n",
|
|
"import sglang.test.doc_patch # noqa: F401\n",
|
|
"\n",
|
|
"model_path = \"Qwen/Qwen2.5-VL-3B-Instruct\"\n",
|
|
"chat_template = \"qwen2-vl\"\n",
|
|
"example_image_url = \"https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from io import BytesIO\n",
|
|
"import requests\n",
|
|
"from PIL import Image\n",
|
|
"\n",
|
|
"from sglang.srt.parser.conversation import chat_templates\n",
|
|
"\n",
|
|
"image = Image.open(BytesIO(requests.get(example_image_url).content))\n",
|
|
"\n",
|
|
"conv = chat_templates[chat_template].copy()\n",
|
|
"conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n",
|
|
"conv.append_message(conv.roles[1], \"\")\n",
|
|
"conv.image_data = [image]\n",
|
|
"\n",
|
|
"print(\"Generated prompt text:\")\n",
|
|
"print(conv.get_prompt())\n",
|
|
"print(f\"\\nImage size: {image.size}\")\n",
|
|
"image"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Basic Offline Engine API Call"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sglang import Engine\n",
|
|
"\n",
|
|
"llm = Engine(model_path=model_path, chat_template=chat_template, log_level=\"warning\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n",
|
|
"print(\"Model response:\")\n",
|
|
"print(out[\"text\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Call with Processor Output\n",
|
|
"\n",
|
|
"Using a HuggingFace processor to preprocess text and images, and passing the `processor_output` directly into `Engine.generate`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from transformers import AutoProcessor\n",
|
|
"\n",
|
|
"processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
|
|
"processor_output = processor(\n",
|
|
" images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n",
|
|
")\n",
|
|
"\n",
|
|
"out = llm.generate(\n",
|
|
" input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n",
|
|
" image_data=[dict(processor_output, format=\"processor_output\")],\n",
|
|
")\n",
|
|
"print(\"Response using processor output:\")\n",
|
|
"print(out[\"text\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "10",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Call with Precomputed Embeddings\n",
|
|
"\n",
|
|
"You can pre-calculate image features to avoid repeated visual encoding processes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "11",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from transformers import AutoProcessor\n",
|
|
"from transformers import Qwen2_5_VLForConditionalGeneration\n",
|
|
"\n",
|
|
"processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
|
|
"model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval()\n",
|
|
"vision = model.model.visual.cuda()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "12",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"processor_output = processor(\n",
|
|
" images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n",
|
|
")\n",
|
|
"\n",
|
|
"input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n",
|
|
"\n",
|
|
"precomputed_embeddings = vision(\n",
|
|
" processor_output[\"pixel_values\"].cuda(), processor_output[\"image_grid_thw\"].cuda()\n",
|
|
")\n",
|
|
"precomputed_embeddings = precomputed_embeddings.pooler_output\n",
|
|
"\n",
|
|
"multi_modal_item = dict(\n",
|
|
" processor_output,\n",
|
|
" format=\"precomputed_embedding\",\n",
|
|
" feature=precomputed_embeddings,\n",
|
|
")\n",
|
|
"\n",
|
|
"out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])\n",
|
|
"print(\"Response using precomputed embeddings:\")\n",
|
|
"print(out[\"text\"])\n",
|
|
"\n",
|
|
"llm.shutdown()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "13",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Querying Llama 4 Vision Model\n",
|
|
"\n",
|
|
"```python\n",
|
|
"model_path = \"meta-llama/Llama-4-Scout-17B-16E-Instruct\"\n",
|
|
"chat_template = \"llama-4\"\n",
|
|
"\n",
|
|
"from io import BytesIO\n",
|
|
"import requests\n",
|
|
"from PIL import Image\n",
|
|
"\n",
|
|
"from sglang.srt.parser.conversation import chat_templates\n",
|
|
"\n",
|
|
"# Download the same example image\n",
|
|
"image = Image.open(BytesIO(requests.get(example_image_url).content))\n",
|
|
"\n",
|
|
"conv = chat_templates[chat_template].copy()\n",
|
|
"conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n",
|
|
"conv.append_message(conv.roles[1], \"\")\n",
|
|
"conv.image_data = [image]\n",
|
|
"\n",
|
|
"print(\"Llama 4 generated prompt text:\")\n",
|
|
"print(conv.get_prompt())\n",
|
|
"print(f\"Image size: {image.size}\")\n",
|
|
"\n",
|
|
"image\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "14",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Llama 4 Basic Call\n",
|
|
"\n",
|
|
"Llama 4 requires more computational resources, so it's configured with multi-GPU parallelism (tp_size=4) and larger context length.\n",
|
|
"\n",
|
|
"```python\n",
|
|
"llm = Engine(\n",
|
|
" model_path=model_path,\n",
|
|
" enable_multimodal=True,\n",
|
|
" attention_backend=\"fa3\",\n",
|
|
" tp_size=4,\n",
|
|
" context_length=65536,\n",
|
|
")\n",
|
|
"\n",
|
|
"out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n",
|
|
"print(\"Llama 4 response:\")\n",
|
|
"print(out[\"text\"])\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "15",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Call with Processor Output\n",
|
|
"\n",
|
|
"Using HuggingFace processor to preprocess data can reduce computational overhead during inference.\n",
|
|
"\n",
|
|
"```python\n",
|
|
"from transformers import AutoProcessor\n",
|
|
"\n",
|
|
"processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
|
|
"processor_output = processor(\n",
|
|
" images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n",
|
|
")\n",
|
|
"\n",
|
|
"out = llm.generate(\n",
|
|
" input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n",
|
|
" image_data=[dict(processor_output, format=\"processor_output\")],\n",
|
|
")\n",
|
|
"print(\"Response using processor output:\")\n",
|
|
"print(out)\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "16",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Call with Precomputed Embeddings\n",
|
|
"\n",
|
|
"```python\n",
|
|
"from transformers import AutoProcessor\n",
|
|
"from transformers import Llama4ForConditionalGeneration\n",
|
|
"\n",
|
|
"processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
|
|
"model = Llama4ForConditionalGeneration.from_pretrained(\n",
|
|
" model_path, torch_dtype=\"auto\"\n",
|
|
").eval()\n",
|
|
"\n",
|
|
"vision = model.vision_model.cuda()\n",
|
|
"multi_modal_projector = model.multi_modal_projector.cuda()\n",
|
|
"\n",
|
|
"print(f'Image pixel values shape: {processor_output[\"pixel_values\"].shape}')\n",
|
|
"input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n",
|
|
"\n",
|
|
"# Process image through vision encoder\n",
|
|
"image_outputs = vision(\n",
|
|
" processor_output[\"pixel_values\"].to(\"cuda\"), \n",
|
|
" aspect_ratio_ids=processor_output[\"aspect_ratio_ids\"].to(\"cuda\"),\n",
|
|
" aspect_ratio_mask=processor_output[\"aspect_ratio_mask\"].to(\"cuda\"),\n",
|
|
" output_hidden_states=False\n",
|
|
")\n",
|
|
"image_features = image_outputs.last_hidden_state\n",
|
|
"\n",
|
|
"# Flatten image features and pass through multimodal projector\n",
|
|
"vision_flat = image_features.view(-1, image_features.size(-1))\n",
|
|
"precomputed_embeddings = multi_modal_projector(vision_flat)\n",
|
|
"\n",
|
|
"# Build precomputed embedding data item\n",
|
|
"mm_item = dict(\n",
|
|
" processor_output, \n",
|
|
" format=\"precomputed_embedding\", \n",
|
|
" feature=precomputed_embeddings\n",
|
|
")\n",
|
|
"\n",
|
|
"# Use precomputed embeddings for efficient inference\n",
|
|
"out = llm.generate(input_ids=input_ids, image_data=[mm_item])\n",
|
|
"print(\"Llama 4 precomputed embedding response:\")\n",
|
|
"print(out[\"text\"])\n",
|
|
"```"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"jupytext": {
|
|
"cell_metadata_filter": "-all",
|
|
"custom_cell_magics": "kql",
|
|
"encoding": "# -*- coding: utf-8 -*-",
|
|
"text_representation": {
|
|
"extension": ".py",
|
|
"format_name": "light",
|
|
"format_version": "1.5",
|
|
"jupytext_version": "1.16.1"
|
|
}
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|