Files
sglang/docs_new/docs/basic_usage/native_api.ipynb
Mingyi a3291b5654 Add new Mintlify documentation site (docs_new/) (#23001)
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com>
Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com>
Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com>
Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com>
Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Maitri Shah <shah29maitri@gmail.com>
Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com>
Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com>
Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com>
Co-authored-by: IshhanKheria <ishhankheria06@gmail.com>
Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com>
Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com>
Co-authored-by: longGGGGGG <553746008@qq.com>
Co-authored-by: Richard <richardchen@radixark.ai>
Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com>
Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com>
Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu>
Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com>
Co-authored-by: nimeshas <nimesha.s106@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com>
Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
2026-04-20 15:10:22 -07:00

668 lines
21 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SGLang Native APIs\n",
"\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
"\n",
"- `/generate` (text generation model)\n",
"- `/get_model_info`\n",
"- `/get_server_info`\n",
"- `/health`\n",
"- `/health_generate`\n",
"- `/flush_cache`\n",
"- `/update_weights`\n",
"- `/encode`(embedding model)\n",
"- `/v1/rerank`(cross encoder rerank model)\n",
"- `/v1/score`(decoder-only scoring)\n",
"- `/classify`(reward model)\n",
"- `/start_expert_distribution_record`\n",
"- `/stop_expert_distribution_record`\n",
"- `/dump_expert_distribution_record`\n",
"- `/tokenize`\n",
"- `/detokenize`\n",
"- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n",
"\n",
"We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Launch A Server"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate (text generation model)\n",
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = f\"http://localhost:{port}/generate\"\n",
"data = {\"text\": \"What is the capital of France?\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get Model Info\n",
"\n",
"Get the information of the model.\n",
"\n",
"- `model_path`: The path/name of the model.\n",
"- `is_generation`: Whether the model is used as generation model or embedding model.\n",
"- `tokenizer_path`: The path/name of the tokenizer.\n",
"- `preferred_sampling_params`: The default sampling params specified via `--preferred-sampling-params`. `None` is returned in this example as we did not explicitly configure it in server args.\n",
"- `weight_version`: This field contains the version of the model weights. This is often used to track changes or updates to the models trained parameters.\n",
"- `has_image_understanding`: Whether the model has image-understanding capability.\n",
"- `has_audio_understanding`: Whether the model has audio-understanding capability.\n",
"- `model_type`: The model type from the HuggingFace config (e.g., \"qwen2\", \"llama\").\n",
"- `architectures`: The model architectures from the HuggingFace config (e.g., [\"Qwen2ForCausalLM\"])."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = f\"http://localhost:{port}/get_model_info\"\n",
"\n",
"response = requests.get(url)\n",
"response_json = response.json()\n",
"print_highlight(response_json)\n",
"assert response_json[\"model_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
"assert response_json[\"is_generation\"] is True\n",
"assert response_json[\"tokenizer_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
"assert response_json[\"preferred_sampling_params\"] is None\n",
"assert response_json.keys() == {\n",
" \"model_path\",\n",
" \"is_generation\",\n",
" \"tokenizer_path\",\n",
" \"preferred_sampling_params\",\n",
" \"weight_version\",\n",
" \"has_image_understanding\",\n",
" \"has_audio_understanding\",\n",
" \"model_type\",\n",
" \"architectures\",\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get Server Info\n",
"Gets the server information including CLI arguments, token limits, and memory pool sizes.\n",
"- Note: `get_server_info` merges the following deprecated endpoints:\n",
" - `get_server_args`\n",
" - `get_memory_pool_size`\n",
" - `get_max_total_num_tokens`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = f\"http://localhost:{port}/get_server_info\"\n",
"\n",
"response = requests.get(url)\n",
"print_highlight(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Health Check\n",
"- `/health`: Check the health of the server.\n",
"- `/health_generate`: Check the health of the server by generating one token."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = f\"http://localhost:{port}/health_generate\"\n",
"\n",
"response = requests.get(url)\n",
"print_highlight(response.text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = f\"http://localhost:{port}/health\"\n",
"\n",
"response = requests.get(url)\n",
"print_highlight(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Flush Cache\n",
"\n",
"Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = f\"http://localhost:{port}/flush_cache\"\n",
"\n",
"response = requests.post(url)\n",
"print_highlight(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Update Weights From Disk\n",
"\n",
"Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.\n",
"\n",
"SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# successful update with same architecture and size\n",
"\n",
"url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
"data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.text)\n",
"assert response.json()[\"success\"] is True\n",
"assert response.json()[\"message\"] == \"Succeeded to update model weights.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# failed update with different parameter size or wrong name\n",
"\n",
"url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
"data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct-wrong\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"response_json = response.json()\n",
"print_highlight(response_json)\n",
"assert response_json[\"success\"] is False\n",
"assert response_json[\"message\"] == (\n",
" \"Failed to get weights iterator: \"\n",
" \"qwen/qwen2.5-0.5b-instruct-wrong\"\n",
" \" (repository not found).\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Encode (embedding model)\n",
"\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings) and will raise an error for generation models.\n",
"Therefore, we launch a new server to server an embedding model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embedding_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
" --host 0.0.0.0 --is-embedding --log-level warning\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# successful encode for embedding model\n",
"\n",
"url = f\"http://localhost:{port}/encode\"\n",
"data = {\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"text\": \"Once upon a time\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"response_json = response.json()\n",
"print_highlight(f\"Text embedding (first 10): {response_json['embedding'][:10]}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(embedding_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## v1/rerank (cross encoder rerank model)\n",
"Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) with `attention-backend` `triton` and `torch_native`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reranker_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n",
" --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# compute rerank scores for query and documents\n",
"\n",
"url = f\"http://localhost:{port}/v1/rerank\"\n",
"data = {\n",
" \"model\": \"BAAI/bge-reranker-v2-m3\",\n",
" \"query\": \"what is panda?\",\n",
" \"documents\": [\n",
" \"hi\",\n",
" \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\",\n",
" ],\n",
"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"response_json = response.json()\n",
"for item in response_json:\n",
" print_highlight(f\"Score: {item['score']:.2f} - Document: '{item['document']}'\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(reranker_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## v1/score (decoder-only scoring)\n",
"\n",
"Compute token probabilities for specified tokens given a query and items. This is useful for classification tasks, scoring responses, or computing log-probabilities.\n",
"\n",
"Parameters:\n",
"- `query`: Query text\n",
"- `items`: Item text(s) to score\n",
"- `label_token_ids`: Token IDs to compute probabilities for\n",
"- `apply_softmax`: Whether to apply softmax to get normalized probabilities (default: False)\n",
"- `item_first`: Whether items come first in concatenation order (default: False)\n",
"- `model`: Model name\n",
"\n",
"The response contains `scores` - a list of probability lists, one per item, each in the order of `label_token_ids`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
" --host 0.0.0.0 --log-level warning\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Score the probability of different completions given a query\n",
"query = \"The capital of France is\"\n",
"items = [\"Paris\", \"London\", \"Berlin\"]\n",
"\n",
"url = f\"http://localhost:{port}/v1/score\"\n",
"data = {\n",
" \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n",
" \"query\": query,\n",
" \"items\": items,\n",
" \"label_token_ids\": [9454, 2753], # e.g. \"Yes\" and \"No\" token ids\n",
" \"apply_softmax\": True, # Normalize probabilities to sum to 1\n",
"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"response_json = response.json()\n",
"\n",
"# Display scores for each item\n",
"for item, scores in zip(items, response_json[\"scores\"]):\n",
" print_highlight(f\"Item '{item}': probabilities = {[f'{s:.4f}' for s in scores]}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(score_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classify (reward model)\n",
"\n",
"SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note that SGLang now treats embedding models and reward models as the same type of models.\n",
"# This will be updated in the future.\n",
"\n",
"reward_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"\n",
"PROMPT = (\n",
" \"What is the range of the numeric output of a sigmoid node in a neural network?\"\n",
")\n",
"\n",
"RESPONSE1 = \"The output of a sigmoid node is bounded between -1 and 1.\"\n",
"RESPONSE2 = \"The output of a sigmoid node is bounded between 0 and 1.\"\n",
"\n",
"CONVS = [\n",
" [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE1}],\n",
" [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE2}],\n",
"]\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(\"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\")\n",
"prompts = tokenizer.apply_chat_template(CONVS, tokenize=False, return_dict=False)\n",
"\n",
"url = f\"http://localhost:{port}/classify\"\n",
"data = {\"model\": \"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\", \"text\": prompts}\n",
"\n",
"responses = requests.post(url, json=data).json()\n",
"for response in responses:\n",
" print_highlight(f\"reward: {response['embedding'][0]}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(reward_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Capture expert selection distribution in MoE models\n",
"\n",
"SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization.\n",
"\n",
"*Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"expert_record_server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = requests.post(f\"http://localhost:{port}/start_expert_distribution_record\")\n",
"print_highlight(response)\n",
"\n",
"url = f\"http://localhost:{port}/generate\"\n",
"data = {\"text\": \"What is the capital of France?\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.json())\n",
"\n",
"response = requests.post(f\"http://localhost:{port}/stop_expert_distribution_record\")\n",
"print_highlight(response)\n",
"\n",
"response = requests.post(f\"http://localhost:{port}/dump_expert_distribution_record\")\n",
"print_highlight(response)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(expert_record_server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokenize/Detokenize Example (Round Trip)\n",
"\n",
"This example demonstrates how to use the /tokenize and /detokenize endpoints together. We first tokenize a string, then detokenize the resulting IDs to reconstruct the original text. This workflow is useful when you need to handle tokenization externally but still leverage the server for detokenization."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokenizer_free_server_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from sglang.utils import print_highlight\n",
"\n",
"base_url = f\"http://localhost:{port}\"\n",
"tokenize_url = f\"{base_url}/tokenize\"\n",
"detokenize_url = f\"{base_url}/detokenize\"\n",
"\n",
"model_name = \"qwen/qwen2.5-0.5b-instruct\"\n",
"input_text = \"SGLang provides efficient tokenization endpoints.\"\n",
"print_highlight(f\"Original Input Text:\\n'{input_text}'\")\n",
"\n",
"# --- tokenize the input text ---\n",
"tokenize_payload = {\n",
" \"model\": model_name,\n",
" \"prompt\": input_text,\n",
" \"add_special_tokens\": False,\n",
"}\n",
"try:\n",
" tokenize_response = requests.post(tokenize_url, json=tokenize_payload)\n",
" tokenize_response.raise_for_status()\n",
" tokenization_result = tokenize_response.json()\n",
" token_ids = tokenization_result.get(\"tokens\")\n",
"\n",
" if not token_ids:\n",
" raise ValueError(\"Tokenization returned empty tokens.\")\n",
"\n",
" print_highlight(f\"\\nTokenized Output (IDs):\\n{token_ids}\")\n",
" print_highlight(f\"Token Count: {tokenization_result.get('count')}\")\n",
" print_highlight(f\"Max Model Length: {tokenization_result.get('max_model_len')}\")\n",
"\n",
" # --- detokenize the obtained token IDs ---\n",
" detokenize_payload = {\n",
" \"model\": model_name,\n",
" \"tokens\": token_ids,\n",
" \"skip_special_tokens\": True,\n",
" }\n",
"\n",
" detokenize_response = requests.post(detokenize_url, json=detokenize_payload)\n",
" detokenize_response.raise_for_status()\n",
" detokenization_result = detokenize_response.json()\n",
" reconstructed_text = detokenization_result.get(\"text\")\n",
"\n",
" print_highlight(f\"\\nDetokenized Output (Text):\\n'{reconstructed_text}'\")\n",
"\n",
" if input_text == reconstructed_text:\n",
" print_highlight(\n",
" \"\\nRound Trip Successful: Original and reconstructed text match.\"\n",
" )\n",
" else:\n",
" print_highlight(\n",
" \"\\nRound Trip Mismatch: Original and reconstructed text differ.\"\n",
" )\n",
"\n",
"except requests.exceptions.RequestException as e:\n",
" print_highlight(f\"\\nHTTP Request Error: {e}\")\n",
"except Exception as e:\n",
" print_highlight(f\"\\nAn error occurred: {e}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(tokenizer_free_server_process)"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}