From 670c488155871a207cca48eee979bda0f66f2a29 Mon Sep 17 00:00:00 2001 From: Jianwei Dong Date: Tue, 2 Dec 2025 20:04:10 +0800 Subject: [PATCH] [docs]: Add deepseek-v3.2 run tutorial (#1659) --- .../deepseek-v3.2-sglang-tutorial.md | 177 ++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md diff --git a/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md b/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md new file mode 100644 index 0000000..a8ee454 --- /dev/null +++ b/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md @@ -0,0 +1,177 @@ +# Running DeepSeek V3.2 with SGLang and KT-Kernel + +This tutorial demonstrates how to run DeepSeek V3.2 model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. This setup enables efficient deployment of large MoE models by offloading experts to CPU. + +## Table of Contents + +- [Hardware Requirements](#hardware-requirements) +- [Prerequisites](#prerequisites) +- [Step 1: Download Model Weights](#step-1-download-model-weights) +- [Step 2: Quantize CPU Weights](#step-2-quantize-cpu-weights) +- [Step 3: Launch SGLang Server](#step-3-launch-sglang-server) +- [Step 4: Send Inference Requests](#step-4-send-inference-requests) + +## Hardware Requirements + +**Minimum Configuration:** +- **GPU**: NVIDIA L20 48GB (or equivalent with at least 27GB VRAM available) +- **CPU**: Intel Xeon with AMX support (e.g., Sapphire Rapids) +- **RAM**: At least 350GB system memory for INT4 quantization +- **Storage**: ~1TB for model weights (FP8 + INT4 quantized) + +**Tested Configuration:** +- **GPU**: NVIDIA L20 48GB +- **CPU**: Intel(R) Xeon(R) Platinum 8488C +- **RAM**: 2TB DDR5 +- **OS**: Linux (Ubuntu 20.04+ recommended) + +## Prerequisites + +Before starting, ensure you have: + +1. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation) +2. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang) +3. **CUDA toolkit** - Compatible with your GPU (CUDA 11.8+ recommended) +4. **Hugging Face CLI** - For downloading models: + ```bash + pip install huggingface-hub + ``` + +## Step 1: Download Model Weights + +DeepSeek V3.2 requires downloading model repositories: + +1. **DeepSeek-V3.2** +2. **DeepSeek-V3.2-Speciale** + +```bash +# Create a directory for models +mkdir -p /path/to/models +cd /path/to/models + +# Download DeepSeek-V3.2 (FP8 weights for GPU) +huggingface-cli download deepseek-ai/DeepSeek-V3.2 \ + --local-dir /path/to/deepseek-v3.2 + +# Download DeepSeek-V3.2-Speciale (if needed) +huggingface-cli download deepseek-ai/DeepSeek-V3.2-Speciale \ + --local-dir /path/to/deepseek-v3.2-speciale +``` + +**Note:** Replace `/path/to/models` with your actual storage path throughout this tutorial. + +## Step 2: Quantize CPU Weights + +Convert the FP8 GPU weights to INT4 quantized CPU weights using the provided conversion script. + +### Conversion Command + +For a 2-NUMA system with 60 physical cores: + +```bash +cd /path/to/ktransformers/kt-kernel + +python scripts/convert_cpu_weights.py \ + --input-path /path/to/deepseek-v3.2 \ + --input-type fp8 \ + --output /path/to/deepseek-v3.2-INT4 \ + --quant-method int4 \ + --cpuinfer-threads 60 \ + --threadpool-count 2 \ + --no-merge-safetensor +``` + +## Step 3: Launch SGLang Server + +Start the SGLang server with KT-Kernel integration for CPU-GPU heterogeneous inference. + +### Launch Command + +For single NVIDIA L20 48GB + 2-NUMA CPU system: + +```bash +python -m sglang.launch_server \ + --host 0.0.0.0 \ + --port 30000 \ + --model /path/to/deepseek-v3.2 \ + --kt-weight-path /path/to/deepseek-v3.2-INT4 \ + --kt-cpuinfer 60 \ + --kt-threadpool-count 2 \ + --kt-num-gpu-experts 1 \ + --attention-backend triton \ + --trust-remote-code \ + --mem-fraction-static 0.98 \ + --chunked-prefill-size 4096 \ + --max-running-requests 32 \ + --max-total-tokens 40000 \ + --served-model-name DeepSeek-V3.2 \ + --enable-mixed-chunk \ + --tensor-parallel-size 1 \ + --enable-p2p-check \ + --disable-shared-experts-fusion \ + --kt-method AMXINT4 +``` + +### Resource Usage + +- **GPU VRAM:** ~27GB (for 1 GPU expert per layer + attention) +- **System RAM:** ~350GB (for INT4 quantized CPU experts) + +## Step 4: Send Inference Requests + +Once the server is running, you can send inference requests using the OpenAI-compatible API. + +### Basic Chat Completion Request + +```bash +curl -s http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "DeepSeek-V3.2", + "stream": false, + "messages": [ + {"role": "user", "content": "hi"} + ] + }' +``` + +### Example Response + +```json +{ + "id": "adbb44f6aafb4b58b167e42fbbb1eed3", + "object": "chat.completion", + "created": 1764675126, + "model": "DeepSeek-V3.2", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Hi there! 👋 \n\nThanks for stopping by! How can I help you today? Feel free to ask me anything - I'm here to assist with questions, explanations, conversations, or whatever you need! 😊\n\nIs there something specific on your mind, or would you like to know more about what I can do?", + "reasoning_content": null, + "tool_calls": null + }, + "logprobs": null, + "finish_reason": "stop", + "matched_stop": 1 + } + ], + "usage": { + "prompt_tokens": 5, + "total_tokens": 72, + "completion_tokens": 67, + "prompt_tokens_details": null, + "reasoning_tokens": 0 + }, + "metadata": { + "weight_version": "default" + } +} +``` + +## Additional Resources + +- [KT-Kernel Documentation](../../../kt-kernel/README.md) +- [DeepSeek V3.2 Model Card](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) +- [SGLang GitHub](https://github.com/sgl-project/sglang) \ No newline at end of file