support qwen3 next

2026-04-20 14:29:22 +00:00 · 2025-09-11 11:55:09 +00:00
parent 1bc9213e7b
commit a44b710649
4 changed files with 80 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

 <h2 id="Updates">🔥 Updates</h2>

+* **September 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md))
 * **July 11, 2025**: Support Kimi-K2-0905. ([Tutorial](./doc/en/Kimi-K2.md))
 * **July 26, 2025**: Support SmallThinker and GLM4-MoE. ([Tutorial](./doc/en/SmallThinker_and_Glm4moe.md))
 * **July 11, 2025**: Support Kimi-K2. ([Tutorial](./doc/en/Kimi-K2.md))
--- a/doc/en/Qwen3-Next.md
+++ b/doc/en/Qwen3-Next.md
@@ -0,0 +1,70 @@
+# Qwen3-Next Support for KTransformers
+
+## Introduction
+
+### Overview
+We are very pleased to announce that Ktransformers now supports Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct.
+
+### Model & Resource Links
+
+- Official Qwen3-Next-80B-A3B-Thinking Release: 
+  - https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
+
+- Official Qwen3-Next-80B-A3B-Instruct Release
+  - https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
+
+
+## Installation Guide
+
+### 1. Resource Requirements
+
+The model running with 512 Experts requires approximately 320 GB of memory and 6 GB of GPU memory.
+
+### 2. Prepare Models
+
+```bash
+# download gguf
+huggingface-cli download --resume-download Qwen/Qwen3-Next-80B-A3B-Instruct
+
+```
+
+### 3. Install ktransformers
+
+To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
+
+### 4. Run Qwen3-Next Inference Server
+
+```bash
+python ktransformers/server/main.py \
+  --port 10021 \
+  --model_path path-to-Qwen3-Next-80B-A3B-Thinking \
+  --model_name Qwen3NextForCausalLM \
+  --optimize_config_path <local_path>/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \
+  --max_new_tokens 1024 \
+  --cache_lens 32768 \
+  --chunk_size 256 \
+  --max_batch_size 4 \
+  --no-use_cuda_graph \
+  --backend_type balance_serve
+```
+
+### 5. Access server
+
+```
+curl -X POST http://localhost:10021/v1/chat/completions \
+  -H "accept: application/json" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "hello"}
+    ],
+    "model": "Qwen3-Next-80B-A3B-Instruct",
+    "temperature": 0.3,
+    "top_p": 1.0,
+    "stream": true
+  }'
+```
+
+### 6. Notes
+
+Due to Qwen3-Next’s use of linear attention, CUDA Graph optimization is not yet support — but it’s coming soon! 🚀
--- a/ktransformers/models/custom_modeling_qwen3_next.py
+++ b/ktransformers/models/custom_modeling_qwen3_next.py
@@ -60,6 +60,11 @@ class KQwen3NextForCausalLM(Qwen3NextPreTrainedModel):

        return features

+    def reset_conv_states(self):
+        for i in range(self.config.num_hidden_layers):
+            self.conv_states[i] = None
+            self.recurrent_states[i] = None
+

    def forward(
        self,
@@ -75,7 +80,10 @@ class KQwen3NextForCausalLM(Qwen3NextPreTrainedModel):

        forward_batch_output = ForwardBatchOutput()

-        
+        q_len = features[0].size(0)
+        if q_len > 1:
+            self.reset_conv_states()
+
        hidden_states = features[0]
        self.attn[cuda_graph_idx].calc_batch_indices(hidden_states.shape[0])
        freqs_cis = self.model.rotary_emb(hidden_states.unsqueeze(0), batch.minibatch.position_ids.unsqueeze(0))
--- a/ktransformers/server/backend/interfaces/balance_serve.py
+++ b/ktransformers/server/backend/interfaces/balance_serve.py
@@ -305,7 +305,6 @@ class BalanceServeThreadContext(ThreadContext):
    

 def run_engine(args, token_queue, broadcast_endpoint, event, kvcache_event):
-    args.use_cuda_graph = False # tmp set
    engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
    if args.use_cuda_graph:
        engine.model_runner.warmup()