support qwen3 next

This commit is contained in:
djw
2025-09-11 11:55:09 +00:00
parent 1bc9213e7b
commit a44b710649
4 changed files with 80 additions and 2 deletions

View File

@@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
<h2 id="Updates">🔥 Updates</h2>
* **September 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md))
* **July 11, 2025**: Support Kimi-K2-0905. ([Tutorial](./doc/en/Kimi-K2.md))
* **July 26, 2025**: Support SmallThinker and GLM4-MoE. ([Tutorial](./doc/en/SmallThinker_and_Glm4moe.md))
* **July 11, 2025**: Support Kimi-K2. ([Tutorial](./doc/en/Kimi-K2.md))

70
doc/en/Qwen3-Next.md Normal file
View File

@@ -0,0 +1,70 @@
# Qwen3-Next Support for KTransformers
## Introduction
### Overview
We are very pleased to announce that Ktransformers now supports Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct.
### Model & Resource Links
- Official Qwen3-Next-80B-A3B-Thinking Release:
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
- Official Qwen3-Next-80B-A3B-Instruct Release
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
## Installation Guide
### 1. Resource Requirements
The model running with 512 Experts requires approximately 320 GB of memory and 6 GB of GPU memory.
### 2. Prepare Models
```bash
# download gguf
huggingface-cli download --resume-download Qwen/Qwen3-Next-80B-A3B-Instruct
```
### 3. Install ktransformers
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
### 4. Run Qwen3-Next Inference Server
```bash
python ktransformers/server/main.py \
--port 10021 \
--model_path path-to-Qwen3-Next-80B-A3B-Thinking \
--model_name Qwen3NextForCausalLM \
--optimize_config_path <local_path>/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--no-use_cuda_graph \
--backend_type balance_serve
```
### 5. Access server
```
curl -X POST http://localhost:10021/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"model": "Qwen3-Next-80B-A3B-Instruct",
"temperature": 0.3,
"top_p": 1.0,
"stream": true
}'
```
### 6. Notes
Due to Qwen3-Nexts use of linear attention, CUDA Graph optimization is not yet support — but its coming soon! 🚀

View File

@@ -60,6 +60,11 @@ class KQwen3NextForCausalLM(Qwen3NextPreTrainedModel):
return features
def reset_conv_states(self):
for i in range(self.config.num_hidden_layers):
self.conv_states[i] = None
self.recurrent_states[i] = None
def forward(
self,
@@ -75,7 +80,10 @@ class KQwen3NextForCausalLM(Qwen3NextPreTrainedModel):
forward_batch_output = ForwardBatchOutput()
q_len = features[0].size(0)
if q_len > 1:
self.reset_conv_states()
hidden_states = features[0]
self.attn[cuda_graph_idx].calc_batch_indices(hidden_states.shape[0])
freqs_cis = self.model.rotary_emb(hidden_states.unsqueeze(0), batch.minibatch.position_ids.unsqueeze(0))

View File

@@ -305,7 +305,6 @@ class BalanceServeThreadContext(ThreadContext):
def run_engine(args, token_queue, broadcast_endpoint, event, kvcache_event):
args.use_cuda_graph = False # tmp set
engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
if args.use_cuda_graph:
engine.model_runner.warmup()