mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-20 14:29:22 +00:00
support qwen3 next
This commit is contained in:
@@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
|
||||
|
||||
<h2 id="Updates">🔥 Updates</h2>
|
||||
|
||||
* **September 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md))
|
||||
* **July 11, 2025**: Support Kimi-K2-0905. ([Tutorial](./doc/en/Kimi-K2.md))
|
||||
* **July 26, 2025**: Support SmallThinker and GLM4-MoE. ([Tutorial](./doc/en/SmallThinker_and_Glm4moe.md))
|
||||
* **July 11, 2025**: Support Kimi-K2. ([Tutorial](./doc/en/Kimi-K2.md))
|
||||
|
||||
70
doc/en/Qwen3-Next.md
Normal file
70
doc/en/Qwen3-Next.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Qwen3-Next Support for KTransformers
|
||||
|
||||
## Introduction
|
||||
|
||||
### Overview
|
||||
We are very pleased to announce that Ktransformers now supports Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct.
|
||||
|
||||
### Model & Resource Links
|
||||
|
||||
- Official Qwen3-Next-80B-A3B-Thinking Release:
|
||||
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
|
||||
|
||||
- Official Qwen3-Next-80B-A3B-Instruct Release
|
||||
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
|
||||
|
||||
|
||||
## Installation Guide
|
||||
|
||||
### 1. Resource Requirements
|
||||
|
||||
The model running with 512 Experts requires approximately 320 GB of memory and 6 GB of GPU memory.
|
||||
|
||||
### 2. Prepare Models
|
||||
|
||||
```bash
|
||||
# download gguf
|
||||
huggingface-cli download --resume-download Qwen/Qwen3-Next-80B-A3B-Instruct
|
||||
|
||||
```
|
||||
|
||||
### 3. Install ktransformers
|
||||
|
||||
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
|
||||
|
||||
### 4. Run Qwen3-Next Inference Server
|
||||
|
||||
```bash
|
||||
python ktransformers/server/main.py \
|
||||
--port 10021 \
|
||||
--model_path path-to-Qwen3-Next-80B-A3B-Thinking \
|
||||
--model_name Qwen3NextForCausalLM \
|
||||
--optimize_config_path <local_path>/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \
|
||||
--max_new_tokens 1024 \
|
||||
--cache_lens 32768 \
|
||||
--chunk_size 256 \
|
||||
--max_batch_size 4 \
|
||||
--no-use_cuda_graph \
|
||||
--backend_type balance_serve
|
||||
```
|
||||
|
||||
### 5. Access server
|
||||
|
||||
```
|
||||
curl -X POST http://localhost:10021/v1/chat/completions \
|
||||
-H "accept: application/json" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "hello"}
|
||||
],
|
||||
"model": "Qwen3-Next-80B-A3B-Instruct",
|
||||
"temperature": 0.3,
|
||||
"top_p": 1.0,
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
### 6. Notes
|
||||
|
||||
Due to Qwen3-Next’s use of linear attention, CUDA Graph optimization is not yet support — but it’s coming soon! 🚀
|
||||
@@ -60,6 +60,11 @@ class KQwen3NextForCausalLM(Qwen3NextPreTrainedModel):
|
||||
|
||||
return features
|
||||
|
||||
def reset_conv_states(self):
|
||||
for i in range(self.config.num_hidden_layers):
|
||||
self.conv_states[i] = None
|
||||
self.recurrent_states[i] = None
|
||||
|
||||
|
||||
def forward(
|
||||
self,
|
||||
@@ -75,7 +80,10 @@ class KQwen3NextForCausalLM(Qwen3NextPreTrainedModel):
|
||||
|
||||
forward_batch_output = ForwardBatchOutput()
|
||||
|
||||
|
||||
q_len = features[0].size(0)
|
||||
if q_len > 1:
|
||||
self.reset_conv_states()
|
||||
|
||||
hidden_states = features[0]
|
||||
self.attn[cuda_graph_idx].calc_batch_indices(hidden_states.shape[0])
|
||||
freqs_cis = self.model.rotary_emb(hidden_states.unsqueeze(0), batch.minibatch.position_ids.unsqueeze(0))
|
||||
|
||||
@@ -305,7 +305,6 @@ class BalanceServeThreadContext(ThreadContext):
|
||||
|
||||
|
||||
def run_engine(args, token_queue, broadcast_endpoint, event, kvcache_event):
|
||||
args.use_cuda_graph = False # tmp set
|
||||
engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
|
||||
if args.use_cuda_graph:
|
||||
engine.model_runner.warmup()
|
||||
|
||||
Reference in New Issue
Block a user