mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-23 15:59:16 +00:00
3.6 KiB
3.6 KiB
SmallThinker & GLM-4-MoE Support for KTransformers
Introduction
Overview
We are excited to announce that KTransformers now supports both SmallThinker and GLM-4-MoE.
- SmallThinker-21BA3B-Instruct (bf16): ~26 TPS on a dual-socket CPU with one consumer-grade GPU, requiring ~84 GB DRAM.
- GLM-4.5-Air (bf16): ~11 TPS on a dual-socket CPU with one consumer-grade GPU, requiring ~440 GB DRAM.
- GLM-4.5-Air (AMX INT8): prefill ~309 TPS / decode ~16 TPS on a dual-socket CPU with one consumer-grade GPU, requiring ~220 GB DRAM.
Model & Resource Links
- SmallThinker-21BA3B-Instruct
- GLM-4.5-Air 110B
Installation Guide
1. Resource Requirements
| Model | Precision | Experts | DRAM Needed | GPU Memory Needed* | TPS (approx.) |
|---|---|---|---|---|---|
| SmallThinker-21B-Instruct | bf16 | 32 | ~42 GB | 14 GB | ~26 TPS |
| GLM-4.5-Air | bf16 | 128 | ~220 GB | 14 GB | ~11 TPS |
| GLM-4.5-Air (AMX INT8) | int8 | 128 | ~220 GB | 14 GB | ~16 TPS |
* Exact GPU memory depends on sequence length, batch size, and kernels used.
2. Prepare Models
# Example: download original safetensors (adjust to your paths/repos)
# (Fill in actual repos/filenames yourself)
# SmallThinker-21B
huggingface-cli download --resume-download https://huggingface.co/PowerInfer/SmallThinker-21BA3B-Instruct \
--local-dir ./SmallThinker-21BA3B-Instruct
# GLM-4-MoE 110B
huggingface-cli download --resume-download https://huggingface.co/zai-org/GLM-4.5-Air \
--local-dir ./GLM-4.5-Air
3. Install KTransformers
Follow the official Installation Guide.
pip install ktransformers # or from source if you need bleeding-edge features
4. Run SmallThinker-21B Inference Server
python ktransformers/server/main.py \
--port 10021 \
--model_path /abs/path/to/SmallThinker-21B-bf16 \
--model_name SmallThinkerForCausalLM \
--optimize_config_path ktransformers/optimize/optimize_rules/SmallThinker-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--backend_type balance_serve
5. Run GLM-4-MoE 110B Inference Server
python ktransformers/server/main.py \
--port 10110 \
--model_name Glm4MoeForCausalLM \
--model_path /abs/path/to/GLM-4-MoE-110B-bf16 \
--optimize_config_path ktransformers/optimize/optimize_rules/Glm4Moe-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--backend_type balance_serve
6. Access Server
curl -X POST http://localhost:10021/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"model": "SmallThinker-21BA3B-Instruct",
"temperature": 0.3,
"top_p": 1.0,
"stream": true
}'
curl -X POST http://localhost:10110/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"model": "GLM-4.5-Air",
"temperature": 0.3,
"top_p": 1.0,
"stream": true
}'