Benjamin F 041bdfc636 [New Model] DeepSeek-V4-Flash: kt-kernel MXFP4 MoE + sglang hybrid inference (#1970)
* [feat](kt-kernel): add MXFP4 MoE operator with E2M1 weights × BF16 activations

Implements AMX_FP4_MOE_TP based on the RAWINT4 (k2-moe) CRTP pattern.
FP4 E2M1 weights are nibble-packed and decoded via PSHUFB LUT, then
computed with BF16 activations using _mm512_dpbf16_ps. Supports weight-only
per-kgroup scaling (group_size=32) and tensor parallelism.

Includes a Python validation test covering uniform, alternating, ramp,
and random weight patterns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950)

V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group
scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader
that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}`
naming and converts ue8m0 → bf16 via a lossless bit-cast, register the
model entry, and ship an end-to-end numerical validation script.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [perf](kt-kernel): MXFP4 MoE add mat-mat 4×4 tile, refine mat-vec reduce (#1957)

mat_mul_kgroup previously aliased to fp4_mat_vec_kgroup, leaving large
batches stuck on the per-token path. Implement fp4_mat_mat_kgroup as a
4×4 register tile (MB=NB=4, 16 zmm accumulators) so each PSHUFB decode
of four weight rows is reused across four tokens.

Refactor fp4_mat_vec_kgroup to accumulate four N-rows in parallel and
flush them with a new reduce4 helper, removing per-row reduce_add_ps
calls from the hot loop. Mark mxfp4_to_bf16_32 always_inline.

Add bench/bench_fp4_moe.py with --routing {balanced,concentrated} and
a backend registry so future kernels can be added without changing the
runner.

Dispatch thresholds, derived_init, GeneralMOEConfig handling,
load_weights, write_weights_to_buffer and the TP_MOE specialization are
unchanged.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(loader): avoid uint16 lshift in ue8m0->bf16 conversion

PyTorch CPU has no lshift kernel for UInt16, so the previous
`(scale_t.to(torch.uint16) << 7)` raised NotImplementedError when
loading any V4-Flash MXFP4 routed-expert scale tensor on the host.

Switch to int32 for the shift (kernel exists) and narrow to int16
afterwards. The shifted value max is 255<<7 = 32640, well within
int16 range, so the narrow is lossless. The .view(bfloat16) bit
pattern is identical (bf16 sign bit is always 0 for ue8m0 values).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(v4-flash): hybrid CPU/GPU recipe + bump kt-sglang submodule

Bumps third_party/sglang to kvcache-ai/sglang main (3cbd49c29) which now
contains DeepSeek V4 Flash model support + consumer-GPU (SM_120) portable
Triton/TileLang fallbacks (kt-sglang PR #38).

Adds doc/en/DeepSeek-V4-Flash.md tutorial: 8x RTX 5090 hybrid recipe with
the full launch command, OpenAI-compatible /generate + /v1/chat/completions
examples, and the kt chat CLI client.

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-03 10:48:31 +08:00
2026-03-04 16:54:48 +08:00
2026-03-04 16:54:48 +08:00

KTransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference/Fine-tune Optimizations

🎯 Overview | 🚀 Inference | 🎓 SFT | 🔥 Citation | 🚀 Roadmap(2026Q2)

🎯 Overview

KTransformers is a research project focused on efficient inference and fine-tuning of large language models through CPU-GPU heterogeneous computing. The project now exposes two user-facing capabilities from the kt-kernel source tree: Inference and SFT.

🔥 Updates

  • May 6, 2026: KTransformers at GOSIM Paris 2026 — "Agentic AI on Edge" track. We'll present KT's inference performance on consumer hardware.
  • May 02, 2026: DeepSeek-V4-Flash Support! (Tutorial)
  • Apr 30, 2026: KTransformers v0.6.1 refreshes kt-kernel inference and SFT docs with separate Inference and SFT Quick Start entry points.
  • Mar 26, 2026: Support AVX2-only CPU backend for KT-Kernel inference. (Tutorial)
  • Feb 13, 2026: MiniMax-M2.5 Day0 Support! (Tutorial)
  • Feb 12, 2026: GLM-5 Day0 Support! (Tutorial)
  • Jan 27, 2026: Kimi-K2.5 Day0 Support! (Tutorial) (SFT Tutorial)
  • Jan 22, 2026: Support CPU-GPU Expert Scheduling, Native BF16 and FP8 per channel Precision and AutoDL unified fine-tuning and inference
  • Dec 24, 2025: Support Native MiniMax-M2.1 inference. (Tutorial)
  • Dec 22, 2025: Support RL-DPO fine-tuning with LLaMA-Factory. (Tutorial)
  • Dec 5, 2025: Support Native Kimi-K2-Thinking inference (Tutorial)
  • Nov 6, 2025: Support Kimi-K2-Thinking inference (Tutorial) and fine-tune (Tutorial)
  • Nov 4, 2025: KTransformers Fine-Tuning × LLaMA-Factory Integration. (Tutorial)
  • Oct 27, 2025: Support Ascend NPU. (Tutorial)
  • Oct 10, 2025: Integrating into SGLang. (Roadmap, Blog)
  • Sept 11, 2025: Support Qwen3-Next. (Tutorial)
  • Sept 05, 2025: Support Kimi-K2-0905. (Tutorial)
  • July 26, 2025: Support SmallThinker and GLM4-MoE. (Tutorial)
  • July 11, 2025: Support Kimi-K2. (Tutorial)
  • June 30, 2025: Support 3-layer (GPU-CPU-Disk) prefix cache reuse.
  • May 14, 2025: Support Intel Arc GPU (Tutorial).
  • Apr 29, 2025: Support AMX-Int8、 AMX-BF16 and Qwen3MoE (Tutorial)
  • Apr 9, 2025: Experimental support for LLaMA 4 models (Tutorial).
  • Apr 2, 2025: Support Multi-concurrency. (Tutorial).
  • Mar 15, 2025: Support ROCm on AMD GPU (Tutorial).
  • Mar 5, 2025: Support unsloth 1.58/2.51 bits weights and IQ1_S/FP8 hybrid weights. Support 139K Longer Context for DeepSeek-V3 and R1 in 24GB VRAM.
  • Feb 25, 2025: Support FP8 GPU kernel for DeepSeek-V3 and R1; Longer Context.
  • Feb 15, 2025: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed +15%, up to 16 Tokens/s), update docs and online books.
  • Feb 10, 2025: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see here.
  • Aug 28, 2024: Decrease DeepseekV2's required VRAM from 21G to 11G.
  • Aug 15, 2024: Update detailed tutorial for injection and multi-GPU.
  • Aug 14, 2024: Support llamfile as linear backend.
  • Aug 12, 2024: Support multiple GPU; Support new model: mixtral 8*7B and 8*22B; Support q2k, q3k, q5k dequant on gpu.
  • Aug 9, 2024: Support windows native.

📦 Capabilities

🚀 Inference - High-Performance kt-kernel Serving

CPU-optimized kernel operations for heterogeneous LLM inference.

image

Key Features:

  • AMX/AVX Acceleration: Intel AMX and AVX512/AVX2 optimized kernels for INT4/INT8 quantized inference
  • MoE Optimization: Efficient Mixture-of-Experts inference with NUMA-aware memory management
  • Quantization Support: CPU-side INT4/INT8 quantized weights, GPU-side GPTQ support
  • Easy Integration: Clean Python API for SGLang and other frameworks

Quick Start:

cd kt-kernel
pip install .

Use Cases:

  • CPU-GPU hybrid inference for large MoE models
  • Integration with SGLang for production serving
  • Heterogeneous expert placement (hot experts on GPU, cold experts on CPU)

Performance Examples:

Model Hardware Configuration Total Throughput Output Throughput
DeepSeek-R1-0528 (FP8) 8×L20 GPU + Xeon Gold 6454S 227.85 tokens/s 87.58 tokens/s (8-way concurrency)

👉 Full Documentation →


🎓 SFT - Fine-Tuning with LLaMA-Factory

KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning.

KTransformers SFT

Key Features:

  • Multi-Backend Support: CPU/GPU hybrid fine-tuning with INT8/INT4 quantization
  • Ultra-Large MoE Support: Fine-tune models like DeepSeek-V3/R1 on limited GPU memory
  • Faster than ZeRO-Offload: 6-12x training speedup in benchmarked MoE SFT workloads
  • Lower CPU Memory: About half the CPU memory of the previous KT SFT path in the benchmarked setup
  • LLaMA-Factory Integration: Seamless integration with popular fine-tuning framework
Model GPU Memory Training Speed Hardware
DeepSeek-V3 ~80GB total 3.7 it/s 4x RTX 4090
DeepSeek-R1 ~80GB total 3.7 it/s 4x RTX 4090
Qwen3-30B-A3B ~24GB total 8+ it/s 1x RTX 4090

Quick Start:

cd /path/to/LLaMA-Factory
pip install -e .
pip install -r requirements/ktransformers.txt
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
  --config_file examples/ktransformers/accelerate/fsdp2_kt_int8.yaml \
  src/train.py \
  examples/ktransformers/train_lora/qwen3_5moe_lora_sft_kt.yaml

👉 Quick Start → 👉 Full Documentation →


🔥 Citation

If you use KTransformers in your research, please cite our paper:

@inproceedings{10.1145/3731569.3764843,
  title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models},
  author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing},
  booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles},
  year = {2025}
}

👥 Contributors & Team

Developed and maintained by:

We welcome contributions! Please feel free to submit issues and pull requests.

💬 Community & Support

📦 KT original Code

The original integrated KTransformers framework has been archived to the archive/ directory for reference. The project now organizes the two capabilities above from the kt-kernel source tree for clearer documentation and maintenance.

For the original documentation with full quick-start guides and examples, see:

Description
A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations
Readme Apache-2.0 88 MiB
Languages
Python 57.8%
C++ 33.1%
Cuda 5.1%
Shell 1%
CMake 0.8%
Other 2.1%