ktransformers/README.md at 57d14d22bc9fe911694530cc8ff3ca7d20fa493a

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-04-20 06:18:59 +00:00

Files

Jiaqi Liao 57d14d22bc Refactor: restructure repository to focus on kt-kernel and KT-SFT modulesq recon (#1581 )

* refactor: move legacy code to archive/ directory

  - Moved ktransformers, csrc, third_party, merge_tensors to archive/
  - Moved build scripts and configurations to archive/
  - Kept kt-kernel, KT-SFT, doc, and README files in root
  - Preserved complete git history for all moved files

* refactor: restructure repository to focus on kt-kernel and KT-SFT modules

* fix README

* fix README

* fix README

* fix README

* docs: add performance benchmarks to kt-kernel section

Add comprehensive performance data for kt-kernel to match KT-SFT's presentation:
- AMX kernel optimization: 21.3 TFLOPS (3.9× faster than PyTorch)
- Prefill phase: up to 20× speedup vs baseline
- Decode phase: up to 4× speedup
- NUMA optimization: up to 63% throughput improvement
- Multi-GPU (8×L20): 227.85 tokens/s total throughput with DeepSeek-R1 FP8

Source: https://lmsys.org/blog/2025-10-22-KTransformers/

This provides users with concrete performance metrics for both core modules,
making it easier to understand the capabilities of each component.

* refactor: improve kt-kernel performance data with specific hardware and models

Replace generic performance descriptions with concrete benchmarks:
- Specify exact hardware: 8×L20 GPU + Xeon Gold 6454S, Single/Dual-socket Xeon + AMX
- Include specific models: DeepSeek-R1-0528 (FP8), DeepSeek-V3 (671B)
- Show detailed metrics: total throughput, output throughput, concurrency details
- Match KT-SFT presentation style for consistency

This provides users with actionable performance data they can use to evaluate
hardware requirements and expected performance for their use cases.

* fix README

* docs: clean up performance table and improve formatting

* add pic for README

* refactor: simplify .gitmodules and backup legacy submodules

- Remove 7 legacy submodules from root .gitmodules (archive/third_party/*)
- Keep only 2 active submodules for kt-kernel (llama.cpp, pybind11)
- Backup complete .gitmodules to archive/.gitmodules
- Add documentation in archive/README.md for researchers who need legacy submodules

This reduces initial clone size by ~500MB and avoids downloading unused dependencies.

* refactor: move doc/ back to root directory

Keep documentation in root for easier access and maintenance.

* refactor: consolidate all images to doc/assets/

- Move kt-kernel/assets/heterogeneous_computing.png to doc/assets/
- Remove KT-SFT/assets/ (images already in doc/assets/)
- Update KT-SFT/README.md image references to ../doc/assets/
- Eliminates ~7.9MB image duplication
- Centralizes all documentation assets in one location

* fix pic path for README

2025-11-10 17:42:26 +08:00

5.5 KiB

Raw Blame History

High-Performance CPU-GPU Hybrid Inference for Large Language Models

🎯 Overview

KTransformers is a research project focused on efficient inference and fine-tuning of large language models through CPU-GPU heterogeneous computing. The project has evolved into two core modules: kt-kernel and KT-SFT.

🔥 Updates

Nov 6, 2025: Support Kimi-K2-Thinking inference and fine-tune
Nov 4, 2025: KTransformers Fine-Tuning × LLaMA-Factory Integration
Oct 27, 2025: Support Ascend NPU
Oct 10, 2025: Integrating into SGLang (Roadmap, Blog)
Sept 11, 2025: Support Qwen3-Next
Sept 05, 2025: Support Kimi-K2-0905
July 26, 2025: Support SmallThinker and GLM4-MoE
June 30, 2025: Support 3-layer (GPU-CPU-Disk) prefix cache reuse
May 14, 2025: Support Intel Arc GPU
Apr 29, 2025: Support AMX-Int8、AMX-BF16 and Qwen3MoE
Apr 9, 2025: Experimental support for LLaMA 4 models
Apr 2, 2025: Support Multi-concurrency
Mar 15, 2025: Support ROCm on AMD GPU
Mar 5, 2025: Support unsloth 1.58/2.51 bits weights and IQ1_S/FP8 hybrid weights; 139K longer context for DeepSeek-V3/R1
Feb 25, 2025: Support FP8 GPU kernel for DeepSeek-V3 and R1
Feb 10, 2025: Support Deepseek-R1 and V3, up to 3~28x speedup

📦 Core Modules

🚀 kt-kernel - High-Performance Inference Kernels

CPU-optimized kernel operations for heterogeneous LLM inference.

Key Features:

AMX/AVX Acceleration: Intel AMX and AVX512/AVX2 optimized kernels for INT4/INT8 quantized inference
MoE Optimization: Efficient Mixture-of-Experts inference with NUMA-aware memory management
Quantization Support: CPU-side INT4/INT8 quantized weights, GPU-side GPTQ support
Easy Integration: Clean Python API for SGLang and other frameworks

Quick Start:

cd kt-kernel
pip install .

Use Cases:

CPU-GPU hybrid inference for large MoE models
Integration with SGLang for production serving
Heterogeneous expert placement (hot experts on GPU, cold experts on CPU)

Performance Examples:

Model	Hardware Configuration	Total Throughput	Output Throughput
DeepSeek-R1-0528 (FP8)	8×L20 GPU + Xeon Gold 6454S	227.85 tokens/s	87.58 tokens/s (8-way concurrency)

👉 Full Documentation →

🎓 KT-SFT - Fine-Tuning Framework

KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning.

Key Features:

Resource Efficient: Fine-tune 671B DeepSeek-V3 with just 70GB GPU memory + 1.3TB RAM
LoRA Support: Full LoRA fine-tuning with heterogeneous acceleration
LLaMA-Factory Integration: Seamless integration with popular fine-tuning framework
Production Ready: Chat, batch inference, and metrics evaluation

Performance Examples:

Model	Configuration	Throughput	GPU Memory
DeepSeek-V3 (671B)	LoRA + AMX	~40 tokens/s	70GB (multi-GPU)
DeepSeek-V2-Lite (14B)	LoRA + AMX	~530 tokens/s	6GB

Quick Start:

cd KT-SFT
# Install environment following KT-SFT/README.md
USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml

👉 Full Documentation →

🔥 Citation

If you use KTransformers in your research, please cite our paper:

@inproceedings{10.1145/3731569.3764843,
  title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models},
  author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing},
  booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles},
  year = {2025}
}

👥 Contributors & Team

Developed and maintained by:

MADSys Lab @ Tsinghua University
Approaching.AI
Community contributors

We welcome contributions! Please feel free to submit issues and pull requests.

💬 Community & Support

GitHub Issues: Report bugs or request features
GitHub Discussions: Ask questions and share ideas
WeChat Group: See archive/WeChatGroup.png

📦 Legacy Code

The original integrated KTransformers framework has been archived to the archive/ directory for reference. The project now focuses on the two core modules above for better modularity and maintainability.

For the original documentation with full quick-start guides and examples, see:

archive/README_LEGACY.md (English)
archive/README_ZH_LEGACY.md (中文)

5.5 KiB Raw Blame History Unescape Escape