mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-03-14 18:37:23 +00:00
Refactor: restructure repository to focus on kt-kernel and KT-SFT modulesq recon (#1581)
* refactor: move legacy code to archive/ directory - Moved ktransformers, csrc, third_party, merge_tensors to archive/ - Moved build scripts and configurations to archive/ - Kept kt-kernel, KT-SFT, doc, and README files in root - Preserved complete git history for all moved files * refactor: restructure repository to focus on kt-kernel and KT-SFT modules * fix README * fix README * fix README * fix README * docs: add performance benchmarks to kt-kernel section Add comprehensive performance data for kt-kernel to match KT-SFT's presentation: - AMX kernel optimization: 21.3 TFLOPS (3.9× faster than PyTorch) - Prefill phase: up to 20× speedup vs baseline - Decode phase: up to 4× speedup - NUMA optimization: up to 63% throughput improvement - Multi-GPU (8×L20): 227.85 tokens/s total throughput with DeepSeek-R1 FP8 Source: https://lmsys.org/blog/2025-10-22-KTransformers/ This provides users with concrete performance metrics for both core modules, making it easier to understand the capabilities of each component. * refactor: improve kt-kernel performance data with specific hardware and models Replace generic performance descriptions with concrete benchmarks: - Specify exact hardware: 8×L20 GPU + Xeon Gold 6454S, Single/Dual-socket Xeon + AMX - Include specific models: DeepSeek-R1-0528 (FP8), DeepSeek-V3 (671B) - Show detailed metrics: total throughput, output throughput, concurrency details - Match KT-SFT presentation style for consistency This provides users with actionable performance data they can use to evaluate hardware requirements and expected performance for their use cases. * fix README * docs: clean up performance table and improve formatting * add pic for README * refactor: simplify .gitmodules and backup legacy submodules - Remove 7 legacy submodules from root .gitmodules (archive/third_party/*) - Keep only 2 active submodules for kt-kernel (llama.cpp, pybind11) - Backup complete .gitmodules to archive/.gitmodules - Add documentation in archive/README.md for researchers who need legacy submodules This reduces initial clone size by ~500MB and avoids downloading unused dependencies. * refactor: move doc/ back to root directory Keep documentation in root for easier access and maintenance. * refactor: consolidate all images to doc/assets/ - Move kt-kernel/assets/heterogeneous_computing.png to doc/assets/ - Remove KT-SFT/assets/ (images already in doc/assets/) - Update KT-SFT/README.md image references to ../doc/assets/ - Eliminates ~7.9MB image duplication - Centralizes all documentation assets in one location * fix pic path for README
This commit is contained in:
22
.gitmodules
vendored
22
.gitmodules
vendored
@@ -1,25 +1,3 @@
|
||||
[submodule "third_party/llama.cpp"]
|
||||
path = third_party/llama.cpp
|
||||
url = https://github.com/ggerganov/llama.cpp.git
|
||||
[submodule "third_party/pybind11"]
|
||||
path = third_party/pybind11
|
||||
url = https://github.com/pybind/pybind11.git
|
||||
[submodule "third_party/spdlog"]
|
||||
path = third_party/spdlog
|
||||
url = https://github.com/gabime/spdlog.git
|
||||
[submodule "third_party/custom_flashinfer"]
|
||||
path = third_party/custom_flashinfer
|
||||
url = https://github.com/kvcache-ai/custom_flashinfer.git
|
||||
branch = fix-precision-mla-merge-main
|
||||
[submodule "third_party/xxHash"]
|
||||
path = third_party/xxHash
|
||||
url = https://github.com/Cyan4973/xxHash.git
|
||||
[submodule "third_party/prometheus-cpp"]
|
||||
path = third_party/prometheus-cpp
|
||||
url = https://github.com/jupp0r/prometheus-cpp
|
||||
[submodule "third_party/PhotonLibOS"]
|
||||
path = third_party/PhotonLibOS
|
||||
url = https://github.com/alibaba/PhotonLibOS.git
|
||||
[submodule "kt-kernel/third_party/llama.cpp"]
|
||||
path = kt-kernel/third_party/llama.cpp
|
||||
url = https://github.com/ggerganov/llama.cpp.git
|
||||
|
||||
277
README.md
277
README.md
@@ -1,217 +1,136 @@
|
||||
<div align="center">
|
||||
<!-- <h1>KTransformers</h1> -->
|
||||
<p align="center">
|
||||
|
||||
<picture>
|
||||
<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
|
||||
|
||||
</picture>
|
||||
|
||||
</p>
|
||||
<h3>A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations</h3>
|
||||
<strong><a href="#show-cases">🌟 Show Cases</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#tutorial">📃 Tutorial</a> | <a href="#Citation">🔥 Citation </a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬 Discussion </a>|<a href="#FAQ"> 🙋 FAQ</a> </strong>
|
||||
<picture>
|
||||
<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
|
||||
</picture>
|
||||
</p>
|
||||
<h3>High-Performance CPU-GPU Hybrid Inference for Large Language Models</h3>
|
||||
</div>
|
||||
|
||||
<h2 id="intro">🎉 Introduction</h2>
|
||||
KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 <a href="https://github.com/huggingface/transformers">Transformers</a> experience with advanced kernel optimizations and placement/parallelism strategies.
|
||||
<br/><br/>
|
||||
KTransformers is a flexible, Python-centric framework designed with extensibility at its core.
|
||||
By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible
|
||||
interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI.
|
||||
<br/><br/>
|
||||
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
|
||||
## 🎯 Overview
|
||||
|
||||
<h2 id="Updates">🔥 Updates</h2>
|
||||
KTransformers is a research project focused on efficient inference and fine-tuning of large language models through CPU-GPU heterogeneous computing. The project has evolved into **two core modules**: [kt-kernel](./kt-kernel/) and [KT-SFT](./KT-SFT/).
|
||||
|
||||
* **Nov 6, 2025**: Support Kimi-K2-Thinking inference ([Tutorial](./doc/en/Kimi-K2-Thinking.md)) and fine-tune ([Tutorial](./doc/en/SFT_Installation_Guide_KimiK2.md))
|
||||
* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/KTransformers-Fine-Tuning_User-Guide.md))
|
||||
* **Oct 27, 2025**: Support Ascend NPU. ([Tutorial](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md))
|
||||
* **Oct 10, 2025**: Integrating into SGLang. ([Roadmap](https://github.com/sgl-project/sglang/issues/11425))
|
||||
* **Sept 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md))
|
||||
* **Sept 05, 2025**: Support Kimi-K2-0905. ([Tutorial](./doc/en/Kimi-K2.md))
|
||||
* **July 26, 2025**: Support SmallThinker and GLM4-MoE. ([Tutorial](./doc/en/SmallThinker_and_Glm4moe.md))
|
||||
* **July 11, 2025**: Support Kimi-K2. ([Tutorial](./doc/en/Kimi-K2.md))
|
||||
* **June 30, 2025**: Support 3-layer (GPU-CPU-Disk) [prefix cache](./doc/en/prefix_cache.md) reuse.
|
||||
* **May 14, 2025**: Support Intel Arc GPU ([Tutorial](./doc/en/xpu.md)).
|
||||
* **Apr 29, 2025**: Support AMX-Int8、 AMX-BF16 and Qwen3MoE ([Tutorial](./doc/en/AMX.md))
|
||||
## 🔥 Updates
|
||||
|
||||
https://github.com/user-attachments/assets/fafe8aec-4e22-49a8-8553-59fb5c6b00a2
|
||||
* **Nov 6, 2025**: Support Kimi-K2-Thinking inference and fine-tune
|
||||
* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration
|
||||
* **Oct 27, 2025**: Support Ascend NPU
|
||||
* **Oct 10, 2025**: Integrating into SGLang ([Roadmap](https://github.com/sgl-project/sglang/issues/11425), [Blog](https://lmsys.org/blog/2025-10-22-KTransformers/))
|
||||
* **Sept 11, 2025**: Support Qwen3-Next
|
||||
* **Sept 05, 2025**: Support Kimi-K2-0905
|
||||
* **July 26, 2025**: Support SmallThinker and GLM4-MoE
|
||||
* **June 30, 2025**: Support 3-layer (GPU-CPU-Disk) prefix cache reuse
|
||||
* **May 14, 2025**: Support Intel Arc GPU
|
||||
* **Apr 29, 2025**: Support AMX-Int8、AMX-BF16 and Qwen3MoE
|
||||
* **Apr 9, 2025**: Experimental support for LLaMA 4 models
|
||||
* **Apr 2, 2025**: Support Multi-concurrency
|
||||
* **Mar 15, 2025**: Support ROCm on AMD GPU
|
||||
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and IQ1_S/FP8 hybrid weights; 139K longer context for DeepSeek-V3/R1
|
||||
* **Feb 25, 2025**: Support FP8 GPU kernel for DeepSeek-V3 and R1
|
||||
* **Feb 10, 2025**: Support Deepseek-R1 and V3, up to 3~28x speedup
|
||||
|
||||
* **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./doc/en/llama4.md)).
|
||||
* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)).
|
||||
---
|
||||
|
||||
https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a
|
||||
## 📦 Core Modules
|
||||
|
||||
* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
|
||||
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
|
||||
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
|
||||
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
|
||||
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
|
||||
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
|
||||
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
|
||||
* **Aug 14, 2024**: Support llamfile as linear backend.
|
||||
* **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
|
||||
* **Aug 9, 2024**: Support windows native.
|
||||
### 🚀 [kt-kernel](./kt-kernel/) - High-Performance Inference Kernels
|
||||
|
||||
<!-- * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). -->
|
||||
CPU-optimized kernel operations for heterogeneous LLM inference.
|
||||
|
||||
<h2 id="show-cases">🌟 Show Cases</h2>
|
||||

|
||||
|
||||
<div>
|
||||
<h3>GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
|
||||
</div>
|
||||
**Key Features:**
|
||||
- **AMX/AVX Acceleration**: Intel AMX and AVX512/AVX2 optimized kernels for INT4/INT8 quantized inference
|
||||
- **MoE Optimization**: Efficient Mixture-of-Experts inference with NUMA-aware memory management
|
||||
- **Quantization Support**: CPU-side INT4/INT8 quantized weights, GPU-side GPTQ support
|
||||
- **Easy Integration**: Clean Python API for SGLang and other frameworks
|
||||
|
||||
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
|
||||
|
||||
</p>
|
||||
|
||||
- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM([Tutorial](./doc/en/DeepseekR1_V3_tutorial.md)).
|
||||
|
||||
- Prefill Speed (tokens/s):
|
||||
- KTransformers: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)
|
||||
- Compared to 10.31 tokens/s in llama.cpp with 2×32 cores, achieving up to **27.79× speedup**.
|
||||
- Decode Speed (tokens/s):
|
||||
- KTransformers: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)
|
||||
- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
|
||||
- Upcoming Open Source Release:
|
||||
- AMX optimizations and selective expert activation will be open-sourced in V0.3.
|
||||
- Currently available only in preview binary distribution, which can be downloaded [here](./doc/en/DeepseekR1_V3_tutorial.md).
|
||||
- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
- **Faster Speed:** Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) and [Marlin](https://github.com/IST-DASLab/marlin).
|
||||
- **VSCode Integration:** Wrapped into an OpenAI and Ollama compatible API for seamless integration as a backend for [Tabby](https://github.com/TabbyML/tabby) and various other frontends.
|
||||
|
||||
<p align="center">
|
||||
|
||||
https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
|
||||
|
||||
</p>
|
||||
|
||||
<!-- <h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
|
||||
<p align="center">
|
||||
|
||||
https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
|
||||
|
||||
* **1M Context InternLM 2.5 7B**: Operates at full bf16 precision, utilizing 24GB VRAM and 150GB DRAM, which is feasible on a local desktop setup. It achieves a 92.88% success rate on the 1M "Needle In a Haystack" test and 100% on the 128K NIAH test.
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
|
||||
|
||||
* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
|
||||
-->
|
||||
|
||||
<strong>More advanced features will coming soon, so stay tuned!</strong>
|
||||
|
||||
<h2 id="quick-start">🚀 Quick Start</h2>
|
||||
|
||||
Getting started with KTransformers is simple! Follow the steps below to set up and start using it.
|
||||
|
||||
we have already supported vendors:
|
||||
|
||||
- Metax
|
||||
- Sanechips (ZhuFeng V1.0)
|
||||
- Intel
|
||||
- Ascend
|
||||
- Kunpeng
|
||||
- AMD
|
||||
|
||||
### 📥 Installation
|
||||
|
||||
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
|
||||
|
||||
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
|
||||
At the heart of KTransformers is a user-friendly, template-based injection framework.
|
||||
This allows researchers to easily replace original torch modules with optimized variants. It also simplifies the process of combining multiple optimizations, allowing the exploration of their synergistic effects.
|
||||
|
||||
</br>
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Inject-Struction" src="https://github.com/user-attachments/assets/6b4c1e54-9f6d-45c5-a3fc-8fa45e7d257e" width=65%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
Given that vLLM already serves as a great framework for large-scale deployment optimizations, KTransformers is particularly focused on local deployments that are constrained by limited resources. We pay special attention to heterogeneous computing opportunities, such as GPU/CPU offloading of quantized models. For example, we support the efficient <a herf="https://github.com/Mozilla-Ocho/llamafile/tree/main">Llamafile</a> and <a herf="https://github.com/IST-DASLab/marlin">Marlin</a> kernels for CPU and GPU, respectively. More details can be found <a herf="doc/en/operators/llamafile.md">here</a>.
|
||||
|
||||
<h3>Example Usage</h3>
|
||||
To utilize the provided kernels, users only need to create a YAML-based injection template and add the call to `optimize_and_load_gguf` before using the Transformers model.
|
||||
|
||||
```python
|
||||
with torch.device("meta"):
|
||||
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
||||
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
|
||||
...
|
||||
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
||||
**Quick Start:**
|
||||
```bash
|
||||
cd kt-kernel
|
||||
pip install .
|
||||
```
|
||||
|
||||
In this example, the AutoModel is first initialized on the meta device to avoid occupying any memory resources. Then, `optimize_and_load_gguf` iterates through all sub-modules of the model, matches rules specified in your YAML rule file, and replaces them with advanced modules as specified.
|
||||
**Use Cases:**
|
||||
|
||||
After injection, the original `generate` interface is available, but we also provide a compatible `prefill_and_generate` method, which enables further optimizations like CUDAGraph to improve generation speed.
|
||||
- CPU-GPU hybrid inference for large MoE models
|
||||
- Integration with SGLang for production serving
|
||||
- Heterogeneous expert placement (hot experts on GPU, cold experts on CPU)
|
||||
|
||||
<h3>How to custom your model</h3>
|
||||
**Performance Examples:**
|
||||
| Model | Hardware Configuration | Total Throughput | Output Throughput |
|
||||
|-------|------------------------|------------------|-------------------|
|
||||
| DeepSeek-R1-0528 (FP8) | 8×L20 GPU + Xeon Gold 6454S | 227.85 tokens/s | 87.58 tokens/s (8-way concurrency) |
|
||||
|
||||
A detailed tutorial of the injection and multi-GPU using DeepSeek-V2 as an example is given [here](doc/en/injection_tutorial.md).
|
||||
👉 **[Full Documentation →](./kt-kernel/README.md)**
|
||||
|
||||
Below is an example of a YAML template for replacing all original Linear modules with Marlin, an advanced 4-bit quantization kernel.
|
||||
---
|
||||
|
||||
```yaml
|
||||
- match:
|
||||
name: "^model\\.layers\\..*$" # regular expression
|
||||
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||
replace:
|
||||
class: ktransformers.operators.linear.KTransformerLinear # optimized Kernel on quantized data types
|
||||
device: "cpu" # which devices to load this module when initializing
|
||||
kwargs:
|
||||
generate_device: "cuda"
|
||||
generate_linear_type: "QuantizedLinearMarlin"
|
||||
### 🎓 [KT-SFT](./KT-SFT/) - Fine-Tuning Framework
|
||||
|
||||
KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning.
|
||||
|
||||

|
||||
|
||||
**Key Features:**
|
||||
|
||||
- **Resource Efficient**: Fine-tune 671B DeepSeek-V3 with just **70GB GPU memory** + 1.3TB RAM
|
||||
- **LoRA Support**: Full LoRA fine-tuning with heterogeneous acceleration
|
||||
- **LLaMA-Factory Integration**: Seamless integration with popular fine-tuning framework
|
||||
- **Production Ready**: Chat, batch inference, and metrics evaluation
|
||||
|
||||
**Performance Examples:**
|
||||
|
||||
| Model | Configuration | Throughput | GPU Memory |
|
||||
|-------|--------------|------------|------------|
|
||||
| DeepSeek-V3 (671B) | LoRA + AMX | ~40 tokens/s | 70GB (multi-GPU) |
|
||||
| DeepSeek-V2-Lite (14B) | LoRA + AMX | ~530 tokens/s | 6GB |
|
||||
|
||||
**Quick Start:**
|
||||
```bash
|
||||
cd KT-SFT
|
||||
# Install environment following KT-SFT/README.md
|
||||
USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml
|
||||
```
|
||||
|
||||
Each rule in the YAML file has two parts: `match` and `replace`. The `match` part specifies which module should be replaced, and the `replace` part specifies the module to be injected into the model along with the initialization keywords.
|
||||
👉 **[Full Documentation →](./KT-SFT/README.md)**
|
||||
|
||||
You can find example rule templates for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models, in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory. These templates are used to power the `local_chat.py` demo.
|
||||
---
|
||||
|
||||
If you are interested in our design principles and the implementation of the injection framework, please refer to the [design document](doc/en/deepseek-v2-injection.md).
|
||||
## 🔥 Citation
|
||||
|
||||
<h2 id="Citation">🔥 Citation</h2>
|
||||
If you use KTransformers in your research, please cite our paper:
|
||||
|
||||
If you use KTransformers for your research, please cite our [paper](https://madsys.cs.tsinghua.edu.cn/publication/ktransformers-unleashing-the-full-potential-of-cpu/gpu-hybrid-inference-for-moe-models/):
|
||||
|
||||
```
|
||||
```bibtex
|
||||
@inproceedings{10.1145/3731569.3764843,
|
||||
title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models},
|
||||
author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing},
|
||||
booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles},
|
||||
year = {2025}
|
||||
title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models},
|
||||
author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing},
|
||||
booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles},
|
||||
year = {2025}
|
||||
}
|
||||
```
|
||||
|
||||
<h2 id="ack">Acknowledgment and Contributors</h2>
|
||||
## 👥 Contributors & Team
|
||||
|
||||
The development of KTransformers is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
|
||||
Developed and maintained by:
|
||||
- [MADSys Lab](https://madsys.cs.tsinghua.edu.cn/) @ Tsinghua University
|
||||
- [Approaching.AI](http://approaching.ai/)
|
||||
- Community contributors
|
||||
|
||||
KTransformers is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformers faster and easier to use.
|
||||
We welcome contributions! Please feel free to submit issues and pull requests.
|
||||
|
||||
<h2 id="ack">Discussion</h2>
|
||||
## 💬 Community & Support
|
||||
|
||||
If you have any questions, feel free to open an issue. Alternatively, you can join our WeChat group for further discussion. QR Code: [WeChat Group](WeChatGroup.png)
|
||||
- **GitHub Issues**: [Report bugs or request features](https://github.com/kvcache-ai/ktransformers/issues)
|
||||
- **GitHub Discussions**: [Ask questions and share ideas](https://github.com/kvcache-ai/ktransformers/discussions)
|
||||
- **WeChat Group**: See [archive/WeChatGroup.png](./archive/WeChatGroup.png)
|
||||
|
||||
<h2 id="FAQ">🙋 FAQ</h2>
|
||||
## 📦 Legacy Code
|
||||
|
||||
Some common questions are answered in the [FAQ](doc/en/FAQ.md).
|
||||
The original integrated KTransformers framework has been archived to the [`archive/`](./archive/) directory for reference. The project now focuses on the two core modules above for better modularity and maintainability.
|
||||
|
||||
For the original documentation with full quick-start guides and examples, see:
|
||||
- [archive/README_LEGACY.md](./archive/README_LEGACY.md) (English)
|
||||
- [archive/README_ZH_LEGACY.md](./archive/README_ZH_LEGACY.md) (中文)
|
||||
|
||||
|
||||
232
README_ZH.md
232
README_ZH.md
@@ -1,166 +1,132 @@
|
||||
<div align="center">
|
||||
<!-- <h1>KTransformers</h1> -->
|
||||
<p align="center">
|
||||
|
||||
<picture>
|
||||
<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
|
||||
|
||||
</picture>
|
||||
|
||||
</p>
|
||||
<h3>一个用于体验尖端 LLM 推理优化的灵活框架</h3>
|
||||
<strong><a href="#show-cases">🌟 案例展示</a> | <a href="#quick-start">🚀 快速入门</a> | <a href="#tutorial">📃 教程</a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬 讨论</a> | <a href="#FAQ">🙋 常见问题</a> </strong>
|
||||
<picture>
|
||||
<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
|
||||
</picture>
|
||||
</p>
|
||||
<h3>高性能 CPU-GPU 异构大语言模型推理</h3>
|
||||
</div>
|
||||
|
||||
<h2 id="intro">🎉 介绍</h2>
|
||||
KTransformers(发音为 Quick Transformers)旨在通过先进的内核优化和放置/并行策略来增强您对 🤗 [Transformers](https://github.com/huggingface/transformers) 的体验。
|
||||
<br/><br/>
|
||||
KTransformers 是一个以 Python 为中心的灵活框架,其核心是可扩展性。通过用一行代码实现并注入优化模块,用户可以获得与 Transformers 兼容的接口、符合 OpenAI 和 Ollama 的 RESTful API,甚至是一个简化的类似 ChatGPT 的 Web 界面。
|
||||
<br/><br/>
|
||||
我们对 KTransformers 的愿景是成为一个用于实验创新 LLM 推理优化的灵活平台。如果您需要任何其他功能,请告诉我们。
|
||||
## 🎯 项目概述
|
||||
|
||||
<h2 id="Updates">🔥 更新</h2>
|
||||
KTransformers 是一个专注于大语言模型高效推理和微调的研究项目,通过 CPU-GPU 异构计算实现资源受限环境下的模型部署。项目已演进为**两个核心模块**:[kt-kernel](./kt-kernel/) 和 [KT-SFT](./KT-SFT/)。
|
||||
|
||||
* **2025 年 2 月 15 日**:为DeepSeek-V3/R1支持[FP8 GPU内核](./doc/en/fp8_kernel.md); 支持更长的上下文([教程](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context)).
|
||||
* **2025 年 2 月 15 日**:长上下文(从4K到8K,24GB VRAM) & 稍快的速度(+15%)(最快 16 Tokens/s),文档请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 和 [在线指南](https://kvcache-ai.github.io/ktransformers/) 。
|
||||
* **2025 年 2 月 10 日**:支持 Deepseek-R1 和 V3 在单个(24GB VRAM)/多 GPU 和 382G DRAM 上运行,速度提升高达 3~28 倍。详细教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md)。
|
||||
* **2024 年 8 月 28 日**:支持 InternLM2.5-7B-Chat-1M 模型下的 1M 上下文,使用 24GB 的 VRAM 和 150GB 的 DRAM。详细教程请参见 [这里](./doc/en/long_context_tutorial.md)。
|
||||
* **2024 年 8 月 28 日**:将 DeepseekV2 所需的 VRAM 从 21G 降低到 11G。
|
||||
* **2024 年 8 月 15 日**:更新了详细的 [教程](doc/en/injection_tutorial.md),介绍注入和多 GPU 的使用。
|
||||
* **2024 年 8 月 14 日**:支持 llamfile 作为线性后端。
|
||||
* **2024 年 8 月 12 日**:支持多 GPU;支持新模型:mixtral 8\*7B 和 8\*22B;支持 q2k、q3k、q5k 在 GPU 上的去量化。
|
||||
* **2024 年 8 月 9 日**:支持 Windows。
|
||||
## 🔥 更新
|
||||
|
||||
<h2 id="show-cases">🌟 案例展示</h2>
|
||||
* **2025年11月6日**:支持 Kimi-K2-Thinking 推理和微调
|
||||
* **2025年11月4日**:KTransformers 微调 × LLaMA-Factory 集成
|
||||
* **2025年10月27日**:支持 Ascend NPU
|
||||
* **2025年10月10日**:集成到 SGLang ([路线图](https://github.com/sgl-project/sglang/issues/11425), [博客](https://lmsys.org/blog/2025-10-22-KTransformers/))
|
||||
* **2025年9月11日**:支持 Qwen3-Next
|
||||
* **2025年9月5日**:支持 Kimi-K2-0905
|
||||
* **2025年7月26日**:支持 SmallThinker 和 GLM4-MoE
|
||||
* **2025年6月30日**:支持 3层(GPU-CPU-磁盘)前缀缓存复用
|
||||
* **2025年5月14日**:支持 Intel Arc GPU
|
||||
* **2025年4月29日**:支持 AMX-Int8、AMX-BF16 和 Qwen3MoE
|
||||
* **2025年4月9日**:实验性支持 LLaMA 4 模型
|
||||
* **2025年4月2日**:支持多并发
|
||||
* **2025年3月15日**:支持 AMD GPU 的 ROCm
|
||||
* **2025年3月5日**:支持 unsloth 1.58/2.51 bits 权重和 IQ1_S/FP8 混合权重;DeepSeek-V3/R1 支持 139K 长上下文
|
||||
* **2025年2月25日**:支持 DeepSeek-V3 和 R1 的 FP8 GPU 内核
|
||||
* **2025年2月10日**:支持 Deepseek-R1 和 V3,速度提升最高达 3~28 倍
|
||||
|
||||
<div>
|
||||
<h3>在仅 24GB VRAM 的桌面上运行 GPT-4/o1 级别的本地 VSCode Copilot</h3>
|
||||
</div>
|
||||
---
|
||||
|
||||
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
|
||||
## 📦 核心模块
|
||||
|
||||
</p>
|
||||
### 🚀 [kt-kernel](./kt-kernel/) - 高性能推理内核
|
||||
|
||||
- **[NEW!!!] 本地 671B DeepSeek-Coder-V3/R1**:使用其 Q4_K_M 版本,仅需 14GB VRAM 和 382GB DRAM 即可运行(教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md))。
|
||||
- 预填充速度(tokens/s):
|
||||
- KTransformers:54.21(32 核)→ 74.362(双插槽,2×32 核)→ 255.26(优化的 AMX 基 MoE 内核,仅 V0.3)→ 286.55(选择性使用 6 个专家,仅 V0.3)
|
||||
- 与 llama.cpp 在 2×32 核下相比,达到 **27.79× 速度提升**。
|
||||
- 解码速度(tokens/s):
|
||||
- KTransformers:8.73(32 核)→ 11.26(双插槽,2×32 核)→ 13.69(选择性使用 6 个专家,仅 V0.3)
|
||||
- 与 llama.cpp 在 2×32 核下相比,达到 **3.03× 速度提升**。
|
||||
- 即将开源发布:
|
||||
- AMX 优化和选择性专家激活将在 V0.3 中开源。
|
||||
- 目前仅在预览二进制分发中可用,可从 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 下载。
|
||||
面向异构 LLM 推理的 CPU 优化内核操作库。
|
||||
|
||||
- **本地 236B DeepSeek-Coder-V2**:使用其 Q4_K_M 版本,仅需 21GB VRAM 和 136GB DRAM 即可运行,甚至在 [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench) 中得分超过 GPT4-0613。
|
||||

|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
**核心特性:**
|
||||
- **AMX/AVX 加速**:Intel AMX 和 AVX512/AVX2 优化内核,支持 INT4/INT8 量化推理
|
||||
- **MoE 优化**:高效的专家混合推理,支持 NUMA 感知内存管理
|
||||
- **量化支持**:CPU 端 INT4/INT8 量化权重,GPU 端 GPTQ 支持
|
||||
- **易于集成**:简洁的 Python API,可集成到 SGLang 等框架
|
||||
|
||||
- **更快的速度**:通过 MoE 卸载和注入来自 [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) 和 [Marlin](https://github.com/IST-DASLab/marlin) 的高级内核,实现了 2K 提示预填充 126 tokens/s 和生成 13.6 tokens/s 的速度。
|
||||
- **VSCode 集成**:封装成符合 OpenAI 和 Ollama 的 API,可无缝集成到 [Tabby](https://github.com/TabbyML/tabby) 和其他前端的后端。
|
||||
|
||||
<p align="center">
|
||||
|
||||
https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
|
||||
|
||||
</p>
|
||||
|
||||
<!-- <h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3>
|
||||
<p align="center"> -->
|
||||
|
||||
<!-- https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12 -->
|
||||
<!--
|
||||
* **1M 上下文 InternLM 2.5 7B**:以全 bf16 精度运行,使用 24GB VRAM 和 150GB DRAM,可在本地桌面设置中实现。在 1M "针在干草堆中" 测试中达到 92.88% 的成功率,在 128K NIAH 测试中达到 100%。
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
* **增强的速度**:使用稀疏注意力,通过 llamafile 内核实现 1M 上下文生成 16.91 tokens/s 的速度。这种方法比 llama.cpp 的全注意力方法快 10 倍以上。
|
||||
|
||||
* **灵活的稀疏注意力框架**:提供了一个灵活的块稀疏注意力框架,用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。 -->
|
||||
|
||||
<strong>更多高级功能即将推出,敬请期待!</strong>
|
||||
|
||||
<h2 id="quick-start">🚀 快速入门</h2>
|
||||
|
||||
|
||||
KTransformers 的入门非常简单!请参考我们的[安装指南]((https://kvcache-ai.github.io/ktransformers/))进行安装。
|
||||
|
||||
<h2 id="tutorial">📃 简要注入教程</h2>
|
||||
KTransformers 的核心是一个用户友好的、基于模板的注入框架。这使得研究人员可以轻松地将原始 torch 模块替换为优化的变体。它还简化了多种优化的组合过程,允许探索它们的协同效应。
|
||||
</br>
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Inject-Struction" src="https://github.com/user-attachments/assets/6b4c1e54-9f6d-45c5-a3fc-8fa45e7d257e" width=65%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
鉴于 vLLM 已经是一个用于大规模部署优化的优秀框架,KTransformers 特别关注受资源限制的本地部署。我们特别关注异构计算时机,例如量化模型的 GPU/CPU 卸载。例如,我们支持高效的 <a herf="https://github.com/Mozilla-Ocho/llamafile/tree/main">Llamafile</a> 和<a herf="https://github.com/IST-DASLab/marlin">Marlin</a> 内核,分别用于 CPU 和 GPU。 更多详细信息可以在 <a herf="doc/en/operators/llamafile.md">这里</a>找到。
|
||||
|
||||
|
||||
<h3>示例用法</h3>
|
||||
要使用提供的内核,用户只需创建一个基于 YAML 的注入模板,并在使用 Transformers 模型之前添加对 `optimize_and_load_gguf` 的调用。
|
||||
|
||||
```python
|
||||
with torch.device("meta"):
|
||||
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
||||
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
|
||||
...
|
||||
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
||||
**快速开始:**
|
||||
```bash
|
||||
cd kt-kernel
|
||||
pip install .
|
||||
```
|
||||
|
||||
在这个示例中,首先在 meta 设备上初始化 AutoModel,以避免占用任何内存资源。然后,`optimize_and_load_gguf` 遍历模型的所有子模块,匹配您的 YAML 规则文件中指定的规则,并将它们替换为指定的高级模块。
|
||||
**应用场景:**
|
||||
- 大型 MoE 模型的 CPU-GPU 混合推理
|
||||
- 与 SGLang 集成用于生产服务
|
||||
- 异构专家放置(热门专家在 GPU,冷门专家在 CPU)
|
||||
|
||||
注入后,原始的 `generate` 接口仍然可用,但我们还提供了一个兼容的 `prefill_and_generate` 方法,这使得可以进一步优化,例如使用 CUDAGraph 提高生成速度。
|
||||
**性能示例:**
|
||||
| 模型 | 硬件配置 | 总吞吐量 | 输出吞吐量 |
|
||||
|------|---------|---------|-----------|
|
||||
| DeepSeek-R1-0528 (FP8) | 8×L20 GPU + Xeon Gold 6454S | 227.85 tokens/s | 87.58 tokens/s(8路并发)|
|
||||
|
||||
<h3>如何自定义您的模型</h3>
|
||||
👉 **[完整文档 →](./kt-kernel/README.md)**
|
||||
|
||||
一个详细的使用 DeepSeek-V2 作为示例的注入和 multi-GPU 教程在 [这里](doc/en/injection_tutorial.md)。
|
||||
---
|
||||
|
||||
以下是一个将所有原始 Linear 模块替换为 Marlin 的 YAML 模板示例,Marlin 是一个高级的 4 位量化内核。
|
||||
### 🎓 [KT-SFT](./KT-SFT/) - 微调框架
|
||||
|
||||
```yaml
|
||||
- match:
|
||||
name: "^model\\.layers\\..*$" # 正则表达式
|
||||
class: torch.nn.Linear # 仅匹配同时符合名称和类的模块
|
||||
replace:
|
||||
class: ktransformers.operators.linear.KTransformerLinear # 量化数据类型的优化内核
|
||||
device: "cpu" # 初始化时加载该模块的 device
|
||||
kwargs:
|
||||
generate_device: "cuda"
|
||||
generate_linear_type: "QuantizedLinearMarlin"
|
||||
KTransformers × LLaMA-Factory 集成,支持超大 MoE 模型微调。
|
||||
|
||||

|
||||
|
||||
**核心特性:**
|
||||
- **资源高效**:仅需 **70GB 显存** + 1.3TB 内存即可微调 671B DeepSeek-V3
|
||||
- **LoRA 支持**:完整的 LoRA 微调与异构加速
|
||||
- **LLaMA-Factory 集成**:与流行微调框架无缝集成
|
||||
- **生产就绪**:支持对话、批量推理和指标评估
|
||||
|
||||
**性能示例:**
|
||||
| 模型 | 配置 | 吞吐量 | GPU 显存 |
|
||||
|------|------|--------|----------|
|
||||
| DeepSeek-V3 (671B) | LoRA + AMX | ~40 tokens/s | 70GB (多卡) |
|
||||
| DeepSeek-V2-Lite (14B) | LoRA + AMX | ~530 tokens/s | 6GB |
|
||||
|
||||
**快速开始:**
|
||||
```bash
|
||||
cd KT-SFT
|
||||
# 按照 KT-SFT/README.md 安装环境
|
||||
USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml
|
||||
```
|
||||
|
||||
YAML 文件中的每个规则都有两部分:`match` 和 `replace`。`match` 部分指定应替换的模块,`replace` 部分指定要注入到模型中的模块以及初始化关键字。
|
||||
👉 **[完整文档 →](./KT-SFT/README.md)**
|
||||
|
||||
您可以在 [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) 目录中找到用于优化 DeepSeek-V2 和 Qwen2-57B-A14 的示例规则模板。这些模板用于为 `local_chat.py` 示例提供支持。
|
||||
---
|
||||
|
||||
如果您对我们的设计原则和注入框架的实现感兴趣,请参考 [设计文档](doc/en/deepseek-v2-injection.md)。
|
||||
## 🔥 引用
|
||||
|
||||
<h2 id="ack">致谢和贡献者</h2>
|
||||
如果您在研究中使用了 KTransformers,请引用我们的论文:
|
||||
|
||||
KTransformers 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
|
||||
```bibtex
|
||||
@inproceedings{10.1145/3731569.3764843,
|
||||
title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models},
|
||||
author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing},
|
||||
booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles},
|
||||
year = {2025}
|
||||
}
|
||||
```
|
||||
|
||||
KTransformers 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformers 更快、更易于使用。
|
||||
## 👥 贡献者与团队
|
||||
|
||||
由以下团队开发和维护:
|
||||
- 清华大学 [MADSys 实验室](https://madsys.cs.tsinghua.edu.cn/)
|
||||
- [Approaching.AI](http://approaching.ai/)
|
||||
- 社区贡献者
|
||||
|
||||
<h2 id="ack">讨论</h2>
|
||||
我们欢迎贡献!请随时提交 issues 和 pull requests。
|
||||
|
||||
如果您有任何问题,欢迎随时提出 issue。或者,您可以加入我们的微信群进行进一步讨论。二维码: [微信群](WeChatGroup.png)
|
||||
## 💬 社区与支持
|
||||
|
||||
<h2 id="FAQ">🙋 常见问题</h2>
|
||||
- **GitHub Issues**:[报告 bug 或请求功能](https://github.com/kvcache-ai/ktransformers/issues)
|
||||
- **GitHub Discussions**:[提问和分享想法](https://github.com/kvcache-ai/ktransformers/discussions)
|
||||
- **微信群**:查看 [archive/WeChatGroup.png](./archive/WeChatGroup.png)
|
||||
|
||||
一些常见问题的答案可以在 [FAQ](doc/en/FAQ.md) 中找到。
|
||||
## 📦 历史代码
|
||||
|
||||
原完整的 KTransformers 框架代码已归档至 [`archive/`](./archive/) 目录供参考。项目现专注于上述两个核心模块,以实现更好的模块化和可维护性。
|
||||
|
||||
关于原始完整文档(包含快速入门指南和示例),请查看:
|
||||
- [archive/README_LEGACY.md](./archive/README_LEGACY.md) (English)
|
||||
- [archive/README_ZH_LEGACY.md](./archive/README_ZH_LEGACY.md) (中文)
|
||||
|
||||
28
archive/.gitmodules
vendored
Normal file
28
archive/.gitmodules
vendored
Normal file
@@ -0,0 +1,28 @@
|
||||
[submodule "third_party/llama.cpp"]
|
||||
path = archive/third_party/llama.cpp
|
||||
url = https://github.com/ggerganov/llama.cpp.git
|
||||
[submodule "third_party/pybind11"]
|
||||
path = archive/third_party/pybind11
|
||||
url = https://github.com/pybind/pybind11.git
|
||||
[submodule "third_party/spdlog"]
|
||||
path = archive/third_party/spdlog
|
||||
url = https://github.com/gabime/spdlog.git
|
||||
[submodule "third_party/custom_flashinfer"]
|
||||
path = archive/third_party/custom_flashinfer
|
||||
url = https://github.com/kvcache-ai/custom_flashinfer.git
|
||||
branch = fix-precision-mla-merge-main
|
||||
[submodule "third_party/xxHash"]
|
||||
path = archive/third_party/xxHash
|
||||
url = https://github.com/Cyan4973/xxHash.git
|
||||
[submodule "third_party/prometheus-cpp"]
|
||||
path = archive/third_party/prometheus-cpp
|
||||
url = https://github.com/jupp0r/prometheus-cpp
|
||||
[submodule "third_party/PhotonLibOS"]
|
||||
path = archive/third_party/PhotonLibOS
|
||||
url = https://github.com/alibaba/PhotonLibOS.git
|
||||
[submodule "kt-kernel/third_party/llama.cpp"]
|
||||
path = kt-kernel/third_party/llama.cpp
|
||||
url = https://github.com/ggerganov/llama.cpp.git
|
||||
[submodule "kt-kernel/third_party/pybind11"]
|
||||
path = kt-kernel/third_party/pybind11
|
||||
url = https://github.com/pybind/pybind11.git
|
||||
103
archive/README.md
Normal file
103
archive/README.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# Archive - Legacy KTransformers Code
|
||||
|
||||
This directory contains the original integrated KTransformers framework code that has been archived as part of the repository restructuring.
|
||||
|
||||
## 📋 What's Here
|
||||
|
||||
This archive preserves the complete original KTransformers implementation, including:
|
||||
|
||||
- **Core Framework** (`ktransformers/`): Original integrated inference framework
|
||||
- **C/C++ Extensions** (`csrc/`): Low-level kernel implementations
|
||||
- **Third-party Dependencies** (`third_party/`): Vendored external libraries
|
||||
- **Git Submodules** (`.gitmodules`): Complete submodule configuration for legacy dependencies
|
||||
- **Build System**: Installation scripts, Dockerfiles, and configuration files
|
||||
- **Legacy Documentation**: Original README files with full quick-start guides
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
### Original README Files
|
||||
|
||||
- **[English README (Legacy)](./README_LEGACY.md)**: Complete original English documentation with:
|
||||
- Quick Start guides
|
||||
- Show cases and benchmarks
|
||||
- Injection tutorial
|
||||
- Full installation instructions
|
||||
|
||||
- **[中文 README (Legacy)](./README_ZH_LEGACY.md)**: 完整的原始中文文档,包含:
|
||||
- 快速入门指南
|
||||
- 案例展示和基准测试
|
||||
- 注入教程
|
||||
- 完整安装说明
|
||||
|
||||
## 🔄 Migration to New Structure
|
||||
|
||||
The KTransformers project has evolved into two focused modules:
|
||||
|
||||
### For Inference (CPU-optimized kernels):
|
||||
→ Use **[kt-kernel](../kt-kernel/)** instead
|
||||
|
||||
### For Fine-tuning (LLaMA-Factory integration):
|
||||
→ Use **[KT-SFT](../KT-SFT/)** instead
|
||||
|
||||
## ⚠️ Status
|
||||
|
||||
This code is **archived for reference only**. For active development and support:
|
||||
|
||||
- **Inference**: See [kt-kernel](../kt-kernel/)
|
||||
- **Fine-tuning**: See [KT-SFT](../KT-SFT/)
|
||||
- **Documentation**: See [doc](../doc/) directory
|
||||
- **Issues**: Visit [GitHub Issues](https://github.com/kvcache-ai/ktransformers/issues)
|
||||
|
||||
## 🔧 Git Submodules (For Researchers)
|
||||
|
||||
The root `.gitmodules` only contains kt-kernel's dependencies to keep the repository lightweight. If you need to build the legacy code, you can use the archived submodule configuration:
|
||||
|
||||
```bash
|
||||
# Copy the complete submodule configuration
|
||||
cp archive/.gitmodules .gitmodules
|
||||
|
||||
# Initialize legacy submodules
|
||||
git submodule update --init --recursive archive/third_party/
|
||||
```
|
||||
|
||||
**Note**: This will download ~500MB of additional dependencies.
|
||||
|
||||
## 📦 Contents Overview
|
||||
|
||||
```
|
||||
archive/
|
||||
├── README.md # This file
|
||||
├── README_LEGACY.md # Original English documentation
|
||||
├── README_ZH_LEGACY.md # Original Chinese documentation
|
||||
├── .gitmodules # Complete git submodule configuration (7 legacy submodules)
|
||||
├── ktransformers/ # Original framework code
|
||||
├── csrc/ # C/C++ extensions
|
||||
├── third_party/ # External dependencies (submodules not initialized by default)
|
||||
├── setup.py # Original installation script
|
||||
├── pyproject.toml # Python project configuration
|
||||
├── Dockerfile* # Container configurations
|
||||
├── install*.sh # Installation scripts
|
||||
└── ... # Other legacy files
|
||||
```
|
||||
|
||||
## 💡 Why Archived?
|
||||
|
||||
The original monolithic framework has been refactored into modular components for:
|
||||
|
||||
1. **Better Maintainability**: Separated concerns between inference and fine-tuning
|
||||
2. **Easier Integration**: Cleaner APIs for external frameworks (SGLang, LLaMA-Factory)
|
||||
3. **Focused Development**: Dedicated modules with specific optimization goals
|
||||
4. **Reduced Complexity**: Smaller, more manageable codebases
|
||||
|
||||
## 🔗 Related Resources
|
||||
|
||||
- **Main Repository**: [../README.md](../README.md)
|
||||
- **kt-kernel Documentation**: [../kt-kernel/README.md](../kt-kernel/README.md)
|
||||
- **KT-SFT Documentation**: [../KT-SFT/README.md](../KT-SFT/README.md)
|
||||
- **Project Website**: https://kvcache-ai.github.io/ktransformers/
|
||||
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
<sub>Archived on 2025-11 as part of repository restructuring</sub>
|
||||
</div>
|
||||
217
archive/README_LEGACY.md
Normal file
217
archive/README_LEGACY.md
Normal file
@@ -0,0 +1,217 @@
|
||||
<div align="center">
|
||||
<!-- <h1>KTransformers</h1> -->
|
||||
<p align="center">
|
||||
|
||||
<picture>
|
||||
<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
|
||||
|
||||
</picture>
|
||||
|
||||
</p>
|
||||
<h3>A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations</h3>
|
||||
<strong><a href="#show-cases">🌟 Show Cases</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#tutorial">📃 Tutorial</a> | <a href="#Citation">🔥 Citation </a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬 Discussion </a>|<a href="#FAQ"> 🙋 FAQ</a> </strong>
|
||||
</div>
|
||||
|
||||
<h2 id="intro">🎉 Introduction</h2>
|
||||
KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 <a href="https://github.com/huggingface/transformers">Transformers</a> experience with advanced kernel optimizations and placement/parallelism strategies.
|
||||
<br/><br/>
|
||||
KTransformers is a flexible, Python-centric framework designed with extensibility at its core.
|
||||
By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible
|
||||
interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI.
|
||||
<br/><br/>
|
||||
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
|
||||
|
||||
<h2 id="Updates">🔥 Updates</h2>
|
||||
|
||||
* **Nov 6, 2025**: Support Kimi-K2-Thinking inference ([Tutorial](./doc/en/Kimi-K2-Thinking.md)) and fine-tune ([Tutorial](./doc/en/SFT_Installation_Guide_KimiK2.md))
|
||||
* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/KTransformers-Fine-Tuning_User-Guide.md))
|
||||
* **Oct 27, 2025**: Support Ascend NPU. ([Tutorial](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md))
|
||||
* **Oct 10, 2025**: Integrating into SGLang. ([Roadmap](https://github.com/sgl-project/sglang/issues/11425))
|
||||
* **Sept 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md))
|
||||
* **Sept 05, 2025**: Support Kimi-K2-0905. ([Tutorial](./doc/en/Kimi-K2.md))
|
||||
* **July 26, 2025**: Support SmallThinker and GLM4-MoE. ([Tutorial](./doc/en/SmallThinker_and_Glm4moe.md))
|
||||
* **July 11, 2025**: Support Kimi-K2. ([Tutorial](./doc/en/Kimi-K2.md))
|
||||
* **June 30, 2025**: Support 3-layer (GPU-CPU-Disk) [prefix cache](./doc/en/prefix_cache.md) reuse.
|
||||
* **May 14, 2025**: Support Intel Arc GPU ([Tutorial](./doc/en/xpu.md)).
|
||||
* **Apr 29, 2025**: Support AMX-Int8、 AMX-BF16 and Qwen3MoE ([Tutorial](./doc/en/AMX.md))
|
||||
|
||||
https://github.com/user-attachments/assets/fafe8aec-4e22-49a8-8553-59fb5c6b00a2
|
||||
|
||||
* **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./doc/en/llama4.md)).
|
||||
* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)).
|
||||
|
||||
https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a
|
||||
|
||||
* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
|
||||
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
|
||||
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
|
||||
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
|
||||
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
|
||||
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
|
||||
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
|
||||
* **Aug 14, 2024**: Support llamfile as linear backend.
|
||||
* **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
|
||||
* **Aug 9, 2024**: Support windows native.
|
||||
|
||||
<!-- * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). -->
|
||||
|
||||
<h2 id="show-cases">🌟 Show Cases</h2>
|
||||
|
||||
<div>
|
||||
<h3>GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
|
||||
</div>
|
||||
|
||||
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
|
||||
|
||||
</p>
|
||||
|
||||
- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM([Tutorial](./doc/en/DeepseekR1_V3_tutorial.md)).
|
||||
|
||||
- Prefill Speed (tokens/s):
|
||||
- KTransformers: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)
|
||||
- Compared to 10.31 tokens/s in llama.cpp with 2×32 cores, achieving up to **27.79× speedup**.
|
||||
- Decode Speed (tokens/s):
|
||||
- KTransformers: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)
|
||||
- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
|
||||
- Upcoming Open Source Release:
|
||||
- AMX optimizations and selective expert activation will be open-sourced in V0.3.
|
||||
- Currently available only in preview binary distribution, which can be downloaded [here](./doc/en/DeepseekR1_V3_tutorial.md).
|
||||
- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
- **Faster Speed:** Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) and [Marlin](https://github.com/IST-DASLab/marlin).
|
||||
- **VSCode Integration:** Wrapped into an OpenAI and Ollama compatible API for seamless integration as a backend for [Tabby](https://github.com/TabbyML/tabby) and various other frontends.
|
||||
|
||||
<p align="center">
|
||||
|
||||
https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
|
||||
|
||||
</p>
|
||||
|
||||
<!-- <h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
|
||||
<p align="center">
|
||||
|
||||
https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
|
||||
|
||||
* **1M Context InternLM 2.5 7B**: Operates at full bf16 precision, utilizing 24GB VRAM and 150GB DRAM, which is feasible on a local desktop setup. It achieves a 92.88% success rate on the 1M "Needle In a Haystack" test and 100% on the 128K NIAH test.
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
|
||||
|
||||
* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
|
||||
-->
|
||||
|
||||
<strong>More advanced features will coming soon, so stay tuned!</strong>
|
||||
|
||||
<h2 id="quick-start">🚀 Quick Start</h2>
|
||||
|
||||
Getting started with KTransformers is simple! Follow the steps below to set up and start using it.
|
||||
|
||||
we have already supported vendors:
|
||||
|
||||
- Metax
|
||||
- Sanechips (ZhuFeng V1.0)
|
||||
- Intel
|
||||
- Ascend
|
||||
- Kunpeng
|
||||
- AMD
|
||||
|
||||
### 📥 Installation
|
||||
|
||||
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
|
||||
|
||||
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
|
||||
At the heart of KTransformers is a user-friendly, template-based injection framework.
|
||||
This allows researchers to easily replace original torch modules with optimized variants. It also simplifies the process of combining multiple optimizations, allowing the exploration of their synergistic effects.
|
||||
|
||||
</br>
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Inject-Struction" src="https://github.com/user-attachments/assets/6b4c1e54-9f6d-45c5-a3fc-8fa45e7d257e" width=65%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
Given that vLLM already serves as a great framework for large-scale deployment optimizations, KTransformers is particularly focused on local deployments that are constrained by limited resources. We pay special attention to heterogeneous computing opportunities, such as GPU/CPU offloading of quantized models. For example, we support the efficient <a herf="https://github.com/Mozilla-Ocho/llamafile/tree/main">Llamafile</a> and <a herf="https://github.com/IST-DASLab/marlin">Marlin</a> kernels for CPU and GPU, respectively. More details can be found <a herf="doc/en/operators/llamafile.md">here</a>.
|
||||
|
||||
<h3>Example Usage</h3>
|
||||
To utilize the provided kernels, users only need to create a YAML-based injection template and add the call to `optimize_and_load_gguf` before using the Transformers model.
|
||||
|
||||
```python
|
||||
with torch.device("meta"):
|
||||
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
||||
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
|
||||
...
|
||||
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
||||
```
|
||||
|
||||
In this example, the AutoModel is first initialized on the meta device to avoid occupying any memory resources. Then, `optimize_and_load_gguf` iterates through all sub-modules of the model, matches rules specified in your YAML rule file, and replaces them with advanced modules as specified.
|
||||
|
||||
After injection, the original `generate` interface is available, but we also provide a compatible `prefill_and_generate` method, which enables further optimizations like CUDAGraph to improve generation speed.
|
||||
|
||||
<h3>How to custom your model</h3>
|
||||
|
||||
A detailed tutorial of the injection and multi-GPU using DeepSeek-V2 as an example is given [here](doc/en/injection_tutorial.md).
|
||||
|
||||
Below is an example of a YAML template for replacing all original Linear modules with Marlin, an advanced 4-bit quantization kernel.
|
||||
|
||||
```yaml
|
||||
- match:
|
||||
name: "^model\\.layers\\..*$" # regular expression
|
||||
class: torch.nn.Linear # only match modules matching name and class simultaneously
|
||||
replace:
|
||||
class: ktransformers.operators.linear.KTransformerLinear # optimized Kernel on quantized data types
|
||||
device: "cpu" # which devices to load this module when initializing
|
||||
kwargs:
|
||||
generate_device: "cuda"
|
||||
generate_linear_type: "QuantizedLinearMarlin"
|
||||
```
|
||||
|
||||
Each rule in the YAML file has two parts: `match` and `replace`. The `match` part specifies which module should be replaced, and the `replace` part specifies the module to be injected into the model along with the initialization keywords.
|
||||
|
||||
You can find example rule templates for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models, in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory. These templates are used to power the `local_chat.py` demo.
|
||||
|
||||
If you are interested in our design principles and the implementation of the injection framework, please refer to the [design document](doc/en/deepseek-v2-injection.md).
|
||||
|
||||
<h2 id="Citation">🔥 Citation</h2>
|
||||
|
||||
If you use KTransformers for your research, please cite our [paper](https://madsys.cs.tsinghua.edu.cn/publication/ktransformers-unleashing-the-full-potential-of-cpu/gpu-hybrid-inference-for-moe-models/):
|
||||
|
||||
```
|
||||
@inproceedings{10.1145/3731569.3764843,
|
||||
title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models},
|
||||
author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing},
|
||||
booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles},
|
||||
year = {2025}
|
||||
}
|
||||
```
|
||||
|
||||
<h2 id="ack">Acknowledgment and Contributors</h2>
|
||||
|
||||
The development of KTransformers is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
|
||||
|
||||
KTransformers is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformers faster and easier to use.
|
||||
|
||||
<h2 id="ack">Discussion</h2>
|
||||
|
||||
If you have any questions, feel free to open an issue. Alternatively, you can join our WeChat group for further discussion. QR Code: [WeChat Group](WeChatGroup.png)
|
||||
|
||||
<h2 id="FAQ">🙋 FAQ</h2>
|
||||
|
||||
Some common questions are answered in the [FAQ](doc/en/FAQ.md).
|
||||
|
||||
166
archive/README_ZH_LEGACY.md
Normal file
166
archive/README_ZH_LEGACY.md
Normal file
@@ -0,0 +1,166 @@
|
||||
<div align="center">
|
||||
<!-- <h1>KTransformers</h1> -->
|
||||
<p align="center">
|
||||
|
||||
<picture>
|
||||
<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
|
||||
|
||||
</picture>
|
||||
|
||||
</p>
|
||||
<h3>一个用于体验尖端 LLM 推理优化的灵活框架</h3>
|
||||
<strong><a href="#show-cases">🌟 案例展示</a> | <a href="#quick-start">🚀 快速入门</a> | <a href="#tutorial">📃 教程</a> | <a href="https://github.com/kvcache-ai/ktransformers/discussions">💬 讨论</a> | <a href="#FAQ">🙋 常见问题</a> </strong>
|
||||
</div>
|
||||
|
||||
<h2 id="intro">🎉 介绍</h2>
|
||||
KTransformers(发音为 Quick Transformers)旨在通过先进的内核优化和放置/并行策略来增强您对 🤗 [Transformers](https://github.com/huggingface/transformers) 的体验。
|
||||
<br/><br/>
|
||||
KTransformers 是一个以 Python 为中心的灵活框架,其核心是可扩展性。通过用一行代码实现并注入优化模块,用户可以获得与 Transformers 兼容的接口、符合 OpenAI 和 Ollama 的 RESTful API,甚至是一个简化的类似 ChatGPT 的 Web 界面。
|
||||
<br/><br/>
|
||||
我们对 KTransformers 的愿景是成为一个用于实验创新 LLM 推理优化的灵活平台。如果您需要任何其他功能,请告诉我们。
|
||||
|
||||
<h2 id="Updates">🔥 更新</h2>
|
||||
|
||||
* **2025 年 2 月 15 日**:为DeepSeek-V3/R1支持[FP8 GPU内核](./doc/en/fp8_kernel.md); 支持更长的上下文([教程](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context)).
|
||||
* **2025 年 2 月 15 日**:长上下文(从4K到8K,24GB VRAM) & 稍快的速度(+15%)(最快 16 Tokens/s),文档请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 和 [在线指南](https://kvcache-ai.github.io/ktransformers/) 。
|
||||
* **2025 年 2 月 10 日**:支持 Deepseek-R1 和 V3 在单个(24GB VRAM)/多 GPU 和 382G DRAM 上运行,速度提升高达 3~28 倍。详细教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md)。
|
||||
* **2024 年 8 月 28 日**:支持 InternLM2.5-7B-Chat-1M 模型下的 1M 上下文,使用 24GB 的 VRAM 和 150GB 的 DRAM。详细教程请参见 [这里](./doc/en/long_context_tutorial.md)。
|
||||
* **2024 年 8 月 28 日**:将 DeepseekV2 所需的 VRAM 从 21G 降低到 11G。
|
||||
* **2024 年 8 月 15 日**:更新了详细的 [教程](doc/en/injection_tutorial.md),介绍注入和多 GPU 的使用。
|
||||
* **2024 年 8 月 14 日**:支持 llamfile 作为线性后端。
|
||||
* **2024 年 8 月 12 日**:支持多 GPU;支持新模型:mixtral 8\*7B 和 8\*22B;支持 q2k、q3k、q5k 在 GPU 上的去量化。
|
||||
* **2024 年 8 月 9 日**:支持 Windows。
|
||||
|
||||
<h2 id="show-cases">🌟 案例展示</h2>
|
||||
|
||||
<div>
|
||||
<h3>在仅 24GB VRAM 的桌面上运行 GPT-4/o1 级别的本地 VSCode Copilot</h3>
|
||||
</div>
|
||||
|
||||
https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
|
||||
|
||||
</p>
|
||||
|
||||
- **[NEW!!!] 本地 671B DeepSeek-Coder-V3/R1**:使用其 Q4_K_M 版本,仅需 14GB VRAM 和 382GB DRAM 即可运行(教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md))。
|
||||
- 预填充速度(tokens/s):
|
||||
- KTransformers:54.21(32 核)→ 74.362(双插槽,2×32 核)→ 255.26(优化的 AMX 基 MoE 内核,仅 V0.3)→ 286.55(选择性使用 6 个专家,仅 V0.3)
|
||||
- 与 llama.cpp 在 2×32 核下相比,达到 **27.79× 速度提升**。
|
||||
- 解码速度(tokens/s):
|
||||
- KTransformers:8.73(32 核)→ 11.26(双插槽,2×32 核)→ 13.69(选择性使用 6 个专家,仅 V0.3)
|
||||
- 与 llama.cpp 在 2×32 核下相比,达到 **3.03× 速度提升**。
|
||||
- 即将开源发布:
|
||||
- AMX 优化和选择性专家激活将在 V0.3 中开源。
|
||||
- 目前仅在预览二进制分发中可用,可从 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 下载。
|
||||
|
||||
- **本地 236B DeepSeek-Coder-V2**:使用其 Q4_K_M 版本,仅需 21GB VRAM 和 136GB DRAM 即可运行,甚至在 [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench) 中得分超过 GPT4-0613。
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
- **更快的速度**:通过 MoE 卸载和注入来自 [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) 和 [Marlin](https://github.com/IST-DASLab/marlin) 的高级内核,实现了 2K 提示预填充 126 tokens/s 和生成 13.6 tokens/s 的速度。
|
||||
- **VSCode 集成**:封装成符合 OpenAI 和 Ollama 的 API,可无缝集成到 [Tabby](https://github.com/TabbyML/tabby) 和其他前端的后端。
|
||||
|
||||
<p align="center">
|
||||
|
||||
https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
|
||||
|
||||
</p>
|
||||
|
||||
<!-- <h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3>
|
||||
<p align="center"> -->
|
||||
|
||||
<!-- https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12 -->
|
||||
<!--
|
||||
* **1M 上下文 InternLM 2.5 7B**:以全 bf16 精度运行,使用 24GB VRAM 和 150GB DRAM,可在本地桌面设置中实现。在 1M "针在干草堆中" 测试中达到 92.88% 的成功率,在 128K NIAH 测试中达到 100%。
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
* **增强的速度**:使用稀疏注意力,通过 llamafile 内核实现 1M 上下文生成 16.91 tokens/s 的速度。这种方法比 llama.cpp 的全注意力方法快 10 倍以上。
|
||||
|
||||
* **灵活的稀疏注意力框架**:提供了一个灵活的块稀疏注意力框架,用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。 -->
|
||||
|
||||
<strong>更多高级功能即将推出,敬请期待!</strong>
|
||||
|
||||
<h2 id="quick-start">🚀 快速入门</h2>
|
||||
|
||||
|
||||
KTransformers 的入门非常简单!请参考我们的[安装指南]((https://kvcache-ai.github.io/ktransformers/))进行安装。
|
||||
|
||||
<h2 id="tutorial">📃 简要注入教程</h2>
|
||||
KTransformers 的核心是一个用户友好的、基于模板的注入框架。这使得研究人员可以轻松地将原始 torch 模块替换为优化的变体。它还简化了多种优化的组合过程,允许探索它们的协同效应。
|
||||
</br>
|
||||
<p align="center">
|
||||
<picture>
|
||||
<img alt="Inject-Struction" src="https://github.com/user-attachments/assets/6b4c1e54-9f6d-45c5-a3fc-8fa45e7d257e" width=65%>
|
||||
</picture>
|
||||
</p>
|
||||
|
||||
鉴于 vLLM 已经是一个用于大规模部署优化的优秀框架,KTransformers 特别关注受资源限制的本地部署。我们特别关注异构计算时机,例如量化模型的 GPU/CPU 卸载。例如,我们支持高效的 <a herf="https://github.com/Mozilla-Ocho/llamafile/tree/main">Llamafile</a> 和<a herf="https://github.com/IST-DASLab/marlin">Marlin</a> 内核,分别用于 CPU 和 GPU。 更多详细信息可以在 <a herf="doc/en/operators/llamafile.md">这里</a>找到。
|
||||
|
||||
|
||||
<h3>示例用法</h3>
|
||||
要使用提供的内核,用户只需创建一个基于 YAML 的注入模板,并在使用 Transformers 模型之前添加对 `optimize_and_load_gguf` 的调用。
|
||||
|
||||
```python
|
||||
with torch.device("meta"):
|
||||
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
|
||||
optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
|
||||
...
|
||||
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
|
||||
```
|
||||
|
||||
在这个示例中,首先在 meta 设备上初始化 AutoModel,以避免占用任何内存资源。然后,`optimize_and_load_gguf` 遍历模型的所有子模块,匹配您的 YAML 规则文件中指定的规则,并将它们替换为指定的高级模块。
|
||||
|
||||
注入后,原始的 `generate` 接口仍然可用,但我们还提供了一个兼容的 `prefill_and_generate` 方法,这使得可以进一步优化,例如使用 CUDAGraph 提高生成速度。
|
||||
|
||||
<h3>如何自定义您的模型</h3>
|
||||
|
||||
一个详细的使用 DeepSeek-V2 作为示例的注入和 multi-GPU 教程在 [这里](doc/en/injection_tutorial.md)。
|
||||
|
||||
以下是一个将所有原始 Linear 模块替换为 Marlin 的 YAML 模板示例,Marlin 是一个高级的 4 位量化内核。
|
||||
|
||||
```yaml
|
||||
- match:
|
||||
name: "^model\\.layers\\..*$" # 正则表达式
|
||||
class: torch.nn.Linear # 仅匹配同时符合名称和类的模块
|
||||
replace:
|
||||
class: ktransformers.operators.linear.KTransformerLinear # 量化数据类型的优化内核
|
||||
device: "cpu" # 初始化时加载该模块的 device
|
||||
kwargs:
|
||||
generate_device: "cuda"
|
||||
generate_linear_type: "QuantizedLinearMarlin"
|
||||
```
|
||||
|
||||
YAML 文件中的每个规则都有两部分:`match` 和 `replace`。`match` 部分指定应替换的模块,`replace` 部分指定要注入到模型中的模块以及初始化关键字。
|
||||
|
||||
您可以在 [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) 目录中找到用于优化 DeepSeek-V2 和 Qwen2-57B-A14 的示例规则模板。这些模板用于为 `local_chat.py` 示例提供支持。
|
||||
|
||||
如果您对我们的设计原则和注入框架的实现感兴趣,请参考 [设计文档](doc/en/deepseek-v2-injection.md)。
|
||||
|
||||
<h2 id="ack">致谢和贡献者</h2>
|
||||
|
||||
KTransformers 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
|
||||
|
||||
KTransformers 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformers 更快、更易于使用。
|
||||
|
||||
|
||||
<h2 id="ack">讨论</h2>
|
||||
|
||||
如果您有任何问题,欢迎随时提出 issue。或者,您可以加入我们的微信群进行进一步讨论。二维码: [微信群](WeChatGroup.png)
|
||||
|
||||
<h2 id="FAQ">🙋 常见问题</h2>
|
||||
|
||||
一些常见问题的答案可以在 [FAQ](doc/en/FAQ.md) 中找到。
|
||||
|
Before Width: | Height: | Size: 1.1 MiB After Width: | Height: | Size: 1.1 MiB |
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user