add the docs and update README for KSFT

2026-04-20 14:29:22 +00:00 · 2025-11-04 05:51:48 +00:00
parent 4421d48108
commit 7b6ccc3f57
89 changed files with 1296 additions and 3882 deletions
--- a/doc/assets/image-20250801165752484.png
+++ b/doc/assets/image-20250801165752484.png
--- a/doc/assets/image-20250801174517784.png
+++ b/doc/assets/image-20250801174517784.png
--- a/doc/assets/image-20250801174623919.png
+++ b/doc/assets/image-20250801174623919.png
--- a/doc/assets/image-20250911184023795.png
+++ b/doc/assets/image-20250911184023795.png
--- a/doc/assets/image-20250911184455749.png
+++ b/doc/assets/image-20250911184455749.png
--- a/doc/assets/image-20251011010558909.png
+++ b/doc/assets/image-20251011010558909.png
--- a/doc/assets/image-20251016171526210.png
+++ b/doc/assets/image-20251016171526210.png
--- a/doc/assets/image-20251016171537997.png
+++ b/doc/assets/image-20251016171537997.png
--- a/doc/assets/image-20251016175046882.png
+++ b/doc/assets/image-20251016175046882.png
--- a/doc/assets/image-20251016175848143.png
+++ b/doc/assets/image-20251016175848143.png
--- a/doc/assets/image-20251016182810716.png
+++ b/doc/assets/image-20251016182810716.png
--- a/doc/assets/image-20251016182920722.png
+++ b/doc/assets/image-20251016182920722.png
--- a/doc/assets/image-20251016182942726.png
+++ b/doc/assets/image-20251016182942726.png
--- a/doc/assets/image-compare_model.png
+++ b/doc/assets/image-compare_model.png
--- a/doc/assets/演示文稿1_01.png
+++ b/doc/assets/演示文稿1_01.png
--- a/doc/assets/风格化数据集模型输出对比_01.png
+++ b/doc/assets/风格化数据集模型输出对比_01.png
--- a/doc/en/KTransformers
+++ b/doc/en/KTransformers
@@ -0,0 +1,294 @@
+- [KTransformers Fine-Tuning × LLaMA-Factory Integration – User Guide](#ktransformers-fine-tuning-x-llama-factory-integration-–-user-guide)
+- [Introduction](#introduction)
+
+- [Fine-Tuning Results (Examples)](#fine-tuning-results-examples)
+  - [Stylized Dialogue (CatGirl tone)](#stylized-dialogue-catgirl-tone)
+  - [Benchmarks](#benchmarks)
+    - [Translational-Style dataset](#translational-style-dataset)
+    - [AfriMed-QA (short answer)](#afrimed-qa-short-answer)
+    - [AfriMed-QA (multiple choice)](#afrimed-qa-multiple-choice)
+
+- [Quick to Start](#quick-to-start)
+  - [Environment Setup](#environment-setup)
+  - [Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models](#core-feature-1-use-ktransformers-backend-to-fine-tune-ultra-large-moe-models)
+  - [Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)](#core-feature-2-chat-with-the-fine-tuned-model-base--lora-adapter)
+  - [Core Feature 3: Batch inference + metrics (base + LoRA adapter)](#core-feature-3-batch-inference--metrics-base--lora-adapter)
+
+- [KT Fine-Tuning Speed (User-Side View)](#kt-fine-tuning-speed-user-side-view)
+  - [End-to-End Performance](#end-to-end-performance)
+  - [GPU/CPU Memory Footprint](#gpucpu-memory-footprint)
+
+- [Conclusion](#conclusion)
+
+
+# KTransformers Fine-Tuning × LLaMA-Factory Integration – User Guide
+
+**MadSys Lab, KVCache-AI Team, Approaching AI, LLaMA-Factory Team**
+
+## Introduction
+
+From **DeepSeek-V3/R1** to **Qwen3-MoE** and **Kimi-K2**, each wave of open-sourced large models brings leaps in performance and scale. However, many researchers and developers are constrained by expensive GPUs and models with tens or even hundreds of billions of parameters, making it **hard to fine-tune very large models under limited resources**. To bridge this gap, we propose a practical approach: combining **KTransformers** with **LLaMA-Factory**. With just **2–4 RTX 4090s** and a high-memory CPU, you can fine-tune ultra-large MoE models like DeepSeek-671B.
+
+Our goal is to give resource-constrained researchers a **local path to explore fine-tuning ultra-large models**, and also a fast way to customize smaller models (e.g., 14B/30B) for specific scenarios. We validate the setup using **stylized dialogue**, **Westernized translation tone**, and **medical Q&A** as representative tasks, showing that **personalized adaptation can be achieved within hours**.
+
+As shown below, LLaMA-Factory is the unified orchestration/configuration layer for the whole fine-tuning workflow—handling data, training scheduling, LoRA injection, and inference interfaces. **KTransformers** acts as a pluggable high-performance backend that takes over core operators like Attention/MoE under the same training configs, enabling efficient **GPU+CPU heterogeneous cooperation**.
+
+![image-20251011010558909](../assets/image-20251011010558909.png)
+
+Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unsloth**, and **KTransformers** backends. KTransformers is the **only workable 4090-class solution** for ultra-large MoE models (e.g., 671B) and also delivers higher throughput and lower GPU memory on smaller MoE models (e.g., DeepSeek-14B).
+
+| Under LoRA (BF16) + [NekoQA-10K stylized dialogue](https://github.com/mindsRiverPonder/LLM-practice) | HuggingFace Backend                      | Unsloth Backend                      | KTransformers Backend |
+| ------------------------------------------------------------ | ---------------------------------------- | ------------------------------------ | --------------------- |
+| [14B-DeepSeekV2-Lite] LoRA fine-tuning throughput            | 303.58 token/s                           | 455.37 token/s                       | 530.38 token/s        |
+| [14B-DeepSeekV2-Lite] GPU memory                             | 32.12 GB                                 | 9.64 GB                              | 6.08 GB               |
+| [671B-DeepSeekV3] LoRA fine-tuning throughput                | <font color='red'>Too Huge to run</font> | <font color='red'>NOT SUPPORT</font> | 40.35 token/s         |
+| [671B-DeepSeekV3] GPU memory (sum across GPUs)               | theoretical 1400 GB †                    | <font color='red'>NOT SUPPORT</font> | 70 GB †               |
+
+† **1400 GB** is a **theoretical** FP16 full-parameter resident footprint (not runnable). **70 GB** is the **measured peak** with KT strategy (Attention on GPU + layered MoE offload).
+
+![按照模型划分的对比图_02](../assets/image-compare_model.png)
+
+### Fine-Tuning Results (Examples)
+
+#### Stylized Dialogue (CatGirl tone)
+
+Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). Goal: improve style consistency and recognizability.
+
+The figure compares responses from the base vs. fine-tuned models. The fine-tuned model maintains the target tone and address terms more consistently (red boxes), validating the effectiveness of **style-transfer fine-tuning**.
+
+![image-20251016175046882](../assets/image-20251016175046882.png)
+
+#### Benchmarks
+
+We use:
+
+(1) [Translational-Style-ChatLLM](https://github.com/Benson114/Translational-Style-ChatLLM), which asks for an exaggerated, Westernized translation tone—clear, stylized customization.
+
+(2) [AfriMed-QA](https://aclanthology.org/2025.acl-long.96/) (ACL 2025), a medical dataset for African contexts with strong domain specificity, including multiple-choice and short-answer sub-tasks—well-suited for vertical fine-tuning evaluation.
+
+The tables show metrics before vs. after LoRA fine-tuning. We observe **large improvements** across metrics, verifying fine-tuning effectiveness:
+
+| Translational-Style dataset    | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------ | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite (no LoRA)              | 20.66     | 8.33      | 4.54      | 2.89      | 22.71     | 4.52      | 19.19     |
+| **KT-LoRA fine-tuned V2-Lite** | **35.41** | **22.44** | **15.42** | **11.18** | **42.03** | **18.38** | **33.10** |
+| V3 base (no LoRA)              | 8.49      | 3.34      | 1.62      | 0.96      | 15.91     | 2.55      | 10.07     |
+| **KT-LoRA fine-tuned V3**      | **37.02** | **23.70** | **16.21** | **11.49** | **43.43** | **18.96** | **34.54** |
+
+| AfriMed-QA (short answer)      | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------ | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite (no LoRA)              | 13.58     | 11.12     | 9.10      | 7.23      | 22.48     | 7.81      | 11.73     |
+| **KT-LoRA fine-tuned V2-Lite** | **35.90** | **27.63** | **22.99** | **19.15** | **35.25** | **17.50** | **28.44** |
+| V3 base (no LoRA)              | 12.75     | 10.27     | 8.05      | 5.99      | 20.33     | 5.65      | 10.11     |
+| **KT-LoRA fine-tuned V3**      | **42.42** | **34.12** | **28.95** | **24.54** | **41.97** | **22.37** | **33.28** |
+
+| AfriMed-QA (multiple choice)   | Accuracy   |
+| ------------------------------ | ---------- |
+| V2-Lite (no LoRA)              | 0.0645     |
+| **KT-LoRA fine-tuned V2-Lite** | **0.4812** |
+| V3 base (no LoRA)              | 0.5833     |
+| **KT-LoRA fine-tuned V3**      | **0.7930** |
+
+Even for ultra-large MoE models, **KTransformers-backed fine-tuning** achieves strong task performance quickly.
+
+
+
+## Quick to Start
+
+This section shows how to install and use **LLaMA-Factory + KTransformers** for fine-tuning and inference:
+
+- Environment setup
+- Fine-tune ultra-large MoE models with KTransformers backend
+- Load the fine-tuned model (base + LoRA adapter) for chat/inference
+- Batch inference and metric evaluation
+
+### Environment Setup
+
+According to the following example, install both the **KTransformers** and **LLaMA-Factory** environments simultaneously.
+ This time, to simplify the installation process of KTransformers, we have specially packaged a wheel file to avoid local compilation.
+ The detailed installation steps are as follows:
+ (Note: Make sure your local **Python version**, **Torch version**, **CUDA version**, and the **KTransformers wheel filename** correspond correctly.)
+
+```shell
+# 1. Create a conda environment
+conda create -n Kllama python=3.10 # choose from : [3.10, 3.11, 3.12, 3.13]
+conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
+conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
+
+# 2. Install the LLaMA-Factory environment
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch,metrics]" --no-build-isolation
+
+# 3. Install the KTransformers wheel that matches your Torch and Python versions (Note: The CUDA version can differ from that in the wheel filename.)
+pip install ktransformers-0.4.1+cu128torch28fancy-cp310-cp310-linux_x86_64.whl
+
+# 4. Install flash-attention, download the corresponding file based on your Python and Torch versions from: https://github.com/Dao-AILab/flash-attention/releases
+pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
+# abi=True/False can find from below
+# import torch
+# print(torch._C._GLIBCXX_USE_CXX11_ABI)
+
+# 5. (Optional) If you want to use flash_infer (otherwise it defaults to triton)
+git clone https://github.com/kvcache-ai/custom_flashinfer.git
+pip install custom_flashinfer/
+```
+
+**Usage tip:** In LLaMA-Factory YAML, set `use_kt: true` and pick a `kt_optimize_rule` file to have KTransformers handle the core compute. The features below show typical configs.
+
+### Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models
+
+Run the command: `USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml`.
+
+Note: You **must** provide a **BF16** model. DeepSeek-V3-671B is released in FP8 by default; convert with [DeepSeek-V3/inference/fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py).
+
+```yaml
+### model
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+trust_remote_code: true
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_rank: 8
+lora_target: all
+
+### dataset
+dataset: identity
+template: deepseek
+cutoff_len: 2048
+max_samples: 100000
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Kllama_deepseekV3
+logging_steps: 10
+save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: false
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+learning_rate: 1.0e-4
+num_train_epochs: 3.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+
+### ktransformers
+use_kt: true # use KTransformers as LoRA sft backend
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
+```
+
+`kt_optimize_rule` controls **placement strategy**. See also [ktransformers/optimize_rules](https://github.com/kvcache-ai/ktransformers/tree/main/ktransformers/optimize/optimize_rules). Naming hints (`*` = wildcard):
+
+| Pattern                                      | Meaning                                               |
+| -------------------------------------------- | ----------------------------------------------------- |
+| DeepSeek-V2-Lite-Chat-* / DeepSeek-V3-Chat-* | Target model variants                                 |
+| *-sft-*                                      | Strategy for fine-tuning; others are for inference    |
+| *-amx-*                                      | Use AMX on CPU; otherwise use **llamafile**           |
+| *-multi-gpu-X*                               | Model parallel on X GPUs (X omitted → default 2 GPUs) |
+
+Example: `DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml` = V3-Chat fine-tuning with AMX and 2-GPU model parallel.
+
+We recommend **AMX acceleration** where available (`lscpu | grep amx`). AMX supports BF16/INT8. Example:
+
+```yaml
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert parallelism
+    kwargs:
+      prefill_device: "cpu"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+```
+
+Outputs go to `output_dir` in safetensors format plus adapter metadata for later loading.
+
+![image-20251016171537997](../assets/image-20251016171537997.png)
+
+### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)
+
+Run the command: `llamafactory-cli chat examples/inference/deepseek3_lora_sft_kt.yaml`.
+
+Use the safetensors adapter trained with KT for inference.
+
+```yaml
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+adapter_name_or_path: saves/Kllama_deepseekV3
+template: deepseek
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # use KTransformers as LoRA sft backend to inference
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
+```
+
+We also support **GGUF** adapters: for safetensors, set the **directory**; for GGUF, set the **file path** in `adapter_name_or_path`.
+
+During loading, LLaMA-Factory maps layer names to KT’s naming. You’ll see logs like `Loaded adapter weight: XXX -> XXX`:
+
+![image-20251016171526210](../assets/image-20251016171526210.png)
+
+### Core Feature 3: Batch inference + metrics (base + LoRA adapter)
+
+Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek3_lora_sft_kt.yaml`.
+ Invoke the KT fine-tuned adapter to provide the API; the usage logic of other APIs is consistent with the native LLaMA-Factory approach.
+
+```yaml
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+adapter_name_or_path: saves/Kllama_deepseekV3
+template: deepseek
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # use KTransformers as LoRA sft backend to inference
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
+```
+
+
+
+## KT Fine-Tuning Speed (User-Side View)
+
+### End-to-End Performance
+
+**Definitions**
+
+- `step_time`: wall-clock time for a full optimization step (tensor movement + Attention + MoE + other compute).
+- `tokens_per_step = GAS × qlen`; `token/s = tokens_per_step / step_time`.
+
+**Settings:** `GAS=16`, `qlen=512` (→ `tokens_per_step = 8192`); LoRA (`r=8, alpha=32, dropout=0.1`); **AMX** enabled; GPU: RTX 4090, CPU: Intel Xeon Platinum 8488C.
+
+**Measured**
+
+- **DeepSeek-V3-671B:** `step_time = 203 s` → `token/s ≈ 8192 / 203 ≈ 40.35`
+- **DeepSeek-V2-Lite-14B:** `step_time = 36 s` → `token/s ≈ 8192 / 36 ≈ 227.6`
+
+### GPU/CPU Memory Footprint
+
+- DeepSeek-V3 (671B; 61 layers with 58 MoE): ~**70 GB** total GPU memory (multi-GPU), ~**1.2–1.3 TB** host memory.
+- DeepSeek-V2-Lite (14B; 27 layers with 26 MoE): ~**5.5 GB** GPU memory, ~**150 GB** host memory.
+
+## Conclusion
+
+By integrating **KTransformers LoRA fine-tuning** into **LLaMA-Factory**, we provide a practical guide for efficient training and deployment of MoE LLMs. KT brings cutting-edge optimizations (DeepSeek/Qwen/Kimi support with AMX-accelerated kernels), and LoRA enables customization under very low GPU memory. LLaMA-Factory offers a friendly, unified interface.
+
+This integration (akin to Unsloth-style speedups) means even models with tens to hundreds of billions of parameters can be fine-tuned and deployed with low latency on commodity hardware. You get **memory savings, speed-ups, and usability** together. We encourage you to try LLaMA-Factory + KT for your next MoE project and follow this guide. Feedback is welcome!
--- a/doc/en/KTransformers
+++ b/doc/en/KTransformers
@@ -0,0 +1,221 @@
+- [KTransformers Fine-Tuning × LLaMA-Factory Integration – Developer Technical Notes](#ktransformers-fine-tuning-x-llama-factory-integration-–-developer-technical-notes)
+- [Introduction](#introduction)
+
+- [Overall View of the KT Fine-Tuning Framework](#overall-view-of-the-kt-fine-tuning-framework)
+  - [Attention (LoRA + KT coexist)](#attention-lora--kt-coexist)
+  - [MoE (operator encapsulation + backward)](#moe-operator-encapsulation--backward)
+    - [Encapsulation](#encapsulation)
+    - [Backward (CPU)](#backward-cpu)
+  - [Multi-GPU Loading/Training: Placement strategy instead of DataParallel](#multi-gpu-loadingtraining-placement-strategy-instead-of-dataparallel)
+
+- [KT-LoRA Fine-Tuning Evaluation](#kt-lora-fine-tuning-evaluation)
+  - [Setup](#setup)
+  - [Results](#results)
+    - [Stylized Dialogue (CatGirl tone)](#stylized-dialogue-catgirl-tone)
+    - [Translational-Style benchmark (generative)](#translational-style-benchmark-generative)
+    - [Medical Vertical Benchmark (AfriMed-SAQ/MCQ)](#medical-vertical-benchmark-afrimed-saqmcq)
+    - [Limitations](#limitations)
+
+- [Speed Tests](#speed-tests)
+  - [End-to-End Performance](#end-to-end-performance)
+  - [MoE Compute (DeepSeek-V3-671B)](#moe-compute-deepseek-v3-671b)
+  - [Memory Footprint](#memory-footprint)
+
+- [Conclusion](#conclusion)
+
+
+# KTransformers Fine-Tuning × LLaMA-Factory Integration – Developer Technical Notes
+
+**MadSys Lab, KVCache-AI Team, Approaching AI, LLaMA-Factory Team**
+
+## Introduction
+
+Recent open-source LLMs—from DeepSeek-V3/R1 to Qwen-MoE and Kimi-K2—have surged in performance and scale. Yet due to **compute and memory constraints**, it is difficult for typical researchers to fine-tune trillion-parameter-class models. We therefore integrate **KTransformers** with **LLaMA-Factory** so that, with **2–4 RTX 4090 GPUs** and sufficient CPU memory, one can fine-tune ultra-large Mixture-of-Experts (MoE) models such as DeepSeek-671B.
+
+This architecture bridges resource gaps, enabling **local fine-tuning of ultra-large models**, while also supporting **efficient scenario customization** at 14B/30B scales. We validate on stylized dialogue, Westernized translation tone, and medical Q&A, achieving rapid adaptation within hours.
+
+Architecturally, LLaMA-Factory orchestrates data/config/training, LoRA insertion, and inference; KTransformers is a pluggable, high-performance operator backend that takes over Attention and MoE under the same training code, enabling **GPU+CPU heterogeneity** to accelerate training and reduce GPU memory.
+
+![image-20251011010558909](../assets/image-20251011010558909.png)
+
+We evaluated LoRA fine-tuning with HuggingFace default, Unsloth, and KTransformers backends (same settings and data). **KTransformers** is currently the only solution feasible on **2–4×24GB 4090s** for **671B-scale MoE**, and also shows higher throughput and lower GPU memory for 14B MoEs.
+
+| Under LoRA (BF16) + [NekoQA-10K stylized dialogue](https://github.com/mindsRiverPonder/LLM-practice) | HuggingFace Backend                      | Unsloth Backend                      | KTransformers Backend |
+| ------------------------------------------------------------ | ---------------------------------------- | ------------------------------------ | --------------------- |
+| [14B-DeepSeekV2-Lite] LoRA fine-tuning throughput            | 303.58 token/s                           | 455.37 token/s                       | 530.38 token/s        |
+| [14B-DeepSeekV2-Lite] GPU memory                             | 32.12 GB                                 | 9.64 GB                              | 6.08 GB               |
+| [671B-DeepSeekV3] LoRA fine-tuning throughput                | <font color='red'>Too Huge to run</font> | <font color='red'>NOT SUPPORT</font> | 40.35 token/s         |
+| [671B-DeepSeekV3] GPU memory (sum across GPUs)               | theoretical 1400 GB †                    | <font color='red'>NOT SUPPORT</font> | 70 GB †               |
+
+† The **1400 GB** is the **theoretical** FP16 full-resident footprint (not runnable). **70 GB** is the **measured peak** with KT (Attention on GPU + layered MoE offload).
+
+From the table above, it can be seen that for the 14B model, the KTransformers backend achieves approximately 75% higher throughput than the default HuggingFace solution, while using only about one-fifth of the GPU memory. For the 671B model, both HuggingFace and Unsloth fail to run on a single 4090 GPU, whereas KTransformers is able to perform LoRA fine-tuning at 40 tokens/s, keeping the GPU memory usage within 70 GB.
+
+![按照模型划分的对比图_02](../assets/image-compare_model.png)
+
+
+
+## Overall View of the KT Fine-Tuning Framework
+
+We detail how KTransformers takes over core operators in LLaMA-Factory’s fine-tuning framework to optimize Attention and MoE.
+
+DeepSeek-V3/V2 MoE models comprise a small-parameter dense Attention part and a large-parameter sparse MoE part. For illustration, consider layer 2 of DeepSeek-V2-Lite-Chat (from which each layer includes both Attention and MoE). Attention compute and KV cache mainly reside on the GPU; the heavyweight MoE part is primarily executed on the CPU. We first cover **Attention replacement and inheritance**, then **MoE encapsulation and backend interfacing**, and finally **multi-GPU placement**.
+
+### Attention (LoRA + KT coexist)
+
+KTransformers provides operator injection (`BaseInjectedModule`), and PEFT provides LoRA layer insertion. For fine-tuning, we design `KTransformersLinearLora`, inheriting from both `KTransformersLinear` and `LoraLayer`:
+
+- **Inheritance:** `KTransformersLinearLora` retains KT’s high-performance paths (`prefill_linear`/`generate_linear`) while accepting LoRA parameters (`lora_A/lora_B`).
+- **Replacement:** During preparation, we replace original `KTransformersLinear` layers (Q/K/V/O) with `KTransformersLinearLora`, preserving KT optimizations while enabling LoRA trainability.
+
+![image-20251016182810716](../assets/image-20251016182810716.png)
+
+After replacement, LoRA is inserted at Q/K/V/O linear transforms (left), and `KTransformersLinearLora` contains both KT fast paths and LoRA matrices (right).
+
+![image-20251016182920722](../assets/image-20251016182920722.png)
+
+### MoE (operator encapsulation + backward)
+
+#### Encapsulation
+
+Given large parameters and sparse compute, we encapsulate the expert computation as a **differentiable black-box operator**—transparent upstream, replaceable downstream.
+
+- **Upstream (PyTorch graph):** we register a custom Autograd Function so the MoE layer appears as **a single node**. In the left figure (red box), only `KSFTExpertsCPU` is visible; on the right, the unencapsulated graph expands routing, dispatch, and FFN experts. Encapsulation makes the MoE layer behave like a standard `nn.Module` with gradients.
+- **Downstream (backend):** inside the Autograd Function, pybind11 calls C++ extensions for forward/backward. Multiple **pluggable backends** exist (AMX BF16/INT8; **llamafile**). The backend can be switched via YAML (e.g., `"backend": "AMXBF16"` vs. `"llamafile"`).
+
+![image-20250801174623919](../assets/image-20250801174623919.png)
+
+#### Backward (CPU)
+
+MoE backward frequently needs the transposed weights $W^\top$. To avoid repeated runtime transposes, we **precompute/cache** $W^\top$ at load time (blue box). We also **cache necessary intermediate activations** (e.g., expert projections, red box) to reuse in backward and reduce recomputation. We provide backward implementations for **llamafile** and **AMX (INT8/BF16)**, with NUMA-aware optimizations.
+
+<img src="../assets/image-20251016182942726.png" alt="image-20251016182942726" style="zoom:33%;" />
+
+### Multi-GPU Loading/Training: Placement strategy instead of DataParallel
+
+To lower **per-GPU memory peaks** on 2–4 GPUs, we use **model parallelism + explicit placement**, not DataParallel (which duplicates the whole model on each GPU).
+
+Key changes:
+
+1. **KTrainer:** takes over `.to(device)` to prevent “move whole model to a single GPU”. Using KT’s optimize-rule YAML, each layer declares `device: cuda:0/cuda:1/...` and is **constructed directly on the target GPU** (no extra copies).
+2. **Disable automatic DataParallel:** when `USE_KT=1`, we disable automatic DP wrappers from LLaMA-Factory/HF Trainer to avoid duplication and keep full control over sharding.
+3. **Gradient aggregation:** gradients are reduced to `cuda:0`. Intermediate activations stay local; only necessary tensors are transferred, cutting communication/activation overhead.
+
+Thus, we keep KT placement strategies under multi-GPU fine-tuning. Users choose a `kt_optimize_rule` with `multi-gpu`. For DeepSeek-671B, `DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml` is a typical 2-GPU plan: KV/attention parts on each GPU; MoE experts sharded on CPU; both GPUs share the workload.
+
+
+
+## KT-LoRA Fine-Tuning Evaluation
+
+### Setup
+
+LLaMA-Factory orchestration, KTransformers backend, LoRA (rank=8, α=32, dropout=0.1, BF16), `GAS=16`, `qlen=512`, with the same KT optimize rule as training. We evaluate (a) stylized dialogue transfer and (b) two **small-scale representative** benchmarks: Translational-Style (generative) and AfriMed-QA (medical vertical; **SAQ** and **MCQ**). AMX is enabled; GPUs: 2×48GB RTX 4090; CPU: Intel Xeon Platinum 8488C.
+
+### Results
+
+#### Stylized Dialogue (CatGirl tone)
+
+Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). The fine-tuned model consistently exhibits the target style (red boxes) versus neutral/rational base (blue). This shows **KT-LoRA injects style features** into the generation distribution with low GPU cost.
+
+![image-20251016175848143](../assets/image-20251016175848143.png)
+
+#### Translational-Style benchmark (generative)
+
+Dataset: [Translational-Style-ChatLLM](https://github.com/Benson114/Translational-Style-ChatLLM). Metrics: BLEU-1/2/3/4, ROUGE-1/2/L.
+
+| Translational-Style dataset    | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------ | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite (no LoRA)              | 20.66     | 8.33      | 4.54      | 2.89      | 22.71     | 4.52      | 19.19     |
+| **KT-LoRA fine-tuned V2-Lite** | **35.41** | **22.44** | **15.42** | **11.18** | **42.03** | **18.38** | **33.10** |
+| V3 base (no LoRA)              | 8.49      | 3.34      | 1.62      | 0.96      | 15.91     | 2.55      | 10.07     |
+| **KT-LoRA fine-tuned V3**      | **37.02** | **23.70** | **16.21** | **11.49** | **43.43** | **18.96** | **34.54** |
+
+As shown by the test results in the tables above, under a unified workflow and placement strategy, **both model scales exhibit consistent gains after fine-tuning**, supporting the usability and effectiveness of the “KT backend + LoRA fine-tuning” combination for generative style control. At the same time, this indicates that KT’s heterogeneous placement and operator optimizations can stably support small-sample adaptation in the style domain.
+
+#### Medical Vertical Benchmark (AfriMed-SAQ/MCQ)
+
+The dataset adopts [AfriMed-QA](https://aclanthology.org/2025.acl-long.96/) (ACL 2025), a domain-specific dataset for the medical field in Africa with strong scenario customization characteristics, comprising two formats—multiple-choice questions (MCQ) and short-answer questions (SAQ)—which in this case serve as the evaluation for vertical-domain fine-tuning. In terms of evaluation criteria, BLEU/ROUGE are used for SAQ, and Accuracy is used for MCQ.
+
+| AfriMed-QA (SAQ)               | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------ | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite (no LoRA)              | 13.58     | 11.12     | 9.10      | 7.23      | 22.48     | 7.81      | 11.73     |
+| **KT-LoRA fine-tuned V2-Lite** | **35.90** | **27.63** | **22.99** | **19.15** | **35.25** | **17.50** | **28.44** |
+| V3 base (no LoRA)              | 12.75     | 10.27     | 8.05      | 5.99      | 20.33     | 5.65      | 10.11     |
+| **KT-LoRA fine-tuned V3**      | **42.42** | **34.12** | **28.95** | **24.54** | **41.97** | **22.37** | **33.28** |
+
+| AfriMed-QA (MCQ)               | Accuracy   |
+| ------------------------------ | ---------- |
+| V2-Lite (no LoRA)              | 0.0645     |
+| **KT-LoRA fine-tuned V2-Lite** | **0.4812** |
+| V3 base (no LoRA)              | 0.5833     |
+| **KT-LoRA fine-tuned V3**      | **0.7930** |
+
+As shown in the tables above, (1) DeepSeek-V3 (671B) after KT-LoRA fine-tuning achieves clearly higher performance than the fine-tuned DeepSeek-V2-Lite (14B) on both MCQ and SAQ, and it also surpasses the V3 base model. Within our small-scale setting, this preliminarily indicates that KT-LoRA fine-tuning of ultra-large-parameter models has practical significance in vertical domains.
+
+(2) Across both SAQ/MCQ sub-tasks, KT-LoRA delivers consistent gains, indicating that—with KT’s heterogeneous placement and backend operator support—LoRA fine-tuning can effectively inject the key knowledge points of vertical domains such as medicine into the model.
+
+#### Limitations
+
+At present, most of our testing is conducted on **single datasets** and at **small scale** (≤ 20k examples), with the goal of providing **existence evidence of system effectiveness for KT-LoRA fine-tuning**, rather than drawing generalized conclusions about algorithmic generalization or scaling laws. Our report primarily presents representative figures; to support stronger algorithmic claims, larger sample sizes, multi-lingual/multi-domain datasets, and multi-seed repeated experiments would be required—these are beyond the scope of this work.
+
+**We also warmly welcome everyone to join the open-source LLaMA-Factory KT fine-tuning project. If you have additional test results, we especially welcome you to record them in the shared spreadsheet below, and to include the corresponding `kt_optimize_rule` files, dataset examples, training/evaluation YAMLs, and detailed GPU-memory and CPU configurations for community reference and reproducibility~!** 
+
+
+
+### Speed Tests
+
+#### End-to-End Performance
+
+**Definitions**
+
+`step_time`：time per optimization step (tensor movement + Attention + MoE + others).
+
+`tokens_per_step = GAS × qlen`；`token/s = tokens_per_step / step_time`。 We use `GAS=16`, `qlen=512` → `tokens_per_step=8192`.
+
+**Measured**
+
+| Model                | step_time (s) | tokens/step | token/s   |
+| -------------------- | ------------- | ----------- | --------- |
+| DeepSeek-V3-671B     | 203           | 8192        | **40.35** |
+| DeepSeek-V2-Lite-14B | 36            | 8192        | **227.6** |
+
+#### MoE Compute (DeepSeek-V3-671B)
+
+**Theory**
+
+- MoE per-layer, per-token FLOPs (forward+backward) approx.:
+  $$
+  \text{FLOPs}_{\text{per-layer, per-token}} \approx c \cdot k \cdot H \cdot I
+  $$
+
+		with $k = 8$（Top-k），$H = 7168$（hidden size），$I = 2048$（intermediate size），$c\approx16$（≈6 forward + ≈10 backward matmuls）。
+
+- Per-step across all MoE layers:
+  $$
+  \text{FLOPs}_{\text{per-step}} \approx c \cdot qlen \cdot k \cdot H \cdot I \cdot L_{\text{MoE}}
+  $$
+
+		Plugging $c=16, qlen=512, k=8, H=7168, I=2048, L_{MoE}=58$，$\text{FLOPs}_{\text{per-step}} \approx 55.8\ \text{TFLOPs}$.
+
+**Measured (MoE TFLOPS on CPU)**
+
+If the **MoE-only** time per step is `t_moe` (seconds), $\text{TFLOPS} = \text{FLOPs}_{\text{per-step}} / \text{step\_per\_second}.$
+
+Use MoE-phase time, not full `step_time`, to get MoE throughput.
+
+| TFLOPS  | Forward | Backward |
+| ------- | ------- | -------- |
+| Average | 17.55   | 18.41    |
+
+### Memory Footprint
+
+- DeepSeek-V3 (671B; 58 MoE layers out of 61): ~**70 GB** total GPU, ~**1.2–1.3 TB** host memory.
+- DeepSeek-V2-Lite (14B; 26 MoE layers out of 27): ~**5 GB** GPU, ~**30 GB** host memory.
+
+
+
+## Conclusion
+
+Integrating **KTransformers LoRA** with **LLaMA-Factory** provides a practical path to efficiently train and deploy MoE LLMs. KT contributes placement strategies and operator optimizations (DeepSeek/Qwen/Kimi support with AMX-accelerated kernels), and LoRA enables customization with very low GPU memory; LLaMA-Factory supplies a coherent user-level interface.
+
+This means even tens-to-hundreds-of-billion-parameter MoE models can be fine-tuned and served with low latency on ordinary hardware. The approach balances **memory savings**, **speed**, and **usability**, turning ultra-large models into tools that developers can actually wield.
--- a/doc/zh/KTransformers
+++ b/doc/zh/KTransformers
@@ -0,0 +1,302 @@
+- [KTransformers 微调 × LLaMA-Factory 集成 – 用户指南](#ktransformers-微调-x-llama-factory-集成-–-用户指南)
+- [Introduction](#introduction)
+
+- [Quick to Start](#quick-to-start)
+  - [快速上手](#快速上手)
+  - [环境安装](#环境安装)
+  - [核心功能1：使用KTransformers作为backend，微调超大规模MoE模型](#核心功能1使用ktransformers作为backend微调超大规模moe模型)
+  - [核心功能2：与微调后模型（即原模型+LoRA Adapter）聊天，用于交互](#核心功能2与微调后模型即原模型lora-adapter聊天用于交互)
+  - [核心功能3：生成微调后模型（即原模型+LoRA Adapter）的API，用于批量生成并评测指标](#核心功能3生成微调后模型即原模型lora-adapter的api用于批量生成并评测指标)
+
+- [KT微调速度性能测试：用户侧](#kt微调速度性能测试用户侧)
+  - [端到端性能](#端到端性能)
+  - [显存/内存性能](#显存内存性能)
+
+- [结论](#结论)
+
+# KTransformers 微调 × LLaMA-Factory 集成 – 用户指南
+
+**MadSys实验室, KVCache-AI团队, 趋境科技, LLaMA-Factory团队**
+
+## Introduction
+
+从 **DeepSeek-V3/R1** 到 **Qwen3-MoE、Kimi-K2**，每一次超大模型的开源都带来性能与规模上的巨大跃升。然而，多数研究者与开发者受限于昂贵的显卡与动辄数千亿参数的模型，**难以在资源受限条件下微调超大模型**。面对这种差距，我们提出了一种更具可行性的方案：通过 **KTransformers 与 LLaMA-Factory 的结合**，仅需2~4张RTX 4090与较高内存CPU，便可微调DeepSeek-671B等超大规模的MoE模型。
+
+该架构的核心目标是为资源受限下的研究者提供 **在本地探索超大规模模型微调的可能性**。同时，也在较小规模（如 14B/30B）提供快速定制特定场景的路径。我们以**风格化对话、西式腔调翻译、医学问答**作为代表任务，验证架构的可行性，并展示在**数小时内达成个性化适配**的可操作性。
+
+
+
+如下图所示，LLaMA-Factory 是整个微调流程的统一调度与配置框架，负责数据处理、训练调度、LoRA 插入与推理接口管理； KTransformers 则作为其可插拔的高性能后端，在相同的训练配置下接管 Attention / MoE 等核心算子，实现异构设备（GPU+CPU）的高效协同。
+
+![image-20251011010558909](../assets/image-20251011010558909.png)
+
+我们在 LLaMA-Factory 框架下，对比评测了 **HuggingFace**、**Unsloth**、**KTransformers** 三种后端的 LoRA 微调方案。结果显示，KTransformers为超大规模的MoE模型（671B等）提供了**4090 级别**的唯一可行方案，并在较小规模的MoE模型（DeepSeek-14B）上面也展现了更高的吞吐和更低的显存占用。
+
+| Under LoRA (BF16)+[NekoQA-10K-风格化对话数据集](https://github.com/mindsRiverPonder/LLM-practice) | HuggingFace Backend                      | Unsloth Backend                      | KTransformers Backend |
+| ------------------------------------------------------------ | ---------------------------------------- | ------------------------------------ | --------------------- |
+| [14B-DeepSeekV2-Lite] LoRA Fine-tuning throughput            | 303.58 token/s                           | 455.37 token/s                       | 530.38 token/s        |
+| [14B-DeepSeekV2-Lite] GPU Memory                             | 32.12 GB                                 | 9.64 GB                              | 6.08 GB               |
+| [671B-DeepSeekV3] LoRA Fine-tuning throughput                | <font color='red'>Too Huge to run</font> | <font color='red'>NOT SUPPORT</font> | 40.35 token/s         |
+| [671B-DeepSeekV3] GPU Memory（多卡总和）                     | 理论值1400 GB †                          | <font color='red'>NOT SUPPORT</font> | 70 GB †               |
+
+† **1400 GB** 为**理论显存**（FP16 全参数常驻，非可运行配置）；**70 GB** 为 KT 策略（Attention 驻 GPU + MoE分层 offload）下的**实测峰值**。
+
+![按照模型划分的对比图_02](../assets/image-compare_model.png)
+
+### 微调效果示例
+
+#### 风格化对话测试（CatGirl风格语气）
+
+数据集：[NekoQA-10K: 面向猫娘语言建模的对话数据集](https://zhuanlan.zhihu.com/p/1934983798233231689)，目标是提升风格一致性与可辨识度。
+
+下图对比了原始模型和微调模型的回答，可以看到微调后模型在语气和称谓上更加稳定地保持了猫娘风格（红框部分），验证了**风格迁移微调**的有效性。
+
+![风格化数据集模型输出对比_01](../assets/风格化数据集模型输出对比_01.png)
+
+#### Benchmark测试
+
+数据集选取：
+
+（1）采用了[西式翻译腔数据集](https://github.com/Benson114/Translational-Style-ChatLLM)，该数据集要求模型按西式表达习惯进行夸张的翻译，有明确的定制化风格需求。
+
+（2）采用了[AfriMed-QA](https://aclanthology.org/2025.acl-long.96/)数据集（ACL-2025），作为非洲地区医疗领域的专用数据集，具有很强的场景定制特征，包含选择题和简答题两种形式，非常适合作为垂直领域微调的评估。针对单选和简答形式，我们分别进行测试，结果如下。
+
+下表显示了微调前后模型在这些数据集上的指标变化。可以看到经过 LoRA 微调后，各项指标**大幅提升**，验证了微调的有效性：
+
+| 西式翻译腔数据集                | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite原模型（不LoRA微调）     | 20.66     | 8.33      | 4.54      | 2.89      | 22.71     | 4.52      | 19.19     |
+| **KT-LoRA微调DeepSeek-V2-Lite** | **35.41** | **22.44** | **15.42** | **11.18** | **42.03** | **18.38** | **33.10** |
+| V3原模型（不LoRA微调）          | 8.49      | 3.34      | 1.62      | 0.96      | 15.91     | 2.55      | 10.07     |
+| **KT-LoRA微调DeepSeek-V3**      | **37.02** | **23.70** | **16.21** | **11.49** | **43.43** | **18.96** | **34.54** |
+
+| AfriMed-QA数据集（简答任务）    | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite原模型（不LoRA微调）     | 13.58     | 11.12     | 9.10      | 7.23      | 22.48     | 7.81      | 11.73     |
+| **KT-LoRA微调DeepSeek-V2-Lite** | **35.90** | **27.63** | **22.99** | **19.15** | **35.25** | **17.50** | **28.44** |
+| V3原模型（不LoRA微调）          | 12.75     | 10.27     | 8.05      | 5.99      | 20.33     | 5.65      | 10.11     |
+| **KT-LoRA微调DeepSeek-V3**      | **42.42** | **34.12** | **28.95** | **24.54** | **41.97** | **22.37** | **33.28** |
+
+| AfriMed-QA数据集（单选任务）    | Accuracy   |
+| ------------------------------- | ---------- |
+| V2-Lite原模型（不LoRA微调）     | 0.0645     |
+| **KT-LoRA微调DeepSeek-V2-Lite** | **0.4812** |
+| V3原模型（不LoRA微调）          | 0.5833     |
+| **KT-LoRA微调DeepSeek-V3**      | **0.7930** |
+
+从以上测试可以看出，即使是参数量巨大的 MoE 模型，通过 KTransformers 后端的高效微调，**也能在特定任务上快速达到理想效果**。
+
+
+
+## Quick to Start
+
+### 快速上手
+
+本节将指导您如何安装环境并使用 **LLaMA-Factory + KTransformers** 完成微调和推理。我们将涵盖以下内容：
+
+- 环境依赖的安装配置
+- 使用 KTransformers 作为后端微调超大规模 MoE 模型
+- 加载微调后的模型（原模型 + LoRA 适配器）进行对话/推理
+- 批量推理微调模型并评测指标
+
+### 环境安装
+
+根据下面示例，同时安装KTransformers和LLaMA-Factory环境，这次为了简化KTransformers的安装流程，我们特意封装了wheel包避免本地编译，具体安装步骤如下：（注意对应好本地的python版本、torch版本、cuda版本和不同文件名的KTransformers包）
+
+```shell
+# 1. 安装conda环境
+conda create -n Kllama python=3.10 # choose from : [3.10, 3.11, 3.12, 3.13]
+conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
+conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
+
+# 2. 安装llamafactory环境
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch,metrics]" --no-build-isolation
+
+# 3. 安装对应torch和python版本的KTransformers（CUDA版本可以跟whl命名的不一致）
+pip install ktransformers-0.4.1+cu128torch28fancy-cp310-cp310-linux_x86_64.whl
+
+# 4. 安装flash-attention，参照python版本和torch版本，从https://github.com/Dao-AILab/flash-attention/releases下载
+https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
+# abi=True/False可以用下面代码查看
+# import torch
+# print(torch._C._GLIBCXX_USE_CXX11_ABI)
+
+# 5. （可选）如果你想使用flash_infer的话（不然默认triton）
+git clone https://github.com/kvcache-ai/custom_flashinfer.git
+pip install custom_flashinfer/
+```
+
+
+
+**使用要点**：在 LLaMA-Factory 的配置 YAML 文件中启用 KTransformers 后端，只需设置 `use_kt: true`，并指定相应的 `kt_optimize_rule` YAML 文件，即可切换到底层由 KTransformers 接管计算。下面我们将通过具体功能来说明如何设置这些配置。
+
+### 核心功能1：使用KTransformers作为backend，微调超大规模MoE模型
+
+运行命令：`USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml`。
+
+需要注意的是，必须提供BF16格式模型文件，DeepSeek-V3-671B默认下载是FP8格式，需要通过 [DeepSeek-V3/inference/fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) 转换。
+
+```yaml
+### model
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+trust_remote_code: true
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_rank: 8
+lora_target: all
+
+### dataset
+dataset: identity
+template: deepseek
+cutoff_len: 2048
+max_samples: 100000
+overwrite_cache: true
+preprocessing_num_workers: 16
+dataloader_num_workers: 4
+
+### output
+output_dir: saves/Kllama_deepseekV3
+logging_steps: 10
+save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: false
+report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]
+
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+learning_rate: 1.0e-4
+num_train_epochs: 3.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+resume_from_checkpoint: null
+
+### ktransformers
+use_kt: true # use KTransformers as LoRA sft backend
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
+```
+
+其中，`kt_optimize_rule`提供了大量默认的YAML文件来控制**KTransformers的放置策略**，下面针对YAML文件名和功能对照特别说明，也可以参考[ktransformers/optimize_rules](https://github.com/kvcache-ai/ktransformers/tree/main/ktransformers/optimize/optimize_rules)：（\*指通配符）
+
+| 文件名字段                                    | 功能特征                                           |
+| --------------------------------------------- | -------------------------------------------------- |
+| DeepSeek-V2-Lite-Chat-\*或DeepSeek-V3-Chat-\* | 对应的不同模型                                     |
+| \*-sft-\*                                     | 微调所用的放置策略，其他为推理所用                 |
+| \*-amx-\*                                     | 使用AMX指令集进行CPU运算，其他为llamafile          |
+| \*-multi-gpu-X\*                              | 使用X张GPU进行模型并行（显存共担），X为空默认是2张 |
+
+例如：`examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml`为DeepSeek-V3-Chat模型用AMX指令集进行微调，并调用两卡模型并行。
+
+对于微调任务，我们推荐使用**AMX指令集加速**，可以使用`lscpu | grep amx`查看CPU是否支持AMX指令集，AMX精度支持BF16/Int8，修改方式如下：
+
+```yaml
+- match:
+    name: "^model\\.layers\\..*\\.mlp\\.experts$"
+  replace:
+    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert parallelism
+    kwargs:
+      prefill_device: "cpu"
+      prefill_op: "KExpertsTorch"
+      generate_device: "cpu"
+      generate_op: "KSFTExpertsCPU"
+      out_device: "cuda"
+      backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
+```
+
+输出会保存在`output_dir`里面，默认为safetensor格式，并且保留adapter.json等配套内容以便后续加载。
+
+![演示文稿1_01](../assets/演示文稿1_01.png)
+
+
+
+### 核心功能2：与微调后模型（即原模型+LoRA Adapter）聊天，用于交互
+
+运行命令：`llamafactory-cli chat examples/inference/deepseek3_lora_sft_kt.yaml`。
+
+调用KT微调的adapter (safetensor格式) 推理对话。
+
+```yaml
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+adapter_name_or_path: saves/Kllama_deepseekV3
+template: deepseek
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # 调用KTransformers backend
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx.yaml # 请选择和LoRA微调的时候保持一致的YAML文件
+cpu_infer: 32
+chunk_size: 8192
+```
+
+同时，我们也支持GGUF格式的adapter进行推理（如果您已经使用了上述LLaMA-Factory+KTransformers的微调方案，就不用管啦~）。
+
+safetensors 场景填**文件所在目录**，GGUF 场景填**文件路径**，也就是说您需要把`adapter_name_or_path`选为具体的GGUF格式文件。
+
+加载过程中适配了KT每层的命名，和torch.save保存下来的常规命名的不同，正常映射日志`Loaded adapter weight: XXX -> XXX`，展示如下。
+
+![image-20250801165752484](../assets/image-20250801165752484.png)
+
+
+
+### 核心功能3：生成微调后模型（即原模型+LoRA Adapter）的API，用于批量生成并评测指标
+
+运行命令：`API_PORT=8000 llamafactory-cli api examples/inference/deepseek3_lora_sft_kt.yaml`。
+
+调用KT微调的adapter给出API，其他API使用逻辑和llamafactory原生方式一致。
+
+```yaml
+model_name_or_path: opensourcerelease/DeepSeek-V3-bf16
+adapter_name_or_path: saves/Kllama_deepseekV3
+template: deepseek
+infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
+trust_remote_code: true
+
+use_kt: true # use KTransformers as LoRA sft backend to inference
+kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml
+cpu_infer: 32
+chunk_size: 8192
+```
+
+
+
+## KT微调速度性能测试：用户侧
+
+### 端到端性能
+
+**测试定义：**
+
+`step_time`：一次优化步（包含 `gradient_accumulation_steps (GAS)` 次累积）的总时间，涵盖 **PyTorch 张量搬运 + Attention + MoE + 其他计算等**。
+
+`tokens_per_step = GAS × qlen`；`token/s = tokens_per_step / step_time`。
+
+**测试设置：**`GAS=16`，`qlen=512`（即每步 8192 tokens）；LoRA（`r=8, alpha=32, dropout=0.1`）；使用AMX指令集优化；GPU选取RTX 4090，CPU选取Intel Xeon Platinum 8488C。
+
+**实测结果：**
+
+**DeepSeek-V3-671B：**step_time = 203 s` → `token/s ≈ 8192 / 203 **≈ 40.35 token/s**
+
+**DeepSeek-V2-Lite-14B：**step_time = 36 s` → `token/s ≈ 8192 / 36 **≈ 227.6 token/s**
+
+### 显存/内存性能
+
+DeepSeek-V3（671B，61层，其中58层有MoE）占用显存（多卡总量）大约**70GB**、内存占用约1.2-1.3TB。
+
+DeepSeek-V2-lite（14B，27层，其中26层有MoE）占用显存大约**5.5GB**、内存占用约150GB。
+
+
+
+## 结论
+
+通过开发 KTransformers LoRA微调并将其集成到 LLaMA‑Factory，我们为希望高效训练与部署 MoE 大模型的用户提供了可行指南。KT 带来最尖端的优化（支持 DeepSeek、Qwen、Kimi 等，配合 AMX 加速 kernel），同时通过 LoRA 微调在极低 GPU 显存下实现定制化。LLaMA‑Factory 则提供友好的统一界面，更广的用户支持。
+
+该集成（类似 Unsloth 补丁所带来的提速）意味着即便是数百亿乃至万亿总参数量的 MoE 模型，也可在普通硬件上完成微调并低延迟部署。**显存节省、速度提升、易用性** 三者兼得。我们鼓励用户在下一次 MoE 项目中尝试 LLaMA‑Factory 的 KT 集成，并参考本文档进行操作。也欢迎提出任何问题和建议！
--- a/doc/zh/KTransformers
+++ b/doc/zh/KTransformers
@@ -0,0 +1,227 @@
+- [KTransformers 微调 × LLaMA-Factory 集成 – 开发技术篇](#ktransformers-微调-x-llama-factory-集成-–-开发技术篇)
+- [Introduction](#introduction)
+
+- [KT微调框架整体性描述](#kt微调框架整体性描述)
+  - [Attention 部分（LoRA + KT 特性并存）](#attention-部分lora--kt-特性并存)
+    - [继承关系](#继承关系)
+    - [替换策略](#替换策略)
+  - [MoE 部分（算子封装+backward实现）](#moe-部分算子封装backward实现)
+    - [MoE算子封装](#moe算子封装)
+    - [MoE 反向优化 (CPU 实现)](#moe-反向优化-cpu-实现)
+  - [多卡加载与训练：用“放置策略”而不是 DataParallel](#多卡加载与训练用放置策略而不是-dataparallel)
+
+- [KT-LoRA微调测试](#kt-lora微调测试)
+  - [实验设置](#实验设置)
+  - [效果测试](#效果测试)
+    - [风格化对话测试（CatGirl语气）](#风格化对话测试catgirl语气)
+    - [生成式翻译风格基准测试](#生成式翻译风格基准测试)
+    - [医疗垂直领域基准（AfriMed-SAQ/MCQ）](#医疗垂直领域基准afrimed-saqmcq)
+    - [局限性说明](#局限性说明)
+
+- [速度测试](#速度测试)
+  - [端到端性能](#端到端性能)
+  - [MoE部分的计算性能（DeepSeek-V3-671B）](#moe部分的计算性能deepseek-v3-671b)
+
+- [显存/内存性能](#显存内存性能)
+
+- [结论](#结论)
+
+# KTransformers 微调 × LLaMA-Factory 集成 – 开发技术篇
+
+**MadSys实验室, KVCache-AI团队, 趋境科技, LLaMA-Factory团队**
+
+## Introduction
+
+当今的开源大模型（从 DeepSeek-V3/R1 到 Qwen-MoE 系列以及 Kimi-K2 等）在性能和规模上突飞猛进。然而，受限于**计算资源和显存**，普通研究者难以对这些上千亿乃至更大规模的模型进行微调。为此，我们设计了 **KTransformers** 与 **LLaMA-Factory** 集成的方案，使得仅需 **2～4 张 RTX 4090 GPU** 加上足够的 CPU 内存，就能微调 DeepSeek-671B 这样的超大规模 Mixture-of-Experts (MoE) 模型。
+
+这一架构旨在桥接资源鸿沟，让更多人能够**在本地探索超大模型微调**的可能；同时在相对小一些的模型（如 14B/30B 参数量级）上，也能提供**更高效的场景化定制**途径。我们通过风格化对话、西式翻译语气、医学问答等任务验证了该方案，仅用数小时即可实现模型风格和专业领域的**快速适配**。
+
+从系统架构上看，如下图所示，**LLaMA-Factory** 扮演微调流程的调度中枢，负责统一配置数据和训练流程、插入 LoRA 模块以及管理推理接口；**KTransformers** 则作为可插拔的高性能算子后端，在相同的训练代码下接管底层 **Attention** 和 **MoE** 运算，实现 **GPU+CPU 异构协同**，加速训练并降低显存占用。
+
+![image-20251011010558909](../assets/image-20251011010558909.png)
+
+为评估该集成的性能优势，我们使用 LLaMA-Factory 分别调用了 HuggingFace 默认后端、Unsloth 后端以及 KTransformers 后端进行 LoRA 微调的对比测试（在相同设置和数据集下）。结果表明，**KTransformers** 是目前唯一能在 2～4 张 24GB 4090卡上微调 **671B 规模 MoE 模型** 的方案；同时在 14B 规模的 MoE 模型上，相比另两种方案也具有**更高的吞吐速率**和**更低的 GPU 显存占用**。
+
+| Under LoRA (BF16)+[NekoQA-10K-风格化对话数据集](https://github.com/mindsRiverPonder/LLM-practice) | HuggingFace Backend                      | Unsloth Backend                      | KTransformers Backend |
+| ------------------------------------------------------------ | ---------------------------------------- | ------------------------------------ | --------------------- |
+| [14B-DeepSeekV2-Lite] LoRA Fine-tuning throughput            | 303.58 token/s                           | 455.37 token/s                       | 530.38 token/s        |
+| [14B-DeepSeekV2-Lite] GPU Memory                             | 32.12 GB                                 | 9.64 GB                              | 6.08 GB               |
+| [671B-DeepSeekV3] LoRA Fine-tuning throughput                | <font color='red'>Too Huge to run</font> | <font color='red'>NOT SUPPORT</font> | 40.35 token/s         |
+| [671B-DeepSeekV3] GPU Memory（多卡总和）                     | 理论值1400 GB †                          | <font color='red'>NOT SUPPORT</font> | 70 GB †               |
+
+† **1400 GB** 为**理论显存**（FP16 全参数常驻，非可运行配置）；**70 GB** 为 KT 策略（Attention 驻 GPU + MoE分层 offload）下的**实测峰值**。
+
+上表中可以看出，对于 14B 模型，KTransformers 后端的吞吐量相比 HuggingFace 默认方案提升了约 75%，而显存占用仅为其约 1/5。对于 671B 模型，HuggingFace 和 Unsloth 在单台4090环境下无法运行，而 KTransformers 能以 **40 tokens/s** 的速度LoRA微调，并将 GPU 显存需求控制在 70 GB。
+
+![按照模型划分的对比图_02](../assets/image-compare_model.png)
+
+
+
+## KT微调框架整体性描述
+
+下面详细展示的是在 LLaMA-Factory 的微调框架中，KTransformers 后端如何接管底层算子并实现 Attention / MoE 的优化结构。
+
+DeepSeek-V3/V2等MoE模型主要包括小参数、密集矩阵的Attention部分和大参数、稀疏矩阵的MoE部分。为了直观说明，我们以 DeepSeek-V2-Lite-Chat 的第 2 层为例（从该层起，每层包含 Attention 与 MoE 两个子模块），其中Attention由GPU承担主要计算与缓存（KV），剩下的大参数量MoE主要由CPU承担 。下文将先介绍 **Attention 部分的替换与继承关系**，再介绍 **MoE 部分的封装与后端对接**，最后说明**多卡放置等特性支持**。
+
+### Attention 部分（LoRA + KT 特性并存）
+
+KTransformers 提供了算子模块的注入机制（`BaseInjectedModule`），而 PEFT 库提供了 LoRA 微调的层插入机制。为了在**微调阶段**同时兼容两者，我们设计了 `KTransformersLinearLora` 类，使其同时继承自 KTransformers 的线性层 (`KTransformersLinear`) 和 LoRA 的层基类 (`LoraLayer`)。如下图所示：
+
+- **继承关系**：如下图所示，`KTransformersLinearLora` 同时继承 `KTransformersLinear` 与 `LoraLayer`，既保留 **KT 的高性能算子**（如 `prefill_linear` / `generate_linear`），又能**加载 LoRA参数**（如 `lora_A`、`lora_B` 等矩阵）；
+
+- **替换策略**：在微调准备阶段，用 `KTransformersLinearLora` **逐一替换** 原 `KTransformersLinear`层（如下图右侧所示，主要包含Q/K/V/O 等线性层），从而在不破坏 KT 优化的前提下，将 LoRA 注入到了模型中，使其参数可训练。
+
+![image-20250911184023795](../assets/image-20250911184023795.png)
+
+替换完成后，如下图（左）所示，在计算图中相当于在原模型的 Q/K/V/O 四个矩阵乘法位置都插入了 LoRA。下图（右）展示了 `KTransformersLinearLora` 的内部，它同时包含了 KT 模块的高性能计算接口（prefill 和 generate 阶段的方法）以及 LoRA 的 A、B 矩阵等参数。
+
+![image-20250801174517784](../assets/image-20250801174517784.png)
+
+### MoE 部分（算子封装+backward实现）
+
+#### MoE算子封装
+
+考虑到 MoE 参数量大且计算稀疏，我们采用“封装成黑盒算子”的策略处理：将 MoE 专家计算封装为一个**对上游而言透明（单节点）、对下游可替换（多实现）**的可微算子。
+
+- **上游（PyTorch 计算图）**：我们注册自定义 Autograd Function，整个 MoE 专家层在计算图中呈现为**一个节点**。如下左图红框所示，封装后计算图中只有 `KSFTExpertsCPU` 这样一个算子节点；而右图红框为未封装时的细粒度计算图——路由、专家选择以及 FFN 计算都完整展开在计算图中。封装后，对微调过程来说，MoE层就等同于一个普通 `nn.Module`，前向计算可求梯度，反向梯度也由我们来自定义算子返回。
+- **下游（后端实现）**：在这个 Autograd Function 内部，我们通过 pybind11 调用了 C++ 扩展实现具体的前向和反向计算。这里我们提供了多个**可插拔后端实现**，如 AMX 指令集版本（支持 BF16/INT8 算子优化）和 llamafile 版本。只要遵循同样的接口，即可灵活切换后端。例如在 YAML 优化规则里指定使用 `"backend": "AMXBF16"`，就会调用 AMX 后端；改成 `"llamafile"` 则使用默认后端。
+
+![image-20250801174623919](../assets/image-20250801174623919.png)
+
+#### MoE 反向优化 (CPU 实现)
+
+在实现 MoE 自定义算子的反向传播时，我们特别优化了大矩阵的梯度计算开销。MoE反向计算需要频繁访问权重转置`Wᵀ`，为避免运行时反复转置带来的开销，我们在加载参数时**预备一份权重转置`Wᵀ` 便于复用**（如下图蓝框）。同时，**缓存必要的中间激活**（例如专家层中间投影结果，见下图红框），以便在反向阶段复用，减少重复计算。基于这些缓存，当前已提供 llamafile 与 AMX（INT8/BF16） 的MoE反向计算实现，并针对 NUMA 架构优化内存访问。
+
+<img src="../assets/image-20250911184455749.png" alt="image-20250911184455749" style="zoom: 33%;" />
+
+### 多卡加载与训练：用“放置策略”而不是 DataParallel
+
+为了在使用 2～4 张 GPU 时进一步降低**单卡显存压力**，KTransformers 结合模型并行技术实现了**多卡协同微调**。与常规的 DataParallel 不同，我们没有简单地将整层模型复制到每张卡（那样显存需求会翻倍），而是采用**模型并行 + 显式算子放置**的策略，让不同 GPU 各自承载模型的一部分层。
+
+具体而言，我们对 Transformers Trainer 做了以下改动：
+
+1. **自定义训练器 (KTrainer)**：接管模型加载到设备的逻辑，采用显示层放置。默认情况下 `transformers` 会在初始化时将模型 `.to(device)` 全部搬移到单块 GPU，我们通过自定义 KTrainer 阻止这一行为，利用 KTransformers 的优化规则 YAML，我们可以在每一层声明 `device: cuda:0/cuda:1/...` 来指定该层所在的设备。这样初始化模型时，各层就直接构建在目标 GPU 上，不需要额外拷贝。。
+
+2. **禁用自动 DataParallel**：当启动全局变量`USE_KT=1`时，我们暂时禁用了 LLaMA-Factory 和 HuggingFace Train 原本自动启动的多卡 DataParallel 封装。避免了框架层面对模型的重复拷贝，使我们能够完全掌控模型的分片方案。
+
+3. **梯度回传与汇总**：由于模型各部分分散在不同 GPU 上，我们采取梯度汇总到 `cuda:0` 的方式。具体做法是：在反向传播时，仅将所需的梯度张量在设备间传输，而不传输整个模型的中间激活；各 GPU 计算各自部分的梯度，最终在0号卡汇总计算 loss。这种方式减少了不必要的通讯开销和激活冗余。
+
+通过上述手段，我们实现了**多 GPU 下依然遵循 KTransformers 放置策略**的训练方案。用户只需选择合适的 `kt_optimize_rule` 配置文件（例如带有 `multi-gpu` 的 YAML），即可启用默认的模型分片方案。在 DeepSeek-671B 微调中，我们提供的 `DeepSeek-V3-Chat-sft-amx-multi-gpu.yaml` 就是一个两卡模型并行的典型策略：Attention 模块的 KV缓存和部分计算放在每张卡上，MoE 专家层在 CPU 上分片处理，两张卡共同承担全模型的计算。
+
+
+
+## KT-LoRA微调测试
+
+### 实验设置
+
+实验均采用 LLaMA-Factory 调度、KTransformers 后端、LoRA 轻量微调范式（超参数：rank = 8、α = 32、dropout = 0.1，BF16，`gradient_accumulation_steps=16`、`qlen=512`）以及与微调阶段一致的 KT 优化规则。我们分别评测了（a）风格化对话的迁移效果，以及（b）两类具有代表性的**定量基准**：西式翻译腔（生成式）与 AfriMed-QA（医疗垂直领域，含**简答生成**与**单项选择**两种子任务）。固定使用AMX指令集优化；GPU选取2张 48G VRAM 的 RTX 4090，CPU选取 Intel Xeon Platinum 8488C。
+
+### 效果测试
+
+#### 风格化对话测试（CatGirl语气）
+
+数据集采用[NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689)进行风格迁移微调，目标是提升语气一致性与可辨识度。
+
+下图展示了原模型与微调后模型的对比。微调后回答在称谓、语气标记与修饰语上更稳定地保持了目标风格（红框），相较原模型的中性与理性表达（蓝框）具有更强的风格可辨识性，说明KT-LoRA 能以较低 GPU 成本，将特定风格特征有效注入到大模型生成分布。
+
+![风格化数据集模型输出对比_01](../assets/风格化数据集模型输出对比_01.png)
+
+#### 生成式翻译风格基准测试
+
+数据集采用了[西式翻译腔数据集](https://github.com/Benson114/Translational-Style-ChatLLM)，要求模型采用夸张的“西式翻译腔”，属生成式风格控制任务，评价指标采用生成任务常见的 BLEU-1/2/3/4 与 ROUGE-1/2/L。
+
+| 西式翻译腔数据集                | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite原模型（不LoRA微调）     | 20.66     | 8.33      | 4.54      | 2.89      | 22.71     | 4.52      | 19.19     |
+| **KT-LoRA微调DeepSeek-V2-Lite** | **35.41** | **22.44** | **15.42** | **11.18** | **42.03** | **18.38** | **33.10** |
+| V3原模型（不LoRA微调）          | 8.49      | 3.34      | 1.62      | 0.96      | 15.91     | 2.55      | 10.07     |
+| **KT-LoRA微调DeepSeek-V3**      | **37.02** | **23.70** | **16.21** | **11.49** | **43.43** | **18.96** | **34.54** |
+
+如上表测试结果所示，在统一流程与放置策略下，**两种规模的模型在微调后均出现一致性增益**，支持“KT 后端 + LoRA 微调”组合在生成式风格控制上的可用性与有效性。同时，说明 KT 的异构放置与算子优化能够稳定支撑风格域的小样本适配。
+
+#### 医疗垂直领域基准（AfriMed-SAQ/MCQ）
+
+数据集采用了[AfriMed-QA](https://aclanthology.org/2025.acl-long.96/)数据集（ACL-2025），作为非洲地区医疗领域的专用数据集，具有很强的场景定制特征，包含单选题（MCQ）和简答题（SAQ）两种形式，在本案例中作为垂直领域微调的评估。评估标准上，SAQ 用 BLEU/ROUGE；MCQ 用 Accuracy。
+
+| AfriMed-QA数据集（简答任务SAQ） | BLEU-1    | BLEU-2    | BLEU-3    | BLEU-4    | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+| ------------------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| V2-Lite原模型（不LoRA微调）     | 13.58     | 11.12     | 9.10      | 7.23      | 22.48     | 7.81      | 11.73     |
+| **KT-LoRA微调DeepSeek-V2-Lite** | **35.90** | **27.63** | **22.99** | **19.15** | **35.25** | **17.50** | **28.44** |
+| V3原模型（不LoRA微调）          | 12.75     | 10.27     | 8.05      | 5.99      | 20.33     | 5.65      | 10.11     |
+| **KT-LoRA微调DeepSeek-V3**      | **42.42** | **34.12** | **28.95** | **24.54** | **41.97** | **22.37** | **33.28** |
+
+| AfriMed-QA数据集（单选任务MCQ） | Accuracy   |
+| ------------------------------- | ---------- |
+| V2-Lite原模型（不LoRA微调）     | 0.0645     |
+| **KT-LoRA微调DeepSeek-V2-Lite** | **0.4812** |
+| V3原模型（不LoRA微调）          | 0.5833     |
+| **KT-LoRA微调DeepSeek-V3**      | **0.7930** |
+
+如上表所示，（1）DeepSeek-V3（671B）经 KT-LoRA 微调后在MCQ和SAQ任务上均明显高于微调后的 DeepSeek-V2-Lite（14B），并且超过 V3 原模型。在我们的小规模设置中，初步说明了KT-LoRA微调巨大参数模型，在垂直领域中具有实际意义。
+
+（2）在 SAQ/MCQ 两类子任务上，KT-LoRA 均带来一致增益，说明在 KT 的异构放置与后端算子支持下，LoRA 微调能够把“医疗等垂直领域的知识要点”有效注入模型。
+
+#### 局限性说明
+
+目前我们基于的多为单数据集、小规模（2w条及以下）进行测试，旨在提供**KT-LoRA微调系统有效性的“存在性证据”**，而非对算法泛化或规模规律的概括性结论。我们报告中主要给出的是代表性数值；若要支持更强的算法结论，需要更大样本、跨语种/跨域多数据集与多随机种子重复实验，本文不作展开。
+
+**我们也特别欢迎大家加入LLaMA-Factory KT微调的开源项目中，如果大家有更多的测试结果，也特别特别欢迎写在下面的共享表格中，并补充好`kt_optimize_rule` 文件、数据集example、训练/评测 YAML、具体显存与 CPU 配置等，以便大家参考、复现~！**
+
+
+
+### 速度测试
+
+#### 端到端性能
+
+**测试定义：**
+
+`step_time`：一次优化步的总耗时（含张量搬运、Attention、MoE 等全部计算）。
+
+`tokens_per_step = GAS × qlen`；`token/s = tokens_per_step / step_time`。 本节统一采用 `GAS=16`、`qlen=512`，因此 `tokens_per_step = 8192`。
+
+**实测结果：**
+
+| 模型                 | step_time (s) | tokens/step | token/s   |
+| -------------------- | ------------- | ----------- | --------- |
+| DeepSeek-V3-671B     | 203           | 8192        | **40.35** |
+| DeepSeek-V2-Lite-14B | 36            | 8192        | **227.6** |
+
+#### MoE部分的计算性能（DeepSeek-V3-671B）
+
+**理论估算**
+
+- MoE 每层、每token的前/反向浮点计算总量 (FLOPs) 可近似：
+  $$
+  \text{FLOPs}_{\text{per-layer, per-token}} \approx c \cdot k \cdot H \cdot I
+  $$
+
+		其中：$k = 8$（Top-k 专家数），$H = 7168$（hidden size），$I = 2048$（intermediate size），常数 $c\approx16$（折合前向=6、反向=10 的矩阵乘总系数）。
+
+- 每步（全 MoE 层）FLOPs 近似：
+  $$
+  \text{FLOPs}_{\text{per-step}} \approx c \cdot qlen \cdot k \cdot H \cdot I \cdot L_{\text{MoE}}
+  $$
+
+		代 $c=16, qlen=512, k=8, H=7168, I=2048, L_{MoE}=58$，得 $\text{FLOPs}_{\text{per-step}} \approx 55.8\ \text{TFLOPs}$.
+
+**实测情况**
+
+MOE部分在CPU上面的性能情况：每秒浮点计算量 $\text{TFLOPS} = \text{FLOPs}_{\text{per-step}} / \text{step\_per\_second}.$
+
+| TFLOPS                 | Forward | Backward |
+| ---------------------- | ------- | -------- |
+| 平均值（单位：TFLOPS） | 17.55   | 18.41    |
+
+### 显存/内存性能
+
+DeepSeek-V3（671B，61层，其中58层有MoE）占用显存大约70GB（多卡总量）、内存占用约1.2-1.3TB。
+
+DeepSeek-V2-lite（14B，27层，其中26层有MoE）占用显存大约5GB、内存占用约30GB。
+
+
+
+## 结论
+
+通过将 KTransformers LoRA 微调集成到 LLaMA‑Factory，我们为希望高效训练和部署 MoE 大模型的用户提供了一条可行路径。KT 提供新的放置策略和算子优化（支持 DeepSeek、Qwen、Kimi 等模型，并结合 AMX 指令加速关键内核），配合 LoRA 微调实现了在极低 GPU 显存占用下的模型定制化训练；而 LLaMA‑Factory 则提供了友好的上层接口与配置管理，让这一切变得易于使用。
+
+这种集成意味着即便是拥有数百亿乃至上万亿参数的 MoE 模型，也能够在相对普通的硬件上完成微调，并进行低延迟的推理部署。**显存节省**、**速度提升**和**易用性**在这套方案中达到了一定的平衡。我们期待社区在未来的 MoE 项目中尝试使用 LLaMA‑Factory 与 KTransformers 的组合，并欢迎参考本文档提供的指南进行操作。通过这一方案，超大模型不再是“无法企及”的存在，而成为每个开发者都可能驾驭的工具。