mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-03-15 02:47:22 +00:00

Files

Jiaqi Liao 2f6f7f1921 Kimi k2.5 doc (#1812 )

* [doc]: add Kimi-K2.5 deploy&sft guide

* [doc]: add Kimi-K2.5 deploy&sft guide

2026-01-27 13:33:25 +08:00

6.4 KiB

Raw Blame History

Kimi-K2.5 LoRA SFT Tutorial

This tutorial demonstrates how to perform LoRA Supervised Fine-Tuning (SFT) on Kimi-K2.5 using LlamaFactory with KTransformers as the backend, and then serve the fine-tuned model using SGLang.

The workflow is:

KTransformers + LlamaFactory LoRA SFT → (Optional) LlamaFactory Verification → SGLang Serving

Hardware Requirements

Training (LoRA SFT)

LlamaFactory + KTransformers
GPU: 4 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
CPU: x86 CPU with AMX support
RAM: At least 2TGB system memory
Swap can be used if CPU memory is insufficient

Inference (LoRA Adapter + Original Model)

SGLang + KTransformers
GPU: 2 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
CPU: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids)
RAM: At least 600GB system memory
Storage: ~600GB for model weights (native INT4 weight, same weight dir for CPU and GPU)

Step 0: Environment Setup

We recommend to separate two conda environments:

Environment	Purpose
`kt-kernel`	Inference & serving (KTransformers + SGLang)
`kt-sft`	Training (LlamaFactory + KTransformers SFT backend)

0.1 Inference Environment: `kt-kernel`

conda create -n kt-kernel python=3.11
conda activate kt-kernel

git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh

0.2 Install SGLang (Inference / Serving)

Recommended for Kimi-K2.5:

git clone https://github.com/kvcache-ai/sglang.git
cd sglang
git checkout kimi_k2.5
pip install -e "python[all]"

0.3 Training Environment: `kt-sft`

conda create -n kt-sft python=3.11
conda activate kt-sft

git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .

0.4 Install KTransformers SFT Dependencies

conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime

# Install matching wheels (recommended), from https://github.com/kvcache-ai/ktransformers/releases
pip install ktransformers-<matching-version>.whl
pip install flash_attn-<matching-version>.whl

Step 1: Prepare Model Weights (BF16 for SFT)

1.1 Download INT4 Weights

KTransformers requires BF16 weights for SFT.

# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
huggingface-cli download moonshotai/Kimi-K2.5 \
  --local-dir /path/to/kimi-k2.5

1.2 Convert INT4 → BF16

Kimi-K2.5 base model is in INT4 format, convert it to BF16 before SFT.

Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)

2.1 Training YAML (LoRA SFT)

Example file: examples/train_lora/kimik2_lora_sft_kt.yaml

Required fields:

stage: sft
finetuning_type: lora
bf16: true

use_kt: true
kt_optimize_rule: <rule.yaml>
cpu_infer: 32
chunk_size: 8192

Other fields (dataset, output_dir, learning rate, epochs) can be adjusted as usual.

2.2 Inference YAML (LlamaFactory Verification)

Key requirements:

adapter_name_or_path: LoRA output directory
infer_backend: ktransformers
Same use_kt and kt_optimize_rule as training

This YAML is used only for quick verification, not production serving.

Step 3: Run LoRA SFT

conda activate kt-sft
cd LlamaFactory

USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml

After training, the LoRA adapter is saved to output_dir.

Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)

Before production deployment, the new PDF recommends a lightweight sanity check.

conda activate kt-sft
cd LlamaFactory

llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml

Purpose:

Validate LoRA correctness
Ensure reproducibility
Not for throughput benchmarking

Step 5: SGLang Serving with LoRA (Recommended Delivery Path)

This is the major runtime update introduced by the new PDF.

5.1 Convert LoRA for SGLang

python ktransformers/kt-kernel/scripts/convert_lora.py \
  --base_path /path/to/kimi-base-model \
  --lora_path /path/to/llamafactory/output_dir \
  --output_path /path/to/lora_converted

5.2 (Optional) Convert CPU Weights to INT8

To reduce CPU memory usage:

python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
  --base_path /path/to/kimi-base-model \
  --output_dir /path/to/kimi-base-model-int8

This produces:

/path/to/kimi-base-model-int8/int8

5.3 Launch SGLang Server with LoRA

conda activate kt-kernel

python -m sglang.launch_server \
  --enable-lora \
  --lora-paths lora1=/path/to/lora_converted \
  --lora-backend triton \
  --model-path /path/to/kimi-base-model \
  --tp 1 \
  --trust-remote-code \
  --context-length 4096 \
  --kt-weight-path /path/to/kimi-base-model-int8/int8 \
  --mem-fraction-static 0.9

Notes:

--kt-weight-path points to CPU INT8 weights
Adjust tp, context-length, and memory parameters per machine
RAWINT4 inference paths can follow Kimi-K2.5-Native directly

6.4 KiB Raw Blame History