Files
ktransformers/doc/en/SFT_Installation_Guide_KimiK2.5.md
Jianwei Dong 15c624dcae Fix/sglang kt detection (#1875)
* [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment

- Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main)
- Add top-level install.sh for one-click source installation (sglang + kt-kernel)
- Add sglang-kt as hard dependency in kt-kernel/pyproject.toml
- Add CI workflow to auto-sync sglang submodule daily and create PR
- Add CI workflow to build and publish sglang-kt to PyPI
- Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages)
- Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection
- Update Dockerfile to use submodule and inject aligned version
- Update all 13 doc files, CLI hints, and i18n strings to reference new install methods

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: bump version to 0.5.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: rename PyPI package from kt-kernel to ktransformers

Users can now `pip install ktransformers` to get everything
(sglang-kt is auto-installed as a dependency).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "[build]: rename PyPI package from kt-kernel to ktransformers"

This reverts commit e0cbbf6364.

* [build]: add ktransformers meta-package for PyPI

`pip install ktransformers` now works as a single install command.
It pulls kt-kernel (which in turn pulls sglang-kt).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [fix]: show sglang-kt package version in kt version command

- Prioritize sglang-kt package version (aligned with ktransformers)
  over sglang internal __version__
- Update display name from "sglang" to "sglang-kt"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [fix]: improve sglang-kt detection in kt doctor and kt version

Recognize sglang-kt package name as proof of kvcache-ai fork installation.
Previously both commands fell through to "PyPI (not recommended)" for
non-editable local source installs. Now version.py reuses the centralized
check_sglang_installation() logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: bump version to 0.5.2.post1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 16:54:48 +08:00

6.4 KiB

Kimi-K2.5 LoRA SFT Tutorial

This tutorial demonstrates how to perform LoRA Supervised Fine-Tuning (SFT) on Kimi-K2.5 using LlamaFactory with KTransformers as the backend, and then serve the fine-tuned model using SGLang.

The workflow is:

KTransformers + LlamaFactory LoRA SFT → (Optional) LlamaFactory Verification → SGLang Serving

Table of Contents

Hardware Requirements

Training (LoRA SFT)

  • LlamaFactory + KTransformers
  • GPU: 4 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
  • CPU: x86 CPU with AMX support
  • RAM: At least 2TGB system memory
  • Swap can be used if CPU memory is insufficient

Inference (LoRA Adapter + Original Model)

  • SGLang + KTransformers
  • GPU: 2 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
  • CPU: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids)
  • RAM: At least 600GB system memory
  • Storage: ~600GB for model weights (native INT4 weight, same weight dir for CPU and GPU)

Step 0: Environment Setup

We recommend to separate two conda environments:

Environment Purpose
kt-kernel Inference & serving (KTransformers + SGLang)
kt-sft Training (LlamaFactory + KTransformers SFT backend)

0.1 Inference Environment: kt-kernel

conda create -n kt-kernel python=3.11
conda activate kt-kernel

git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh

0.2 Install SGLang (Inference / Serving)

Recommended for Kimi-K2.5:

# Option A: One-click install (from ktransformers root, installs sglang + kt-kernel)
./install.sh

# Option B: pip install
pip install sglang-kt

0.3 Training Environment: kt-sft

conda create -n kt-sft python=3.11
conda activate kt-sft

git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .

0.4 Install KTransformers SFT Dependencies

conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime

# Install matching wheels (recommended), from https://github.com/kvcache-ai/ktransformers/releases
pip install ktransformers-<matching-version>.whl
pip install flash_attn-<matching-version>.whl

Step 1: Prepare Model Weights (BF16 for SFT)

1.1 Download INT4 Weights

KTransformers requires BF16 weights for SFT.

# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
huggingface-cli download moonshotai/Kimi-K2.5 \
  --local-dir /path/to/kimi-k2.5

1.2 Convert INT4 → BF16

Kimi-K2.5 base model is in INT4 format, convert it to BF16 before SFT.

Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)

2.1 Training YAML (LoRA SFT)

Example file: examples/train_lora/kimik2_lora_sft_kt.yaml

Required fields:

stage: sft
finetuning_type: lora
bf16: true

use_kt: true
kt_optimize_rule: <rule.yaml>
cpu_infer: 32
chunk_size: 8192

Other fields (dataset, output_dir, learning rate, epochs) can be adjusted as usual.

2.2 Inference YAML (LlamaFactory Verification)

Key requirements:

  • adapter_name_or_path: LoRA output directory
  • infer_backend: ktransformers
  • Same use_kt and kt_optimize_rule as training

This YAML is used only for quick verification, not production serving.

Step 3: Run LoRA SFT

conda activate kt-sft
cd LlamaFactory

USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml

After training, the LoRA adapter is saved to output_dir.

Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)

Before production deployment, the new PDF recommends a lightweight sanity check.

conda activate kt-sft
cd LlamaFactory

llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml

Purpose:

  • Validate LoRA correctness
  • Ensure reproducibility
  • Not for throughput benchmarking

This is the major runtime update introduced by the new PDF.

5.1 Convert LoRA for SGLang

python ktransformers/kt-kernel/scripts/convert_lora.py \
  --base_path /path/to/kimi-base-model \
  --lora_path /path/to/llamafactory/output_dir \
  --output_path /path/to/lora_converted

5.2 (Optional) Convert CPU Weights to INT8

To reduce CPU memory usage:

python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
  --base_path /path/to/kimi-base-model \
  --output_dir /path/to/kimi-base-model-int8

This produces:

/path/to/kimi-base-model-int8/int8

5.3 Launch SGLang Server with LoRA

conda activate kt-kernel

python -m sglang.launch_server \
  --enable-lora \
  --lora-paths lora1=/path/to/lora_converted \
  --lora-backend triton \
  --model-path /path/to/kimi-base-model \
  --tp 1 \
  --trust-remote-code \
  --context-length 4096 \
  --kt-weight-path /path/to/kimi-base-model-int8/int8 \
  --mem-fraction-static 0.9

Notes:

  • --kt-weight-path points to CPU INT8 weights
  • Adjust tp, context-length, and memory parameters per machine
  • RAWINT4 inference paths can follow Kimi-K2.5-Native directly