* [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment
- Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main)
- Add top-level install.sh for one-click source installation (sglang + kt-kernel)
- Add sglang-kt as hard dependency in kt-kernel/pyproject.toml
- Add CI workflow to auto-sync sglang submodule daily and create PR
- Add CI workflow to build and publish sglang-kt to PyPI
- Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages)
- Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection
- Update Dockerfile to use submodule and inject aligned version
- Update all 13 doc files, CLI hints, and i18n strings to reference new install methods
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: rename PyPI package from kt-kernel to ktransformers
Users can now `pip install ktransformers` to get everything
(sglang-kt is auto-installed as a dependency).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "[build]: rename PyPI package from kt-kernel to ktransformers"
This reverts commit e0cbbf6364.
* [build]: add ktransformers meta-package for PyPI
`pip install ktransformers` now works as a single install command.
It pulls kt-kernel (which in turn pulls sglang-kt).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: show sglang-kt package version in kt version command
- Prioritize sglang-kt package version (aligned with ktransformers)
over sglang internal __version__
- Update display name from "sglang" to "sglang-kt"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: improve sglang-kt detection in kt doctor and kt version
Recognize sglang-kt package name as proof of kvcache-ai fork installation.
Previously both commands fell through to "PyPI (not recommended)" for
non-editable local source installs. Now version.py reuses the centralized
check_sglang_installation() logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2.post1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
6.4 KiB
Kimi-K2.5 LoRA SFT Tutorial
This tutorial demonstrates how to perform LoRA Supervised Fine-Tuning (SFT) on Kimi-K2.5 using LlamaFactory with KTransformers as the backend, and then serve the fine-tuned model using SGLang.
The workflow is:
KTransformers + LlamaFactory LoRA SFT → (Optional) LlamaFactory Verification → SGLang Serving
Table of Contents
- Hardware Requirements
- Prerequisites
- Step 0: Environment Setup (Method 1: Source Install)
- Step 1: Prepare Model Weights (BF16 for SFT)
- Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)
- Step 3: Run LoRA SFT
- Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)
- Step 5: SGLang Serving with LoRA (Recommended Delivery Path)
Hardware Requirements
Training (LoRA SFT)
- LlamaFactory + KTransformers
- GPU: 4 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
- CPU: x86 CPU with AMX support
- RAM: At least 2TGB system memory
- Swap can be used if CPU memory is insufficient
Inference (LoRA Adapter + Original Model)
- SGLang + KTransformers
- GPU: 2 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
- CPU: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids)
- RAM: At least 600GB system memory
- Storage: ~600GB for model weights (native INT4 weight, same weight dir for CPU and GPU)
Step 0: Environment Setup
We recommend to separate two conda environments:
| Environment | Purpose |
|---|---|
kt-kernel |
Inference & serving (KTransformers + SGLang) |
kt-sft |
Training (LlamaFactory + KTransformers SFT backend) |
0.1 Inference Environment: kt-kernel
conda create -n kt-kernel python=3.11
conda activate kt-kernel
git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh
0.2 Install SGLang (Inference / Serving)
Recommended for Kimi-K2.5:
# Option A: One-click install (from ktransformers root, installs sglang + kt-kernel)
./install.sh
# Option B: pip install
pip install sglang-kt
0.3 Training Environment: kt-sft
conda create -n kt-sft python=3.11
conda activate kt-sft
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .
0.4 Install KTransformers SFT Dependencies
conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
# Install matching wheels (recommended), from https://github.com/kvcache-ai/ktransformers/releases
pip install ktransformers-<matching-version>.whl
pip install flash_attn-<matching-version>.whl
Step 1: Prepare Model Weights (BF16 for SFT)
1.1 Download INT4 Weights
KTransformers requires BF16 weights for SFT.
# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
huggingface-cli download moonshotai/Kimi-K2.5 \
--local-dir /path/to/kimi-k2.5
1.2 Convert INT4 → BF16
Kimi-K2.5 base model is in INT4 format, convert it to BF16 before SFT.
Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)
2.1 Training YAML (LoRA SFT)
Example file:
examples/train_lora/kimik2_lora_sft_kt.yaml
Required fields:
stage: sft
finetuning_type: lora
bf16: true
use_kt: true
kt_optimize_rule: <rule.yaml>
cpu_infer: 32
chunk_size: 8192
Other fields (dataset, output_dir, learning rate, epochs) can be adjusted as usual.
2.2 Inference YAML (LlamaFactory Verification)
Key requirements:
adapter_name_or_path: LoRA output directoryinfer_backend: ktransformers- Same
use_ktandkt_optimize_ruleas training
This YAML is used only for quick verification, not production serving.
Step 3: Run LoRA SFT
conda activate kt-sft
cd LlamaFactory
USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
After training, the LoRA adapter is saved to output_dir.
Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)
Before production deployment, the new PDF recommends a lightweight sanity check.
conda activate kt-sft
cd LlamaFactory
llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
Purpose:
- Validate LoRA correctness
- Ensure reproducibility
- Not for throughput benchmarking
Step 5: SGLang Serving with LoRA (Recommended Delivery Path)
This is the major runtime update introduced by the new PDF.
5.1 Convert LoRA for SGLang
python ktransformers/kt-kernel/scripts/convert_lora.py \
--base_path /path/to/kimi-base-model \
--lora_path /path/to/llamafactory/output_dir \
--output_path /path/to/lora_converted
5.2 (Optional) Convert CPU Weights to INT8
To reduce CPU memory usage:
python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
--base_path /path/to/kimi-base-model \
--output_dir /path/to/kimi-base-model-int8
This produces:
/path/to/kimi-base-model-int8/int8
5.3 Launch SGLang Server with LoRA
conda activate kt-kernel
python -m sglang.launch_server \
--enable-lora \
--lora-paths lora1=/path/to/lora_converted \
--lora-backend triton \
--model-path /path/to/kimi-base-model \
--tp 1 \
--trust-remote-code \
--context-length 4096 \
--kt-weight-path /path/to/kimi-base-model-int8/int8 \
--mem-fraction-static 0.9
Notes:
--kt-weight-pathpoints to CPU INT8 weights- Adjust
tp,context-length, and memory parameters per machine - RAWINT4 inference paths can follow Kimi-K2.5-Native directly