upload hands-on tutorial with KTransformers-FT, especially in customize your KT-FT+LLaMA-Factory (#1597)

* Add files via upload

* upload hands-on tutorial for KTransformers-FT
This commit is contained in:
Peilin Li
2025-11-11 20:54:41 +08:00
committed by GitHub
parent d483147307
commit 148a030026
2 changed files with 621 additions and 0 deletions

Binary file not shown.

View File

@@ -0,0 +1,621 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6201cdec-70f7-4c22-b988-b23ece31979d",
"metadata": {},
"source": [
"<div align=\"center\">\n",
" <!-- <h1>KTransformers</h1> -->\n",
" <p align=\"center\">\n",
"\n",
"<picture>\n",
" <img alt=\"KTransformers\" src=\"https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b\" width=50%>\n",
"\n",
"</picture>\n",
"\n",
"</p>\n",
"\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "5dcfddc6-d51b-4aa8-b887-f7c817492316",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"# **Introduction**\n",
"[KTransformers](https://github.com/kvcache-ai/ktransformers), is designed to enhance the 🤗 Transformers experience through advanced kernel optimizations and placement/parallelism strategies. \n",
"<br/> <br/>\n",
"This tutorial serves as a guide for KTransformers-ft, aiming to to give resource-constrained researchers a **local path to explore fine-tuning ultra-large models (e.g., 671B/1000B)**, and also a fast way to customize smaller models (e.g., 14B/30B) for specific scenarios. We validate the setup using representative tasks such as stylized dialogue, Westernized translation tone, and medical Q&A, demonstrating that personalized adaptation can be achieved within hours.\n",
"<br/> <br/>\n",
"This tutorial takes DeepSeek-V2-Lite as a code example; for more details, refer to [KTransformers-Fine-Tuning_User-Guide](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/KTransformers-Fine-Tuning_User-Guide.md) and [KTransformers-Fine-Tuning_Developer-Technical-Notes](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/KTransformers-Fine-Tuning_Developer-Technical-Notes.md)."
]
},
{
"cell_type": "markdown",
"id": "b4167684-81f4-4e2b-a486-c33ec3bc92f0",
"metadata": {},
"source": [
"# **Installation**"
]
},
{
"cell_type": "markdown",
"id": "5548a7f8-20d6-4ae4-a575-a3ef7a0ea5f8",
"metadata": {},
"source": [
"### **1. Install torch and clone the repo**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f39051d-eb14-44fa-af82-9ded23144985",
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git\n",
"!cd LLaMA-Factory"
]
},
{
"cell_type": "markdown",
"id": "e7dd351f-9102-4d7d-951c-4306df9f4cd7",
"metadata": {},
"source": [
"**(Optional)** If you want to choose your version of torch and cuda, please install separately."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a5afa0c-1ed0-4190-ab50-967e553d6fd2",
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"!pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu118"
]
},
{
"cell_type": "markdown",
"id": "711dcc79-056f-4483-a2e1-7e780af1def1",
"metadata": {},
"source": [
"### **2. Install LLaMA-Factory**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "42f09df9-7db8-46e3-b11d-2946a57d2933",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.chdir(\"LLaMA-Factory\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a6a5532-e5cc-463b-bdf8-030e547287fc",
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"!pip install -e \".[torch,metrics]\" --no-build-isolation"
]
},
{
"cell_type": "markdown",
"id": "48c19762-70a7-402c-94f9-a71b277eb932",
"metadata": {},
"source": [
"### **3. Install dependency libraries for GCC and CUDA**\n",
"You need to install system-level dependency libraries. `libstdcxx-ng` and `gcc_impl_linux-64` ensure compilation compatibility, while cuda-runtime provides a GPU-accelerated runtime environment. **Please do NOT IGNORE this two commands! `nvidia/label/cuda-11.8.0 cuda-runtime` should be installed for every version of cuda for KT whl.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "202e672a-b30a-4bde-92d5-27500f435b30",
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
},
"scrolled": true
},
"outputs": [],
"source": [
"!conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64\n",
"!conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime"
]
},
{
"cell_type": "markdown",
"id": "94e6448f-1e27-4f16-885c-27738c2089dc",
"metadata": {},
"source": [
"### **4. Install ktransformers and flash-attention**\n",
"You need to download the corresponding version of python, cuda and torch from [downloading ktransformers whl](https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.1) and [downloading flash-attention whl](https://github.com/Dao-AILab/flash-attention/releases)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7c4a5e82-ae9f-490f-9f90-441cdd98041e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n"
]
}
],
"source": [
"import torch\n",
"print(torch._C._GLIBCXX_USE_CXX11_ABI)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "837a2240-818d-499f-a1b5-641fa5c45339",
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
},
"scrolled": true
},
"outputs": [],
"source": [
"!pip install ../ktransformers-0.4.1+cu128torch27fancy-cp312-cp312-linux_x86_64.whl"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3c78d9e-26e0-4f85-94ff-d6b028b194ac",
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"!pip install ../flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"
]
},
{
"cell_type": "markdown",
"id": "2593e2cb-5fbd-4d66-94fc-d2d74c4d8f65",
"metadata": {},
"source": [
"# **How to Start**\n",
"## Fine-tuning the Model with LoRA"
]
},
{
"cell_type": "markdown",
"id": "f7db3349-8cdb-48cd-8b63-0ea70fe4af6f",
"metadata": {},
"source": [
"LoRA (Low-Rank Adaptation) fine-tuning only trains small \"adapter\" weights for large models. However, under traditional frameworks, it still needs more than 1400GB GPU VRAM, which hardly handles on the 4090s machine. **KTransformers**, as high-performance backend engine, provides a solution for GPU/CPU Hybrid devices to further cut GPU memory usage and speed up training. As shown below, we compare KTransformers(ours) with other common LoRA fine-tuning backends (HuggingFace and Unsloth). KTransformers is the **only workable 4090-class solution** for ultra-large MoE models (e.g., 671B) and also delivers higher fine-tuning throughput. <br/>\n",
"<div style=\"text-align: center;\">\n",
"<img src=\"https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/按照模型划分的对比图_02.png\" alt=\"kt_unsloth_huggingface_compare\" width=\"70%\" height=\"auto\">\n",
"</div>\n",
"\n",
"To make KTransformers-ft more easy-to-use, we cooperator with [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/), a easy and efficiency model fine-tuning framework. As shown below, LLaMA-Factory is the unified configuration layer for the whole fine-tuning workflow. **KTransformers** acts as a high-performance backend that takes over core operators like Attention/MoE under the same training configs, enabling efficient **GPU+CPU heterogeneous cooperation**. <br/>\n",
"<div style=\"text-align: center;\">\n",
"<img src=\"https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/image-20251011010558909.png\" alt=\"image-20251011010558909\" width=\"70%\" height=\"auto\">\n",
"</div>\n",
"\n",
"This combination lets you fine-tune big models (like 671B/1000B) on consumer level GPUs (2-4 RTX 4090s) — no need for expensive hardware. Heres the training command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "baf5b8fc-e910-4531-9f00-a2076c698eff",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml"
]
},
{
"cell_type": "markdown",
"id": "dc80b189-17ac-47a7-9889-b77e7a9d5304",
"metadata": {},
"source": [
"Lets break down the training command (`USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml`):\n",
"- `USE_KT=1`: The \"switch\" to enable KTransformers optimization. \n",
"- `llamafactory-cli train`: The core command to start LLaMA-Factorys fine-tuning tool.\n",
"- `examples/train_lora/deepseek2_lora_sft_kt.yaml`: The configuration file that controls model, data, training rules and KTransformers settings — well detail this next.\n",
"\n",
"**The LLaMA-Factory yaml (e.g. `deepseek2_lora_sft_kt.yaml`) is where you define how the fine-tuning works.** Below is a simplified version, you can use this directly for basic tasks like style transfer or domain Q&A. And Well explain each sections purpose and why the values are set this way in the following part--Custom your KTransformers-FineTuning + LLaMA-Factory.\n",
"```yaml\n",
"### model\n",
"model_name_or_path: deepseek-ai/DeepSeek-V2-Lite\n",
"\n",
"### method\n",
"finetuning_type: lora\n",
"lora_rank: 8\n",
"lora_target: all\n",
"\n",
"### dataset\n",
"dataset: identity\n",
"template: deepseek\n",
"cutoff_len: 2048\n",
"max_samples: 100000\n",
"\n",
"### output\n",
"output_dir: saves/Kllama_deepseekV2\n",
"logging_steps: 10\n",
"save_steps: 500\n",
"\n",
"### train\n",
"per_device_train_batch_size: 1\n",
"gradient_accumulation_steps: 8\n",
"learning_rate: 1.0e-4\n",
"num_train_epochs: 3.0\n",
"\n",
"### ktransformers\n",
"use_kt: true # use KTransformers as LoRA sft backend\n",
"kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml\n",
"cpu_infer: 32\n",
"chunk_size: 8192\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "dac7722d-89dd-40b1-ac27-7ca64e80fe47",
"metadata": {},
"source": [
"## Chat with the Fine-tuned Model: Test Your Customized AI"
]
},
{
"cell_type": "markdown",
"id": "9af428c6-4fce-4320-b3d3-af59726ab9ce",
"metadata": {},
"source": [
"After finishing fine-tuning with KTransformers, **the next step is to chat with your model and verify the results!** This step loads the original base model plus the fine-tuned \"custom plugin\" (LoRA adapter) you saved earlier, letting you interact with the model in real time. \n",
"\n",
"Well use LLaMA-Factorys `chat` command to launch the interactive interface. The core is the LLaMA-Factory YAML configuration file — it tells the tool which model to load, how to optimize inference, and what style of dialogue to use. We take one of the example as follows."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37191db1-a97c-407c-9626-af9fde6dd94f",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml"
]
},
{
"cell_type": "markdown",
"id": "06c18255-66d0-4189-a714-6050160a0637",
"metadata": {},
"source": [
"To know exactly what youre running, we break down the full command (`llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml`):\n",
"- `llamafactory-cli chat`: The core command to launch LLaMA-Factorys interactive chat tool.\n",
"- `examples/inference/deepseek2_lora_sft_kt.yaml`: The configuration file for inference (controls model loading, optimization, and dialogue settings).\n",
"- No need for `USE_KT=1` here — well enable KTransformers directly in the YAML (but it still needs to match the training settings!).\n",
"\n",
"**The LLaMA-Factory configuration file for inference (`examples/inference/deepseek2_lora_sft_kt.yaml`) controls the generate config for specific tasks.** Below is a simplified version, you can use this directly to chat with your fine-tuned model. Most setting is linked to your training config — well still explain the details in next part.\n",
"```yaml\n",
"model_name_or_path: deepseek-ai/DeepSeek-V2-Lite\n",
"adapter_name_or_path: saves/Kllama_deepseekV2\n",
"template: deepseek\n",
"infer_backend: ktransformers # choices: [huggingface, vllm, sglang, ktransformers]\n",
"trust_remote_code: true\n",
"\n",
"use_kt: true # use KTransformers as LoRA sft backend to inference\n",
"kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml\n",
"cpu_infer: 32\n",
"chunk_size: 8192\n",
"```\n",
"`kt_optimize_rule` needs as same as the kt_optimize_rule in LoRA Fine-tuning."
]
},
{
"cell_type": "markdown",
"id": "18814c5c-3b73-44cc-a608-505c1e870437",
"metadata": {},
"source": [
"# **Custom your KTransformers-FineTuning + LLaMA-Factory**"
]
},
{
"cell_type": "markdown",
"id": "8072427f-46d4-41fb-8850-e33a2446e031",
"metadata": {},
"source": [
"Once youve got the basic fine-tuning workflow down, youll likely want to **adapt the process to your specific needs**—whether thats training on your own data, squeezing more performance out of limited GPU memory, or speeding up training for large datasets. Belows a hands-on guide to customizing every part of the process, with clear explanations of why each setting matters and how to tweak it.\n",
"\n",
"## 1. Fine-tuning Customization: Tailor Training to Your Needs \n",
"To start customizing, youll still use the core training command: `USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml`. Notably, it performs even better than the default setup when adapted to your specific needs. <br/>\n",
"### Full example **LLaMA-Factory YAML** for DeepSeek-V2-Lite\n",
"```yaml\n",
"### model\n",
"model_name_or_path: deepseek-ai/DeepSeek-V2-Lite\n",
"trust_remote_code: true\n",
"\n",
"### method\n",
"stage: sft\n",
"do_train: true\n",
"finetuning_type: lora\n",
"lora_rank: 8\n",
"lora_target: all\n",
"\n",
"### dataset\n",
"dataset: identity\n",
"template: deepseek\n",
"cutoff_len: 2048\n",
"max_samples: 100000\n",
"overwrite_cache: true\n",
"preprocessing_num_workers: 16\n",
"dataloader_num_workers: 4\n",
"\n",
"### output\n",
"output_dir: saves/Kllama_deepseekV2Lite\n",
"logging_steps: 10\n",
"save_steps: 500\n",
"plot_loss: true\n",
"overwrite_output_dir: true\n",
"save_only_model: false\n",
"report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]\n",
"\n",
"### train\n",
"per_device_train_batch_size: 1\n",
"gradient_accumulation_steps: 8\n",
"learning_rate: 1.0e-4\n",
"num_train_epochs: 3.0\n",
"lr_scheduler_type: cosine\n",
"warmup_ratio: 0.1\n",
"bf16: true\n",
"ddp_timeout: 180000000\n",
"resume_from_checkpoint: null\n",
"\n",
"### ktransformers\n",
"use_kt: true # use KTransformers as LoRA sft backend\n",
"kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Chat-sft-amx.yaml\n",
"cpu_infer: 32\n",
"chunk_size: 8192\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "6abc1968-6208-4344-9c82-335d7fe1d27c",
"metadata": {},
"source": [
"---\n",
"### A. Pick & Prepare Your Model\n",
"The first step in customization is choosing the right base model, and ensuring it works with KTransformers. The `model_name_or_path` setting (shown in LLaMA-Factory YAML before) controls this, and getting it right avoids common errors.\n",
"- **Use a public model**: Directly set to Hugging Face Hub names (e.g., `deepseek-ai/DeepSeek-V2-Lite`, `Qwen/Qwen2-MoE-72B`). \n",
"- **Use a local model**: Replace with your local folder path (e.g., `/mnt/data/models/DeepSeek-V2-Lite`).\n",
"\n",
"**Critical Requirement**: The model must be in **BF16 format**. \n",
" - FP8 models (like DeepSeek-V3s default release) arent compatible with KTransformers optimization. \n",
" - Fix: Convert FP8 to BF16 with **[this official script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py)**.\n",
"\n",
"---\n",
"\n",
"### B. Tune LoRA: Balance Fitting Capability & Memory \n",
"LoRA trains tiny \"adapter\" weights instead of the entire model. Tweaking these two settings in LLaMA-Factory YAML (`lora_rank`, `lora_target`) lets you balance how well the model learns your data and how much GPU memory it uses:\n",
"\n",
"| Setting | What it does | Scenario & Recommendation |\n",
"|-----------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|\n",
"| `lora_rank` | Controls the \"power\" of LoRA adapters (higher = more fitting, more memory). | - Small dataset (≤5k samples) or limited GPU: 4-8 (balances speed/memory).<br>- Large dataset (≥20k samples): 16-32 (better fits custom data). |\n",
"| `lora_target` | Which layers get LoRA (applies only to linear layers). | - Quick fine-tuning (e.g., style transfer): `q_proj,v_proj` (only attention layers—faster).<br>- Deep customization (e.g., medical Q&A): `all` (all linear layers—more accurate). |\n",
"\n",
"**Tip**: Pair `lora_rank=8` with `lora_alpha=32` (alpha = 4× rank) for stable training This ratio is tested to work well for most tasks, from chatbots to domain Q&A. \n",
"\n",
"---\n",
"\n",
"### C. Use Your Own Dataset\n",
"Fine-tunings value lies in training on your own data, such as company documents, customer support logs, or domain-specific Q&A. Below is how to replace the default (identity) dataset with yours: \n",
"\n",
"1. **Add a custom dataset**: \n",
" - Step 1: Organize your data into LLaMA-Factorys format (e.g., JSON with `instruction`, `input`, `output` fields—see [dataset examples](https://github.com/hiyouga/LLaMA-Factory/tree/main/data)). \n",
" - Step 2: Register your dataset in [LLaMA-Factory/data/dataset_info.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) (copy the format of built-in datasets—just add your dataset name and file path).\n",
" For example,\n",
" ```json\n",
" \"niko\": {\n",
" \"file_name\": \"../niko_train.json\"\n",
" },\n",
" ```\n",
" - Step 3: You may replace `dataset: identity` in LLaMA-Factory YAML to your dataset name (e.g. `dataset: niko`).\n",
"2. **Tweak dataset settings for better results**: \n",
" - `cutoff_len`: Truncates long texts (e.g., set to 4096 for long documents, 2048 for short dialogues—never exceed `model_max_length`). \n",
" - `max_samples`: Limit samples to avoid overfitting (use 100 for debugging, `None` for full training—great if your dataset is huge). \n",
" - `template`: Must match your model (e.g., `deepseek` for DeepSeek, `llama3` for LLaMA3, more refer to [supported-models](https://github.com/hiyouga/LLaMA-Factory/tree/main?tab=readme-ov-file#supported-models))—mismatched templates break response formatting! \n",
"\n",
"---\n",
"\n",
"### D. Save GPU Memory & Speed Up Training \n",
"If youre hitting GPU memory limits or waiting too long for training, adjust these settings in LLaMA-Factory YAML: \n",
"\n",
"| Challenge | Setting to Tweak | How to Adjust |\n",
"|-------------------------|-------------------------------------------|--------------------------------------------------------------------------------|\n",
"| GPU memory is tight | `per_device_train_batch_size` + `gradient_accumulation_steps` | Set `per_device_train_batch_size=1` (smallest batch) + `gradient_accumulation_steps=16` (simulates a batch of 16—no memory penalty!). |\n",
"| Model overfits (bad generalization) | `lora_dropout` + `num_train_epochs` | Add `lora_dropout: 0.1` (prevents overfitting) + reduce `num_train_epochs` to 2 (3 is default—overtraining hurts!). |\n",
"\n",
"**Key Train Configs Recap**: \n",
"- `learning_rate`: 1e-4~2e-4 for LoRA (stick to this range—too high = unstable, too low = slow learning). \n",
"- `save_steps`: Save checkpoints every 100-500 steps (frequent saves = safe, but dont overdo it—each checkpoint takes storage!). \n",
"- `output_dir`: Customize the save path (e.g., `saves/medical_qa_deepseek` instead of the default—keeps your projects organized!). \n",
"\n",
"---\n",
"\n",
"### E. KTransformers Optimization: Unlock Maximum Performance \n",
"KTransformers is what makes fine-tuning large models (like 671B-parameter DeepSeek-V3) possible on modest hardware. These settings control how it optimizes layer placement (GPU vs. CPU) and computation speed:\n",
"\n",
"| Setting | What it does | How to Customize |\n",
"|-----------------------|-----------------------------------------------------------------------------|----------------------------------------------------------------------------------|\n",
"| `use_kt` | Enables KTransformers backend (must be `true`—otherwise, no optimization!). | Leave as `true`—this is what makes 671B models trainable on 2×4090s! |\n",
"| `cpu_infer` | Number of CPU threads for MoE/linear computations. | Set to half your CPU cores (e.g., 32 for a 64-core CPU—too many threads = bottlenecks!). |\n",
"| `chunk_size` | Block size for long text processing (affects memory and speed). | Default 8192 works for most tasks; increase to 16384 for extra-long texts (e.g., book summaries). |\n",
"| `kt_optimize_rule` | Defines where layers run (GPU/CPU) and which kernels to use (core of KT!). | - Use the pre-built rule for your model (e.g., `DeepSeek-V2-Lite-Chat-sft-amx.yaml`).<br>- For faster speed: Use `AMXInt8`/`AMXBF16` as backend (if your CPU supports AMX—check with `lscpu | grep amx`).<br>- For compatibility: Fall back to `llamafile` if AMX isnt supported. |\n",
"\n",
"#### Example Custom `kt_optimize_rule` (shown in the table above) \n",
"This rule tells KTransformers to offload heavy MoE layers to the CPU (saving GPU memory) and use AMX for fast CPU computation. Use it as a template for your own model: (Details tutorial could be seen in **[here](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/injection_tutorial.md)**)\n",
"```yaml\n",
"- match:\n",
" name: \"^model\\\\.layers\\\\..*\\\\.mlp\\\\.experts$\" # Target all MoE expert layers\n",
" replace:\n",
" class: ktransformers.operators.experts.KTransformersExperts # KT's optimized MoE kernel\n",
" kwargs:\n",
" prefill_device: \"cuda\" # Fast pre-processing on GPU\n",
" prefill_op: \"KExpertsTorch\"\n",
" generate_device: \"cpu\" # Heavy MoE compute on CPU (saves GPU memory)\n",
" generate_op: \"KSFTExpertsCPU\" # KT's SFT-optimized MoE operator\n",
" out_device: \"cuda\" # Send results back to GPU for next steps\n",
" backend: \"AMXInt8\" # Options: AMXInt8 (fastest) > AMXBF16 > llamafile (default)\n",
"```\n",
"**Alert:** Never mix KLinearMarlin with LoRA fine-tuning—replace it with KLinearTorch (as in the example) to avoid compatibility issues!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93840117-084b-44fa-8b2e-6389e4a52bf0",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml"
]
},
{
"cell_type": "markdown",
"id": "c6d0b4db-65f7-4683-88d0-3269c962224c",
"metadata": {},
"source": [
"## 2. Chat with the Fine-tuned Model"
]
},
{
"cell_type": "markdown",
"id": "fdbc5e95-9567-4b8a-94d7-eec410d94a6b",
"metadata": {},
"source": [
"After completing fine-tuning, the next critical step is to test your customized model through real-time interaction. Running `llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml` loads the base model and your fine-tuned LoRA adapter. Belows a detailed guide to customizing the chat process, with clear explanations of each settings role and how to fit it to your specific tasks.\n",
"\n",
"### Full example LLaMA-Factory YAML for inference\n",
"```yaml\n",
"model_name_or_path: deepseek-ai/DeepSeek-V2-Lite\n",
"adapter_name_or_path: saves/Kllama_deepseekV2Lite\n",
"template: deepseek\n",
"infer_backend: ktransformers # choices: [huggingface, vllm, sglang, ktransformers]\n",
"trust_remote_code: true\n",
"\n",
"use_kt: true # use KTransformers as LoRA sft backend to inference\n",
"kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Chat-sft-amx.yaml\n",
"cpu_infer: 32\n",
"chunk_size: 8192\n",
"```\n",
"\n",
"---\n",
"\n",
"### A. Load Your Fine-Tuned Adapter (Two Supported Formats) \n",
"The `adapter_name_or_path` setting in LLaMA-Factory YAML points to your trained LoRA weights. Two formats are supported: \n",
"- **Folder Format (Default)**: If training saved a folder (e.g., `saves/Kllama_deepseekV2`) with `.safetensors` files, set it directly (e.g., `adapter_name_or_path: saves/Kllama_deepseekV2`). \n",
"- **GGUF Format (Single File)**: If you exported the adapter to a `.gguf` file (for portability), set the full path (e.g., `adapter_name_or_path: saves/my_adapter.gguf`). \n",
"\n",
"---\n",
"\n",
"### B. Tweak Response Quality (Generation Configs) \n",
"Optional generation parameters let you adjust the models responses to fit specific use cases, whether you need factual accuracy, creative expression, or concise answers. Add these to your YAML and modify based on your needs:\n",
"```yaml\n",
"# Optional generation configs (add to your inference YAML)\n",
"max_new_tokens: 1024 # Max length of responses (512 = short, 2048 = long)\n",
"temperature: 0.7 # Randomness (0.1 = factual/consistent, 1.0 = creative/diverse)\n",
"top_p: 0.9 # Focus (0.8-0.95 = avoids irrelevant content)\n",
"repetition_penalty: 1.1 # Reduces repetition (1.0 = no penalty, 1.2 = strict)\n",
"```\n",
"\n",
"---\n",
"\n",
"### C. KTransformers Inference Backend \n",
"The KTransformers-related settings directly impact inference performance—they must align with your training configuration to maintain optimization effects (e.g., low memory usage, fast speed):\n",
"- `infer_backend` determines how the model generates responses—pick based on your needs. You need to choose `ktransformers`, if you LoRA fine-tuning it with ktransformers.\n",
"- `use_kt: true`: Must match training—disables KT optimization if set to `false` (slower inference!). \n",
"- `kt_optimize_rule`: Use the **exact same file** as training (e.g., `DeepSeek-V2-Lite-Chat-sft-amx.yaml`)—ensures layers map correctly. \n",
"\n",
"---\n",
"\n",
"### How to Verify Inference Works\n",
"After launching the chat command, check the logs for these key messages to confirm the model is running correctly:\n",
"1. `Loaded adapter weight: XXX -> XXX`: LoRA adapter is loaded correctly. \n",
"2. `KTransformers inference enabled`: KT optimization is active. \n",
"3. `Backend: AMXInt8`: AMX acceleration is working (if supported). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c08b31f7-32a4-4d51-b6c0-d063d7785371",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "KNllama",
"language": "python",
"name": "knllama"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}