[doc]: update web doc and kt-kernel doc (#1609)

* [doc]: update web doc and kt-kernel doc * [doc](book.toml): add book.toml for rust book compile
2026-03-14 18:37:23 +00:00 · 2025-11-13 20:44:13 +08:00
parent 4bd0fe812b
commit c32fefb1cd
6 changed files with 50 additions and 215 deletions
--- a/book.toml
+++ b/book.toml
@@ -0,0 +1,18 @@
+[book]
+authors = ["kvcache-ai"]
+language = "zh-CN"
+title = "Ktransformers"
+src = "doc"
+
+[output.html]
+git-repository-url = "https://github.com/kvcache-ai/ktransformers"
+edit-url-template = "https://github.com/kvcache-ai/ktransformers/edit/main/{path}"
+
+[output.html.playground]
+editable = true
+copy-js = true
+# line-numbers = true
+
+[output.html.fold]
+enable = true
+level = 0
--- a/doc/README.md
+++ b/doc/README.md
@@ -1,38 +0,0 @@
-<div align="center">
-  <!-- <h1>KTransformers</h1> -->
-  <p align="center">
-
-<picture>
-    <img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
-
-</picture>
-
-</p>
-
-</div>
-
-<h2 id="intro">🎉 Introduction</h2>
-KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 <a href="https://github.com/huggingface/transformers">Transformers</a> experience with advanced kernel optimizations and placement/parallelism strategies.
-<br/><br/>
-KTransformers is a flexible, Python-centric framework designed with extensibility at its core. 
-By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible
-interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. 
-<br/><br/>
-Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
-
-<h2 id="Updates">🔥 Updates</h2>
-
-* **May 14, 2025**: Support Intel Arc GPU ([Tutorial](./en/xpu.md)).
-* **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./en/llama4.md)).
-* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./en/balance-serve.md)).
-* **Mar 27, 2025**: Support Multi-concurrency.
-* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./en/ROCm.md)).
-* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./en/fp8_kernel.md) weights. Support 139K [Longer Context](./en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
-* **Feb 25, 2025**: Support [FP8 GPU kernel](./en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./en/DeepseekR1_V3_tutorial.md#v022-longer-context).
-* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./en/DeepseekR1_V3_tutorial.md).
-* **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./en/long_context_tutorial.md).
-* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
-* **Aug 15, 2024**: Update detailed [TUTORIAL](./en/injection_tutorial.md) for injection and multi-GPU.
-* **Aug 14, 2024**: Support llamfile as linear backend.
-* **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B  and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
-* **Aug 9, 2024**: Support windows native.
--- a/doc/README.md
+++ b/doc/README.md
@@ -0,0 +1 @@
+../README.md
--- a/doc/SUMMARY.md
+++ b/doc/SUMMARY.md
@@ -2,7 +2,8 @@

 [Introduction](./README.md)
 # Install
- [Installation Guide](en/install.md)
+- [Installation Guide for kt-kernel](en/kt-kernel_intro.md)
+- [SFT user guide](en/KTransformers-Fine-Tuning_User-Guide.md)

 # Tutorial 
 - [Deepseek-R1/V3 Show Case/Tutorial](en/DeepseekR1_V3_tutorial.md)
@@ -11,18 +12,13 @@
 - [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
 - [Use FP8 GPU Kernel](en/fp8_kernel.md)
 - [Use AMD GPU](en/ROCm.md)
- [SFT user guide](en/KTransformers-Fine-Tuning_User-Guide.md)
 - [SFT developer tech notes](en/KTransformers-Fine-Tuning_Developer-Technical-Notes.md)
-# Server
-  - [Server](en/api/server/server.md)
-  - [Website](en/api/server/website.md)
-  - [Tabby](en/api/server/tabby.md)
-# For Developer
- [Makefile Usage](en/makefile_usage.md)
+<!-- # For Developer
+- [Makefile Usage](en/makefile_usage.md) -->

 # FAQ
 - [FAQ](en/FAQ.md)
-# V3 Reproduction
+<!-- # V3 Reproduction
 - [Success List](en/V3-success.md)
 # Benchmark
- [Benchmark](en/benchmark.md)
+- [Benchmark](en/benchmark.md) -->
--- a/doc/en/FAQ.md
+++ b/doc/en/FAQ.md
@@ -1,167 +1,2 @@
 <!-- omit in toc -->
-# FAQ
- [Install](#install)
-  - [Q: ImportError: /lib/x86\_64-linux-gnu/libstdc++.so.6: version GLIBCXX\_3.4.32' not found](#q-importerror-libx86_64-linux-gnulibstdcso6-version-glibcxx_3432-not-found)
-  - [Q: DeepSeek-R1 not outputting initial  token](#q-deepseek-r1-not-outputting-initial--token)
- [Usage](#usage)
-  - [Q: If I got more VRAM than the model's requirement, how can I fully utilize it?](#q-if-i-got-more-vram-than-the-models-requirement-how-can-i-fully-utilize-it)
-  - [Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?](#q-if-i-dont-have-enough-vram-but-i-have-multiple-gpus-how-can-i-utilize-them)
-  - [Q: How to get the best performance?](#q-how-to-get-the-best-performance)
-  - [Q: My DeepSeek-R1 model is not thinking.](#q-my-deepseek-r1-model-is-not-thinking)
-  - [Q: Loading gguf error](#q-loading-gguf-error)
-  - [Q: Version \`GLIBCXX\_3.4.30' not found](#q-version-glibcxx_3430-not-found)
-  - [Q: When running the bfloat16 moe model, the data shows NaN](#q-when-running-the-bfloat16-moe-model-the-data-shows-nan)
-  - [Q: Using fp8 prefill very slow.](#q-using-fp8-prefill-very-slow)
-  - [Q: Possible ways to run graphics cards using volta and turing architectures](#q-possible-ways-to-run-graphics-cards-using-volta-and-turing-architectures)
-## Install
-### Q: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found
-```
-in Ubuntu 22.04 installation need to add the:
-sudo add-apt-repository ppa:ubuntu-toolchain-r/test
-sudo apt-get update
-sudo apt-get install --only-upgrade libstdc++6
-```
-from-https://github.com/kvcache-ai/ktransformers/issues/117#issuecomment-2647542979
-### Q: DeepSeek-R1 not outputting initial <think> token
-
-> from deepseek-R1 doc:<br>
-> Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think>\n\n\</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think>\n" at the beginning of every output.
-
-So we fix this by manually adding "\<think>\n" token at prompt end (you can check out at local_chat.py),
-and pass the arg `--force_think true ` can let the local_chat initiate the response with "\<think>\n"
-
-from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
-
-## Usage
-### Q: If I got more VRAM than the model's requirement, how can I fully utilize it?
-
-1. Get larger context.
-   1. local_chat.py: You can increase the context window size by setting `--max_new_tokens` to a larger value.
-   2. server: Increase the `--cache_lens' to a larger value.
-2. Move more weights to the GPU.
-    Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
-    ```yaml
-    - match:
-       name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
-     replace:
-       class: ktransformers.operators.experts.KTransformersExperts  
-       kwargs:
-         generate_device: "cuda:0" # run in cuda:0; marlin only support GPU
-         generate_op:  "KExpertsMarlin" # use marlin expert
-     recursive: False
-    ```
-    You can modify layer as you want, eg. `name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$"` to `name: "^model\\.layers\\.([4-12])\\.mlp\\.experts$"` to move more weights to the GPU.
-
-    > Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.
-    > Note：Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization.
-    > Note KExpertsTorch is untested.
-
-
-### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
-
-Use the `--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
-
-> Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.
-
-### Q: How to get the best performance?
-
-You have to set `--cpu_infer` to the number of cores you want to use. The more cores you use, the faster the model will run. But it's not the more the better. Adjust it slightly lower to your actual number of cores.
-
-### Q: My DeepSeek-R1 model is not thinking.
-
-According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think True `.
-
-### Q: Loading gguf error
-
-Make sure you:
-1. Have the `gguf` file in the `--gguf_path` directory.
-2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
-3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
-4. The file itself is not corrupted; you can verify this by checking that the sha256sum matches the one from huggingface, modelscope, or hf-mirror.
-
-### Q: Version `GLIBCXX_3.4.30' not found
-The detailed error:
->ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)
-
-Running `conda install -c conda-forge libstdcxx-ng` can solve the problem.
-
-
-### Q: When running the bfloat16 moe model, the data shows NaN
-The detailed error:
-```shell
-Traceback (most recent call last):
-  File "/root/ktransformers/ktransformers/local_chat.py", line 183, in <module>
-    fire.Fire(local_chat)
-  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 135, in Fire
-    component_trace = _Fire(component, args, parsed_flag_args, context, name)
-  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 468, in _Fire
-    component, remaining_args = _CallAndUpdateTrace(
-  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
-    component = fn(*varargs, **kwargs)
-  File "/root/ktransformers/ktransformers/local_chat.py", line 177, in local_chat
-    generated = prefill_and_generate(
-  File "/root/ktransformers/ktransformers/util/utils.py", line 204, in prefill_and_generate
-    next_token = decode_one_tokens(cuda_graph_runner, next_token.unsqueeze(0), position_ids, cache_position, past_key_values, use_cuda_graph).to(torch_device)
-  File "/root/ktransformers/ktransformers/util/utils.py", line 128, in decode_one_tokens
-    next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
-RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
-```
-**SOLUTION**: The issue of running ktransformers on Ubuntu 22.04 is caused by the current system's g++ version being too old, and the pre-defined macros do not include avx_bf16. We have tested and confirmed that it works on g++ 11.4 in Ubuntu 22.04.
-
-### Q: Using fp8 prefill very slow.
-
-The FP8 kernel is build by JIT, so the first run will be slow. The subsequent runs will be faster.
-
-### Q: Possible ways to run graphics cards using volta and turing architectures
-
-From: https://github.com/kvcache-ai/ktransformers/issues/374
-
-1. First, download the latest source code using git.
-2. Then, modify the DeepSeek-V3-Chat-multi-gpu-4.yaml in the source code and all related yaml files, replacing all instances of KLinearMarlin with KLinearTorch.
-3. Next, you need to compile from the ktransformers source code until it successfully compiles on your local machine.
-4. Then, install flash-attn. It won't be used, but not installing it will cause an error.
-5. Then, modify local_chat.py, replacing all instances of flash_attention_2 with eager.
-6. Then, run local_chat.py. Be sure to follow the official tutorial's commands and adjust according to your local machine's parameters.
-7. During the running process, check the memory usage. Observe its invocation through the top command. The memory capacity on a single CPU must be greater than the complete size of the model. (For multiple CPUs, it's just a copy.)
-Finally, confirm that the model is fully loaded into memory and specific weight layers are fully loaded into the GPU memory. Then, try to input content in the chat interface and observe if there are any errors.
-
-Attention, for better perfomance, you can check this [method](https://github.com/kvcache-ai/ktransformers/issues/374#issuecomment-2667520838) in the issue
->
->https://github.com/kvcache-ai/ktransformers/blob/89f8218a2ab7ff82fa54dbfe30df741c574317fc/ktransformers/operators/attention.py#L274-L279
->
->```diff
->+ original_dtype = query_states.dtype
->+ target_dtype = torch.half
->+ query_states = query_states.to(target_dtype)
->+ compressed_kv_with_k_pe = compressed_kv_with_k_pe.to(target_dtype)
->+ compressed_kv = compressed_kv.to(target_dtype)
->+ attn_output = attn_output.to(target_dtype)
->
->decode_attention_fwd_grouped(query_states, compressed_kv_with_k_pe, compressed_kv, attn_output,
->                             page_table,
->                             position_ids.squeeze(0).to(torch.int32)+1, attn_logits,
->                             4, #num_kv_splits # follow vLLM, fix it TODO
->                             self.softmax_scale,
->                             past_key_value.page_size)
->
->+ attn_output = attn_output.to(original_dtype)
->```
->
->https://github.com/kvcache-ai/ktransformers/blob/89f8218a2ab7ff82fa54dbfe30df741c574317fc/ktransformers/operators/attention.py#L320-L326
->
->```diff
->- attn_output = flash_attn_func( 
->-     query_states, 
->-     key_states, 
->-     value_states_padded, 
->-     softmax_scale=self.softmax_scale, 
->-     causal=True, 
->- )
->+ attn_output = F.scaled_dot_product_attention(
->+     query_states.transpose(1, 2),
->+     key_states.transpose(1, 2),
->+     value_states_padded.transpose(1, 2),
->+     scale=self.softmax_scale,
->+     is_causal=True
->+ ).transpose(1, 2)
->```
+# see the issue [FAQ page](https://github.com/kvcache-ai/ktransformers/issues/1608)
--- a/doc/en/kt-kernel_intro.md
+++ b/doc/en/kt-kernel_intro.md
@@ -0,0 +1 @@
+../../kt-kernel/README.md
--- a/kt-kernel/README.md
+++ b/kt-kernel/README.md
@@ -1,7 +1,29 @@
 # KT-Kernel

-High-performance kernel operations for KTransformers, featuring CPU-optimized MoE inference with AMX, AVX, and KML support.
+High-performance kernel operations for KTransformers, featuring CPU-optimized MoE inference with AMX, AVX, KML and blis (amd library) support.

+- [Note](#note)
+- [Features](#features)
+- [Installation](#installation)
+  - [Prerequisites](#prerequisites)
+  - [Quick Installation (Recommended)](#quick-installation-recommended)
+  - [Manual Configuration (Advanced)](#manual-configuration-advanced)
+- [Verification](#verification)
+- [Integration with SGLang](#integration-with-sglang)
+  - [Installation Steps](#installation-steps)
+  - [Complete Example: Qwen3-30B-A3B](#complete-example-qwen3-30b-a3b)
+  - [KT-Kernel Parameters](#kt-kernel-parameters)
+- [Direct Python API Usage](#direct-python-api-usage)
+  - [Advanced Options](#advanced-options)
+- [Build Configuration](#build-configuration)
+  - [Manual Installation](#manual-installation)
+- [Error Troubleshooting](#error-troubleshooting)
+  - [CUDA Not Found](#cuda-not-found)
+  - [hwloc Not Found](#hwloc-not-found)
+- [Weight Quantization](#weight-quantization)
+  - [CPU Weights (for "cold" experts on CPU)](#cpu-weights-for-cold-experts-on-cpu)
+  - [GPU Weights (for "hot" experts on GPU)](#gpu-weights-for-hot-experts-on-gpu)
+- [Before Commit!](#before-commit)
 ## Note

 **Current Support Status:**