mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-03-14 18:37:23 +00:00

Files

SCDESPERTATE 6fc4080a7d [fix](kt-kernel): fix typo in moe-tp's forward time-profiling (#1720 )

* [fix](kt-kernel): fix typo in moe-tp's forward time-profiling

* [fix](kt-kernel): fix the experts count in profiling

---------

Co-authored-by: KMSorSMS <yzwliam@126.com>

2025-12-17 12:06:33 +08:00

.githooks

[ci]: add int4_1 & int4_1k (#1653 )

2025-12-02 15:58:14 +08:00

bench

[Feature] Add avx-based kimi-k2 support (#1656 )

2025-12-02 16:01:07 +08:00

cmake

add kt-kernel

2025-10-12 05:13:00 +00:00

cpu_backend

[docs]: add contribuing guide and add hooks install (#1613 )

2025-11-15 18:26:49 +08:00

cuda

add kt-kernel

2025-10-12 05:13:00 +00:00

demo

update kt-kernel

2025-11-03 15:19:52 +08:00

examples

[Feature] Add avx-based kimi-k2 support (#1656 )

2025-12-02 16:01:07 +08:00

operators

[fix](kt-kernel): fix typo in moe-tp's forward time-profiling (#1720 )

2025-12-17 12:06:33 +08:00

python

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

scripts

[feat](kt-kernel): refactor convert_cpu_weights.py to support conversation for GLM-4.6V (#1687 )

2025-12-09 14:24:41 +08:00

test

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

.clang-format

add kt-kernel

2025-10-12 05:13:00 +00:00

.gitignore

update kt-kernel

2025-11-03 15:19:52 +08:00

.gitmodules

update kt-kernel

2025-10-24 14:32:45 +08:00

CMakeLists.txt

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

CMakePresets.json

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

ext_bindings.cpp

[Feature] Add avx-based kimi-k2 support (#1656 )

2025-12-02 16:01:07 +08:00

install.sh

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

MANIFEST.in

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

pyproject.toml

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

pytest.ini

add ci (#1642 )

2025-11-25 20:52:08 +08:00

README_zh.md

[docs] Update Native Kimi-K2-Thinking documentation and kt-kernel parameters (#1671 )

2025-12-05 22:46:16 +08:00

README.md

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

requirements.txt

update README for kt-kernel for installation issues (#1590 )

2025-11-11 14:53:45 +08:00

setup.py

[feat](kt-kernel): Add automatic deployment workflow (#1719 )

2025-12-16 15:20:06 +08:00

README.md

KT-Kernel

High-performance kernel operations for KTransformers, featuring CPU-optimized MoE inference with AMX, AVX, KML and blis (amd library) support.

KT-Kernel

Note

Current Support Status:

✅ Intel CPUs with AMX: Fully supported (using weights converted to INT4/INT8 format)
✅ Universal CPU (llamafile backend): Supported (using GGUF-format weights)
✅ AMD CPUs with BLIS: Supported (for int8 prefill & decode)

Features

CPU-Optimized MoE Kernels: High-throughput MoE expert kernels optimized for instruction sets.
AMX INT4/INT8 Backend: INT4 / INT8 quantized expert inference backend for AMX-capable servers.
Llamafile CPU Backend: AVX2/AVX512-based MoE backend built on Llamafile for universal CPU deployment.
NUMA-Aware Execution: Thread pool and memory layout designed for multi-socket / multi-NUMA machines.

Installation

Option 1: Install from PyPI (Recommended for Most Users)

Choose the version matching your CUDA installation:

# For CUDA 11.8
pip install kt-kernel==0.4.2.cu118

# For CUDA 12.1
pip install kt-kernel==0.4.2.cu121

# For CUDA 12.4
pip install kt-kernel==0.4.2.cu124

# For CUDA 12.6
pip install kt-kernel==0.4.2.cu126

Note

: Replace 0.4.2 with the latest version if available.

Features:

✅ Automatic CPU detection: Detects your CPU and loads the optimal kernel variant
✅ Multi-variant wheel: Includes AMX, AVX512, and AVX2 variants in a single package
✅ No compilation needed: Pre-built wheels for Python 3.10, 3.11, 3.12
✅ Multiple CUDA versions: Choose the version matching your environment

Requirements:

CUDA 11.8+ or 12.x runtime (must match the package version you install)
PyTorch 2.0+ (install separately, must match CUDA version)
Linux x86-64

CPU Variants Included:

Variant	CPU Support	Use Case
AMX	Intel Sapphire Rapids+	Best performance on latest Intel CPUs
AVX512	Intel Skylake-X/Ice Lake/Cascade Lake	AVX512-capable CPUs without AMX
AVX2	Intel Haswell+, AMD Zen+	Maximum compatibility

Check which variant is loaded:

import kt_kernel
print(f"CPU variant: {kt_kernel.__cpu_variant__}")  # 'amx', 'avx512', or 'avx2'
print(f"Version: {kt_kernel.__version__}")

Environment Variables:

# Override automatic CPU detection
export KT_KERNEL_CPU_VARIANT=avx2  # or 'avx512', 'amx'

# Enable debug output
export KT_KERNEL_DEBUG=1
python -c "import kt_kernel"

Option 2: Install from Source (For AMD, ARM, or Custom Builds)

If you need AMD (BLIS), ARM (KML), or custom CUDA versions, build from source:

Prerequisites

First, initialize git submodules:

git submodule update --init --recursive

Quick Installation

Step 0: Create and activate a conda environment (recommended):

conda create -n kt-kernel python=3.11 -y
conda activate kt-kernel

You can now install in two clear steps using the same script.

Option A: Two-step (specify dependencies installation and build separately)

# 1) Install system prerequisites (cmake, hwloc, pkg-config)
./install.sh deps

# 2) Build and install kt-kernel (auto-detects CPU instruction set)
#    By default, the script cleans the local ./build directory before compiling
./install.sh build

Option B: One-step

./install.sh

The install script will:

Auto-detect CPU capabilities (AMX support)
Install cmake via conda (if available)
Install system dependencies (libhwloc-dev, pkg-config) based on your OS

What gets configured automatically:

AMX CPU detected → NATIVE + AMX=ON
No AMX detected → NATIVE + AMX=OFF

⚠️ Important for LLAMAFILE backend users: If you have an AMX-capable CPU but plan to use the LLAMAFILE backend, do NOT use the default auto-detection build. Use "manual mode" with CPUINFER_CPU_INSTRUCT set to AVX512 or AVX2 instead of NATIVE to avoid compilation issues (see below).

⚠️ Important for BLIS AMD backend users: for the installation guide, see this issue

Manual Configuration (Advanced)

If you need specific build options (e.g., for LLAMAFILE backend, compatibility, or binary distribution):

# Example for LLAMAFILE backend on AMX CPU with AVX512
export CPUINFER_CPU_INSTRUCT=AVX512  # Options: NATIVE, AVX512, AVX2, FANCY
export CPUINFER_ENABLE_AMX=OFF       # Options: ON, OFF

# Build only (skip auto-detection of instruction set)
./install.sh build --manual

For advanced build options and binary distribution, see the Build Configuration section. If you encounter issues, refer to Error Troubleshooting.

Verification

python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"

Integration with SGLang

KT-Kernel can be used standalone via Direct Python API or integrated with SGLang for production deployment. This section describes SGLang integration to enable CPU-GPU heterogeneous inference, where "hot" experts run on GPU and "cold" experts run on CPU for optimal resource utilization.

Installation Steps

1. Install SGLang

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

2. Prepare Weights

You need both GPU weights and CPU-side expert weights for heterogeneous inference. The exact format depends on the backend:

GPU Weights (for all backends):
Use the model weights required by SGLang for GPU inference (for example, the original or already-quantized model directory from Hugging Face).

CPU Weights (AMX backend: AMXINT4 / AMXINT8): Quantize weights to AMX-optimized INT4/INT8 format using the provided script:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/cpu-weights \
  --quant-method int8  # or int4 or moe_int8 (for amd now)

--input-path: Path to GPU-side original weights
--input-type: Depends on your GPU weights type (fp8, fp16, or bf16)

In SGLang integration, --kt-weight-path should point to this converted CPU weights directory.

Supported input formats: FP8, FP16, BF16 → INT4/INT8.

CPU Weights (LLAMAFILE backend: LLAMAFILE): LLAMAFILE uses pre-quantized GGUF weights on the CPU side directly, without running convert_cpu_weights.py. You need to:

Download a GGUF model directly from the web (e.g., GGUF repos on Hugging Face / Modelscope);
In SGLang integration, use that GGUF directory as --kt-weight-path. KT-Kernel supports multiple GGUF quantization formats such as Q4_KM, Q4_K, Q5_K, etc. Choose based on your latency and accuracy requirements.

3. Launch SGLang Server

Start the SGLang server with your normal SGLang parameters, and add the following KT-Kernel specific parameters to enable CPU-GPU heterogeneous inference:

KT-Kernel Parameters to Add:

--kt-method: Backend method (AMXINT4, AMXINT8, or LLAMAFILE)
--kt-weight-path: Path to the converted CPU weights
--kt-cpuinfer: Number of CPU inference threads (set to physical cores)
--kt-threadpool-count: Number of thread pools (set to NUMA node count)
--kt-num-gpu-experts: Number of experts to keep on GPU
--kt-max-deferred-experts-per-token: Deferred experts for pipelined execution

Example:

python -m sglang.launch_server \
  [your normal SGLang parameters...] \
  --kt-method AMXINT8 \
  --kt-weight-path /path/to/cpu-weights \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

See KT-Kernel Parameters section below for detailed parameter tuning guidelines.

Complete Example: Qwen3-30B-A3B

This example demonstrates the full workflow from downloading weights to launching the server, showing both AMX backend and LLAMAFILE backend options.

Hardware Configuration:

GPU: NVIDIA RTX 4090 24GB
CPU: 2x Intel Xeon Gold 6454S (64 physical cores total, 128 threads, 2 NUMA nodes)
Model: Qwen3-30B-A3B

How to verify your system configuration:

# Check CPU configuration
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|Socket\(s\)|NUMA node\(s\)"
# Expected output example:
CPU(s):                                  128
Thread(s) per core:                      2
Socket(s):                               2
NUMA node(s):                            2
# → Physical cores = CPU(s) / Thread(s) per core = 128 / 2 = 64

Parameter Rationale:

--kt-cpuinfer 64: Set to physical cores (64), not hyperthreads (128)
--kt-threadpool-count 2: 2 NUMA nodes detected (dual-socket system)
--kt-num-gpu-experts 32: With 24GB GPU memory, we can fit ~32 experts on GPU for this model (varies by model architecture and actual memory usage)
--kt-max-deferred-experts-per-token 2: Enable pipelined execution; allows CPU to process next batch while GPU completes current batch

Option A: AMX Backend (AMXINT8)

For Intel CPUs with AMX instruction set support.

Step 1: Download model weights

# Install huggingface-cli if not already installed
pip install huggingface-hub

# Download model from Hugging Face
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3B

Step 2: Convert to CPU weights (AMXINT8)

python scripts/convert_cpu_weights.py \
  --input-path /mnt/data/models/Qwen3-30B-A3B \
  --input-type bf16 \
  --output /mnt/data/models/Qwen3-30B-A3B-INT8 \
  --quant-method int8

Step 3: Launch SGLang server

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model /mnt/data/models/Qwen3-30B-A3B \
  --trust-remote-code \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 4096 \
  --served-model-name Qwen3-30B-A3B \
  --enable-mixed-chunk \
  --kt-method AMXINT8 \
  --kt-weight-path /mnt/data/models/Qwen3-30B-A3B-INT8 \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

Option B: LLAMAFILE Backend (GGUF)

For universal CPUs (no AMX required), using pre-quantized GGUF weights directly.

Step 1: Download GPU weights (original model)

pip install huggingface-hub

huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3B

Step 2: Download CPU weights (GGUF format)

huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF Qwen3-30B-A3B-Q4_K_M.gguf \
  --local-dir /mnt/data/models/Qwen3-30B-A3B-Q4_K_M

Step 3: Launch SGLang server

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model /mnt/data/models/Qwen3-30B-A3B \
  --trust-remote-code \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 4096 \
  --served-model-name Qwen3-30B-A3B \
  --enable-mixed-chunk \
  --kt-method LLAMAFILE \
  --kt-weight-path /mnt/data/models/Qwen3-30B-A3B-Q4_K_M \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

KT-Kernel Parameters

Parameter	Description	Example Value
`--kt-method`	CPU inference backend method	`AMXINT4`, `AMXINT8`, `RAWINT4`, or `LLAMAFILE`
`--kt-weight-path`	Path to quantized CPU weights	`/path/to/cpu-weights`
`--kt-cpuinfer`	Number of CPU inference threads	`64` (adjust based on CPU cores)
`--kt-threadpool-count`	Number of thread pools for parallel execution	`2` (typically 1-4)
`--kt-num-gpu-experts`	Number of experts to keep on GPU	`32` (remaining experts go to CPU)
`--kt-max-deferred-experts-per-token`	Number of experts per token to defer for pipelined execution	`2` (0 to disable, 1-4 recommended)
`--kt-gpu-prefill-token-threshold`	Token count threshold for prefill strategy (RAWINT4 only)	~`400`

Parameter Guidelines:

kt-method: Choose based on your CPU and weight format:
- AMXINT4: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)
- AMXINT8: Higher accuracy with INT8 quantized weights on AMX CPUs
- RAWINT4: Native INT4 weights shared by CPU and GPU (AMX backend only, currently supports Kimi-K2-Thinking model). See Kimi-K2-Thinking Native Tutorial for details.
- LLAMAFILE: GGUF-based backend
kt-cpuinfer: Set to the number of physical CPU cores (not hyperthreads).
- Check physical cores: lscpu | grep -E "^CPU$s$|Thread$s$ per core"
- Physical cores = CPU(s) / Thread(s) per core
- Example: If CPU(s)=128 and Thread(s) per core=2, then physical cores = 64
- Important: Do NOT set to hyperthread count - this will degrade performance
kt-threadpool-count: Set to the number of NUMA nodes.
- Check NUMA count: lscpu | grep "NUMA node(s)"
- Or use: numactl --hardware | grep "available"
- Note: NUMA node count is NOT necessarily the number of physical CPUs
  - It represents memory domains, which may be divided within a single CPU or across multiple CPUs
  - Use the NUMA node count from lscpu, regardless of physical CPU count
- Typical values: 1-2 for single-socket, 2-4 for dual-socket systems
- This enables better memory bandwidth utilization across NUMA domains
kt-num-gpu-experts: Determine based on GPU memory and profiling:
- More GPU experts = lower latency but higher GPU memory usage (May cause OOM)
kt-max-deferred-experts-per-token: Enables pipelined execution:
- 0: Synchronous execution (simpler, higher latency)
- 1-4: Deferred execution (recommended range; good latency/quality balance, requires tuning)
- 5-7: Highest latency reduction but may introduce noticeable accuracy loss; use with care
kt-gpu-prefill-token-threshold (RAWINT4 only): Controls prefill strategy for native INT4 inference:
- ≤ threshold: Uses hybrid CPU+GPU prefill. No extra VRAM needed, but performance degrades slowly as token count increases.
- > threshold: Uses layerwise GPU prefill. Performance scales better with longer sequences, but requires ~9GB+ extra VRAM.
- Only applicable when --kt-method RAWINT4 is used. Currently supports Kimi-K2-Thinking model only.

Direct Python API Usage

For standalone usage without SGLang, you can use KT-Kernel directly via Python API:

from kt_kernel import KTMoEWrapper

# Initialize the MoE wrapper
wrapper = KTMoEWrapper(
    layer_idx=0,
    num_experts=8,
    num_experts_per_tok=2,
    hidden_size=4096,
    moe_intermediate_size=14336,
    num_gpu_experts=2,
    cpuinfer_threads=32,
    threadpool_count=2,
    weight_path="/path/to/weights",
    chunked_prefill_size=512,
    method="AMXINT4"  # Options: "AMXINT4", "AMXINT8", "LLAMAFILE"
)

# Load weights (from disk - pre-quantized)
wrapper.load_weights(physical_to_logical_map)

# Or load weights from tensors (online quantization)
wrapper.load_weights_from_tensors(gate_proj, up_proj, down_proj, physical_to_logical_map)

# Run inference
output = wrapper.forward(hidden_states, topk_ids, topk_weights, cuda_stream)

# Or use async API for better performance
wrapper.submit_forward(hidden_states, topk_ids, topk_weights, cuda_stream)
# ... do other work ...
output = wrapper.sync_forward(hidden_states, cuda_stream)

Advanced Options

# Initialize with additional options
wrapper = KTMoEWrapper(
    layer_idx=0,
    num_experts=8,
    num_experts_per_tok=2,
    hidden_size=4096,
    moe_intermediate_size=14336,
    num_gpu_experts=2,
    cpuinfer_threads=32,
    threadpool_count=2,
    weight_path="/path/to/weights",
    chunked_prefill_size=512,
    method="AMXINT4",
    cpu_save=False,  # Keep weights in CPU memory after loading
    max_deferred_experts_per_token=0  # Number of experts to defer (for pipelined execution)
)

# Pre-allocate buffers for specific batch sizes (improves performance)
KTMoEWrapper.set_capture_batch_sizes([1, 2, 4, 8, 16])

# Query captured batch sizes
batch_sizes = KTMoEWrapper.get_capture_batch_sizes()

# Clear buffer cache to free memory
KTMoEWrapper.clear_buffer_cache()

Build Configuration

Manual Installation

If you prefer manual installation without the install.sh script, follow these steps:

1. Install System Dependencies

Prerequisites:

cmake (recommended: conda install -y cmake)
libhwloc-dev and pkg-config

2. Set Build Configuration

Core Options:

Variable	Options	Description
`CPUINFER_CPU_INSTRUCT`	`NATIVE`, `AVX512`, `AVX2`, `FANCY`	CPU instruction set to use
`CPUINFER_ENABLE_AMX`	`ON`, `OFF`	Enable Intel AMX support
`CPUINFER_BUILD_TYPE`	`Release`, `Debug`, `RelWithDebInfo`	Build type (default: `Release`)
`CPUINFER_PARALLEL`	Number	Parallel build jobs (default: auto-detect)
`CPUINFER_VERBOSE`	`0`, `1`	Verbose build output (default: `0`)

Instruction Set Details:

NATIVE: Auto-detect and use all available CPU instructions (-march=native) - Recommended for best performance
AVX512: Explicit AVX512 support for Skylake-SP and Cascade Lake
AVX2: AVX2 support for maximum compatibility
FANCY: AVX512 with full extensions (AVX512F/BW/DQ/VL/VNNI) for Ice Lake+ and Zen 4+. Use this when building pre-compiled binaries to distribute to users with modern CPUs. For local builds, prefer NATIVE for better performance.

Example Configurations:

# Maximum performance on AMX CPU
export CPUINFER_CPU_INSTRUCT=NATIVE
export CPUINFER_ENABLE_AMX=ON

# AVX512 CPU without AMX
export CPUINFER_CPU_INSTRUCT=AVX512
export CPUINFER_ENABLE_AMX=OFF

# Compatibility build
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF

# Debug build for development
export CPUINFER_BUILD_TYPE=Debug
export CPUINFER_VERBOSE=1

3. Build and Install

# Editable installation (for development)
pip install -e .

# Standard installation
pip install .

Error Troubleshooting

CUDA Not Found

 -- Looking for a CUDA compiler - NOTFOUND
  CMake Error at CMakeLists.txt:389 (message):
    KTRANSFORMERS_USE_CUDA=ON but CUDA compiler not found

Make sure you have the CUDA toolkit installed and nvcc is in your system PATH.

Try export CMAKE_ARGS="-D CMAKE_CUDA_COMPILER=$(which nvcc)" and reinstall again.

hwloc Not Found

Run sudo apt install libhwloc-dev if on a Debian-based system or build from source: https://www.open-mpi.org/projects/hwloc/.

wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.2.tar.gz
tar -xzf hwloc-2.12.2.tar.gz
cd hwloc-2.12.2
./configure
make
sudo make install

Weight Quantization

For AMX backends (AMXINT4 / AMXINT8), CPU-side experts must be converted to AMX-friendly INT4/INT8 format using the provided script:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4

Supported formats: FP8, FP16, BF16 → INT4/INT8

For LLAMAFILE backend (LLAMAFILE), CPU-side experts are loaded directly from GGUF weights. You do not need to run the AMX conversion script; instead, download a GGUF model from the web (e.g., a GGUF repo on Hugging Face) and point weight_path / SGLang --kt-weight-path (or --model when appropriate) to that GGUF directory. KT-Kernel supports multiple GGUF quantization types such as Q4_KM, Q4_K, Q5_K, etc.

For detailed documentation, advanced options, and low-memory mode, see scripts/README.md.

Before Commit!

Commit messages should follow the Conventional Commits specification: https://www.conventionalcommits.org/

Please format your code before committing:

cmake -B build
cd build
make format

You may need a newer clang-format (at least version 18). In a conda environment:

conda install -c conda-forge clang-format=18
rm -rf build

It's also recommended to install black for Python code formatting:

conda install black