mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-03-15 02:47:22 +00:00

Files

ZiWei Yuan c2b8c60c4e [ci]: add int4_1 & int4_1k (#1653 )

* [feat]: init amd adaption

* [feat]: add blis support

* [fix]: fix setup and moe kernel warpper

* [fix](setup.py): support rebuild with cache and import kt_kernel works
fine

* [feat]: add moe_kernel converter for amd and implement the load
method(haven't tested yet)

* [feat](moe_kernel/moe.hpp): delete unused memory when using save

* [fix](moe_kernel): update PLAIN for pack

* [fix](moe_kernel): rm printf debug

* [fix](moe_kernel): skip gpu experts

* [fix](moe_kernel/moe.hpp): update include memory path

* [feat](moe_kernel/moe.hpp): support expert deferral

* [feat]: finish amd

* [ci]: add int4_1 & int4_1k

---------

Co-authored-by: mrhaoxx <mr.haoxx@gmail.com>

2025-12-02 15:58:14 +08:00

.githooks

[ci]: add int4_1 & int4_1k (#1653 )

2025-12-02 15:58:14 +08:00

bench

update kt-kernel

2025-11-03 15:19:52 +08:00

cmake

add kt-kernel

2025-10-12 05:13:00 +00:00

cpu_backend

[docs]: add contribuing guide and add hooks install (#1613 )

2025-11-15 18:26:49 +08:00

cuda

add kt-kernel

2025-10-12 05:13:00 +00:00

demo

update kt-kernel

2025-11-03 15:19:52 +08:00

examples

fix kt-kernel installation issue (#1603 )

2025-11-12 15:56:02 +08:00

operators

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

python

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

scripts

fix(scripts): resolve OOM when converting gpu weights and update README (#1640 )

2025-12-01 14:15:14 +08:00

test

[ci]: add int4_1 & int4_1k (#1653 )

2025-12-02 15:58:14 +08:00

.clang-format

add kt-kernel

2025-10-12 05:13:00 +00:00

.gitignore

update kt-kernel

2025-11-03 15:19:52 +08:00

.gitmodules

update kt-kernel

2025-10-24 14:32:45 +08:00

CMakeLists.txt

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

CMakePresets.json

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

ext_bindings.cpp

Fix kt-kernel for new wrapper (#1588 )

2025-11-10 21:47:34 +08:00

install.sh

[refactor] fix third_party issue (#1632 )

2025-11-20 13:55:55 +08:00

pyproject.toml

add ci (#1642 )

2025-11-25 20:52:08 +08:00

pytest.ini

add ci (#1642 )

2025-11-25 20:52:08 +08:00

README_zh.md

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

README.md

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

requirements.txt

update README for kt-kernel for installation issues (#1590 )

2025-11-11 14:53:45 +08:00

setup.py

[feat](moe_kernel): add amd blis support (int8) (#1600 )

2025-11-27 12:08:53 +08:00

README.md

KT-Kernel

High-performance kernel operations for KTransformers, featuring CPU-optimized MoE inference with AMX, AVX, KML and blis (amd library) support.

Note
Features
Installation
Verification
Integration with SGLang
Direct Python API Usage
- Advanced Options
Build Configuration
- Manual Installation
Error Troubleshooting
- CUDA Not Found
- hwloc Not Found
Weight Quantization
Before Commit!

Note

Current Support Status:

✅ Intel CPUs with AMX: Fully supported (using weights converted to INT4/INT8 format)
✅ Universal CPU (llamafile backend): Supported (using GGUF-format weights)
✅ AMD CPUs with BLIS: Supported (for int8 prefill & decode)

Features

CPU-Optimized MoE Kernels: High-throughput MoE expert kernels optimized for instruction sets.
AMX INT4/INT8 Backend: INT4 / INT8 quantized expert inference backend for AMX-capable servers.
Llamafile CPU Backend: AVX2/AVX512-based MoE backend built on Llamafile for universal CPU deployment.
NUMA-Aware Execution: Thread pool and memory layout designed for multi-socket / multi-NUMA machines.

Installation

Prerequisites

First, initialize git submodules:

git submodule update --init --recursive

Quick Installation (Recommended)

Step 0: Create and activate a conda environment (recommended):

conda create -n kt-kernel python=3.11 -y
conda activate kt-kernel

You can now install in two clear steps using the same script.

Option A: Two-step (specify dependencies installation and build separately)

# 1) Install system prerequisites (cmake, hwloc, pkg-config)
./install.sh deps

# 2) Build and install kt-kernel (auto-detects CPU instruction set)
#    By default, the script cleans the local ./build directory before compiling
./install.sh build

Option B: One-step

./install.sh

The install script will:

Auto-detect CPU capabilities (AMX support)
Install cmake via conda (if available)
Install system dependencies (libhwloc-dev, pkg-config) based on your OS

What gets configured automatically:

AMX CPU detected → NATIVE + AMX=ON
No AMX detected → NATIVE + AMX=OFF

⚠️ Important for LLAMAFILE backend users: If you have an AMX-capable CPU but plan to use the LLAMAFILE backend, do NOT use the default auto-detection build. Use "manual mode" with CPUINFER_CPU_INSTRUCT set to AVX512 or AVX2 instead of NATIVE to avoid compilation issues (see below).

Manual Configuration (Advanced)

If you need specific build options (e.g., for LLAMAFILE backend, compatibility, or binary distribution):

# Example for LLAMAFILE backend on AMX CPU with AVX512
export CPUINFER_CPU_INSTRUCT=AVX512  # Options: NATIVE, AVX512, AVX2, FANCY
export CPUINFER_ENABLE_AMX=OFF       # Options: ON, OFF

# Build only (skip auto-detection of instruction set)
./install.sh build --manual

For advanced build options and binary distribution, see the Build Configuration section. If you encounter issues, refer to Error Troubleshooting.

Verification

python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"

Integration with SGLang

KT-Kernel can be used standalone via Direct Python API or integrated with SGLang for production deployment. This section describes SGLang integration to enable CPU-GPU heterogeneous inference, where "hot" experts run on GPU and "cold" experts run on CPU for optimal resource utilization.

Installation Steps

1. Install SGLang

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

2. Prepare Weights

You need both GPU weights and CPU-side expert weights for heterogeneous inference. The exact format depends on the backend:

GPU Weights (for all backends):
Use the model weights required by SGLang for GPU inference (for example, the original or already-quantized model directory from Hugging Face).

CPU Weights (AMX backend: AMXINT4 / AMXINT8): Quantize weights to AMX-optimized INT4/INT8 format using the provided script:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/cpu-weights \
  --quant-method int8  # or int4 or moe_int8 (for amd now)

--input-path: Path to GPU-side original weights
--input-type: Depends on your GPU weights type (fp8, fp16, or bf16)

In SGLang integration, --kt-weight-path should point to this converted CPU weights directory.

Supported input formats: FP8, FP16, BF16 → INT4/INT8.

CPU Weights (LLAMAFILE backend: LLAMAFILE): LLAMAFILE uses pre-quantized GGUF weights on the CPU side directly, without running convert_cpu_weights.py. You need to:

Download a GGUF model directly from the web (e.g., GGUF repos on Hugging Face / Modelscope);
In SGLang integration, use that GGUF directory as --kt-weight-path. KT-Kernel supports multiple GGUF quantization formats such as Q4_KM, Q4_K, Q5_K, etc. Choose based on your latency and accuracy requirements.

3. Launch SGLang Server

Start the SGLang server with your normal SGLang parameters, and add the following KT-Kernel specific parameters to enable CPU-GPU heterogeneous inference:

KT-Kernel Parameters to Add:

--kt-method: Backend method (AMXINT4, AMXINT8, or LLAMAFILE)
--kt-weight-path: Path to the converted CPU weights
--kt-cpuinfer: Number of CPU inference threads (set to physical cores)
--kt-threadpool-count: Number of thread pools (set to NUMA node count)
--kt-num-gpu-experts: Number of experts to keep on GPU
--kt-max-deferred-experts-per-token: Deferred experts for pipelined execution

Example:

python -m sglang.launch_server \
  [your normal SGLang parameters...] \
  --kt-method AMXINT8 \
  --kt-weight-path /path/to/cpu-weights \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

See KT-Kernel Parameters section below for detailed parameter tuning guidelines.

Complete Example: Qwen3-30B-A3B

This example demonstrates the full workflow from downloading weights to launching the server, showing both AMX backend and LLAMAFILE backend options.

Hardware Configuration:

GPU: NVIDIA RTX 4090 24GB
CPU: 2x Intel Xeon Gold 6454S (64 physical cores total, 128 threads, 2 NUMA nodes)
Model: Qwen3-30B-A3B

How to verify your system configuration:

# Check CPU configuration
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|Socket\(s\)|NUMA node\(s\)"
# Expected output example:
CPU(s):                                  128
Thread(s) per core:                      2
Socket(s):                               2
NUMA node(s):                            2
# → Physical cores = CPU(s) / Thread(s) per core = 128 / 2 = 64

Parameter Rationale:

--kt-cpuinfer 64: Set to physical cores (64), not hyperthreads (128)
--kt-threadpool-count 2: 2 NUMA nodes detected (dual-socket system)
--kt-num-gpu-experts 32: With 24GB GPU memory, we can fit ~32 experts on GPU for this model (varies by model architecture and actual memory usage)
--kt-max-deferred-experts-per-token 2: Enable pipelined execution; allows CPU to process next batch while GPU completes current batch

Option A: AMX Backend (AMXINT8)

For Intel CPUs with AMX instruction set support.

Step 1: Download model weights

# Install huggingface-cli if not already installed
pip install huggingface-hub

# Download model from Hugging Face
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3B

Step 2: Convert to CPU weights (AMXINT8)

python scripts/convert_cpu_weights.py \
  --input-path /mnt/data/models/Qwen3-30B-A3B \
  --input-type bf16 \
  --output /mnt/data/models/Qwen3-30B-A3B-INT8 \
  --quant-method int8

Step 3: Launch SGLang server

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model /mnt/data/models/Qwen3-30B-A3B \
  --trust-remote-code \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 4096 \
  --served-model-name Qwen3-30B-A3B \
  --enable-mixed-chunk \
  --kt-method AMXINT8 \
  --kt-weight-path /mnt/data/models/Qwen3-30B-A3B-INT8 \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

Option B: LLAMAFILE Backend (GGUF)

For universal CPUs (no AMX required), using pre-quantized GGUF weights directly.

Step 1: Download GPU weights (original model)

pip install huggingface-hub

huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3B

Step 2: Download CPU weights (GGUF format)

huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF Qwen3-30B-A3B-Q4_K_M.gguf \
  --local-dir /mnt/data/models/Qwen3-30B-A3B-Q4_K_M

Step 3: Launch SGLang server

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model /mnt/data/models/Qwen3-30B-A3B \
  --trust-remote-code \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 4096 \
  --served-model-name Qwen3-30B-A3B \
  --enable-mixed-chunk \
  --kt-method LLAMAFILE \
  --kt-weight-path /mnt/data/models/Qwen3-30B-A3B-Q4_K_M \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

KT-Kernel Parameters

Parameter	Description	Example Value
`--kt-method`	CPU inference backend method	`AMXINT4`, `AMXINT8`, or `LLAMAFILE`
`--kt-weight-path`	Path to quantized CPU weights	`/path/to/cpu-weights`
`--kt-cpuinfer`	Number of CPU inference threads	`64` (adjust based on CPU cores)
`--kt-threadpool-count`	Number of thread pools for parallel execution	`2` (typically 1-4)
`--kt-num-gpu-experts`	Number of experts to keep on GPU	`32` (remaining experts go to CPU)
`--kt-max-deferred-experts-per-token`	Number of experts per token to defer for pipelined execution	`2` (0 to disable, 1-4 recommended)

Parameter Guidelines:

kt-method: Choose based on your CPU and weight format:
- AMXINT4: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)
- AMXINT8: Higher accuracy with INT8 quantized weights on AMX CPUs
- LLAMAFILE: GGUF-based backend
kt-cpuinfer: Set to the number of physical CPU cores (not hyperthreads).
- Check physical cores: lscpu | grep -E "^CPU$s$|Thread$s$ per core"
- Physical cores = CPU(s) / Thread(s) per core
- Example: If CPU(s)=128 and Thread(s) per core=2, then physical cores = 64
- Important: Do NOT set to hyperthread count - this will degrade performance
kt-threadpool-count: Set to the number of NUMA nodes.
- Check NUMA count: lscpu | grep "NUMA node(s)"
- Or use: numactl --hardware | grep "available"
- Note: NUMA node count is NOT necessarily the number of physical CPUs
  - It represents memory domains, which may be divided within a single CPU or across multiple CPUs
  - Use the NUMA node count from lscpu, regardless of physical CPU count
- Typical values: 1-2 for single-socket, 2-4 for dual-socket systems
- This enables better memory bandwidth utilization across NUMA domains
kt-num-gpu-experts: Determine based on GPU memory and profiling:
- More GPU experts = lower latency but higher GPU memory usage (May cause OOM)
kt-max-deferred-experts-per-token: Enables pipelined execution:
- 0: Synchronous execution (simpler, higher latency)
- 1-4: Deferred execution (recommended range; good latency/quality balance, requires tuning)
- 5-7: Highest latency reduction but may introduce noticeable accuracy loss; use with care

Direct Python API Usage

For standalone usage without SGLang, you can use KT-Kernel directly via Python API:

from kt_kernel import KTMoEWrapper

# Initialize the MoE wrapper
wrapper = KTMoEWrapper(
    layer_idx=0,
    num_experts=8,
    num_experts_per_tok=2,
    hidden_size=4096,
    moe_intermediate_size=14336,
    num_gpu_experts=2,
    cpuinfer_threads=32,
    threadpool_count=2,
    weight_path="/path/to/weights",
    chunked_prefill_size=512,
    method="AMXINT4"  # Options: "AMXINT4", "AMXINT8", "LLAMAFILE"
)

# Load weights (from disk - pre-quantized)
wrapper.load_weights(physical_to_logical_map)

# Or load weights from tensors (online quantization)
wrapper.load_weights_from_tensors(gate_proj, up_proj, down_proj, physical_to_logical_map)

# Run inference
output = wrapper.forward(hidden_states, topk_ids, topk_weights, cuda_stream)

# Or use async API for better performance
wrapper.submit_forward(hidden_states, topk_ids, topk_weights, cuda_stream)
# ... do other work ...
output = wrapper.sync_forward(hidden_states, cuda_stream)

Advanced Options

# Initialize with additional options
wrapper = KTMoEWrapper(
    layer_idx=0,
    num_experts=8,
    num_experts_per_tok=2,
    hidden_size=4096,
    moe_intermediate_size=14336,
    num_gpu_experts=2,
    cpuinfer_threads=32,
    threadpool_count=2,
    weight_path="/path/to/weights",
    chunked_prefill_size=512,
    method="AMXINT4",
    cpu_save=False,  # Keep weights in CPU memory after loading
    max_deferred_experts_per_token=0  # Number of experts to defer (for pipelined execution)
)

# Pre-allocate buffers for specific batch sizes (improves performance)
KTMoEWrapper.set_capture_batch_sizes([1, 2, 4, 8, 16])

# Query captured batch sizes
batch_sizes = KTMoEWrapper.get_capture_batch_sizes()

# Clear buffer cache to free memory
KTMoEWrapper.clear_buffer_cache()

Build Configuration

Manual Installation

If you prefer manual installation without the install.sh script, follow these steps:

1. Install System Dependencies

Prerequisites:

cmake (recommended: conda install -y cmake)
libhwloc-dev and pkg-config

2. Set Build Configuration

Core Options:

Variable	Options	Description
`CPUINFER_CPU_INSTRUCT`	`NATIVE`, `AVX512`, `AVX2`, `FANCY`	CPU instruction set to use
`CPUINFER_ENABLE_AMX`	`ON`, `OFF`	Enable Intel AMX support
`CPUINFER_BUILD_TYPE`	`Release`, `Debug`, `RelWithDebInfo`	Build type (default: `Release`)
`CPUINFER_PARALLEL`	Number	Parallel build jobs (default: auto-detect)
`CPUINFER_VERBOSE`	`0`, `1`	Verbose build output (default: `0`)

Instruction Set Details:

NATIVE: Auto-detect and use all available CPU instructions (-march=native) - Recommended for best performance
AVX512: Explicit AVX512 support for Skylake-SP and Cascade Lake
AVX2: AVX2 support for maximum compatibility
FANCY: AVX512 with full extensions (AVX512F/BW/DQ/VL/VNNI) for Ice Lake+ and Zen 4+. Use this when building pre-compiled binaries to distribute to users with modern CPUs. For local builds, prefer NATIVE for better performance.

Example Configurations:

# Maximum performance on AMX CPU
export CPUINFER_CPU_INSTRUCT=NATIVE
export CPUINFER_ENABLE_AMX=ON

# AVX512 CPU without AMX
export CPUINFER_CPU_INSTRUCT=AVX512
export CPUINFER_ENABLE_AMX=OFF

# Compatibility build
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF

# Debug build for development
export CPUINFER_BUILD_TYPE=Debug
export CPUINFER_VERBOSE=1

3. Build and Install

# Editable installation (for development)
pip install -e .

# Standard installation
pip install .

Error Troubleshooting

CUDA Not Found

 -- Looking for a CUDA compiler - NOTFOUND
  CMake Error at CMakeLists.txt:389 (message):
    KTRANSFORMERS_USE_CUDA=ON but CUDA compiler not found

Make sure you have the CUDA toolkit installed and nvcc is in your system PATH.

Try export CMAKE_ARGS="-D CMAKE_CUDA_COMPILER=$(which nvcc)" and reinstall again.

hwloc Not Found

Run sudo apt install libhwloc-dev if on a Debian-based system or build from source: https://www.open-mpi.org/projects/hwloc/.

wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.2.tar.gz
tar -xzf hwloc-2.12.2.tar.gz
cd hwloc-2.12.2
./configure
make
sudo make install

Weight Quantization

For AMX backends (AMXINT4 / AMXINT8), CPU-side experts must be converted to AMX-friendly INT4/INT8 format using the provided script:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4

Supported formats: FP8, FP16, BF16 → INT4/INT8

For LLAMAFILE backend (LLAMAFILE), CPU-side experts are loaded directly from GGUF weights. You do not need to run the AMX conversion script; instead, download a GGUF model from the web (e.g., a GGUF repo on Hugging Face) and point weight_path / SGLang --kt-weight-path (or --model when appropriate) to that GGUF directory. KT-Kernel supports multiple GGUF quantization types such as Q4_KM, Q4_K, Q5_K, etc.

For detailed documentation, advanced options, and low-memory mode, see scripts/README.md.

Before Commit!

Commit messages should follow the Conventional Commits specification: https://www.conventionalcommits.org/

Please format your code before committing:

cmake -B build
cd build
make format

You may need a newer clang-format (at least version 18). In a conda environment:

conda install -c conda-forge clang-format=18
rm -rf build

It's also recommended to install black for Python code formatting:

conda install black