Files
ktransformers/doc/en/Kimi-K2.md

2.0 KiB

Kimi-K2 Support for KTransformers

Introduction

Overview

We are very pleased to announce that Ktransformers now supports Kimi-K2 and Kimi-K2-0905.

On a single-socket CPU with one consumer-grade GPU, running the Q4_K_M model yields roughly 10 TPS and requires about 600 GB of DRAM.
With a dual-socket CPU and sufficient system memory, enabling NUMA optimizations increases performance to about 14 TPS.

Installation Guide

1. Resource Requirements

The model running with 384 Experts requires approximately 600 GB of memory and 14 GB of GPU memory.

2. Prepare Models

# download gguf
huggingface-cli download --resume-download KVCache-ai/Kimi-K2-Instruct-GGUF

3. Install ktransformers

To install KTransformers, follow the official Installation Guide.

4. Run Kimi-K2 Inference Server

python ktransformers/server/main.py \
  --port 10002 \
  --model_path <path_to_safetensor_config> \
  --gguf_path <path_to_gguf_files> \
  --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
  --max_new_tokens 1024 \
  --cache_lens 32768 \
  --chunk_size 256 \
  --max_batch_size 4 \
  --backend_type balance_serve \

5. Access server

curl -X POST http://localhost:10002/v1/chat/completions \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "hello"}
    ],
    "model": "Kimi-K2",
    "temperature": 0.3,
    "top_p": 1.0,
    "stream": true
  }'