mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-05-17 19:09:32 +00:00
[New Model] DeepSeek-V4-Flash: kt-kernel MXFP4 MoE + sglang hybrid inference (#1970)
* [feat](kt-kernel): add MXFP4 MoE operator with E2M1 weights × BF16 activations Implements AMX_FP4_MOE_TP based on the RAWINT4 (k2-moe) CRTP pattern. FP4 E2M1 weights are nibble-packed and decoded via PSHUFB LUT, then computed with BF16 activations using _mm512_dpbf16_ps. Supports weight-only per-kgroup scaling (group_size=32) and tensor parallelism. Includes a Python validation test covering uniform, alternating, ramp, and random weight patterns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950) V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}` naming and converts ue8m0 → bf16 via a lossless bit-cast, register the model entry, and ship an end-to-end numerical validation script. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [perf](kt-kernel): MXFP4 MoE add mat-mat 4×4 tile, refine mat-vec reduce (#1957) mat_mul_kgroup previously aliased to fp4_mat_vec_kgroup, leaving large batches stuck on the per-token path. Implement fp4_mat_mat_kgroup as a 4×4 register tile (MB=NB=4, 16 zmm accumulators) so each PSHUFB decode of four weight rows is reused across four tokens. Refactor fp4_mat_vec_kgroup to accumulate four N-rows in parallel and flush them with a new reduce4 helper, removing per-row reduce_add_ps calls from the hot loop. Mark mxfp4_to_bf16_32 always_inline. Add bench/bench_fp4_moe.py with --routing {balanced,concentrated} and a backend registry so future kernels can be added without changing the runner. Dispatch thresholds, derived_init, GeneralMOEConfig handling, load_weights, write_weights_to_buffer and the TP_MOE specialization are unchanged. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(loader): avoid uint16 lshift in ue8m0->bf16 conversion PyTorch CPU has no lshift kernel for UInt16, so the previous `(scale_t.to(torch.uint16) << 7)` raised NotImplementedError when loading any V4-Flash MXFP4 routed-expert scale tensor on the host. Switch to int32 for the shift (kernel exists) and narrow to int16 afterwards. The shifted value max is 255<<7 = 32640, well within int16 range, so the narrow is lossless. The .view(bfloat16) bit pattern is identical (bf16 sign bit is always 0 for ue8m0 values). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v4-flash): hybrid CPU/GPU recipe + bump kt-sglang submodule Bumps third_party/sglang to kvcache-ai/sglang main (3cbd49c29) which now contains DeepSeek V4 Flash model support + consumer-GPU (SM_120) portable Triton/TileLang fallbacks (kt-sglang PR #38). Adds doc/en/DeepSeek-V4-Flash.md tutorial: 8x RTX 5090 hybrid recipe with the full launch command, OpenAI-compatible /generate + /v1/chat/completions examples, and the kt chat CLI client. --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -81,6 +81,26 @@ BUILTIN_MODELS: list[ModelInfo] = [
|
||||
description="DeepSeek R1-0528 reasoning model (May 2025, improved reasoning depth)",
|
||||
description_zh="DeepSeek R1-0528 推理模型(2025年5月,改进的推理深度)",
|
||||
),
|
||||
ModelInfo(
|
||||
name="DeepSeek-V4-Flash",
|
||||
hf_repo="deepseek-ai/DeepSeek-V4-Flash",
|
||||
aliases=["deepseek-v4-flash", "deepseek-v4", "dsv4", "v4-flash", "v4"],
|
||||
type="moe",
|
||||
default_params={
|
||||
"kt-method": "MXFP4",
|
||||
"kt-gpu-prefill-token-threshold": 4096,
|
||||
"attention-backend": "flashinfer",
|
||||
"max-total-tokens": 100000,
|
||||
"max-running-requests": 16,
|
||||
"chunked-prefill-size": 32768,
|
||||
"mem-fraction-static": 0.80,
|
||||
"watchdog-timeout": 3000,
|
||||
"served-model-name": "DeepSeek-V4-Flash",
|
||||
"disable-shared-experts-fusion": True,
|
||||
},
|
||||
description="DeepSeek V4-Flash MoE model (native MXFP4 experts, MQA + sparse index attention)",
|
||||
description_zh="DeepSeek V4-Flash MoE 模型(原生 MXFP4 专家,MQA + 稀疏索引注意力)",
|
||||
),
|
||||
ModelInfo(
|
||||
name="Kimi-K2-Thinking",
|
||||
hf_repo="moonshotai/Kimi-K2-Thinking",
|
||||
@@ -368,6 +388,19 @@ def compute_deepseek_v3_gpu_experts(tensor_parallel_size: int, vram_per_gpu_gb:
|
||||
return total_vram // 3
|
||||
|
||||
|
||||
def compute_deepseek_v4_gpu_experts(tensor_parallel_size: int, vram_per_gpu_gb: float) -> int:
|
||||
"""Compute kt-num-gpu-experts for DeepSeek-V4-Flash.
|
||||
|
||||
V4 uses MXFP4 experts (~0.5 bytes/param vs V3 FP8's 1 byte/param) so each GPU
|
||||
can hold ~2x more experts per VRAM unit than V3 at the same fragmentation.
|
||||
"""
|
||||
per_gpu_gb = 16
|
||||
if vram_per_gpu_gb < per_gpu_gb:
|
||||
return 0
|
||||
total_vram = int(tensor_parallel_size * (vram_per_gpu_gb - per_gpu_gb))
|
||||
return total_vram * 2 // 3
|
||||
|
||||
|
||||
def compute_kimi_k2_thinking_gpu_experts(tensor_parallel_size: int, vram_per_gpu_gb: float) -> int:
|
||||
"""Compute kt-num-gpu-experts for Kimi K2 Thinking."""
|
||||
per_gpu_gb = 16
|
||||
@@ -393,6 +426,7 @@ MODEL_COMPUTE_FUNCTIONS: dict[str, Callable[[int, float], int]] = {
|
||||
"DeepSeek-V3-0324": compute_deepseek_v3_gpu_experts,
|
||||
"DeepSeek-V3.2": compute_deepseek_v3_gpu_experts, # Same as V3-0324
|
||||
"DeepSeek-R1-0528": compute_deepseek_v3_gpu_experts, # Same as V3-0324
|
||||
"DeepSeek-V4-Flash": compute_deepseek_v4_gpu_experts,
|
||||
"Kimi-K2-Thinking": compute_kimi_k2_thinking_gpu_experts,
|
||||
"MiniMax-M2": compute_minimax_m2_gpu_experts,
|
||||
"MiniMax-M2.1": compute_minimax_m2_gpu_experts, # Same as M2
|
||||
|
||||
Reference in New Issue
Block a user