[New Model] DeepSeek-V4-Flash: kt-kernel MXFP4 MoE + sglang hybrid inference (#1970)

* [feat](kt-kernel): add MXFP4 MoE operator with E2M1 weights × BF16 activations

Implements AMX_FP4_MOE_TP based on the RAWINT4 (k2-moe) CRTP pattern.
FP4 E2M1 weights are nibble-packed and decoded via PSHUFB LUT, then
computed with BF16 activations using _mm512_dpbf16_ps. Supports weight-only
per-kgroup scaling (group_size=32) and tensor parallelism.

Includes a Python validation test covering uniform, alternating, ramp,
and random weight patterns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950)

V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group
scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader
that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}`
naming and converts ue8m0 → bf16 via a lossless bit-cast, register the
model entry, and ship an end-to-end numerical validation script.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [perf](kt-kernel): MXFP4 MoE add mat-mat 4×4 tile, refine mat-vec reduce (#1957)

mat_mul_kgroup previously aliased to fp4_mat_vec_kgroup, leaving large
batches stuck on the per-token path. Implement fp4_mat_mat_kgroup as a
4×4 register tile (MB=NB=4, 16 zmm accumulators) so each PSHUFB decode
of four weight rows is reused across four tokens.

Refactor fp4_mat_vec_kgroup to accumulate four N-rows in parallel and
flush them with a new reduce4 helper, removing per-row reduce_add_ps
calls from the hot loop. Mark mxfp4_to_bf16_32 always_inline.

Add bench/bench_fp4_moe.py with --routing {balanced,concentrated} and
a backend registry so future kernels can be added without changing the
runner.

Dispatch thresholds, derived_init, GeneralMOEConfig handling,
load_weights, write_weights_to_buffer and the TP_MOE specialization are
unchanged.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(loader): avoid uint16 lshift in ue8m0->bf16 conversion

PyTorch CPU has no lshift kernel for UInt16, so the previous
`(scale_t.to(torch.uint16) << 7)` raised NotImplementedError when
loading any V4-Flash MXFP4 routed-expert scale tensor on the host.

Switch to int32 for the shift (kernel exists) and narrow to int16
afterwards. The shifted value max is 255<<7 = 32640, well within
int16 range, so the narrow is lossless. The .view(bfloat16) bit
pattern is identical (bf16 sign bit is always 0 for ue8m0 values).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(v4-flash): hybrid CPU/GPU recipe + bump kt-sglang submodule

Bumps third_party/sglang to kvcache-ai/sglang main (3cbd49c29) which now
contains DeepSeek V4 Flash model support + consumer-GPU (SM_120) portable
Triton/TileLang fallbacks (kt-sglang PR #38).

Adds doc/en/DeepSeek-V4-Flash.md tutorial: 8x RTX 5090 hybrid recipe with
the full launch command, OpenAI-compatible /generate + /v1/chat/completions
examples, and the kt chat CLI client.

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Benjamin F
2026-05-03 10:48:31 +08:00
committed by GitHub
parent fe06c4d355
commit 041bdfc636
12 changed files with 1902 additions and 2 deletions

View File

@@ -81,6 +81,26 @@ BUILTIN_MODELS: list[ModelInfo] = [
description="DeepSeek R1-0528 reasoning model (May 2025, improved reasoning depth)",
description_zh="DeepSeek R1-0528 推理模型2025年5月改进的推理深度",
),
ModelInfo(
name="DeepSeek-V4-Flash",
hf_repo="deepseek-ai/DeepSeek-V4-Flash",
aliases=["deepseek-v4-flash", "deepseek-v4", "dsv4", "v4-flash", "v4"],
type="moe",
default_params={
"kt-method": "MXFP4",
"kt-gpu-prefill-token-threshold": 4096,
"attention-backend": "flashinfer",
"max-total-tokens": 100000,
"max-running-requests": 16,
"chunked-prefill-size": 32768,
"mem-fraction-static": 0.80,
"watchdog-timeout": 3000,
"served-model-name": "DeepSeek-V4-Flash",
"disable-shared-experts-fusion": True,
},
description="DeepSeek V4-Flash MoE model (native MXFP4 experts, MQA + sparse index attention)",
description_zh="DeepSeek V4-Flash MoE 模型(原生 MXFP4 专家MQA + 稀疏索引注意力)",
),
ModelInfo(
name="Kimi-K2-Thinking",
hf_repo="moonshotai/Kimi-K2-Thinking",
@@ -368,6 +388,19 @@ def compute_deepseek_v3_gpu_experts(tensor_parallel_size: int, vram_per_gpu_gb:
return total_vram // 3
def compute_deepseek_v4_gpu_experts(tensor_parallel_size: int, vram_per_gpu_gb: float) -> int:
"""Compute kt-num-gpu-experts for DeepSeek-V4-Flash.
V4 uses MXFP4 experts (~0.5 bytes/param vs V3 FP8's 1 byte/param) so each GPU
can hold ~2x more experts per VRAM unit than V3 at the same fragmentation.
"""
per_gpu_gb = 16
if vram_per_gpu_gb < per_gpu_gb:
return 0
total_vram = int(tensor_parallel_size * (vram_per_gpu_gb - per_gpu_gb))
return total_vram * 2 // 3
def compute_kimi_k2_thinking_gpu_experts(tensor_parallel_size: int, vram_per_gpu_gb: float) -> int:
"""Compute kt-num-gpu-experts for Kimi K2 Thinking."""
per_gpu_gb = 16
@@ -393,6 +426,7 @@ MODEL_COMPUTE_FUNCTIONS: dict[str, Callable[[int, float], int]] = {
"DeepSeek-V3-0324": compute_deepseek_v3_gpu_experts,
"DeepSeek-V3.2": compute_deepseek_v3_gpu_experts, # Same as V3-0324
"DeepSeek-R1-0528": compute_deepseek_v3_gpu_experts, # Same as V3-0324
"DeepSeek-V4-Flash": compute_deepseek_v4_gpu_experts,
"Kimi-K2-Thinking": compute_kimi_k2_thinking_gpu_experts,
"MiniMax-M2": compute_minimax_m2_gpu_experts,
"MiniMax-M2.1": compute_minimax_m2_gpu_experts, # Same as M2