Commit Graph

16 Commits

Author SHA1 Message Date
Benjamin F
bb15fdf47e Release/0.6.2.post3: carry kt-kernel SwiGLU clamp companion missing from post2 2026-05-10 03:55:02 +08:00
Benjamin F
041bdfc636 [New Model] DeepSeek-V4-Flash: kt-kernel MXFP4 MoE + sglang hybrid inference (#1970)
* [feat](kt-kernel): add MXFP4 MoE operator with E2M1 weights × BF16 activations

Implements AMX_FP4_MOE_TP based on the RAWINT4 (k2-moe) CRTP pattern.
FP4 E2M1 weights are nibble-packed and decoded via PSHUFB LUT, then
computed with BF16 activations using _mm512_dpbf16_ps. Supports weight-only
per-kgroup scaling (group_size=32) and tensor parallelism.

Includes a Python validation test covering uniform, alternating, ramp,
and random weight patterns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950)

V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group
scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader
that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}`
naming and converts ue8m0 → bf16 via a lossless bit-cast, register the
model entry, and ship an end-to-end numerical validation script.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [perf](kt-kernel): MXFP4 MoE add mat-mat 4×4 tile, refine mat-vec reduce (#1957)

mat_mul_kgroup previously aliased to fp4_mat_vec_kgroup, leaving large
batches stuck on the per-token path. Implement fp4_mat_mat_kgroup as a
4×4 register tile (MB=NB=4, 16 zmm accumulators) so each PSHUFB decode
of four weight rows is reused across four tokens.

Refactor fp4_mat_vec_kgroup to accumulate four N-rows in parallel and
flush them with a new reduce4 helper, removing per-row reduce_add_ps
calls from the hot loop. Mark mxfp4_to_bf16_32 always_inline.

Add bench/bench_fp4_moe.py with --routing {balanced,concentrated} and
a backend registry so future kernels can be added without changing the
runner.

Dispatch thresholds, derived_init, GeneralMOEConfig handling,
load_weights, write_weights_to_buffer and the TP_MOE specialization are
unchanged.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(loader): avoid uint16 lshift in ue8m0->bf16 conversion

PyTorch CPU has no lshift kernel for UInt16, so the previous
`(scale_t.to(torch.uint16) << 7)` raised NotImplementedError when
loading any V4-Flash MXFP4 routed-expert scale tensor on the host.

Switch to int32 for the shift (kernel exists) and narrow to int16
afterwards. The shifted value max is 255<<7 = 32640, well within
int16 range, so the narrow is lossless. The .view(bfloat16) bit
pattern is identical (bf16 sign bit is always 0 for ue8m0 values).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(v4-flash): hybrid CPU/GPU recipe + bump kt-sglang submodule

Bumps third_party/sglang to kvcache-ai/sglang main (3cbd49c29) which now
contains DeepSeek V4 Flash model support + consumer-GPU (SM_120) portable
Triton/TileLang fallbacks (kt-sglang PR #38).

Adds doc/en/DeepSeek-V4-Flash.md tutorial: 8x RTX 5090 hybrid recipe with
the full launch command, OpenAI-compatible /generate + /v1/chat/completions
examples, and the kt chat CLI client.

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-03 10:48:31 +08:00
mrhaoxx
9544a8960d feat(sft): AMX MoE SFT backend with LoRA support (#1936)
* feat(sft): AMX MoE SFT backend with LoRA support

Complete SFT (Supervised Fine-Tuning) backend for MoE models using AMX SIMD:

Core C++ implementation:
- sft_moe.hpp: Forward/backward with LoRA fused operations (~5500 lines)
- moe-sft-tp.hpp: Tensor-parallel wrapper for multi-NUMA
- amx/moe-sft-tp.hpp: AMX-specific TP implementation
- avx_kernels.hpp: AVX512 SIMD kernels for LoRA GEMM
- amx_kernels.hpp: AMX tile kernels for Panel5 rank-outer optimization
- worker_pool: RDTSC profiling, Chrome trace output, SFT timer infrastructure
- ext_bindings.cpp: SFT MOE pybind bindings (BF16/INT8/INT4 + SkipLoRA variants)

Python sft/ submodule (kt_kernel.sft):
- base.py: BaseSFTMoEWrapper with buffer management (template method pattern)
- amx.py: AMXSFTMoEWrapper (weight loading, C++ task construction)
- autograd.py: KTMoEFunction (torch.autograd.Function for distributed training)
- layer.py: KTMoELayerWrapper (nn.Module replacing HF MoE layers)
- arch.py: MOEArchConfig (Qwen3/DeepSeek/Mixtral architecture detection)
- weights.py: Expert weight extraction and checkpoint loading
- lora.py: PEFT LoRA adaptation (view buffers, grad buffers, save/load adapter)
- wrapper.py: wrap_moe_layers_with_kt_wrapper, load_kt_model, build_kt_device_map
- config.py: KTConfig dataclass (DeepSpeed-style opaque config passthrough)
- dist_utils.py: Distributed gather/scatter, checkpoint-phase detection

Design decisions:
- Rank-0-only expert pattern: only rank 0 holds C++ wrapper and expert weights
- DeepSpeed-style integration: accelerate keeps only KTransformersPlugin (framework
  interaction fields), all logic in kt_kernel.sft
- Inference isolation: importing kt_kernel does not load sft/ submodule
- Old field name compatibility: _get_kt_config() converts kt_xxx→xxx automatically

Verified: Qwen3-235B-A22B 4GPU AMXBF16 training, loss converges normally.

* refactor(sft): unify KTConfig field names with kt_ prefix, add share_cache_pool, remove dead code

- KTConfig fields all use kt_ prefix matching dict keys — eliminates
  _OLD_TO_NEW mapping and prefix-stripping in wrapper.py
- Add kt_share_cache_pool field, auto-enabled when gradient_checkpointing
  is on (via training_args.py), flows through to C++ cache allocation
- Remove dead checkpoint detection code: in_ckpt_recompute,
  in_ckpt_first_forward vars (assigned but never read), fallback
  _is_in_checkpoint_first_forward() function, unused inspect import
- Remove redundant env var fallbacks in wrapper.py for share_backward_bb
  and share_cache_pool (KTConfig.__post_init__ already handles env vars)
- Simplify layer.py checkpoint logic to single _checkpoint_hook_mode() check

Verified: Qwen3-235B 3-step training on sap4, loss matches baseline
(1.2886 / 1.9824 / 1.377 vs 1.2886 / 1.9766 / 1.3809)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(sft): share_backward_bb default True, share_cache_pool auto-derived

- kt_share_backward_bb defaults to True (always saves memory)
- kt_share_cache_pool no longer reads from env var; defaults False,
  auto-set to True by trainer_config_process when gradient checkpointing
  is enabled

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add missing gpu_experts_mask=None to KTMoEWrapper call in SFT wrapper

KTMoEWrapper.__new__() requires gpu_experts_mask as a positional argument,
but the SFT wrapper omitted it, causing MoE layer wrapping to fail silently
and FSDP2 to attempt broadcasting all expert weights (OOM/NCCL crash).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(sft): support transformers v5 fused expert format

Fused experts (e.g. Qwen3MoeExperts) store weights as 3D Parameters
(gate_up_proj [E,2I,H], down_proj [E,H,I]) instead of per-expert
nn.Linear modules. PEFT cannot attach LoRA to these, so we create
KT-managed LoRA buffers with kaiming init, nn.Parameter wrappers
for the optimizer, and pre-assigned .grad for C++ backward.

- arch.py: detect_fused_experts() detection
- weights.py: fused format extraction and weight clearing
- wrapper.py: detect fused at wrap time, store _fused_experts/_lora_rank
- lora.py: _create_fused_expert_lora_buffers, save/load fused LoRA,
  get_kt_lora_params collects fused params, deduplicate wrapper finding
- layer.py: handle v5 TopKRouter tuple output, remove dead code
- autograd.py: sync_forward_sft/submit_forward_sft API rename

Verified: v5 loss/expert-LoRA values match v4 baseline, v4 backward compat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(sft): add Qwen3.5 MoE support + fused checkpoint loading

- arch.py: add Qwen3_5Moe arch match, read config from text_config,
  _get_layers_prefix returns model.language_model.layers for Qwen3.5,
  _get_model_container_and_layers searches language_model attr
- weights.py: load_experts_from_checkpoint_files detects fused format
  (gate_up_proj in weight_map) and splits into gate/up/down
- wrapper.py: hidden_size fallback to text_config

Verified: Qwen3.5-35B-A3B (256 experts, fused format) E2E pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [fix](sft): align Python API with C++ backend after v5 refactor

- wrapper.py: pass gpu_experts_mask=None to KTMoEWrapper (required by C++ signature)
- layer.py: rename submit_forward_sft/sync_forward_sft to submit_forward/sync_forward
- autograd.py: rename sync_forward_sft to sync_forward

The sft-v5 refactor (commits 58d7eab, dd1da65) renamed Python-side method
calls but the C++ backend (AMXSFTMoEWrapper) still exposes the original
method names. This caused AttributeError on Qwen3.5-35B and other models.

* align sft branch with main: revert worker_pool, strip sft_timer, fix inference defaults

- Revert worker_pool.cpp/.h to main (remove RDTSC timer, Chrome Trace,
  sft_timer namespace, ITT API, extended do_work_stealing_job API)
- Strip all sft_timer instrumentation from sft-only files (sft_moe.hpp,
  moe-sft-tp.hpp, avx_kernels.hpp)
- Restore pin_memory=True in KExpertsCPUBuffer (inference path)
- Restore fused tensor transpose logic in convert_cpu_weights.py (main layout)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* revert CMakeLists.txt to main: remove debug flags and cpptrace dep

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* clean up dev artifacts: remove SFT design docs, debug examples, bench scripts

Remove files not needed in the merge:
- docs/SFT+KTWrapper/ (6 Chinese design docs)
- docs/sft_moe_amx/ (21 dev/debug docs)
- 12 debug/test example scripts
- 6 SFT-specific bench scripts and report

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* remove dev version stamps from ext_bindings, sft_moe, moe-sft-tp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: JimmyPeilinLi <lipeilin@mail.nwpu.edu.cn>
2026-04-22 11:27:01 +08:00
ErvinXie
3903c9afcc (kt-kernel): add numa_nodes parameter for explicit NUMA node mapping (#1891)
Add numa_nodes parameter to BaseMoEWrapper and all subclasses, allowing
users to explicitly specify which NUMA node IDs to use for subpool
mapping instead of always defaulting to sequential [0, 1, ..., N-1].

This enables running multiple KTransformers instances on different NUMA
nodes of the same machine, e.g. --kt-threadpool-count 1 --kt-numa-nodes 1
to bind to NUMA node 1. Previously this required external numactl
workarounds since subpool_numa_map was hardcoded to start from 0.
2026-03-31 10:27:50 +08:00
mrhaoxx
7a9daf0cd4 [feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892)
* feat: support avx2 bf16 fp8 inference

* feat: support avx2 gptq int4 inference

* fix: numeric issues in fp8 dequant

* Tutorial avx2 (#1900)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* Tutorial avx2 (#1901)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* docs: update README.md

---------

Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>
2026-03-27 14:45:02 +08:00
Jianwei Dong
027832c590 [feat](kt-kernel): CPU-GPU experts sched (#1796) 2026-01-16 17:01:15 +08:00
Oql
5edc456749 support Native BF16 format MoE. (#1788)
support Native BF16 format MoE
2026-01-12 14:43:28 +08:00
ErvinXie
d8046e1bb4 Kt minimax (#1742)
[feat]: fp8 kernel and kt-cli support
2025-12-24 15:39:44 +08:00
Jiaqi Liao
fcf8882075 [Feature] Add avx-based kimi-k2 support (#1656)
* support Kimi-K2-Thinking original weight
fix amx kernel bug

* update k2 avx kernel.

* feat: add CPUInfer write buffer task

* [feat]: add kimi k2 cpu write buffer support

- Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights
- Fix down (w2) weight column-wise slicing for different TP configurations
- Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp
- Add comprehensive test cases for weight extraction validation
- Ensure compatibility with Kimi model's MoE architecture

* [fix]: correct write_weight_scale_to_buffer expert offset calculation

Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios.

Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* [feat]: add write buffer wrapper

* [fix] fix comment

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-12-02 16:01:07 +08:00
ZiWei Yuan
1374b98ee5 [feat](moe_kernel): add amd blis support (int8) (#1600)
* [feat]: init amd adaption

* [feat]: add blis support

* [fix]: fix setup and moe kernel warpper

* [fix](setup.py): support rebuild with cache and import kt_kernel works
fine

* [feat]: add moe_kernel converter for amd and implement the load
method(haven't tested yet)

* [feat](moe_kernel/moe.hpp): delete unused memory when using save

* [fix](moe_kernel): update PLAIN for pack

* [fix](moe_kernel): rm printf debug

* [fix](moe_kernel): skip gpu experts

* [fix](moe_kernel/moe.hpp): update include memory path

* [feat](moe_kernel/moe.hpp): support expert deferral

* [feat]: finish amd

---------

Co-authored-by: mrhaoxx <mr.haoxx@gmail.com>
2025-11-27 12:08:53 +08:00
Jiaqi Liao
9bc00e587b Refactor KTMoEWrapper backend (#1587)
* universal backend for cpu inference
* expert defer
2025-11-10 20:26:15 +08:00
chenht2022
6fe30af50d Merge branch 'main' into develop-cht 2025-11-03 14:35:44 +00:00
ovowei
f854d03bd7 update kt-kernel 2025-11-03 15:19:52 +08:00
chenht2022
dd4377b60b feat: add deferred expert scheduling support 2025-10-31 08:03:37 +00:00
ovowei
28d8663374 fix 2025-10-22 18:14:34 +08:00
Atream
4c5fcf9774 add kt-kernel 2025-10-12 05:13:00 +00:00