Commit Graph

19 Commits

Author SHA1 Message Date
Benjamin F
bb15fdf47e Release/0.6.2.post3: carry kt-kernel SwiGLU clamp companion missing from post2 2026-05-10 03:55:02 +08:00
Benjamin F
041bdfc636 [New Model] DeepSeek-V4-Flash: kt-kernel MXFP4 MoE + sglang hybrid inference (#1970)
* [feat](kt-kernel): add MXFP4 MoE operator with E2M1 weights × BF16 activations

Implements AMX_FP4_MOE_TP based on the RAWINT4 (k2-moe) CRTP pattern.
FP4 E2M1 weights are nibble-packed and decoded via PSHUFB LUT, then
computed with BF16 activations using _mm512_dpbf16_ps. Supports weight-only
per-kgroup scaling (group_size=32) and tensor parallelism.

Includes a Python validation test covering uniform, alternating, ramp,
and random weight patterns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950)

V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group
scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader
that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}`
naming and converts ue8m0 → bf16 via a lossless bit-cast, register the
model entry, and ship an end-to-end numerical validation script.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [perf](kt-kernel): MXFP4 MoE add mat-mat 4×4 tile, refine mat-vec reduce (#1957)

mat_mul_kgroup previously aliased to fp4_mat_vec_kgroup, leaving large
batches stuck on the per-token path. Implement fp4_mat_mat_kgroup as a
4×4 register tile (MB=NB=4, 16 zmm accumulators) so each PSHUFB decode
of four weight rows is reused across four tokens.

Refactor fp4_mat_vec_kgroup to accumulate four N-rows in parallel and
flush them with a new reduce4 helper, removing per-row reduce_add_ps
calls from the hot loop. Mark mxfp4_to_bf16_32 always_inline.

Add bench/bench_fp4_moe.py with --routing {balanced,concentrated} and
a backend registry so future kernels can be added without changing the
runner.

Dispatch thresholds, derived_init, GeneralMOEConfig handling,
load_weights, write_weights_to_buffer and the TP_MOE specialization are
unchanged.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(loader): avoid uint16 lshift in ue8m0->bf16 conversion

PyTorch CPU has no lshift kernel for UInt16, so the previous
`(scale_t.to(torch.uint16) << 7)` raised NotImplementedError when
loading any V4-Flash MXFP4 routed-expert scale tensor on the host.

Switch to int32 for the shift (kernel exists) and narrow to int16
afterwards. The shifted value max is 255<<7 = 32640, well within
int16 range, so the narrow is lossless. The .view(bfloat16) bit
pattern is identical (bf16 sign bit is always 0 for ue8m0 values).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(v4-flash): hybrid CPU/GPU recipe + bump kt-sglang submodule

Bumps third_party/sglang to kvcache-ai/sglang main (3cbd49c29) which now
contains DeepSeek V4 Flash model support + consumer-GPU (SM_120) portable
Triton/TileLang fallbacks (kt-sglang PR #38).

Adds doc/en/DeepSeek-V4-Flash.md tutorial: 8x RTX 5090 hybrid recipe with
the full launch command, OpenAI-compatible /generate + /v1/chat/completions
examples, and the kt chat CLI client.

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-03 10:48:31 +08:00
Aliez Ren
02be2bf53f [feat](kt-kernel): add AVX2/AVX-VNNI RAWINT4 MoE backend (#1942)
* [feat](kt-kernel): add AVX2/AVX-VNNI RAWINT4 MoE backend

* Update AVX2 tutorial with AVX2 compilation instructions

Added instructions for forcing AVX2 compilation on AVX512 or AMX machines.

* Add instructions for AVX2 compilation

---------

Co-authored-by: Jiaheng Dai <108478605+jdai0@users.noreply.github.com>
2026-04-30 17:16:49 +08:00
callmegaga
a9411f1d72 Supports vnni-256 for GPTQ INT4 (#1926)
* [feat](kt-kernel): support avx-vnni-256 for gptq int4
2026-04-13 17:59:59 +08:00
Oql
9e6484a538 [fix]: fix --numa-nodes handling (#1904)
* [fix]: fix --numa-nodes handling
2026-03-31 17:50:22 +08:00
ErvinXie
3903c9afcc (kt-kernel): add numa_nodes parameter for explicit NUMA node mapping (#1891)
Add numa_nodes parameter to BaseMoEWrapper and all subclasses, allowing
users to explicitly specify which NUMA node IDs to use for subpool
mapping instead of always defaulting to sequential [0, 1, ..., N-1].

This enables running multiple KTransformers instances on different NUMA
nodes of the same machine, e.g. --kt-threadpool-count 1 --kt-numa-nodes 1
to bind to NUMA node 1. Previously this required external numactl
workarounds since subpool_numa_map was hardcoded to start from 0.
2026-03-31 10:27:50 +08:00
mrhaoxx
7a9daf0cd4 [feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892)
* feat: support avx2 bf16 fp8 inference

* feat: support avx2 gptq int4 inference

* fix: numeric issues in fp8 dequant

* Tutorial avx2 (#1900)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* Tutorial avx2 (#1901)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* docs: update README.md

---------

Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>
2026-03-27 14:45:02 +08:00
Chen Hongtao
9e69fccb02 [feat]: add mistral moe loader compatibility (#1873)
Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>
2026-02-28 17:50:23 +08:00
Jiaqi Liao
edc48aba37 [fix]: fix wrapper import issue (#1819) 2026-01-28 16:31:56 +08:00
Oql
bf4c8a690b Add Native Precision Tutorial, update worker strategy and README.md (#1807) 2026-01-23 18:00:13 +08:00
Jianwei Dong
027832c590 [feat](kt-kernel): CPU-GPU experts sched (#1796) 2026-01-16 17:01:15 +08:00
Oql
6277da4c2b support GLM 4.7 (#1791)
support GLM 4.7
2026-01-13 17:36:25 +08:00
Oql
5edc456749 support Native BF16 format MoE. (#1788)
support Native BF16 format MoE
2026-01-12 14:43:28 +08:00
ErvinXie
d8046e1bb4 Kt minimax (#1742)
[feat]: fp8 kernel and kt-cli support
2025-12-24 15:39:44 +08:00
Oql
8139c092bf Reduce CPU memory usage during large chunk prefill (Fixes #1676) (#1683)
* fix(amx): add BufferASmallKGroupImpl to fix buffer overflow in from_mat

The original BufferAKGroupImpl::from_mat writes 64 bytes per K_STEP iteration
but when K_STEP=32 (for GemmKernel224Int4SmallKGroup), this causes buffer overflow.

BufferASmallKGroupImpl overrides from_mat to write only 32 bytes per iteration.

* perf(k2-moe): optimize memory allocation with pooled buffers

- Replace per-expert buffer allocation with shared memory pools
- Dynamically assign buffer slices based on activated experts
- Add group_size inference from scale tensor shape in amx.py

* delete kimi k2 forward test

* add TODO comment for pool_count_ calculation
2025-12-08 20:19:07 +08:00
ErvinXie
71f683acec Support Native Kimi K2 Thinking (#1663)
* [feat]: fix k2 prefill

* Update Kimi-K2-Thinking.md

* Create Kimi-K2-Thinking-Native.md

* Update Kimi-K2-Thinking.md

* Update Kimi-K2-Thinking.md

* Update Kimi-K2-Thinking-Native.md

* [perf] optimize K2 MoE weight loading with per-expert pointers

- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s)
- Use per-expert pointer arrays (gate_projs) instead of contiguous memory
- C++ worker pool performs parallel memcpy for TP slicing
- Add LOAD_TIME_PROFILE for load_weights timing analysis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-12-05 21:53:05 +08:00
Jiaqi Liao
fcf8882075 [Feature] Add avx-based kimi-k2 support (#1656)
* support Kimi-K2-Thinking original weight
fix amx kernel bug

* update k2 avx kernel.

* feat: add CPUInfer write buffer task

* [feat]: add kimi k2 cpu write buffer support

- Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights
- Fix down (w2) weight column-wise slicing for different TP configurations
- Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp
- Add comprehensive test cases for weight extraction validation
- Ensure compatibility with Kimi model's MoE architecture

* [fix]: correct write_weight_scale_to_buffer expert offset calculation

Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios.

Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* [feat]: add write buffer wrapper

* [fix] fix comment

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-12-02 16:01:07 +08:00
Jiaqi Liao
94c25626dc Fix kt-kernel for new wrapper (#1588)
* update README for kt-kernel

* style: format C++ and Python code in kt-kernel

  - Format C++ files: task_queue, ext_bindings, and MoE operators
  - Format Python utility modules: amx, llamafile, and loader
  - Improve code readability and consistency
2025-11-10 21:47:34 +08:00
Jiaqi Liao
9bc00e587b Refactor KTMoEWrapper backend (#1587)
* universal backend for cpu inference
* expert defer
2025-11-10 20:26:15 +08:00