11 Commits

Author SHA1 Message Date
mrhaoxx
7a9daf0cd4 [feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892)
* feat: support avx2 bf16 fp8 inference

* feat: support avx2 gptq int4 inference

* fix: numeric issues in fp8 dequant

* Tutorial avx2 (#1900)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* Tutorial avx2 (#1901)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* docs: update README.md

---------

Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>
2026-03-27 14:45:02 +08:00
Chen Hongtao
9e69fccb02 [feat]: add mistral moe loader compatibility (#1873)
Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>
2026-02-28 17:50:23 +08:00
VYSE V.E.O
20262b2743 Fix Qwen3.5 FP8 load for VL detection (#1857)
* Fix Qwen3.5 FP8 load for VL detection

1, for VL models(Qwen3.5), modify base_key: model.layers.{N} -> model.language_model.layers.{N}

2, clean DUPLICATED class BF16SafeTensorLoader(SafeTensorLoader) , only the first overrided one.

* Indent type

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-26 15:47:22 +08:00
Jianwei Dong
16a8b98f3e support qwen3.5 (#1846) 2026-02-16 15:48:14 +08:00
Jiaqi Liao
db82d99fa6 feat: add fallback expert prefix lookup in loader.py from kimi_k2.5 (#1822) 2026-01-30 14:09:38 +08:00
Oql
6277da4c2b support GLM 4.7 (#1791)
support GLM 4.7
2026-01-13 17:36:25 +08:00
Oql
5edc456749 support Native BF16 format MoE. (#1788)
support Native BF16 format MoE
2026-01-12 14:43:28 +08:00
ErvinXie
d8046e1bb4 Kt minimax (#1742)
[feat]: fp8 kernel and kt-cli support
2025-12-24 15:39:44 +08:00
Jiaqi Liao
fcf8882075 [Feature] Add avx-based kimi-k2 support (#1656)
* support Kimi-K2-Thinking original weight
fix amx kernel bug

* update k2 avx kernel.

* feat: add CPUInfer write buffer task

* [feat]: add kimi k2 cpu write buffer support

- Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights
- Fix down (w2) weight column-wise slicing for different TP configurations
- Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp
- Add comprehensive test cases for weight extraction validation
- Ensure compatibility with Kimi model's MoE architecture

* [fix]: correct write_weight_scale_to_buffer expert offset calculation

Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios.

Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* [feat]: add write buffer wrapper

* [fix] fix comment

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-12-02 16:01:07 +08:00
Jiaqi Liao
94c25626dc Fix kt-kernel for new wrapper (#1588)
* update README for kt-kernel

* style: format C++ and Python code in kt-kernel

  - Format C++ files: task_queue, ext_bindings, and MoE operators
  - Format Python utility modules: amx, llamafile, and loader
  - Improve code readability and consistency
2025-11-10 21:47:34 +08:00
Jiaqi Liao
9bc00e587b Refactor KTMoEWrapper backend (#1587)
* universal backend for cpu inference
* expert defer
2025-11-10 20:26:15 +08:00