Jianwei Dong
1f79f6da92
[feat](kt-kernel): Add automatic deployment workflow ( #1719 )
2025-12-16 15:20:06 +08:00
Shaoxu Cheng
f25e58ad69
fix: qwen3-npu bugs; update: add readme-for-qwen3-npu ( #1717 )
...
* fix: qwen3-npu bugs; update: add readme-for-qwen3-npu
* fix: Correct the README description
2025-12-16 14:27:04 +08:00
RICHARDNAN
18fb8fc897
Npu revise benchmark results and prerequisites ( #1716 )
...
* Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md
* Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md
* Revise Ascend NPU tutorial for Docker deployment
Updated the tutorial for deploying the Ascend NPU, changing sections from 'Conda部署' to '镜像部署' and providing specific commands for Docker container setup and Python environment installation.
* Update DeepseekR1 tutorial for Ascend NPU
* Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md
* Update W8A8 weight link in tutorial
* Update doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Refactor Docker command and update package manager
Updated Docker run command to simplify device specifications and corrected package manager command from 'apt' to 'yum'.
* Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md
* Revise benchmark results and prerequisites
Updated performance results and hardware specifications.
* Update doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-16 14:26:44 +08:00
ZiWei Yuan
34230eaf44
[docs]: Fix image link in README.md ( #1718 )
...
Updated image link to use raw GitHub URL for better accessibility.
2025-12-15 17:10:15 +08:00
SCDESPERTATE
008de19e16
[fix](kt-kernel): drop the weights held in Python for loading weights operation in C++ ( #1695 )
2025-12-12 11:42:33 +08:00
Shaoxu Cheng
1e69563363
update: add cache class and ascend ln mlp op for qwen3 adapt npu ( #1708 )
2025-12-11 17:08:35 +08:00
Shaoxu Cheng
cea490a326
update: add ascend attn and experts ops for npu qwen3moe adapt ( #1707 )
...
* update: add ascend attn and experts ops for npu qwen3moe adapt
* Reorder import statements in custom_ascend_modelling_qwen3.py
* Restore copyright and import statements
Restored copyright information and imports in ascend_experts.py.
2025-12-11 17:08:15 +08:00
Shaoxu Cheng
adcfa9080f
update: Qwen3 MoE model adaptation for NPU (framework) ( #1706 )
2025-12-11 17:07:57 +08:00
ZiWei Yuan
53f6a6d6e1
[feat]: patch kml problem ( #1704 )
2025-12-11 14:40:29 +08:00
Jianwei Dong
c65febe05c
[feat]: Automatically detect whether blis is installed on amd cpus ( #1702 )
2025-12-11 14:25:36 +08:00
RICHARDNAN
6431888928
add deploy in docker image ( #1691 )
2025-12-11 14:11:27 +08:00
ZiWei Yuan
2f1b743050
[docs]: update website doc png ( #1696 )
2025-12-11 13:01:32 +08:00
Oql
e87a042ef0
[fix](kt-kernel): fix write_buffer do numa job ( #1699 )
2025-12-10 16:39:16 +08:00
Shaoxu Cheng
8995378a91
update: add attention and ln ut for npu ( #1698 )
2025-12-10 16:12:26 +08:00
mrhaoxx
f992de55da
[fix](kt-sft): fix peft adaptations for RL tasks ( #1674 )
2025-12-09 14:28:51 +08:00
mrhaoxx
503295fc88
[feat](kt-kernel): refactor convert_cpu_weights.py to support conversation for GLM-4.6V ( #1687 )
...
Signed-off-by: mrhaoxx <mr.haoxx@gmail.com >
2025-12-09 14:24:41 +08:00
Oql
ac69ea891e
Fix K2 MoE decode bug in buffer management ( #1686 )
2025-12-08 21:08:28 +08:00
Oql
8139c092bf
Reduce CPU memory usage during large chunk prefill ( Fixes #1676 ) ( #1683 )
...
* fix(amx): add BufferASmallKGroupImpl to fix buffer overflow in from_mat
The original BufferAKGroupImpl::from_mat writes 64 bytes per K_STEP iteration
but when K_STEP=32 (for GemmKernel224Int4SmallKGroup), this causes buffer overflow.
BufferASmallKGroupImpl overrides from_mat to write only 32 bytes per iteration.
* perf(k2-moe): optimize memory allocation with pooled buffers
- Replace per-expert buffer allocation with shared memory pools
- Dynamically assign buffer slices based on activated experts
- Add group_size inference from scale tensor shape in amx.py
* delete kimi k2 forward test
* add TODO comment for pool_count_ calculation
2025-12-08 20:19:07 +08:00
ErvinXie
eefc8cf98d
更新 Kimi-K2-Thinking-Native.md ( #1684 )
2025-12-08 19:58:20 +08:00
Jiaqi Liao
f20e5d1da5
Revise prefill strategy and performance metrics ( #1675 )
...
Updated the prefill strategy descriptions and performance benchmarks in the documentation.
2025-12-06 15:36:04 +08:00
Jiaqi Liao
1d62ac21f7
Update Kimi-K2-Thinking-Native.md ( #1673 )
2025-12-05 23:08:02 +08:00
Jiaqi Liao
69fa7b1a57
Revise installation steps in Kimi-K2 documentation ( #1672 )
...
Updated installation instructions and added steps for cloning the repository.
2025-12-05 23:05:24 +08:00
Jiaqi Liao
721b6c4c94
[docs] Update Native Kimi-K2-Thinking documentation and kt-kernel parameters ( #1671 )
v0.4.3
2025-12-05 22:46:16 +08:00
Jiaqi Liao
47da806cde
[doc](kt-kernel): add kimi-k2-thinking ( #1670 )
2025-12-05 21:53:59 +08:00
ErvinXie
71f683acec
Support Native Kimi K2 Thinking ( #1663 )
...
* [feat]: fix k2 prefill
* Update Kimi-K2-Thinking.md
* Create Kimi-K2-Thinking-Native.md
* Update Kimi-K2-Thinking.md
* Update Kimi-K2-Thinking.md
* Update Kimi-K2-Thinking-Native.md
* [perf] optimize K2 MoE weight loading with per-expert pointers
- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s)
- Use per-expert pointer arrays (gate_projs) instead of contiguous memory
- C++ worker pool performs parallel memcpy for TP slicing
- Add LOAD_TIME_PROFILE for load_weights timing analysis
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
---------
Co-authored-by: ouqingliang <1692110604@qq.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-12-05 21:53:05 +08:00
ZiWei Yuan
4850424345
[docs]: add amd blis backend usage guide ( #1669 )
2025-12-05 16:52:26 +08:00
Jiaqi Liao
1ca3a2662e
Add 9#AISoft to the list of contributors ( #1668 )
2025-12-05 15:44:04 +08:00
Jiaqi Liao
0698252484
[fix](kt-kernel): gate RAWINT4 behind AVX512 and avoid AVX2 build break ( #1660 )
2025-12-03 00:43:23 +08:00
Jianwei Dong
670c488155
[docs]: Add deepseek-v3.2 run tutorial ( #1659 )
2025-12-02 20:04:10 +08:00
Jiaqi Liao
fcf8882075
[Feature] Add avx-based kimi-k2 support ( #1656 )
...
* support Kimi-K2-Thinking original weight
fix amx kernel bug
* update k2 avx kernel.
* feat: add CPUInfer write buffer task
* [feat]: add kimi k2 cpu write buffer support
- Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights
- Fix down (w2) weight column-wise slicing for different TP configurations
- Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp
- Add comprehensive test cases for weight extraction validation
- Ensure compatibility with Kimi model's MoE architecture
* [fix]: correct write_weight_scale_to_buffer expert offset calculation
Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios.
Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
* [feat]: add write buffer wrapper
* [fix] fix comment
---------
Co-authored-by: ouqingliang <1692110604@qq.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-12-02 16:01:07 +08:00
ZiWei Yuan
c2b8c60c4e
[ci]: add int4_1 & int4_1k ( #1653 )
...
* [feat]: init amd adaption
* [feat]: add blis support
* [fix]: fix setup and moe kernel warpper
* [fix](setup.py): support rebuild with cache and import kt_kernel works
fine
* [feat]: add moe_kernel converter for amd and implement the load
method(haven't tested yet)
* [feat](moe_kernel/moe.hpp): delete unused memory when using save
* [fix](moe_kernel): update PLAIN for pack
* [fix](moe_kernel): rm printf debug
* [fix](moe_kernel): skip gpu experts
* [fix](moe_kernel/moe.hpp): update include memory path
* [feat](moe_kernel/moe.hpp): support expert deferral
* [feat]: finish amd
* [ci]: add int4_1 & int4_1k
---------
Co-authored-by: mrhaoxx <mr.haoxx@gmail.com >
2025-12-02 15:58:14 +08:00
Jianwei Dong
fd78fe520a
fix(scripts): resolve OOM when converting gpu weights and update README ( #1640 )
2025-12-01 14:15:14 +08:00
Peilin Li
e637fedc65
[docs]: Add Full introduction of KT ( #1636 )
2025-11-29 15:46:55 +08:00
Peilin Li
7ee80bbc3d
[docs]: Update README with Python 3.12 and dependency changes ( #1634 )
...
Updated Python version in installation instructions and adjusted KTransformers and flash-attention wheel filenames accordingly.
2025-11-29 15:46:05 +08:00
mrhaoxx
637c49c83f
[feat](kt-kernel): support qwen3-vl weights convert ( #1648 )
2025-11-27 22:29:09 +08:00
Jianwei Dong
c256150e08
update ci test ( #1647 )
2025-11-27 16:39:48 +08:00
ZiWei Yuan
1374b98ee5
[feat](moe_kernel): add amd blis support (int8) ( #1600 )
...
* [feat]: init amd adaption
* [feat]: add blis support
* [fix]: fix setup and moe kernel warpper
* [fix](setup.py): support rebuild with cache and import kt_kernel works
fine
* [feat]: add moe_kernel converter for amd and implement the load
method(haven't tested yet)
* [feat](moe_kernel/moe.hpp): delete unused memory when using save
* [fix](moe_kernel): update PLAIN for pack
* [fix](moe_kernel): rm printf debug
* [fix](moe_kernel): skip gpu experts
* [fix](moe_kernel/moe.hpp): update include memory path
* [feat](moe_kernel/moe.hpp): support expert deferral
* [feat]: finish amd
---------
Co-authored-by: mrhaoxx <mr.haoxx@gmail.com >
2025-11-27 12:08:53 +08:00
Jianwei Dong
fef6dd98a8
add accuracy and performance test ( #1643 )
2025-11-27 10:56:39 +08:00
Jiaqi Liao
e7d1c1de09
fix(llamafile): resolve deferred experts data race and update README ( #1646 )
2025-11-26 23:19:37 +08:00
Jianwei Dong
51745a9ea1
add ci ( #1642 )
2025-11-25 20:52:08 +08:00
RICHARDNAN
2cffdf7033
[docs]: Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md ( #1638 )
2025-11-24 11:51:07 +08:00
DocShotgun
e72a4fb880
[feat](kt-kernel): Add resume arg to CPU weight conversion ( #1630 )
...
* [feat]: kt-kernel: Add resume arg to CPU weight conversion
* [docs]: kt-kernel: Document resume arg for CPU weight conversion
* [fix]: kt-kernel: Only print resume layer if in use
* [fix]: kt-kernel: Don't log skipped layers when using resume_layer
2025-11-22 12:00:15 +08:00
Jiaqi Liao
e69c67713f
[refactor] fix third_party issue ( #1632 )
...
* [refactor]: relocate third_party directory
* [fix]: fix custom_flashinfer for kt-sft
v0.4.2
2025-11-20 13:55:55 +08:00
Jiaqi Liao
46af8fcab5
[doc] fix kt parameters ( #1629 )
2025-11-19 16:41:57 +08:00
Peilin Li
171578a7ec
[refactor]: Change named 'KT-SFT' to 'kt-sft' ( #1626 )
...
* Change named 'KT-SFT' to 'kt-sft'
* [docs]: update kt-sft name
---------
Co-authored-by: ZiWei Yuan <yzwliam@126.com >
2025-11-17 11:48:42 +08:00
Pory
2887050ca1
[Feature] add Qwen3MoE models for KTransformers-FT ( #1602 )
...
* add qwen3 attn
* fix KQwen3MoeSparseMoeBlock
* fix bug adapter for llamafactory
---------
Co-authored-by: unknown <xiongchenhui@hisense.ad >
2025-11-16 16:39:19 +08:00
ZiWei Yuan
ab8ad0a110
[docs]: update web doc ( #1625 )
2025-11-16 14:40:22 +08:00
ZiWei Yuan
be6db6f46b
[docs]: improve structure for kt-kernel ( #1624 )
...
* [docs]: improve structure for kt-kernel
* Update doc/en/kt-kernel/README.md
2025-11-16 13:21:41 +08:00
ZiWei Yuan
133eea037c
[docs]: improve docs structure ( #1623 )
2025-11-16 12:40:59 +08:00
ZiWei Yuan
c2d2edbeef
[docs]: update the web docs structure ( #1622 )
2025-11-16 12:09:44 +08:00