ktransformers

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-04-20 14:29:22 +00:00

Author	SHA1	Message	Date
Jianwei Dong	1f79f6da92	[feat](kt-kernel): Add automatic deployment workflow (#1719 )	2025-12-16 15:20:06 +08:00
Shaoxu Cheng	f25e58ad69	fix: qwen3-npu bugs; update: add readme-for-qwen3-npu (#1717 ) * fix: qwen3-npu bugs; update: add readme-for-qwen3-npu * fix: Correct the README description	2025-12-16 14:27:04 +08:00
RICHARDNAN	18fb8fc897	Npu revise benchmark results and prerequisites (#1716 ) * Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md * Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md * Revise Ascend NPU tutorial for Docker deployment Updated the tutorial for deploying the Ascend NPU, changing sections from 'Conda部署' to '镜像部署' and providing specific commands for Docker container setup and Python environment installation. * Update DeepseekR1 tutorial for Ascend NPU * Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md * Update W8A8 weight link in tutorial * Update doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Refactor Docker command and update package manager Updated Docker run command to simplify device specifications and corrected package manager command from 'apt' to 'yum'. * Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md * Revise benchmark results and prerequisites Updated performance results and hardware specifications. * Update doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-16 14:26:44 +08:00
ZiWei Yuan	34230eaf44	[docs]: Fix image link in README.md (#1718 ) Updated image link to use raw GitHub URL for better accessibility.	2025-12-15 17:10:15 +08:00
SCDESPERTATE	008de19e16	[fix](kt-kernel): drop the weights held in Python for loading weights operation in C++ (#1695 )	2025-12-12 11:42:33 +08:00
Shaoxu Cheng	1e69563363	update: add cache class and ascend ln mlp op for qwen3 adapt npu (#1708 )	2025-12-11 17:08:35 +08:00
Shaoxu Cheng	cea490a326	update: add ascend attn and experts ops for npu qwen3moe adapt (#1707 ) * update: add ascend attn and experts ops for npu qwen3moe adapt * Reorder import statements in custom_ascend_modelling_qwen3.py * Restore copyright and import statements Restored copyright information and imports in ascend_experts.py.	2025-12-11 17:08:15 +08:00
Shaoxu Cheng	adcfa9080f	update: Qwen3 MoE model adaptation for NPU (framework) (#1706 )	2025-12-11 17:07:57 +08:00
ZiWei Yuan	53f6a6d6e1	[feat]: patch kml problem (#1704 )	2025-12-11 14:40:29 +08:00
Jianwei Dong	c65febe05c	[feat]: Automatically detect whether blis is installed on amd cpus (#1702 )	2025-12-11 14:25:36 +08:00
RICHARDNAN	6431888928	add deploy in docker image (#1691 )	2025-12-11 14:11:27 +08:00
ZiWei Yuan	2f1b743050	[docs]: update website doc png (#1696 )	2025-12-11 13:01:32 +08:00
Oql	e87a042ef0	[fix](kt-kernel): fix write_buffer do numa job (#1699 )	2025-12-10 16:39:16 +08:00
Shaoxu Cheng	8995378a91	update: add attention and ln ut for npu (#1698 )	2025-12-10 16:12:26 +08:00
mrhaoxx	f992de55da	[fix](kt-sft): fix peft adaptations for RL tasks (#1674 )	2025-12-09 14:28:51 +08:00
mrhaoxx	503295fc88	[feat](kt-kernel): refactor convert_cpu_weights.py to support conversation for GLM-4.6V (#1687 ) Signed-off-by: mrhaoxx <mr.haoxx@gmail.com>	2025-12-09 14:24:41 +08:00
Oql	ac69ea891e	Fix K2 MoE decode bug in buffer management (#1686 )	2025-12-08 21:08:28 +08:00
Oql	8139c092bf	Reduce CPU memory usage during large chunk prefill (Fixes #1676 ) (#1683 ) * fix(amx): add BufferASmallKGroupImpl to fix buffer overflow in from_mat The original BufferAKGroupImpl::from_mat writes 64 bytes per K_STEP iteration but when K_STEP=32 (for GemmKernel224Int4SmallKGroup), this causes buffer overflow. BufferASmallKGroupImpl overrides from_mat to write only 32 bytes per iteration. * perf(k2-moe): optimize memory allocation with pooled buffers - Replace per-expert buffer allocation with shared memory pools - Dynamically assign buffer slices based on activated experts - Add group_size inference from scale tensor shape in amx.py * delete kimi k2 forward test * add TODO comment for pool_count_ calculation	2025-12-08 20:19:07 +08:00
ErvinXie	eefc8cf98d	更新 Kimi-K2-Thinking-Native.md (#1684 )	2025-12-08 19:58:20 +08:00
Jiaqi Liao	f20e5d1da5	Revise prefill strategy and performance metrics (#1675 ) Updated the prefill strategy descriptions and performance benchmarks in the documentation.	2025-12-06 15:36:04 +08:00
Jiaqi Liao	1d62ac21f7	Update Kimi-K2-Thinking-Native.md (#1673 )	2025-12-05 23:08:02 +08:00
Jiaqi Liao	69fa7b1a57	Revise installation steps in Kimi-K2 documentation (#1672 ) Updated installation instructions and added steps for cloning the repository.	2025-12-05 23:05:24 +08:00
Jiaqi Liao	721b6c4c94	[docs] Update Native Kimi-K2-Thinking documentation and kt-kernel parameters (#1671 ) v0.4.3	2025-12-05 22:46:16 +08:00
Jiaqi Liao	47da806cde	[doc](kt-kernel): add kimi-k2-thinking (#1670 )	2025-12-05 21:53:59 +08:00
ErvinXie	71f683acec	Support Native Kimi K2 Thinking (#1663 ) * [feat]: fix k2 prefill * Update Kimi-K2-Thinking.md * Create Kimi-K2-Thinking-Native.md * Update Kimi-K2-Thinking.md * Update Kimi-K2-Thinking.md * Update Kimi-K2-Thinking-Native.md * [perf] optimize K2 MoE weight loading with per-expert pointers - Avoid expensive torch.stack().contiguous() in Python (was ~6.6s) - Use per-expert pointer arrays (gate_projs) instead of contiguous memory - C++ worker pool performs parallel memcpy for TP slicing - Add LOAD_TIME_PROFILE for load_weights timing analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude <noreply@anthropic.com>	2025-12-05 21:53:05 +08:00
ZiWei Yuan	4850424345	[docs]: add amd blis backend usage guide (#1669 )	2025-12-05 16:52:26 +08:00
Jiaqi Liao	1ca3a2662e	Add 9#AISoft to the list of contributors (#1668 )	2025-12-05 15:44:04 +08:00
Jiaqi Liao	0698252484	[fix](kt-kernel): gate RAWINT4 behind AVX512 and avoid AVX2 build break (#1660 )	2025-12-03 00:43:23 +08:00
Jianwei Dong	670c488155	[docs]: Add deepseek-v3.2 run tutorial (#1659 )	2025-12-02 20:04:10 +08:00
Jiaqi Liao	fcf8882075	[Feature] Add avx-based kimi-k2 support (#1656 ) * support Kimi-K2-Thinking original weight fix amx kernel bug * update k2 avx kernel. * feat: add CPUInfer write buffer task * [feat]: add kimi k2 cpu write buffer support - Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights - Fix down (w2) weight column-wise slicing for different TP configurations - Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp - Add comprehensive test cases for weight extraction validation - Ensure compatibility with Kimi model's MoE architecture * [fix]: correct write_weight_scale_to_buffer expert offset calculation Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios. Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * [feat]: add write buffer wrapper * [fix] fix comment --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude <noreply@anthropic.com>	2025-12-02 16:01:07 +08:00
ZiWei Yuan	c2b8c60c4e	[ci]: add int4_1 & int4_1k (#1653 ) * [feat]: init amd adaption * [feat]: add blis support * [fix]: fix setup and moe kernel warpper * [fix](setup.py): support rebuild with cache and import kt_kernel works fine * [feat]: add moe_kernel converter for amd and implement the load method(haven't tested yet) * [feat](moe_kernel/moe.hpp): delete unused memory when using save * [fix](moe_kernel): update PLAIN for pack * [fix](moe_kernel): rm printf debug * [fix](moe_kernel): skip gpu experts * [fix](moe_kernel/moe.hpp): update include memory path * [feat](moe_kernel/moe.hpp): support expert deferral * [feat]: finish amd * [ci]: add int4_1 & int4_1k --------- Co-authored-by: mrhaoxx <mr.haoxx@gmail.com>	2025-12-02 15:58:14 +08:00
Jianwei Dong	fd78fe520a	fix(scripts): resolve OOM when converting gpu weights and update README (#1640 )	2025-12-01 14:15:14 +08:00
Peilin Li	e637fedc65	[docs]: Add Full introduction of KT (#1636 )	2025-11-29 15:46:55 +08:00
Peilin Li	7ee80bbc3d	[docs]: Update README with Python 3.12 and dependency changes (#1634 ) Updated Python version in installation instructions and adjusted KTransformers and flash-attention wheel filenames accordingly.	2025-11-29 15:46:05 +08:00
mrhaoxx	637c49c83f	[feat](kt-kernel): support qwen3-vl weights convert (#1648 )	2025-11-27 22:29:09 +08:00
Jianwei Dong	c256150e08	update ci test (#1647 )	2025-11-27 16:39:48 +08:00
ZiWei Yuan	1374b98ee5	[feat](moe_kernel): add amd blis support (int8) (#1600 ) * [feat]: init amd adaption * [feat]: add blis support * [fix]: fix setup and moe kernel warpper * [fix](setup.py): support rebuild with cache and import kt_kernel works fine * [feat]: add moe_kernel converter for amd and implement the load method(haven't tested yet) * [feat](moe_kernel/moe.hpp): delete unused memory when using save * [fix](moe_kernel): update PLAIN for pack * [fix](moe_kernel): rm printf debug * [fix](moe_kernel): skip gpu experts * [fix](moe_kernel/moe.hpp): update include memory path * [feat](moe_kernel/moe.hpp): support expert deferral * [feat]: finish amd --------- Co-authored-by: mrhaoxx <mr.haoxx@gmail.com>	2025-11-27 12:08:53 +08:00
Jianwei Dong	fef6dd98a8	add accuracy and performance test (#1643 )	2025-11-27 10:56:39 +08:00
Jiaqi Liao	e7d1c1de09	fix(llamafile): resolve deferred experts data race and update README (#1646 )	2025-11-26 23:19:37 +08:00
Jianwei Dong	51745a9ea1	add ci (#1642 )	2025-11-25 20:52:08 +08:00
RICHARDNAN	2cffdf7033	[docs]: Update DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md (#1638 )	2025-11-24 11:51:07 +08:00
DocShotgun	e72a4fb880	[feat](kt-kernel): Add resume arg to CPU weight conversion (#1630 ) * [feat]: kt-kernel: Add resume arg to CPU weight conversion * [docs]: kt-kernel: Document resume arg for CPU weight conversion * [fix]: kt-kernel: Only print resume layer if in use * [fix]: kt-kernel: Don't log skipped layers when using resume_layer	2025-11-22 12:00:15 +08:00
Jiaqi Liao	e69c67713f	[refactor] fix third_party issue (#1632 ) * [refactor]: relocate third_party directory * [fix]: fix custom_flashinfer for kt-sft v0.4.2	2025-11-20 13:55:55 +08:00
Jiaqi Liao	46af8fcab5	[doc] fix kt parameters (#1629 )	2025-11-19 16:41:57 +08:00
Peilin Li	171578a7ec	[refactor]: Change named 'KT-SFT' to 'kt-sft' (#1626 ) * Change named 'KT-SFT' to 'kt-sft' * [docs]: update kt-sft name --------- Co-authored-by: ZiWei Yuan <yzwliam@126.com>	2025-11-17 11:48:42 +08:00
Pory	2887050ca1	[Feature] add Qwen3MoE models for KTransformers-FT (#1602 ) * add qwen3 attn * fix KQwen3MoeSparseMoeBlock * fix bug adapter for llamafactory --------- Co-authored-by: unknown <xiongchenhui@hisense.ad>	2025-11-16 16:39:19 +08:00
ZiWei Yuan	ab8ad0a110	[docs]: update web doc (#1625 )	2025-11-16 14:40:22 +08:00
ZiWei Yuan	be6db6f46b	[docs]: improve structure for kt-kernel (#1624 ) * [docs]: improve structure for kt-kernel * Update doc/en/kt-kernel/README.md	2025-11-16 13:21:41 +08:00
ZiWei Yuan	133eea037c	[docs]: improve docs structure (#1623 )	2025-11-16 12:40:59 +08:00
ZiWei Yuan	c2d2edbeef	[docs]: update the web docs structure (#1622 )	2025-11-16 12:09:44 +08:00

1 2 3 4 5 ...

1150 Commits