ktransformers

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-05-24 22:55:49 +00:00

Author	SHA1	Message	Date
Li Tingfang	f1e2b82c74	[fix] Add runtime AMX BF16 check to prevent SIGILL on pre-Sapphire Rapids CPUs (#2018 )	2026-05-21 17:36:12 +08:00
login256	eeeeae5e91	Fix duplicate BF16 loader definition (#1984 )	2026-05-20 15:04:47 +08:00
Jim James	f0772445a1	[perf]: native path for MXFP4 MoE on AVX512F (#2006 ) * [perf]: native path for MXFP4 MoE on AVX512F * [perf]: move inline static constants outside structs	2026-05-18 15:44:33 +08:00
Benjamin F	f05b4009f3	[fix](kt-kernel): fix double mem used by safetensor loader (#1997 ) Release the SafeTensor mmap loader singleton after each layer's load_weights() completes. The C++ engine already holds a deep copy (cpu_infer.sync() guarantees this), so releasing the mmap handles is safe. The next layer recreates the loader on demand. This halves peak memory usage during model loading (e.g. DSv3.2: 1.2T -> 613G). Based on #1966 by @poryfly — adapted to v0.6.2.post3 codebase (adds MXFP4 support missing from the original PR). Co-authored-by: xiongchenhui <xiongchenhui@hisense.com>	2026-05-11 12:00:30 +08:00
Benjamin F	bb15fdf47e	Release/0.6.2.post3: carry kt-kernel SwiGLU clamp companion missing from post2	2026-05-10 03:55:02 +08:00
Jim James	f7c4fa68c5	[fix]: add guard for SFT MoE and remove guard for AMX FP4 MoE on AVX512F+BW (#1980 )	2026-05-08 16:05:22 +08:00
Benjamin F	c465557c23	docs(v4-flash): add optional AMXINT4 CPU-weight conversion path (#1986 ) - Add convert_cpu_weights_ds4.py: dequantizes MXFP4 routed experts (E2M1 + ue8m0, group size 32) on GPU and re-quantizes to AMX-INT4 on CPU. - Document the script as Step 2 in DeepSeek-V4-Flash.md so AMX users can opt into AMXINT4 mode instead of the default MXFP4 CPU experts.	2026-05-08 15:35:05 +08:00
Benjamin F	041bdfc636	[New Model] DeepSeek-V4-Flash: kt-kernel MXFP4 MoE + sglang hybrid inference (#1970 ) * [feat](kt-kernel): add MXFP4 MoE operator with E2M1 weights × BF16 activations Implements AMX_FP4_MOE_TP based on the RAWINT4 (k2-moe) CRTP pattern. FP4 E2M1 weights are nibble-packed and decoded via PSHUFB LUT, then computed with BF16 activations using _mm512_dpbf16_ps. Supports weight-only per-kgroup scaling (group_size=32) and tensor parallelism. Includes a Python validation test covering uniform, alternating, ramp, and random weight patterns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [feat](kt-kernel): adapt MXFP4 MoE backend for DeepSeek-V4-Flash (#1950) V4-Flash routed experts ship as native MXFP4 (E2M1 nibble + ue8m0 group scale). Expose AMXFP4_KGroup_MOE through NativeMoEWrapper, add a loader that handles V4's `layers.{L}.ffn.experts.{i}.{w1,w3,w2}.{weight,scale}` naming and converts ue8m0 → bf16 via a lossless bit-cast, register the model entry, and ship an end-to-end numerical validation script. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [perf](kt-kernel): MXFP4 MoE add mat-mat 4×4 tile, refine mat-vec reduce (#1957) mat_mul_kgroup previously aliased to fp4_mat_vec_kgroup, leaving large batches stuck on the per-token path. Implement fp4_mat_mat_kgroup as a 4×4 register tile (MB=NB=4, 16 zmm accumulators) so each PSHUFB decode of four weight rows is reused across four tokens. Refactor fp4_mat_vec_kgroup to accumulate four N-rows in parallel and flush them with a new reduce4 helper, removing per-row reduce_add_ps calls from the hot loop. Mark mxfp4_to_bf16_32 always_inline. Add bench/bench_fp4_moe.py with --routing {balanced,concentrated} and a backend registry so future kernels can be added without changing the runner. Dispatch thresholds, derived_init, GeneralMOEConfig handling, load_weights, write_weights_to_buffer and the TP_MOE specialization are unchanged. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(loader): avoid uint16 lshift in ue8m0->bf16 conversion PyTorch CPU has no lshift kernel for UInt16, so the previous `(scale_t.to(torch.uint16) << 7)` raised NotImplementedError when loading any V4-Flash MXFP4 routed-expert scale tensor on the host. Switch to int32 for the shift (kernel exists) and narrow to int16 afterwards. The shifted value max is 255<<7 = 32640, well within int16 range, so the narrow is lossless. The .view(bfloat16) bit pattern is identical (bf16 sign bit is always 0 for ue8m0 values). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v4-flash): hybrid CPU/GPU recipe + bump kt-sglang submodule Bumps third_party/sglang to kvcache-ai/sglang main (3cbd49c29) which now contains DeepSeek V4 Flash model support + consumer-GPU (SM_120) portable Triton/TileLang fallbacks (kt-sglang PR #38). Adds doc/en/DeepSeek-V4-Flash.md tutorial: 8x RTX 5090 hybrid recipe with the full launch command, OpenAI-compatible /generate + /v1/chat/completions examples, and the kt chat CLI client. --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-03 10:48:31 +08:00
Aliez Ren	02be2bf53f	[feat](kt-kernel): add AVX2/AVX-VNNI RAWINT4 MoE backend (#1942 ) * [feat](kt-kernel): add AVX2/AVX-VNNI RAWINT4 MoE backend * Update AVX2 tutorial with AVX2 compilation instructions Added instructions for forcing AVX2 compilation on AVX512 or AMX machines. * Add instructions for AVX2 compilation --------- Co-authored-by: Jiaheng Dai <108478605+jdai0@users.noreply.github.com>	2026-04-30 17:16:49 +08:00
Peilin Li	ef5822639f	[fix](kt-kernel): pin torch 2.9.1 wheel baseline Pin kt-kernel torch 2.9.1 metadata, update autosetup for cu130 wheels, register kt_kernel.kt_kernel_ext, and bump the sglang submodule.	2026-04-30 00:57:24 +08:00
Peilin Li	0656e01ac1	[docs]: refresh KT install commands (#1958 )	2026-04-27 00:45:43 +08:00
Peilin Li	eeaeb7bfd7	[build]: align kt-kernel torch support with v0.6.1 release (#1948 )	2026-04-24 23:45:15 +08:00
Peilin Li	85308615b9	[build] prepare v0.6.1 SFT wheel packaging on main (#1945 ) * [build]: prepare 0.6.1 SFT wheel packaging on main * [build]: finalize py311+ wheel packaging defaults	2026-04-24 12:08:38 +08:00
mrhaoxx	9544a8960d	feat(sft): AMX MoE SFT backend with LoRA support (#1936 ) * feat(sft): AMX MoE SFT backend with LoRA support Complete SFT (Supervised Fine-Tuning) backend for MoE models using AMX SIMD: Core C++ implementation: - sft_moe.hpp: Forward/backward with LoRA fused operations (~5500 lines) - moe-sft-tp.hpp: Tensor-parallel wrapper for multi-NUMA - amx/moe-sft-tp.hpp: AMX-specific TP implementation - avx_kernels.hpp: AVX512 SIMD kernels for LoRA GEMM - amx_kernels.hpp: AMX tile kernels for Panel5 rank-outer optimization - worker_pool: RDTSC profiling, Chrome trace output, SFT timer infrastructure - ext_bindings.cpp: SFT MOE pybind bindings (BF16/INT8/INT4 + SkipLoRA variants) Python sft/ submodule (kt_kernel.sft): - base.py: BaseSFTMoEWrapper with buffer management (template method pattern) - amx.py: AMXSFTMoEWrapper (weight loading, C++ task construction) - autograd.py: KTMoEFunction (torch.autograd.Function for distributed training) - layer.py: KTMoELayerWrapper (nn.Module replacing HF MoE layers) - arch.py: MOEArchConfig (Qwen3/DeepSeek/Mixtral architecture detection) - weights.py: Expert weight extraction and checkpoint loading - lora.py: PEFT LoRA adaptation (view buffers, grad buffers, save/load adapter) - wrapper.py: wrap_moe_layers_with_kt_wrapper, load_kt_model, build_kt_device_map - config.py: KTConfig dataclass (DeepSpeed-style opaque config passthrough) - dist_utils.py: Distributed gather/scatter, checkpoint-phase detection Design decisions: - Rank-0-only expert pattern: only rank 0 holds C++ wrapper and expert weights - DeepSpeed-style integration: accelerate keeps only KTransformersPlugin (framework interaction fields), all logic in kt_kernel.sft - Inference isolation: importing kt_kernel does not load sft/ submodule - Old field name compatibility: _get_kt_config() converts kt_xxx→xxx automatically Verified: Qwen3-235B-A22B 4GPU AMXBF16 training, loss converges normally. * refactor(sft): unify KTConfig field names with kt_ prefix, add share_cache_pool, remove dead code - KTConfig fields all use kt_ prefix matching dict keys — eliminates _OLD_TO_NEW mapping and prefix-stripping in wrapper.py - Add kt_share_cache_pool field, auto-enabled when gradient_checkpointing is on (via training_args.py), flows through to C++ cache allocation - Remove dead checkpoint detection code: in_ckpt_recompute, in_ckpt_first_forward vars (assigned but never read), fallback _is_in_checkpoint_first_forward() function, unused inspect import - Remove redundant env var fallbacks in wrapper.py for share_backward_bb and share_cache_pool (KTConfig.__post_init__ already handles env vars) - Simplify layer.py checkpoint logic to single _checkpoint_hook_mode() check Verified: Qwen3-235B 3-step training on sap4, loss matches baseline (1.2886 / 1.9824 / 1.377 vs 1.2886 / 1.9766 / 1.3809) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(sft): share_backward_bb default True, share_cache_pool auto-derived - kt_share_backward_bb defaults to True (always saves memory) - kt_share_cache_pool no longer reads from env var; defaults False, auto-set to True by trainer_config_process when gradient checkpointing is enabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add missing gpu_experts_mask=None to KTMoEWrapper call in SFT wrapper KTMoEWrapper.__new__() requires gpu_experts_mask as a positional argument, but the SFT wrapper omitted it, causing MoE layer wrapping to fail silently and FSDP2 to attempt broadcasting all expert weights (OOM/NCCL crash). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(sft): support transformers v5 fused expert format Fused experts (e.g. Qwen3MoeExperts) store weights as 3D Parameters (gate_up_proj [E,2I,H], down_proj [E,H,I]) instead of per-expert nn.Linear modules. PEFT cannot attach LoRA to these, so we create KT-managed LoRA buffers with kaiming init, nn.Parameter wrappers for the optimizer, and pre-assigned .grad for C++ backward. - arch.py: detect_fused_experts() detection - weights.py: fused format extraction and weight clearing - wrapper.py: detect fused at wrap time, store _fused_experts/_lora_rank - lora.py: _create_fused_expert_lora_buffers, save/load fused LoRA, get_kt_lora_params collects fused params, deduplicate wrapper finding - layer.py: handle v5 TopKRouter tuple output, remove dead code - autograd.py: sync_forward_sft/submit_forward_sft API rename Verified: v5 loss/expert-LoRA values match v4 baseline, v4 backward compat. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(sft): add Qwen3.5 MoE support + fused checkpoint loading - arch.py: add Qwen3_5Moe arch match, read config from text_config, _get_layers_prefix returns model.language_model.layers for Qwen3.5, _get_model_container_and_layers searches language_model attr - weights.py: load_experts_from_checkpoint_files detects fused format (gate_up_proj in weight_map) and splits into gate/up/down - wrapper.py: hidden_size fallback to text_config Verified: Qwen3.5-35B-A3B (256 experts, fused format) E2E pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [fix](sft): align Python API with C++ backend after v5 refactor - wrapper.py: pass gpu_experts_mask=None to KTMoEWrapper (required by C++ signature) - layer.py: rename submit_forward_sft/sync_forward_sft to submit_forward/sync_forward - autograd.py: rename sync_forward_sft to sync_forward The sft-v5 refactor (commits `58d7eab`, `dd1da65`) renamed Python-side method calls but the C++ backend (AMXSFTMoEWrapper) still exposes the original method names. This caused AttributeError on Qwen3.5-35B and other models. * align sft branch with main: revert worker_pool, strip sft_timer, fix inference defaults - Revert worker_pool.cpp/.h to main (remove RDTSC timer, Chrome Trace, sft_timer namespace, ITT API, extended do_work_stealing_job API) - Strip all sft_timer instrumentation from sft-only files (sft_moe.hpp, moe-sft-tp.hpp, avx_kernels.hpp) - Restore pin_memory=True in KExpertsCPUBuffer (inference path) - Restore fused tensor transpose logic in convert_cpu_weights.py (main layout) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * revert CMakeLists.txt to main: remove debug flags and cpptrace dep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * clean up dev artifacts: remove SFT design docs, debug examples, bench scripts Remove files not needed in the merge: - docs/SFT+KTWrapper/ (6 Chinese design docs) - docs/sft_moe_amx/ (21 dev/debug docs) - 12 debug/test example scripts - 6 SFT-specific bench scripts and report Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove dev version stamps from ext_bindings, sft_moe, moe-sft-tp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: JimmyPeilinLi <lipeilin@mail.nwpu.edu.cn>	2026-04-22 11:27:01 +08:00
callmegaga	a9411f1d72	Supports vnni-256 for GPTQ INT4 (#1926 ) * [feat](kt-kernel): support avx-vnni-256 for gptq int4	2026-04-13 17:59:59 +08:00
Andy18650	f42e94a527	[fix](cli): handle edge cases with empty NUMA nodes (#1929 ) Co-authored-by: Andy18650 <114562805@qq.com>	2026-04-13 16:45:41 +08:00
ErvinXie	279c920a69	Revert "kt-kernel: enable CPUInfer stream bridge for ROCm (#1918 )" (#1925 ) This reverts commit `1dd0a78899`.	2026-04-09 18:43:03 +08:00
guanjiawei	1dd0a78899	kt-kernel: enable CPUInfer stream bridge for ROCm (#1918 )	2026-04-09 12:20:04 +08:00
Jim James	8a427c9321	[feat]: add AVX512F+BW fallback for FP8 and BF16 under AMX backend (#1908 )	2026-04-03 12:46:22 +08:00
Jianwei Dong	db9326302b	chore: bump version to 0.5.3 (#1909 )	2026-04-01 18:58:48 +08:00
Oql	9e6484a538	[fix]: fix --numa-nodes handling (#1904 ) * [fix]: fix --numa-nodes handling	2026-03-31 17:50:22 +08:00
Doctor Shotgun	24cd4fc055	feat(kt-kernel): Add utility script to merge loose layer weights to safetensors (#1886 ) * Add utility script to merge loose layer weights to safetensors * Send warnings and errors to stderr * Fix expert index parsing for MOE_INT4 and MOE_INT8	2026-03-31 10:41:07 +08:00
alin899992	9c18b60556	feat: CPU weight conversion for GLM-5 and MiniMax-M2.5 (#1853 ) * Support for GLM-5 and Minimax-M2.5 Add CPU weight conversion support for GLM-5 and Minimax-M2.5 * fix: remove overly restrictive MiniMax condition and deduplicate code - Remove `args.input_type == "fp8"` from MiniMaxConverter selection so bf16/fp16 MiniMax models no longer fall through to OnlineQuantConverter (which doesn't handle w1/w2/w3 naming and would fail). - Remove OnlineQuantConverter._find_expert_layers() which is identical to the inherited ConverterBase._find_expert_layers(). - Remove redundant expert_key_filter assignment (same as base default). --------- Co-authored-by: ErvinXie <ervinxie@foxmail.com>	2026-03-31 10:39:48 +08:00
ErvinXie	3903c9afcc	(kt-kernel): add numa_nodes parameter for explicit NUMA node mapping (#1891 ) Add numa_nodes parameter to BaseMoEWrapper and all subclasses, allowing users to explicitly specify which NUMA node IDs to use for subpool mapping instead of always defaulting to sequential [0, 1, ..., N-1]. This enables running multiple KTransformers instances on different NUMA nodes of the same machine, e.g. --kt-threadpool-count 1 --kt-numa-nodes 1 to bind to NUMA node 1. Previously this required external numactl workarounds since subpool_numa_map was hardcoded to start from 0.	2026-03-31 10:27:50 +08:00
Doctor Shotgun	bdf4bb76c5	Fix worker pool idle CPU usage (#1902 ) * Replace sleep polling with condition variable blocking	2026-03-30 20:29:17 +08:00
benapple [msft]	7a02daa694	Fix TaskQueue worker thread 100% CPU spin when idle (#1899 ) * initial fix for issue 1858 * [fix]: add done flag check to sync() wait predicate to prevent deadlock during destruction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ben Appleby <Ben.Appleby@microsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 23:20:42 +08:00
mrhaoxx	7a9daf0cd4	[feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892 ) * feat: support avx2 bf16 fp8 inference * feat: support avx2 gptq int4 inference * fix: numeric issues in fp8 dequant * Tutorial avx2 (#1900) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * Tutorial avx2 (#1901) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * docs: update README.md --------- Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>	2026-03-27 14:45:02 +08:00
YIFANCHENGDU	8561a71dd1	[fix] improve Sglang kt-kernel detect time duration (#1887 ) * Increase timeout for Check if --kt-gpu-prefill-token-threshold is in the help output to 90 seconds. In cloud environments,CUDA initialization and Python module loading can easily exceed 30 seconds. * Update kt-kernel/python/cli/utils/sglang_checker.py add comment about the change Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-18 23:07:40 +08:00
Jianwei Dong	15c624dcae	Fix/sglang kt detection (#1875 ) * [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment - Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main) - Add top-level install.sh for one-click source installation (sglang + kt-kernel) - Add sglang-kt as hard dependency in kt-kernel/pyproject.toml - Add CI workflow to auto-sync sglang submodule daily and create PR - Add CI workflow to build and publish sglang-kt to PyPI - Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages) - Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection - Update Dockerfile to use submodule and inject aligned version - Update all 13 doc files, CLI hints, and i18n strings to reference new install methods Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [build]: bump version to 0.5.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [build]: rename PyPI package from kt-kernel to ktransformers Users can now `pip install ktransformers` to get everything (sglang-kt is auto-installed as a dependency). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "[build]: rename PyPI package from kt-kernel to ktransformers" This reverts commit `e0cbbf6364`. * [build]: add ktransformers meta-package for PyPI `pip install ktransformers` now works as a single install command. It pulls kt-kernel (which in turn pulls sglang-kt). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [fix]: show sglang-kt package version in kt version command - Prioritize sglang-kt package version (aligned with ktransformers) over sglang internal __version__ - Update display name from "sglang" to "sglang-kt" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [fix]: improve sglang-kt detection in kt doctor and kt version Recognize sglang-kt package name as proof of kvcache-ai fork installation. Previously both commands fell through to "PyPI (not recommended)" for non-editable local source installs. Now version.py reuses the centralized check_sglang_installation() logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [build]: bump version to 0.5.2.post1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 16:54:48 +08:00
Chen Hongtao	9e69fccb02	[feat]: add mistral moe loader compatibility (#1873 ) Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>	2026-02-28 17:50:23 +08:00
VYSE V.E.O	20262b2743	Fix Qwen3.5 FP8 load for VL detection (#1857 ) * Fix Qwen3.5 FP8 load for VL detection 1, for VL models(Qwen3.5), modify base_key: model.layers.{N} -> model.language_model.layers.{N} 2, clean DUPLICATED class BF16SafeTensorLoader(SafeTensorLoader) , only the first overrided one. * Indent type Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-02-26 15:47:22 +08:00
Rin	786987a95f	Handle unquoted paths and special characters in model scanner (#1840 ) * Handle unquoted paths and special characters in model scanner * Fix ValueError: capture_output cannot be used with stderr `capture_output=True` internally sets `stderr=PIPE`, which conflicts with `stderr=subprocess.DEVNULL`. Replace `capture_output=True` with explicit `stdout=subprocess.PIPE` to keep stderr suppressed correctly. Also remove redundant `shell=False` (already the default). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: ErvinXie <ervinxie@foxmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 15:44:45 +08:00
Jianwei Dong	16a8b98f3e	support qwen3.5 (#1846 )	2026-02-16 15:48:14 +08:00
Oql	56cbd69ac4	kt-cli enhancement (#1834 ) * [feat]: redesign kt run interactive configuration with i18n support - Redesign kt run with 8-step interactive flow (model selection, inference method, NUMA/CPU, GPU experts, KV cache, GPU/TP selection, parsers, host/port) - Add configuration save/load system (~/.ktransformers/run_configs.yaml) - Add i18n support for kt chat (en/zh translations) - Add universal input validators with auto-retry and Chinese comma support - Add port availability checker with auto-suggestion - Add parser configuration (--tool-call-parser, --reasoning-parser) - Remove tuna command and clean up redundant files - Fix: variable reference bug in run.py, filter to show only MoE models * [feat]: unify model selection UI and enable shared experts fusion by default - Unify kt run model selection table with kt model list display * Add Total size, MoE Size, Repo, and SHA256 status columns * Use consistent formatting and styling * Improve user decision-making with more information - Enable --disable-shared-experts-fusion by default * Change default value from False to True * Users can still override with --enable-shared-experts-fusion * [feat]: improve kt chat with performance metrics and better CJK support - Add performance metrics display after each response * Total time, TTFT (Time To First Token), TPOT (Time Per Output Token) * Accurate input/output token counts using model tokenizer * Fallback to estimation if tokenizer unavailable * Metrics shown in dim style (not prominent) - Fix Chinese character input issues * Replace Prompt.ask() with console.input() for better CJK support * Fixes backspace deletion showing half-characters - Suppress NumPy subnormal warnings * Filter "The value of the smallest subnormal" warnings * Cleaner CLI output on certain hardware environments * [fix]: correct TTFT measurement in kt chat - Move start_time initialization before API call - Previously start_time was set when receiving first chunk, causing TTFT ≈ 0ms - Now correctly measures time from request sent to first token received * [docs]: 添加 Clawdbot 集成指南 - KTransformers 企业级 AI 助手部署方案 * [docs]: 强调推荐使用 Kimi K2.5 作为核心模型，突出企业级推理能力 * [docs]: 添加 Clawdbot 飞书接入教程链接 * [feat]: improve CLI table display, model verification, and chat experience - Add sequence number (#) column to all model tables by default - Filter kt edit to show only MoE GPU models (exclude AMX) - Extend kt model verify to check .json and .py files in addition to weights - Fix re-verification bug where repaired files caused false failures - Suppress tokenizer debug output in kt chat token counting * [fix]: fix cpu cores. --------- Co-authored-by: skqliao <skqliao@gmail.com>	2026-02-04 16:44:54 +08:00
Oql	4f64665758	[docs]: add Qwen3 Coder Next Tutorial (#1833 )	2026-02-04 16:27:10 +08:00
Oql	c28cfcb26e	[fix]: fix k2-moe.hpp load weight (#1830 )	2026-02-03 11:28:49 +08:00
Jiaqi Liao	db82d99fa6	feat: add fallback expert prefix lookup in loader.py from kimi_k2.5 (#1822 )	2026-01-30 14:09:38 +08:00
Jiaqi Liao	edc48aba37	[fix]: fix wrapper import issue (#1819 )	2026-01-28 16:31:56 +08:00
Oql	bf4c8a690b	Add Native Precision Tutorial, update worker strategy and README.md (#1807 )	2026-01-23 18:00:13 +08:00
SCDESPERTATE	b0f827d2a9	[chore](cuda): explicitly use `ele_per_blk` var for better readability (#1784 )	2026-01-23 11:11:08 +08:00
mrhaoxx	b27de4068b	[fix]: fix exp_avx512 for act_fn (#1797 )	2026-01-20 11:07:22 +08:00
Jianwei Dong	027832c590	[feat](kt-kernel): CPU-GPU experts sched (#1796 )	2026-01-16 17:01:15 +08:00
Oql	6277da4c2b	support GLM 4.7 (#1791 ) support GLM 4.7	2026-01-13 17:36:25 +08:00
watamario15	667030d6e6	[kt-kernel]: Fix ignored build configurations in `install.sh` and `CMakeLists.txt` (#1789 ) * Correct variable defaults * Remove CMAKE_BUILD_TYPE setting in CMakeLists	2026-01-12 22:16:19 +08:00
Oql	5edc456749	support Native BF16 format MoE. (#1788 ) support Native BF16 format MoE	2026-01-12 14:43:28 +08:00
Oql	ddb957596f	Fix moe bug. (#1783 ) * [fix]: fix moe.hpp load from file bug. * [fix]: fix all moe hpp init bug. * [fix]: fix moe & awq-moe ug.	2026-01-05 17:02:24 +08:00
Oql	dc6394e501	[fix]: fix moe hpp bug. (#1780 ) fix moe hpp init bug.	2026-01-04 19:32:56 +08:00
Jianwei Dong	9adc91714f	Remove kt-kernel-cuda, kt-kernel uses the version with cuda (#1769 )	2025-12-30 10:23:58 +08:00
ZiWei Yuan	b096b01fbc	[docs]: add kt-cli doc and update corresponding website (#1768 )	2025-12-29 23:06:22 +08:00
ErvinXie	9539ab91eb	Cli (#1765 ) * [feat]: add custom option for kt run * [feat]: depth 3	2025-12-29 15:18:42 +08:00

1 2 3

115 Commits