ktransformers

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-06-06 07:51:38 +00:00

Author	SHA1	Message	Date
mrhaoxx	250e4fe52e	merge: integrate origin/main into sft branch Resolve conflicts: - experts.py: keep SFT mode dispatch, add main's numa_nodes param - experts_base.py: merge numa_nodes into shared _get_cpu_infer - convert_cpu_weights.py: keep SFT version (per-layer shard, tmpfs, batched FP8, backward weights) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 22:40:07 +08:00
mrhaoxx	a789729923	align sft branch with main: revert worker_pool, strip sft_timer, fix inference defaults - Revert worker_pool.cpp/.h to main (remove RDTSC timer, Chrome Trace, sft_timer namespace, ITT API, extended do_work_stealing_job API) - Strip all sft_timer instrumentation from sft-only files (sft_moe.hpp, moe-sft-tp.hpp, avx_kernels.hpp) - Restore pin_memory=True in KExpertsCPUBuffer (inference path) - Restore fused tensor transpose logic in convert_cpu_weights.py (main layout) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 17:39:56 +08:00
mrhaoxx	a98d544833	merge: integrate origin/main into sft branch Resolved 6 conflicts: - CMakeLists.txt: keep cpptrace + debug flag, accept flexible build type - worker_pool.cpp: keep SFT profiling + main's block=1 spin fix - ext_bindings.cpp: keep both SFT MOE bindings and AVX2/BF16/FP8 bindings - common.hpp: keep gpu_experts_mask + SFT backward weight fields - __init__.py: export both generate_gpu_experts_masks and AMXSFTMoEWrapper - experts.py: gpu_experts_mask for inference, num_gpu_experts for SFT, new methods	2026-04-08 23:19:28 +08:00
mrhaoxx	f36699affd	feat(sft): AMX MoE SFT backend with LoRA support Complete SFT (Supervised Fine-Tuning) backend for MoE models using AMX SIMD: Core C++ implementation: - sft_moe.hpp: Forward/backward with LoRA fused operations (~5500 lines) - moe-sft-tp.hpp: Tensor-parallel wrapper for multi-NUMA - amx/moe-sft-tp.hpp: AMX-specific TP implementation - avx_kernels.hpp: AVX512 SIMD kernels for LoRA GEMM - amx_kernels.hpp: AMX tile kernels for Panel5 rank-outer optimization - worker_pool: RDTSC profiling, Chrome trace output, SFT timer infrastructure - ext_bindings.cpp: SFT MOE pybind bindings (BF16/INT8/INT4 + SkipLoRA variants) Python sft/ submodule (kt_kernel.sft): - base.py: BaseSFTMoEWrapper with buffer management (template method pattern) - amx.py: AMXSFTMoEWrapper (weight loading, C++ task construction) - autograd.py: KTMoEFunction (torch.autograd.Function for distributed training) - layer.py: KTMoELayerWrapper (nn.Module replacing HF MoE layers) - arch.py: MOEArchConfig (Qwen3/DeepSeek/Mixtral architecture detection) - weights.py: Expert weight extraction and checkpoint loading - lora.py: PEFT LoRA adaptation (view buffers, grad buffers, save/load adapter) - wrapper.py: wrap_moe_layers_with_kt_wrapper, load_kt_model, build_kt_device_map - config.py: KTConfig dataclass (DeepSpeed-style opaque config passthrough) - dist_utils.py: Distributed gather/scatter, checkpoint-phase detection Design decisions: - Rank-0-only expert pattern: only rank 0 holds C++ wrapper and expert weights - DeepSpeed-style integration: accelerate keeps only KTransformersPlugin (framework interaction fields), all logic in kt_kernel.sft - Inference isolation: importing kt_kernel does not load sft/ submodule - Old field name compatibility: _get_kt_config() converts kt_xxx→xxx automatically Verified: Qwen3-235B-A22B 4GPU AMXBF16 training, loss converges normally.	2026-04-08 23:11:00 +08:00
ErvinXie	3903c9afcc	(kt-kernel): add numa_nodes parameter for explicit NUMA node mapping (#1891 ) Add numa_nodes parameter to BaseMoEWrapper and all subclasses, allowing users to explicitly specify which NUMA node IDs to use for subpool mapping instead of always defaulting to sequential [0, 1, ..., N-1]. This enables running multiple KTransformers instances on different NUMA nodes of the same machine, e.g. --kt-threadpool-count 1 --kt-numa-nodes 1 to bind to NUMA node 1. Previously this required external numactl workarounds since subpool_numa_map was hardcoded to start from 0.	2026-03-31 10:27:50 +08:00
Jianwei Dong	027832c590	[feat](kt-kernel): CPU-GPU experts sched (#1796 )	2026-01-16 17:01:15 +08:00
ErvinXie	a8667ddb58	[fix](test): fix import kt-kernel (#1728 )	2025-12-17 19:46:32 +08:00
Jiaqi Liao	e7d1c1de09	fix(llamafile): resolve deferred experts data race and update README (#1646 )	2025-11-26 23:19:37 +08:00
Jiaqi Liao	94c25626dc	Fix kt-kernel for new wrapper (#1588 ) * update README for kt-kernel * style: format C++ and Python code in kt-kernel - Format C++ files: task_queue, ext_bindings, and MoE operators - Format Python utility modules: amx, llamafile, and loader - Improve code readability and consistency	2025-11-10 21:47:34 +08:00
Jiaqi Liao	9bc00e587b	Refactor KTMoEWrapper backend (#1587 ) * universal backend for cpu inference * expert defer	2025-11-10 20:26:15 +08:00

10 Commits