* support Kimi-K2-Thinking original weight
fix amx kernel bug
* update k2 avx kernel.
* feat: add CPUInfer write buffer task
* [feat]: add kimi k2 cpu write buffer support
- Implement write_weights_to_buffer function in k2-moe.hpp for extracting GPU expert weights
- Fix down (w2) weight column-wise slicing for different TP configurations
- Support three TP scenarios: cpu_tp == gpu_tp, cpu_tp > gpu_tp, cpu_tp < gpu_tp
- Add comprehensive test cases for weight extraction validation
- Ensure compatibility with Kimi model's MoE architecture
* [fix]: correct write_weight_scale_to_buffer expert offset calculation
Fixed the bug in write_weight_scale_to_buffer_task where expert offsets in GPU buffers were incorrectly calculated. Changed from using per_expert_gpu sizes to using full gpu_tp sizes, ensuring correct memory layout for multi-expert scenarios.
Also added benchmark scripts for k2 moe and write buffer operations, and cleaned up debug output in test files.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* [feat]: add write buffer wrapper
* [fix] fix comment
---------
Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
* [feat]: kt-kernel: Add resume arg to CPU weight conversion
* [docs]: kt-kernel: Document resume arg for CPU weight conversion
* [fix]: kt-kernel: Only print resume layer if in use
* [fix]: kt-kernel: Don't log skipped layers when using resume_layer
* [feat]: update kt-kernel hooks and add contribution guide
* [docs]: add contributing guide
* [style]: format the python file and cpp file in kt-kernel
* update README for kt-kernel
* style: format C++ and Python code in kt-kernel
- Format C++ files: task_queue, ext_bindings, and MoE operators
- Format Python utility modules: amx, llamafile, and loader
- Improve code readability and consistency
* refactor: move legacy code to archive/ directory
- Moved ktransformers, csrc, third_party, merge_tensors to archive/
- Moved build scripts and configurations to archive/
- Kept kt-kernel, KT-SFT, doc, and README files in root
- Preserved complete git history for all moved files
* refactor: restructure repository to focus on kt-kernel and KT-SFT modules
* fix README
* fix README
* fix README
* fix README
* docs: add performance benchmarks to kt-kernel section
Add comprehensive performance data for kt-kernel to match KT-SFT's presentation:
- AMX kernel optimization: 21.3 TFLOPS (3.9× faster than PyTorch)
- Prefill phase: up to 20× speedup vs baseline
- Decode phase: up to 4× speedup
- NUMA optimization: up to 63% throughput improvement
- Multi-GPU (8×L20): 227.85 tokens/s total throughput with DeepSeek-R1 FP8
Source: https://lmsys.org/blog/2025-10-22-KTransformers/
This provides users with concrete performance metrics for both core modules,
making it easier to understand the capabilities of each component.
* refactor: improve kt-kernel performance data with specific hardware and models
Replace generic performance descriptions with concrete benchmarks:
- Specify exact hardware: 8×L20 GPU + Xeon Gold 6454S, Single/Dual-socket Xeon + AMX
- Include specific models: DeepSeek-R1-0528 (FP8), DeepSeek-V3 (671B)
- Show detailed metrics: total throughput, output throughput, concurrency details
- Match KT-SFT presentation style for consistency
This provides users with actionable performance data they can use to evaluate
hardware requirements and expected performance for their use cases.
* fix README
* docs: clean up performance table and improve formatting
* add pic for README
* refactor: simplify .gitmodules and backup legacy submodules
- Remove 7 legacy submodules from root .gitmodules (archive/third_party/*)
- Keep only 2 active submodules for kt-kernel (llama.cpp, pybind11)
- Backup complete .gitmodules to archive/.gitmodules
- Add documentation in archive/README.md for researchers who need legacy submodules
This reduces initial clone size by ~500MB and avoids downloading unused dependencies.
* refactor: move doc/ back to root directory
Keep documentation in root for easier access and maintenance.
* refactor: consolidate all images to doc/assets/
- Move kt-kernel/assets/heterogeneous_computing.png to doc/assets/
- Remove KT-SFT/assets/ (images already in doc/assets/)
- Update KT-SFT/README.md image references to ../doc/assets/
- Eliminates ~7.9MB image duplication
- Centralizes all documentation assets in one location
* fix pic path for README