* update README for kt-kernel
* style: format C++ and Python code in kt-kernel
- Format C++ files: task_queue, ext_bindings, and MoE operators
- Format Python utility modules: amx, llamafile, and loader
- Improve code readability and consistency
* refactor: move legacy code to archive/ directory
- Moved ktransformers, csrc, third_party, merge_tensors to archive/
- Moved build scripts and configurations to archive/
- Kept kt-kernel, KT-SFT, doc, and README files in root
- Preserved complete git history for all moved files
* refactor: restructure repository to focus on kt-kernel and KT-SFT modules
* fix README
* fix README
* fix README
* fix README
* docs: add performance benchmarks to kt-kernel section
Add comprehensive performance data for kt-kernel to match KT-SFT's presentation:
- AMX kernel optimization: 21.3 TFLOPS (3.9× faster than PyTorch)
- Prefill phase: up to 20× speedup vs baseline
- Decode phase: up to 4× speedup
- NUMA optimization: up to 63% throughput improvement
- Multi-GPU (8×L20): 227.85 tokens/s total throughput with DeepSeek-R1 FP8
Source: https://lmsys.org/blog/2025-10-22-KTransformers/
This provides users with concrete performance metrics for both core modules,
making it easier to understand the capabilities of each component.
* refactor: improve kt-kernel performance data with specific hardware and models
Replace generic performance descriptions with concrete benchmarks:
- Specify exact hardware: 8×L20 GPU + Xeon Gold 6454S, Single/Dual-socket Xeon + AMX
- Include specific models: DeepSeek-R1-0528 (FP8), DeepSeek-V3 (671B)
- Show detailed metrics: total throughput, output throughput, concurrency details
- Match KT-SFT presentation style for consistency
This provides users with actionable performance data they can use to evaluate
hardware requirements and expected performance for their use cases.
* fix README
* docs: clean up performance table and improve formatting
* add pic for README
* refactor: simplify .gitmodules and backup legacy submodules
- Remove 7 legacy submodules from root .gitmodules (archive/third_party/*)
- Keep only 2 active submodules for kt-kernel (llama.cpp, pybind11)
- Backup complete .gitmodules to archive/.gitmodules
- Add documentation in archive/README.md for researchers who need legacy submodules
This reduces initial clone size by ~500MB and avoids downloading unused dependencies.
* refactor: move doc/ back to root directory
Keep documentation in root for easier access and maintenance.
* refactor: consolidate all images to doc/assets/
- Move kt-kernel/assets/heterogeneous_computing.png to doc/assets/
- Remove KT-SFT/assets/ (images already in doc/assets/)
- Update KT-SFT/README.md image references to ../doc/assets/
- Eliminates ~7.9MB image duplication
- Centralizes all documentation assets in one location
* fix pic path for README
Add documentation for KTransformers SGLang inference deployment, including installation steps, model download links, server launch instructions, and performance benchmarks.