[docs]: add contribuing guide and add hooks install (#1613)

* [feat]: update kt-kernel hooks and add contribution guide * [docs]: add contributing guide * [style]: format the python file and cpp file in kt-kernel
2026-04-29 18:51:15 +00:00 · 2025-11-15 18:26:49 +08:00
parent c32fefb1cd
commit aef6672dd8
11 changed files with 289 additions and 164 deletions
--- a/.github/CONTRIBUTING.md
+++ b/.github/CONTRIBUTING.md
@@ -0,0 +1,139 @@
 ## Before Commit!
 Your commit message must follow Conventional Commits (https://www.conventionalcommits.org/) and your code should be formatted. The Git hooks will do most of the work automatically:
 ### Tool Requirements
 You need a recent `clang-format` (>= 18). In a conda environment you can install:
 ```shell
 conda install -c conda-forge clang-format=18
 ```
 If you previously configured with an older version, remove the build directory and reconfigure:
 ```shell
 rm -rf kt-kernel/build
 ```
 Install `black` for Python formatting:
 ```shell
 conda install black
 ```
 ### Install hook:
 ```shell
 bash kt-kernel/scripts/install-git-hooks.sh
 #or just cmake the kt-kernel
 cmake -S kt-kernel -B kt-kernel/build
 ```
 There are manual commands if you need format.
 ```shell
 cmake -S kt-kernel -B kt-kernel/build
 cmake --build kt-kernel/build --target format
 ```
 ## Developer Note
 Formatting and commit message rules are enforced by Git hooks. After installing `clang-format` and `black`, just commit normally—the hooks will run formatting for you.
 > [!NOTE]
 > If formatting modifies files, the commit is aborted after staging those changes. Review them and run `git commit` again. Repeat until no further formatting changes appear.
 ---
 ### Conventional Commit Regex (Reference)
 The commit-msg hook enforces this pattern:
 ```text
 regex='^\[(feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert|wip)\](\([^\)]+\))?(!)?: .+'
 ```
 Meaning (English):
 * `[type]` required — one of feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert|wip
 * Optional scope: `(scope)` — any chars except `)`
 * Optional breaking change marker: `!` right after type or scope
 * Separator: `: ` (colon + space)
 * Subject: free text (at least one character)
 Examples:
 ```text
 [feat]: add adaptive batching
 [fix(parser)]: handle empty token list
 [docs]!: update API section for breaking rename
 ```
 You can bypass locally (not recommended) with:
 ```shell
 git commit --no-verify
 ```
 ## 提交前提醒
 提交信息必须满足 Conventional Commits 规范 (https://www.conventionalcommits.org/)，代码需要符合格式要求。Git 钩子已经集成了大部分工作：
 ### 软件要求
 需要较新的 `clang-format` (>= 18)，在 conda 环境中安装：
 ```shell
 conda install -c conda-forge clang-format=18
 ```
 如果之前用老版本配置过，请删除构建目录重新配置：
 ```shell
 rm -rf kt-kernel/build
 ```
 安装 `black` 以进行 Python 文件格式化：
 ```shell
 conda install black
 ```
 ### 安装钩子
 ```shell
 bash kt-kernel/scripts/install-git-hooks.sh
 #or just cmake the kt-kernel
 cmake -S kt-kernel -B kt-kernel/build
 ```
 如果你需要手动格式化：
 ```shell
 cmake -S kt-kernel -B kt-kernel/build
 cmake --build kt-kernel/build --target format
 ```
 ## 开发者说明
 本仓库通过 Git hooks 自动执行代码格式化与提交信息规范检查。只需安装好 `clang-format` 与 `black` 后正常执行提交即可，钩子会自动格式化。
 > [!NOTE]
 > 如果格式化修改了文件，钩子会终止提交并已暂存这些改动。请查看修改后再次执行 `git commit`，重复直到没有新的格式化变更。
 ### 提交信息正则（参考）
 钩子使用如下正则检查提交信息：
 ```text
 regex='^\[(feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert|wip)\](\([^\)]+\))?(!)?: .+'
 ```
 含义：
 * `[type]` 必填：feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert|wip
 * 作用域可选：`(scope)`，不能包含右括号
 * 可选的破坏性标记：`!`
 * 分隔符：冒号+空格 `: `
 * 描述：至少一个字符
 示例：
 ```text
 [feat]: 增加自适应 batch 功能
 [fix(tokenizer)]: 修复空 token 列表处理
 [docs]!: 更新接口文档（存在破坏性修改）
 ```
 跳过钩子（不推荐，仅紧急时）：
 ```shell
 git commit --no-verify
 ```
--- a/kt-kernel/.githooks/pre-commit
+++ b/kt-kernel/.githooks/pre-commit
@@ -1,10 +1,12 @@
 #!/usr/bin/bash
-# Pre-commit hook: run clang-format via CMake 'format' target and Black for Python before allowing commit.
+# Pre-commit hook: run clang-format via kt-kernel's CMake 'format' target and Black for Python
-# If formatting makes changes, stage them and abort so user can review.
+# before allowing commit. If formatting makes changes, stage them and abort so user can review.
 set -euo pipefail
 REPO_ROOT="$(git rev-parse --show-toplevel)"
-BUILD_DIR="$REPO_ROOT/build"
+# kt-kernel project directory within the monorepo
 KERNEL_DIR="$REPO_ROOT/kt-kernel"
 BUILD_DIR="$KERNEL_DIR/build"
 FORMAT_TARGET="format"
 CLANG_FORMAT_BIN="${CLANG_FORMAT_BIN:-clang-format}"
 BLACK_BIN="${BLACK_BIN:-black}"
@@ -20,10 +22,10 @@ if ! command -v "$BLACK_BIN" >/dev/null 2>&1; then
  echo "[pre-commit] black not found (looked for $BLACK_BIN). Skipping Python format." >&2
 fi
-# Configure build directory if missing (quiet)
+# Configure kt-kernel build directory if missing (quiet)
-if [ ! -d "$BUILD_DIR" ] || [ ! -f "$BUILD_DIR/Makefile" ] && [ ! -f "$BUILD_DIR/build.ninja" ]; then
+if [ ! -d "$BUILD_DIR" ] || { [ ! -f "$BUILD_DIR/Makefile" ] && [ ! -f "$BUILD_DIR/build.ninja" ]; }; then
-  echo "[pre-commit] configuring project (cmake) ..." >&2
+  echo "[pre-commit] configuring kt-kernel (cmake) ..." >&2
-  cmake -S "$REPO_ROOT" -B "$BUILD_DIR" >/dev/null
+  cmake -S "$KERNEL_DIR" -B "$BUILD_DIR" >/dev/null
 fi
 # Run format target (prefer ninja if present)
@@ -38,15 +40,18 @@ fi
 # Run black on staged python files (or entire repo if you prefer)
 if command -v "$BLACK_BIN" >/dev/null 2>&1; then
-  # Get staged python files; if none, skip
+  # Run black only on kt-kernel's python and scripts directories
-  PY_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep -E '\.py$' || true)
+  BLACK_PATHS=""
-  if [ -n "$PY_FILES" ]; then
+  if [ -d "$KERNEL_DIR/python" ]; then
-    echo "[pre-commit] running black on staged python files..." >&2
+    BLACK_PATHS="$BLACK_PATHS $KERNEL_DIR/python"
-    $BLACK_BIN $PY_FILES
+  fi
-  else
+  if [ -d "$KERNEL_DIR/scripts" ]; then
-    # Optionally format all python files; comment out if not desired
+    BLACK_PATHS="$BLACK_PATHS $KERNEL_DIR/scripts"
-    # $BLACK_BIN "$REPO_ROOT"
+  fi
-    :
+  if [ -n "$BLACK_PATHS" ]; then
    echo "[pre-commit] running black on:$BLACK_PATHS" >&2
    # shellcheck disable=SC2086
    $BLACK_BIN $BLACK_PATHS
  fi
 fi
--- a/kt-kernel/CMakeLists.txt
+++ b/kt-kernel/CMakeLists.txt
@@ -79,27 +79,40 @@ if(USE_CONDA_TOOLCHAIN)
    set(CMAKE_INSTALL_RPATH_USE_LINK_PATH OFF)
 endif()
-## Ensure git hooks are installed when configuring the project
+## Ensure git hooks are installed when configuring the project (monorepo-aware)
-# If this is a git working copy and the installer exists, run it and fail the CMake configure
+# If we are inside a git worktree (repo root is outside kt-kernel now), invoke the installer
-# when installation fails. If no .git directory is present (e.g. source tarball), skip.
+# which will link kt-kernel/.githooks into the top-level .git/hooks. Otherwise, skip.
-if(EXISTS "${CMAKE_SOURCE_DIR}/.git" AND IS_DIRECTORY "${CMAKE_SOURCE_DIR}/.git")
+find_program(GIT_BIN git)
-    if(EXISTS "${CMAKE_SOURCE_DIR}/scripts/install-git-hooks.sh")
+if(GIT_BIN)
-        message(STATUS "Detected .git; installing git hooks using scripts/install-git-hooks.sh")
+    execute_process(
-        execute_process(
+        COMMAND "${GIT_BIN}" rev-parse --show-toplevel
-            COMMAND sh "${CMAKE_SOURCE_DIR}/scripts/install-git-hooks.sh"
+        WORKING_DIRECTORY "${CMAKE_SOURCE_DIR}"
-            WORKING_DIRECTORY "${CMAKE_SOURCE_DIR}"
+        OUTPUT_VARIABLE _GIT_TOP
-            RESULT_VARIABLE _INSTALL_GIT_HOOKS_RESULT
+        RESULT_VARIABLE _GIT_RV
-            OUTPUT_VARIABLE _INSTALL_GIT_HOOKS_OUT
+        OUTPUT_STRIP_TRAILING_WHITESPACE
-            ERROR_VARIABLE _INSTALL_GIT_HOOKS_ERR
+        ERROR_QUIET
    )
    if(_GIT_RV EQUAL 0 AND EXISTS "${_GIT_TOP}/.git" AND IS_DIRECTORY "${_GIT_TOP}/.git")
        if(EXISTS "${CMAKE_SOURCE_DIR}/scripts/install-git-hooks.sh")
            message(STATUS "Detected git worktree at ${_GIT_TOP}; installing hooks from kt-kernel/.githooks")
            execute_process(
                COMMAND sh "${CMAKE_SOURCE_DIR}/scripts/install-git-hooks.sh"
                WORKING_DIRECTORY "${CMAKE_SOURCE_DIR}"
                RESULT_VARIABLE _INSTALL_GIT_HOOKS_RESULT
                OUTPUT_VARIABLE _INSTALL_GIT_HOOKS_OUT
                ERROR_VARIABLE _INSTALL_GIT_HOOKS_ERR
            )
-        if(NOT _INSTALL_GIT_HOOKS_RESULT EQUAL 0)
+            if(NOT _INSTALL_GIT_HOOKS_RESULT EQUAL 0)
-            message(FATAL_ERROR "Installing git hooks failed (exit ${_INSTALL_GIT_HOOKS_RESULT}).\nOutput:\n${_INSTALL_GIT_HOOKS_OUT}\nError:\n${_INSTALL_GIT_HOOKS_ERR}")
+                message(FATAL_ERROR "Installing git hooks failed (exit ${_INSTALL_GIT_HOOKS_RESULT}).\nOutput:\n${_INSTALL_GIT_HOOKS_OUT}\nError:\n${_INSTALL_GIT_HOOKS_ERR}")
            endif()
        else()
            message(FATAL_ERROR "Required script 'scripts/install-git-hooks.sh' not found in kt-kernel; cannot install hooks.")
        endif()
    else()
-        message(FATAL_ERROR "Repository appears to be a git repo but required script 'scripts/install-git-hooks.sh' was not found. Please ensure hooks installer is present.")
+        message(STATUS "No git worktree detected; skipping git hooks installation")
    endif()
 else()
-    message(STATUS "No .git directory found; skipping git hooks installation")
+    message(STATUS "git not found; skipping git hooks installation")
 endif()
 set(CMAKE_CXX_STANDARD 20)
--- a/kt-kernel/cpu_backend/shared_mem_buffer.cpp
+++ b/kt-kernel/cpu_backend/shared_mem_buffer.cpp
@@ -9,10 +9,10 @@
 **/
 #include "shared_mem_buffer.h"
 #include <errno.h>
 #include <numa.h>
 #include <cstdio>
 #include <errno.h>
 size_t MemoryRequest::total_size() {
  size_t total = 0;
--- a/kt-kernel/operators/moe-tp.hpp
+++ b/kt-kernel/operators/moe-tp.hpp
@@ -7,6 +7,7 @@
 #include <cstdio>
 #include <type_traits>
 #include "../cpu_backend/shared_mem_buffer.h"
 #include "common.hpp"
 // Forward declaration for Llamafile backend type checking
--- a/kt-kernel/operators/moe_kernel/moe.hpp
+++ b/kt-kernel/operators/moe_kernel/moe.hpp
@@ -13,11 +13,11 @@
 #include <vector>
 #include "../common.hpp"
 #include "../cpu_backend/shared_mem_buffer.h"
 #include "../moe-tp.hpp"
 #include "api/common.h"
 #include "api/mat_kernel.h"
 #include "llama.cpp/ggml.h"
 template <class T, bool PLAIN = true>
 class MOE_KERNEL_TP
 #ifdef FORWARD_TIME_PROFILE
--- a/kt-kernel/scripts/convert_cpu_weights.py
+++ b/kt-kernel/scripts/convert_cpu_weights.py
@@ -22,7 +22,6 @@ import triton
 import triton.language as tl
 Q_BITS = 4
 STORAGE_BITS = 32
 PACK_NUM = STORAGE_BITS // Q_BITS
@@ -31,6 +30,7 @@ NUMA_NUM = 2
 REVERSE_AWQ_PACK_ORDER = [0, 4, 1, 5, 2, 6, 3, 7]
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@triton.jit
 def weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE: tl.constexpr):
    pid_m = tl.program_id(axis=0)
@@ -51,10 +51,11 @@ def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> t
    assert x.dim() == 2 and s.dim() == 2
    M, N = x.size()
    y = torch.empty_like(x, dtype=torch.get_default_dtype())
-    grid = lambda meta: (triton.cdiv(M, meta['BLOCK_SIZE']), triton.cdiv(N, meta['BLOCK_SIZE']))
+    grid = lambda meta: (triton.cdiv(M, meta["BLOCK_SIZE"]), triton.cdiv(N, meta["BLOCK_SIZE"]))
    weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
    return y
 def load_model_config(input_path: str, input_type: str = None) -> Dict:
    """Load model configuration from config.json
@@ -297,7 +298,6 @@ class ConverterBase:
        handle = self.file_handle_map[file]
        return handle.get_tensor(key)
    # layers_id -> list[experts_id]
    def _find_expert_layers(self) -> Dict[int, List[int]]:
        """Find all layers and experts in the model"""
@@ -517,7 +517,9 @@ class OnlineQuantConverter(ConverterBase):
        quant_method: str = "int4",
        merge_to_safetensor: bool = True,
    ):
-        super().__init__(input_path, output_path, model_config, cpuinfer_threads, threadpool_count, input_type, merge_to_safetensor)
+        super().__init__(
            input_path, output_path, model_config, cpuinfer_threads, threadpool_count, input_type, merge_to_safetensor
        )
        self.quant_method = quant_method
        # For FP8, get block size from model_config
@@ -569,11 +571,11 @@ class OnlineQuantConverter(ConverterBase):
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")
-        with open(file_path, 'rb') as f:
+        with open(file_path, "rb") as f:
            binary_data = f.read()
        # Determine dtype based on file name
-        if 'scale' in file_path:
+        if "scale" in file_path:
            # Scale tensors are typically float32
            np_array = np.frombuffer(binary_data, dtype=np.float32)
        else:
@@ -616,22 +618,12 @@ class OnlineQuantConverter(ConverterBase):
            # Iterate through all experts
            for expert_id in range(self.num_experts):
                # For each projection (down, gate, up)
-                proj_mappings = [
+                proj_mappings = [("down", "ffn_down_exps"), ("gate", "ffn_gate_exps"), ("up", "ffn_up_exps")]
                    ('down', 'ffn_down_exps'),
                    ('gate', 'ffn_gate_exps'),
                    ('up', 'ffn_up_exps')
                ]
                for proj_name, proj_key in proj_mappings:
                    # Build file patterns
-                    quant_pattern = os.path.join(
+                    quant_pattern = os.path.join(numa_folder, f"{amx_method}_{proj_name}_{expert_id}_*Byte_quant_.kt")
-                        numa_folder,
+                    scale_pattern = os.path.join(numa_folder, f"{amx_method}_{proj_name}_{expert_id}_*Byte_scale_.kt")
                        f'{amx_method}_{proj_name}_{expert_id}_*Byte_quant_.kt'
                    )
                    scale_pattern = os.path.join(
                        numa_folder,
                        f'{amx_method}_{proj_name}_{expert_id}_*Byte_scale_.kt'
                    )
                    # Find files using glob
                    quant_files = glob.glob(quant_pattern)
@@ -705,18 +697,18 @@ class OnlineQuantConverter(ConverterBase):
                    raise KeyError(f"Missing down weight_scale_inv for layer {layer_idx}, expert {expert_id}")
                # Load FP8 weights and scales
-                gate_fp8 = self._load_tensor(gate_key).to('cuda')
+                gate_fp8 = self._load_tensor(gate_key).to("cuda")
-                up_fp8 = self._load_tensor(up_key).to('cuda')
+                up_fp8 = self._load_tensor(up_key).to("cuda")
-                down_fp8 = self._load_tensor(down_key).to('cuda')
+                down_fp8 = self._load_tensor(down_key).to("cuda")
-                gate_scale_inv = self._load_tensor(gate_scale_key).to('cuda')
+                gate_scale_inv = self._load_tensor(gate_scale_key).to("cuda")
-                up_scale_inv = self._load_tensor(up_scale_key).to('cuda')
+                up_scale_inv = self._load_tensor(up_scale_key).to("cuda")
-                down_scale_inv = self._load_tensor(down_scale_key).to('cuda')
+                down_scale_inv = self._load_tensor(down_scale_key).to("cuda")
                # Dequantize FP8 to BF16 using block-wise scaling
-                gate_weight = weight_dequant(gate_fp8, gate_scale_inv).to('cpu').to(torch.bfloat16).contiguous()
+                gate_weight = weight_dequant(gate_fp8, gate_scale_inv).to("cpu").to(torch.bfloat16).contiguous()
-                up_weight = weight_dequant(up_fp8, up_scale_inv).to('cpu').to(torch.bfloat16).contiguous()
+                up_weight = weight_dequant(up_fp8, up_scale_inv).to("cpu").to(torch.bfloat16).contiguous()
-                down_weight = weight_dequant(down_fp8, down_scale_inv).to('cpu').to(torch.bfloat16).contiguous()
+                down_weight = weight_dequant(down_fp8, down_scale_inv).to("cpu").to(torch.bfloat16).contiguous()
            elif self.input_type == "fp16":
                # Load FP16 and convert to BF16
@@ -804,6 +796,7 @@ class OnlineQuantConverter(ConverterBase):
            print(f"  Keeping layer folder structure at {self.output_path}/_layer_{layer_idx}")
            return {}
 """
 Example usage(test passed):
 python convert_cpu_weights.py --input-path /mnt/data3/models/DeepSeek-R1-0528/ --input-type fp8 --output /mnt/data3/models/DeepSeek-R1-0528-INT4-test --quant-method int4 --cpuinfer-threads 60 --threadpool-count 2
@@ -811,6 +804,7 @@ python convert_cpu_weights.py --input-path /mnt/data3/models/DeepSeek-R1-0528/ -
 python convert_cpu_weights.py --input-path /mnt/data2/models/Qwen3-Next-80B-A3B-Instruct --input-type bf16 --output /mnt/data2/models/Qwen3-Next-80B-A3B-Instruct-INT4-test --quant-method int4 --cpuinfer-threads 60 --threadpool-count 2
 """
 def main():
    parser = argparse.ArgumentParser(description="Convert SafeTensors to column major 1D format")
    parser.add_argument("--input-path", "-i", required=True, help="Input directory with safetensors")
@@ -873,12 +867,25 @@ def main():
        if quant_method == "awq":
            converter = AWQToColumnMajorConverter(
-                args.input_path, args.output, model_config, args.cpuinfer_threads, args.threadpool_count, input_type=None, merge_to_safetensor=merge_to_safetensor
+                args.input_path,
                args.output,
                model_config,
                args.cpuinfer_threads,
                args.threadpool_count,
                input_type=None,
                merge_to_safetensor=merge_to_safetensor,
            )
        elif quant_method in ["int4", "int8"] and args.input_type in ["fp8", "fp16", "bf16"]:
            # Use OnlineQuantConverter for both INT4 and INT8 quantization
            converter = OnlineQuantConverter(
-                args.input_path, args.output, model_config, args.cpuinfer_threads, args.threadpool_count, args.input_type, quant_method, merge_to_safetensor
+                args.input_path,
                args.output,
                model_config,
                args.cpuinfer_threads,
                args.threadpool_count,
                args.input_type,
                quant_method,
                merge_to_safetensor,
            )
        else:
            raise ValueError(
--- a/kt-kernel/scripts/convert_gpu_weights.py
+++ b/kt-kernel/scripts/convert_gpu_weights.py
@@ -34,63 +34,42 @@ from datasets import load_dataset
 def parse_args():
    parser = argparse.ArgumentParser(description="Quantize MoE models with selective quantization")
-    
+
    # Required arguments
-    parser.add_argument(
+    parser.add_argument("--model_id", type=str, required=True, help="Path to the input model directory")
-        "--model_id",
+    parser.add_argument("--output_dir", type=str, required=True, help="Path to save the quantized model")
-        type=str,
+
        required=True,
        help="Path to the input model directory"
    )
    parser.add_argument(
        "--output_dir", 
        type=str,
        required=True,
        help="Path to save the quantized model"
    )
    # Optional arguments
    parser.add_argument(
        "--quant_type",
        type=str,
        choices=["W4A16", "W8A16"],
        default="W8A16",
-        help="Quantization type: W4A16 (GPTQ4) or W8A16 (GPTQ8). Default: W8A16"
+        help="Quantization type: W4A16 (GPTQ4) or W8A16 (GPTQ8). Default: W8A16",
    )
    parser.add_argument(
-        "--num_calibration_samples",
+        "--num_calibration_samples", type=int, default=512, help="Number of calibration samples. Default: 512"
        type=int,
        default=512,
        help="Number of calibration samples. Default: 512"
    )
    parser.add_argument(
-        "--max_sequence_length",
+        "--max_sequence_length", type=int, default=2048, help="Maximum sequence length for calibration. Default: 2048"
        type=int,
        default=2048,
        help="Maximum sequence length for calibration. Default: 2048"
    )
    parser.add_argument(
        "--dampening_frac",
        type=float,
        default=0.1,
-        help="Dampening fraction to mitigate quantization noise. Default: 0.1"
+        help="Dampening fraction to mitigate quantization noise. Default: 0.1",
    )
    parser.add_argument(
        "--dataset",
        type=str,
        default="HuggingFaceH4/ultrachat_200k",
-        help="Dataset for calibration. Default: HuggingFaceH4/ultrachat_200k"
+        help="Dataset for calibration. Default: HuggingFaceH4/ultrachat_200k",
    )
    parser.add_argument(
-        "--dataset_split",
+        "--dataset_split", type=str, default="train_sft", help="Dataset split to use. Default: train_sft"
        type=str,
        default="train_sft",
        help="Dataset split to use. Default: train_sft"
    )
    parser.add_argument(
-        "--force_cpu",
+        "--force_cpu", action="store_true", help="Force all computations to CPU (sets CUDA_VISIBLE_DEVICES='')"
        action="store_true",
        help="Force all computations to CPU (sets CUDA_VISIBLE_DEVICES='')"
    )
    parser.add_argument(
        "--ignore_patterns",
@@ -103,29 +82,22 @@ def parse_args():
            r"re:.*\.shared_expert\..*$",
            r"re:.*\.shared_experts\..*$",
            r"re:.*\.mlp\.shared_expert_gate$",
-            r"re:.*\.linear_attn\..*$"
+            r"re:.*\.linear_attn\..*$",
        ],
-        help="Regex patterns for layers to ignore during quantization"
+        help="Regex patterns for layers to ignore during quantization",
    )
    parser.add_argument(
        "--torch_dtype",
        type=str,
        choices=["bfloat16", "float16", "float32"],
        default="bfloat16",
-        help="PyTorch dtype for model loading. Default: bfloat16"
+        help="PyTorch dtype for model loading. Default: bfloat16",
    )
    parser.add_argument(
-        "--trust_remote_code",
+        "--trust_remote_code", action="store_true", help="Allow loading of remote code (required for some models)"
        action="store_true",
        help="Allow loading of remote code (required for some models)"
    )
-    parser.add_argument(
+    parser.add_argument("--random_seed", type=int, default=42, help="Random seed for dataset shuffling. Default: 42")
-        "--random_seed",
+
        type=int,
        default=42,
        help="Random seed for dataset shuffling. Default: 42"
    )
    return parser.parse_args()
@@ -152,11 +124,7 @@ def get_torch_dtype(dtype_str):
    Returns:
        torch.dtype: Corresponding PyTorch dtype
    """
-    dtype_map = {
+    dtype_map = {"bfloat16": torch.bfloat16, "float16": torch.float16, "float32": torch.float32}
        "bfloat16": torch.bfloat16,
        "float16": torch.float16,
        "float32": torch.float32
    }
    return dtype_map[dtype_str]
@@ -176,18 +144,18 @@ def check_dense_layers_and_update_ignore(model_id, ignore_patterns, trust_remote
        Updated ignore_patterns list with dense layer patterns added
    """
    print("🔍 Checking model configuration for dense layers...")
-    
+
    try:
        # Load model configuration
        config = AutoConfig.from_pretrained(model_id, trust_remote_code=trust_remote_code)
-        
+
        # Check if the model has first_k_dense_replace parameter
-        first_k_dense_replace = getattr(config, 'first_k_dense_replace', None)
+        first_k_dense_replace = getattr(config, "first_k_dense_replace", None)
-        
+
        if first_k_dense_replace is not None and first_k_dense_replace > 0:
            print(f"✅ Found dense layers configuration: first_k_dense_replace = {first_k_dense_replace}")
            print(f"   Adding first {first_k_dense_replace} layers to ignore list...")
-            
+
            # Create regex pattern for dense layers (layers 0 to first_k_dense_replace-1)
            if first_k_dense_replace == 1:
                dense_pattern = r"re:model\.layers\.0\.mlp\..*$"
@@ -195,18 +163,18 @@ def check_dense_layers_and_update_ignore(model_id, ignore_patterns, trust_remote
                # For multiple layers, use range pattern
                layer_range = f"[0-{first_k_dense_replace-1}]"
                dense_pattern = f"re:model\\.layers\\.{layer_range}\\.mlp\\..*$"
-            
+
            # Add the dense layer pattern to ignore list
            updated_ignore_patterns = ignore_patterns + [dense_pattern]
-            
+
            print(f"   Dense layer pattern added: {dense_pattern}")
            print(f"   This will ignore MLP components in layers 0-{first_k_dense_replace-1}")
-            
+
            return updated_ignore_patterns
        else:
            print("ℹ️  No dense layers detected (first_k_dense_replace not found or is 0)")
            return ignore_patterns
-            
+
    except Exception as e:
        print(f"⚠️  Warning: Could not check model config for dense layers: {e}")
        print("   Proceeding with original ignore patterns...")
@@ -246,11 +214,7 @@ def load_and_prepare_dataset(dataset_name, dataset_split, num_samples, max_lengt
    # Tokenize the data
    def tokenize(sample):
        return tokenizer(
-            sample["text"],
+            sample["text"], padding=False, max_length=max_length, truncation=True, add_special_tokens=False
            padding=False,
            max_length=max_length,
            truncation=True,
            add_special_tokens=False
        )
    ds = ds.map(tokenize, remove_columns=ds.column_names)
@@ -291,9 +255,7 @@ def main():
    # 0) Check for dense layers and update ignore patterns
    # Dense layers in the first few layers should not be quantized
    updated_ignore_patterns = check_dense_layers_and_update_ignore(
-        args.model_id,
+        args.model_id, args.ignore_patterns, args.trust_remote_code
        args.ignore_patterns,
        args.trust_remote_code
    )
    # --------------------------------------------------------------------
@@ -302,13 +264,9 @@ def main():
    print("🔍 Inferring device map...")
    with init_empty_weights():
        dummy = AutoModelForCausalLM.from_pretrained(
-            args.model_id,
+            args.model_id, torch_dtype=torch_dtype, trust_remote_code=args.trust_remote_code
            torch_dtype=torch_dtype,
            trust_remote_code=args.trust_remote_code
        )
        device_map = infer_auto_device_map(
            dummy, no_split_module_classes=dummy._no_split_modules
        )
        device_map = infer_auto_device_map(dummy, no_split_module_classes=dummy._no_split_modules)
        del dummy
    # Force all modules to CPU for quantization
@@ -335,7 +293,7 @@ def main():
        args.num_calibration_samples,
        args.max_sequence_length,
        tokenizer,
-        args.random_seed
+        args.random_seed,
    )
    # --------------------------------------------------------------------
@@ -373,4 +331,4 @@ def main():
 if __name__ == "__main__":
-    main()
+    main()
--- a/kt-kernel/scripts/convert_kimi_k2_fp8_to_bf16_cpu.py
+++ b/kt-kernel/scripts/convert_kimi_k2_fp8_to_bf16_cpu.py
@@ -9,6 +9,7 @@ from safetensors.torch import load_file, save_file
 import gc
 def weight_dequant_cpu(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
    assert x.dim() == 2 and s.dim() == 2, "Expect 2D tensors for x and s"
    M, N = x.shape
@@ -27,6 +28,7 @@ def weight_dequant_cpu(x: torch.Tensor, s: torch.Tensor, block_size: int = 128)
            y[m0:m1, n0:n1] = sub.to(torch.bfloat16)
    return y
 def main(fp8_path, bf16_path):
    torch.set_default_dtype(torch.bfloat16)
    os.makedirs(bf16_path, exist_ok=True)
@@ -34,7 +36,7 @@ def main(fp8_path, bf16_path):
    with open(model_index_file, "r") as f:
        model_index = json.load(f)
    weight_map = model_index["weight_map"]
-    
+
    loaded_files = {}
    fp8_weight_names = []
@@ -51,7 +53,7 @@ def main(fp8_path, bf16_path):
        file_name = os.path.basename(safetensor_file)
        current_state_dict = load_file(safetensor_file, device="cpu")
        loaded_files[file_name] = current_state_dict
-        
+
        new_state_dict = {}
        for weight_name, weight in current_state_dict.items():
            if weight_name.endswith("_scale_inv"):
@@ -67,17 +69,17 @@ def main(fp8_path, bf16_path):
                    new_state_dict[weight_name] = weight
            else:
                new_state_dict[weight_name] = weight
-                
+
        new_safetensor_file = os.path.join(bf16_path, file_name)
        save_file(new_state_dict, new_safetensor_file)
-        
+
        if len(loaded_files) > 2:
            oldest_file = next(iter(loaded_files))
            del loaded_files[oldest_file]
            gc.collect()
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
-    
+
    new_model_index_file = os.path.join(bf16_path, "model.safetensors.index.json")
    for weight_name in fp8_weight_names:
        scale_inv_name = f"{weight_name}_scale_inv"
@@ -87,9 +89,10 @@ def main(fp8_path, bf16_path):
        json.dump({"metadata": {}, "weight_map": weight_map}, f, indent=2)
    print(f"Finish, Result in: {bf16_path}")
 if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--input-fp8-hf-path", type=str, required=True, help="Kimi-K2 FP8 model")
    parser.add_argument("--output-bf16-hf-path", type=str, required=True, help="BF16 model (After convert)")
    args = parser.parse_args()
-    main(args.input_fp8_hf_path, args.output_bf16_hf_path)
+    main(args.input_fp8_hf_path, args.output_bf16_hf_path)
--- a/kt-kernel/scripts/convert_moe_to_bf16.py
+++ b/kt-kernel/scripts/convert_moe_to_bf16.py
@@ -48,9 +48,7 @@ def _dequantize_tensor(
        if scales.numel() == weight.numel():
            scales = scales.reshape_as(weight)
        else:
-            raise ValueError(
+            raise ValueError(f"Scale shape {scales.shape} incompatible with weight shape {weight.shape}")
                f"Scale shape {scales.shape} incompatible with weight shape {weight.shape}"
            )
    bf16 = (weight.to(torch.float32) * scales).to(torch.bfloat16)
    return bf16.contiguous()
@@ -128,9 +126,7 @@ def convert_file(
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    save_file(tensors, output_path)
-    print(
+    print(f"[done] wrote {output_path} (converted={stats['converted']}, skipped={stats['skipped']})")
        f"[done] wrote {output_path} (converted={stats['converted']}, skipped={stats['skipped']})"
    )
 def parse_args() -> argparse.Namespace:
@@ -174,9 +170,7 @@ def main():
        targets = [os.path.join(model_dir, fname) for fname in args.files]
    else:
        targets = [
-            os.path.join(model_dir, name)
+            os.path.join(model_dir, name) for name in sorted(os.listdir(model_dir)) if name.endswith(".safetensors")
            for name in sorted(os.listdir(model_dir))
            if name.endswith(".safetensors")
        ]
    if not targets:
--- a/kt-kernel/scripts/install-git-hooks.sh
+++ b/kt-kernel/scripts/install-git-hooks.sh
@@ -1,24 +1,29 @@
 #!/usr/bin/env sh
-# Install git hooks from .githooks into .git/hooks by creating symlinks (or copying if symlink fails).
+# Install git hooks from kt-kernel/.githooks into the monorepo's .git/hooks by
 # creating symlinks (or copying if symlink fails).
 set -eu
 # This script lives in kt-kernel/scripts/, so REPO_ROOT = kt-kernel
 REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 GIT_DIR="$REPO_ROOT/.git"
 HOOKS_SRC="$REPO_ROOT/.githooks"
 # Detect the top-level Git worktree (the monorepo root: ktransformers)
 GIT_TOP="$(git rev-parse --show-toplevel 2>/dev/null || true)"
 if [ -z "$GIT_TOP" ] || [ ! -d "$GIT_TOP/.git" ]; then
  echo "[install-git-hooks] Not inside a git worktree; skipping hooks installation." >&2
  exit 0
 fi
 GIT_DIR="$GIT_TOP/.git"
 HOOKS_DEST="$GIT_DIR/hooks"
 if [ ! -d "$GIT_DIR" ]; then
  echo "Not a git repository (no .git directory) at $REPO_ROOT" >&2
  exit 1
 fi
 if [ ! -d "$HOOKS_SRC" ]; then
-  echo "No .githooks directory found at $HOOKS_SRC" >&2
+  echo "[install-git-hooks] No .githooks directory found at $HOOKS_SRC" >&2
  exit 1
 fi
-echo "Installing git hooks from $HOOKS_SRC to $HOOKS_DEST"
+echo "[install-git-hooks] Installing git hooks from $HOOKS_SRC to $HOOKS_DEST (repo: $GIT_TOP)"
 # Ensure all source hook files are executable so that even if copied (not symlinked) they run.
 for src_hook in "$HOOKS_SRC"/*; do
@@ -49,4 +54,4 @@ for hook in "$HOOKS_SRC"/*; do
  fi
 done
-echo "Done. Hooks installed."
+echo "[install-git-hooks] Done. Hooks installed."