mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-02 20:51:23 +00:00
CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. **Pre-filter**: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. **Expert triage**: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. | Classification | Count | |---------------|-------| | Removed (this PR) | 41 | | Kept (debug helpers, alt configs, reference impls) | 32 | | Not code (false positives) | 16 | Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.
SmoothQuant with CK Tile
This example demonstrates SmoothQuant, a quantization technique for transformer models, using the CK Tile programming model. SmoothQuant enables efficient int8 inference by scaling activations and weights to balance quantization error.
Algorithm and Math
Given input X and per-channel scale S:
- Scale:
Y_{i,j} = X_{i,j} \cdot S_j - Rowwise Dynamic Quantization:
- For each row,
s = \max(|Y|) / 127 Q_{i,j} = \text{round}(Y_{i,j} / s),Q_{i,j} \in \text{int8}
- For each row,
Output:
- Quantized tensor
Q(int8) - Per-row scale
s(fp32)
Tile Programming Model
- Tiles: Each thread block processes a tile (row or block).
- Pipeline: Modular, can be extended for further fusion.
Build & Run
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
make tile_smoothquant -j`nproc`
This will result in an executable build/bin/tile_smoothquant
cmdline
args:
-m m dimension (default:3328)
-n n dimension (default:4096)
-x_stride input stride per row, if -1 then equal to n (default:-1)
-y_stride output stride per row, if -1 then equal to n (default:-1)
-v cpu validation or not (default:1)
-kname print kernel name or not (default:1)
-prec precision (default:fp16)
-warmup cold iter (default:5)
-repeat hot iter (default:20)
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:smoothquant.json)