mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-04-19 14:29:05 +00:00
[CK] Precompute SpaceFillingCurve indices to reduce compile time by 31% (#5041) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Optimize `SpaceFillingCurve` in CK to reduce compile time by precomputing all index values into a static constexpr lookup table. ### Problem - `GetIndex<N>` was instantiated separately for every index value (0 to NumAccesses-1) - Each instantiation triggered nested `static_for` loops with O(N²) template depth - This caused **34,000+ template instantiations** taking **69 seconds** in frontend ### Solution - Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed indices - Add `compute_single_index()` helper using O(N) `static_for` loops - Add `compute_all_indices()` to build entire table in one constexpr evaluation - `GetIndex<N>` becomes simple array lookup: `return index_table[N]` ### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp) | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Total compile time | 120.4s | 83.6s | **-31%** | | Frontend time | 88.7s | 52.6s | **-41%** | | GetIndex instantiations | 34,176 | 384 | **-99%** | | GetIndex time | 69.0s | 0.11s | **-99.8%** | | SpaceFillingCurve time | 75.7s | 4.3s | **-94%** | ## Test plan - [x] Builds successfully with `-Werror -Weverything` - [ ] Run existing unit tests - [ ] Verify numerical correctness on sample kernels 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>