mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-11 08:48:45 +00:00
tmp-develop
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8af6e10be4 |
[CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938)
## Problem Each `static_for<0, N, 1>` instantiates its lambda N times (one per `number<I>` type). When nested, intermediate lambdas capture the outer loop variable (a different type per iteration), creating unique closure types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32 = 52 IR functions, of which 20 are intermediate closures that get inlined away but still cost frontend compile time. ck_tile's `static_ford` was supposed to eliminate these intermediates (as old CK's PR #5031 did successfully), but it used a **recursive** `static_ford_impl` that recreated the same closure pattern plus added `reorder_old_to_new`/`reorder_new_to_old` overhead. Additionally, the sequence utility layer (`sequence_sort`, `is_valid_sequence_map`) used recursive template metaprogramming that generated O(N log N) intermediate types for every permutation validation — called on every `reorder_new_to_old`/`reorder_old_to_new` invocation. ## Changes ### 1. Replace `sequence_sort` with constexpr insertion sort Replace recursive merge sort (`sequence_sort_impl` + `sorted_sequence_merge_impl`, O(N log N) intermediate type instantiations) with constexpr insertion sort using `static_array`. O(1) template depth, same `::type` and `::sorted2unsorted_map` API. ### 2. Replace `is_valid_sequence_map` with constexpr check Replace sort-based permutation validation (which instantiated the full `sequence_sort` chain) with a constexpr "seen array" loop. O(N) constexpr steps instead of O(N log N) template instantiations. ### 3. Replace recursive `static_ford` with flat-loop `index_decomposer` Replace `static_ford_impl` (recursive `static_for` nesting + `pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat `index_decomposer` using pre-computed strides. Add `decompose_reordered` alias that folds reordering into decomposition, and `inverse_perm` helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map` → `sequence_sort` chain. ### 4. Eliminate internal lambda via `ford_applier` The flat-loop approach still used `static_for` with a lambda, creating M×N internal lambda instantiations per call site. Replace with `ford_applier` struct that calls `f(decompose<I>{})` directly via fold expression — zero intermediate closures: ```cpp // Before: 2×M×N function instantiations static_for<0, M*N, 1>{}([&](auto i) { f(decompose<i>{}); }); // After: M×N function instantiations (50% reduction) ford_applier<Decomposer, make_index_sequence<M*N>>{}(f); ``` Also unified identity and non-identity order paths into a single template with `constexpr if`. ### 5. Fix const-qualified sequence handling Fix `is_valid_sequence_map` to handle const-qualified sequence types via `remove_cvref_t` in callers (`tensor_adaptor.hpp`, `tile_distribution_encoding.hpp`). ## Results (this PR only, without flattening) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5) | Target | Base (s) | Treat (s) | Delta | % | Wins | Significant? | |--------|----------|-----------|-------|---|------|-------------| | **flatmm** | 160.1 | 152.7 | **-7.4s** | **-4.6%** | 6/7 | **YES** (W+=1, p<0.05) | | universal_gemm | 228.4 | 224.7 | -3.7s | -1.6% | 6/7 | Trending (W+=4) | Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5] Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6] ### IR Function Counts (device trace, gfx942) | Target | Metric | Before | After | Delta | % | |--------|--------|--------|-------|-------|---| | **universal_gemm** | InstantiateFunction | 117,715 | 109,165 | **-8,550** | **-7.3%** | | **universal_gemm** | CodeGen Function | 47,912 | 45,044 | **-2,868** | **-6.0%** | | **flatmm** | InstantiateFunction | 100,939 | 95,127 | **-5,812** | **-5.8%** | | **flatmm** | CodeGen Function | 42,651 | 40,367 | **-2,284** | **-5.4%** | Note: The `ford_applier` (commit 3) has minimal additional effect in this PR since ck_tile code does not yet use `static_ford` extensively. Its impact compounds when the follow-up flattening PR #5939 converts 124 `static_for` nests to `static_ford`. Combined results with #5939: flatmm **-7.5%** wall time (p<0.01), CodeGen **-10.5%**. ### ASM Equivalence 7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 + gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] `test_ck_tile_static_ford`: 13 behavioral tests (identity/non-identity orders, 1D-4D, unit dimensions, edge cases) - [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map, is_valid_sequence_map edge cases, sequence_unique_sort map round-trip) - [x] ASM equivalence verified (980K lines) - [x] Wilcoxon timing verified (7 trials, flatmm p<0.05) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
0c7665a331 |
[CK_TILE] Optimize ck_tile::sequence to reduce template instantiation depth [2A] (#5028)
## Summary ### Rationale `ck_tile::sequence` is the most fundamental metaprogramming type in ck_tile — it underpins tensor dimensions, strides, loop bounds, and index calculations. Six of its metafunctions use recursive template instantiation, producing O(N) to O(N²) intermediate types that the compiler must process. When these are used inside deeply nested GEMM pipelines with large dimension counts, the cumulative instantiation overhead becomes a significant contributor to frontend compile time. Measurements on `test_gemm_pipeline_compv6` show 84,288 `InstantiateFunction` calls in the frontend alone. Reducing template instantiation depth in these core utilities has a multiplicative effect because they are called from hundreds of sites. ### What changed | Metafunction | Before | After | |---|---|---| | `sequence::modify` | O(N) recursive split/merge | O(1) pack expansion | | `sequence_gen` | O(log N) recursive binary split | O(1) via `__make_integer_seq` | | `uniform_sequence_gen` | Delegates to `sequence_gen` | O(1) via `__make_integer_seq` | | `sequence_reverse_inclusive_scan` | O(N) recursive | O(1) constexpr for-loop + pack expansion | | `sequence_inclusive_scan` | Computed via reverse + flip | O(1) constexpr for-loop (unified impl) | | `sequence_exclusive_scan` | O(N) recursive merge chain | O(1) constexpr for-loop + pack expansion | | `sequence_map_inverse` | O(N²) recursive modify calls | O(1) constexpr for-loop + pack expansion | Supporting changes: - Portable `__type_pack_element` fallback with `__has_builtin` guard (hipRTC-safe, no `<tuple>` dependency) - Renamed reserved `__integer_sequence` to `integer_sequence_wrapper` - Adopted `static_array` from develop (PR #4355) for constexpr computation - Unified forward and reverse inclusive scan into a single `sequence_inclusive_scan_impl` with `bool Reverse` template parameter - Added `sequence_inclusive_scan` struct (new public API for forward scan direction) - Replaced recursive `sequence_exclusive_scan` (3 template specializations) with `sequence_exclusive_scan_impl` using the same constexpr for-loop pattern as inclusive scan - Rewired `exclusive_scan_sequence` and `prefix_sum_sequence` to use new impl - Added `CK_TILE_HOST_DEVICE` to `exclusive_scan_sequence` and `prefix_sum_sequence` to match sibling scan function annotations ### Technical debt and housekeeping - Unified all `namespace impl` to `namespace detail` across sequence.hpp for consistency - Removed dead comment block (orphaned `integer_sequence` alternative) - Added defensive `static_assert(sizeof...(Is) > 0)` in `sequence_map_inverse::build_inverse` - Converted all multi-line Doxygen blocks from `///` to `/** */` per style guide - Corrected `constexpr static` to `static constexpr` keyword ordering in `static_array` - Added blank line between `#pragma once` and first `#include` in `static_array.hpp` - Trimmed redundant 4-line comment on `sequence_gen_helper` to a one-liner - Moved `sequence_gen` Doxygen comment below `namespace detail` block so it directly precedes the struct it documents - Added Doxygen `@brief`/`@tparam`/`@pre` documentation for `sequence_gen` and `sequence_map_inverse` public APIs - Added `@brief` documentation to `static_array` explaining relationship to `ck_tile::array` - Added scope comment at `namespace detail` openings **Note:** `private:`/`public:` access modifier indentation is enforced at 4 spaces by `.clang-format`. The style guide calls for left-alignment, but the formatter overrides this. Requires a `.clang-format` config change to resolve — not addressable in code. ### `static_array` hardening (from develop's PR #4355) - Added zero-length array guard (`T elems[N > 0 ? N : 1]`) - Added `CK_TILE_HOST_DEVICE` annotations to `operator[]` and `size()` - Added `#include "ck_tile/core/config.hpp"` (IWYU for `CK_TILE_HOST_DEVICE`) ### Value Combined with the `static_ford` changes, measured impact on `test_gemm_pipeline_compv6`: - **Frontend: -28.9%** (InstantiateFunction: 84,288 → 69,439) - **Backend: -13.1%** (CodeGen Functions: 3,170 → 2,203) - **Wall-clock: -16.3%** (611.6s → 512.2s) ### Files changed (4) - `sequence.hpp`: Metafunction optimizations, namespace unification, documentation, style fixes - `static_array.hpp`: Zero-length guard, `CK_TILE_HOST_DEVICE`, documentation, style fixes - `test_sequence.cpp`: 50 unit tests with runtime `EXPECT_EQ` assertions (new file) - `CMakeLists.txt`: Register new test target ## Test plan - [x] 50 runtime unit tests covering all optimized and pre-existing sequence APIs - [x] Edge cases: empty sequences, single-element, larger sizes (N=8), negative values, non-trivial init values - [x] Both functor signatures tested (`operator()(index_t)` and `operator()(number<I>)`) - [x] Both scan reducers (`plus`, `multiplies`) with forward, reverse, inclusive, and exclusive directions - [x] Exclusive scan: sum, product, single, empty, non-zero init - [x] Prefix sum: N+1 output verification, single, empty - [x] Permutation round-trip verification for `sequence_map_inverse` - [x] Full sequence public API coverage: modify, gen, uniform_gen, scans (inclusive, exclusive, prefix sum), map_inverse, make_index_sequence, size/sum/product, push/pop, reverse, extract, merge, arithmetic operators, equality, transform - [x] Portable `__type_pack_element` fallback tested implicitly (same `at_index_t` interface) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> |
||
|
|
9c7b0388a9 |
[CK_TILE] Generate random tensor values with multiple threads (#3324)
[ROCm/composable_kernel commit:
|
||
|
|
0861395425 |
chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template
* Fix build
---------
Co-authored-by: Sami Remes <samremes@amd.com>
[ROCm/composable_kernel commit:
|
||
|
|
1777ce3229 |
[CK_TILE] Enable printing more structures in CK-Tile (#2443)
* Add more printing to core cktile
* Revert other changes in static encoding pattern
* Refactor to using a free print() function
* Remove loops and print just the containers
* Print tuple with better formatting, fix sequence compilation
* Add some tests for print utility
* Add print utility header
* Print for static_encoding_pattern
* add buffer_view printing
* Align vector_traits
* Fix formatting
* Lower-case enum strings
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Remove empty comment lines
* Fix test with lower-case too
* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y
* Add test_print_common.hpp
* add print.hpp in core.hpp
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
[ROCm/composable_kernel commit:
|