mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-10 16:28:38 +00:00
Replace nested static_for lambdas with compile-time search helper (#6696) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add `sequence_find_value` and `find_in_tuple_of_sequences` compile-time search helpers with O(1) template depth - Replace nested `static_for` lambdas in `TensorDescriptor::GetTransformAndItsUpperDimension` and `InitializeElementSize` - Apply same optimizations to `TensorAdaptor` Supersedes #4287. Conflict-resolved rebase of ROCm/composable_kernel#3600 onto current develop. ## Motivation The `TensorDescriptor` and `TensorAdaptor` classes had excessive template instantiation from: 1. Nested `static_for` loops with lambdas creating unique closure types at every call site 2. `generate_tuple` with lambdas causing per-type instantiation overhead The new helpers use constexpr array lookup and pack expansion instead of recursive template patterns, achieving O(1) template depth. ## Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`) | TU | Baseline (mean) | New (mean) | Delta | Wilcoxon p | Mann-Whitney p | |----|-----------------|------------|-------|-----------|---------------| | `grouped_conv_fwd_xdl_fp16` (host) | 14,886 ms | 13,353 ms | **-10.3%** | **0.002** | **0.0002** | | `grouped_conv_fwd_xdl_fp16` (device) | 27,762 ms | 25,629 ms | **-7.7%** | **0.002** | **0.0002** | | **Total (all TUs)** | **57,732 ms** | **54,030 ms** | **-6.4%** | | | Unrelated TUs (`device_memory`, `host_tensor`, `convolution_parameter`) show no significant difference (p > 0.3), serving as negative controls. ### Methodology - 10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the same node to eliminate ordering/warmup bias - Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U test (unpaired) - Built with patched clang (LLVM 22) on ctr2-alola-compile-11, `-j1` for accurate per-TU timing - Raw data available in Slurm job 275230 results ## Test plan - [x] 11 unit tests added (5 for `sequence_find_value`, 6 for `find_in_tuple_of_sequences`) - [x] Compile-time benchmark with statistical significance (p < 0.01) - [ ] Full CI Tracking issue: #4229