composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	7c2b979de2	[rocm-libraries] ROCm/rocm-libraries#8573 (commit 04c9f1d) [CK][CK Tile] Drop profiler for experimental builder codegen (#8573) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Switch to dispatcher profiler for ck tile conv. ## Technical Details - Switch to dispatcher profiler for ck tile conv. - Drop profiler for experimental codegen - Minor fixes for bwd data printing - Minor fixes for 3d conv in dispatcher codegen ## Test Plan test_grouped_conv*tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 09:38:44 +00:00
Ville Pietilä	60b276647b	[rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39) [CK Tile] Rule-based configuration generation in CK Dispatcher codegen (#8157) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Tile Dispatcher code generation for CK Tile Profiler relies on flat JSON files to list the generated configurations. This approach has the following problems - The JSON files are verbose - The JSON files get easily out of sync with the CK Builder .config files from which they were generated from. - The JSON file based configuration make it hard to list explicitly the rules that govern the instance generation. ## Technical Details Replaced the JSON files with a rule based configuration. To preserve the existing functionality, the `profiler` and the `tests` instance sets are generated directly from the CK Builder config files. The JSON config files are removed from source control, and the "on-the-fly" generation guarantees that the Dispatcher codegen uses up to date configurations. This is PR introduces six different rule sets for the CK Tile Dispatcher code generation 1. `profiler`: matches with the old JSON set of profiler configurations. 2. `tests`: matches with the old JSON set of tests configurations. 3. `full`: full configuration set created from a rule-based config selection 4. `full-tests`: a subset of `full` for generating configurations for convolution integration tests. 5. `tiny`: a subset of `full-tests` to produce the minimal set of configurations to test the Dispatcher codegen. 6. `default`: the default rules, which corresponds to the existing heuristic rules for configuration selection. This ensures that ML based kernel selection doesn't get broken. The main use of the `full` rule set is to define a reasonable solution space for the possible implicit GEMM configurations. We start from the configurations that allowed by the device architecture. The `full` rule set defines the relevant tile sizes for each convolution direction. From the tile size we have a curated mapping to the number of waves over the different GEMM axes, i.e., we describe how many waves each GEMM dimensions corresponds to. The GEMM-K wave tile dimension can be computed from the other parameters and does not need to be listed explicitly. An orthogonal axis to the tiling strategy is the vectorization strategy. This mainly defined by the data type and hardware as in general, we want to use the maximum possible load widths. The maximum sizes for each convolution direction variant are defined by the implicit GEMM matrix dimensions. For cases where have a low number of channels per convolution group, we need smaller vector load sizes. These are captured by the `VecStrategy` enumeration in the codegen rules. The problem with the rule based configuration selection is that we "over generate" configurations. The old JSON configurations compose approximately 25% of all configuration that the `full` rule set creates. The additional configurations are valid, but they many not provide any performance benefits. Hence, we keep the `profiler` and `tests` rule set for now to avoid building an excessive amount configurations by default. The `full` rule set can be taken into use by specifying CMake configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the `tests` rule set is used, i.e., we don't change the existing bahaviour. ## Test Plan Added a new stage in the CI/CD pipeline that ensures the Dispatcher codegen rules are up to date. Otherwise the functionality is covered by the existing CI/CD tests. There are no functional changes to the convolution kernels. Only how the different instances are generated. ## Test Result If the CK Tile conv instances build without errors, the Dispatcher codegen is generating valid code. If all tests in CI/CD pipeline are passing, the Dispatcher codegen generates valid instances. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 01:22:50 +00:00
Johannes Graner	01cca38c8e	[rocm-libraries] ROCm/rocm-libraries#8220 (commit 4c04a3a) [CK Tile] WAVELET pipeline for backward-data grouped convolution (#8220) ## Motivation On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was ~18% behind classic CK, with the gap concentrated in the K=2376 3x3 detection-head family where bwd_data spends most of its time. The WAVELET GEMM pipeline already gives uplift for forward and backward-weight conv; this ports it to backward-data and consolidates the now-shared machinery across all three directions. ## Technical Details - Backward-data wavelet support in the tile kernel: launch extra load waves when the pipeline exposes `LaunchBlockSize`, and split the epilogue into math waves (run the CShuffle epilogue) and load waves (`RunBarrierStub`). - Register 7 WAVELET instances (fp16 and bf16), tuned for backward-data's tall-skinny GEMM rather than the forward tile shapes: a big-M `256/128/64` workhorse, a `VecA=4` variant for the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped (depthwise-style) shapes. - Implement the native backward-data instance parser in `generate_instances.py`. - Deduplicate the wavelet machinery shared by forward, backward-data, and backward-weight: `GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and `RunWaveletAwareEpilogue` in `grouped_convolution_utils.hpp`; the three native instance parsers collapse to one parameterized parser. The three kernels now call the shared helpers. ## Test Plan - Rebuild the full profiler instance pools for all three directions (fp16/bf16/fp32, nhwgc/ndhwgc) to exercise the shared helpers across every instantiation. - Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`, `test_grouped_convnd_bwd_data_tile`, `test_grouped_convnd_bwd_weight_tile`. - Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK and the non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness spot-checked with GPU-reference verification on the new big-M and NumGroupsToMerge instances. ## Test Result - GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6. - Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 -> 1.11, i.e. the tile path goes from ~12% slower than classic to ~8% faster. The largest single backward-data shape (256x100x100->2376) moves from 11% slower than classic to 12.5% faster. - The dedup refactor preserves behavior (net -174 lines across the kernels/generator), confirmed by the full rebuild and the GTests above. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-13 00:10:50 +00:00
Johannes Graner	0b3c297ee2	[rocm-libraries] ROCm/rocm-libraries#8009 (commit 26ab70d) [CK Tile] Add WAVELET pipeline for forward grouped convolution (#8009) ## Motivation CK Tile forward grouped convolution trails classic CK on 3x3 convolutions whose output-channel count is not divisible by 8, where the narrow output store limits the compute CShuffle epilogue. This ports the WAVELET pipeline (added for backward-weight in #7937) to the forward kernel to close that gap. ## Technical Details - Kernel (`grouped_convolution_forward_kernel.hpp`): WAVELET load/math-wave wiring, mirroring the backward-weight implementation; the non-WAVELET path is unchanged. - Generator: implement `parse_native_fwd_instance`, the forward native-instance parser. - Registered WAVELET instances: profiler bf16 3 / fp16 5, tests 1 each. WAVELET requires input channels divisible by 8 (it does not apply to depthwise). The bf16/fp16 instance asymmetry is intentional and measured: the VecC=8 tiles never beat the compute pool in bf16 but win about 20% of divisible-by-8 3x3 shapes in fp16, so VecC=8 is registered for fp16 only. ## Test Plan - Correctness (CPU reference) for every registered profiler instance, across VecC variants. - Per-shape best-instance performance sweep over the 34 RetinaNet shapes (bf16) and a 200-shape cross-model sweep (bf16 and fp16), compared against classic CK. ## Test Result - Correctness: PASS for all instances. - RetinaNet (bf16, vs classic CK): faster on 28 of 34 shapes, geomean +19.5%; the not-divisible-by-8 shapes up to 3.7x. One 1x1 stride-2 shape stays ~20% behind classic CK, unrelated to WAVELET. - Cross-model (200 shapes): WAVELET wins 3x3 not-divisible-by-8 in both dtypes (up to 61% over the next-best compute instance); for divisible-by-8 3x3 it wins about 20% of shapes in fp16 (3-11%) and none in bf16. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 08:57:39 +00:00
Johannes Graner	b7c8fb164f	[rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d) [CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight instances (#7937) ## Motivation CK Tile had no pipeline competitive with old CK's wavelet on the RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive kernel for spatial small-K shapes. ## Technical Details - New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`): workgroup split into math waves (LDS read + MFMA) and load waves (DRAM read + LDS write). - VGPR role-split: `operator()` has two top-level mutually-exclusive `is_math` branches so the allocator overlays both roles onto the same physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy. Correctness depends on identical `block_sync_lds` counts on both arms plus a matching load-wave barrier stub in the epilogue (`cshuffle_epilogue.hpp`). - Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`): `kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub. Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36 C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27% (googlenet K=320 by 16-23%); the role-split roughly halves the parity gap vs old CK on the 13x13 fp16 shape. ## Test Plan - `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16, `split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a broad 2D sweep. - Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64} (barrier-parity / deadlock check). - `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet instances. ## Test Result - All wavelet instances CPU-verify correct across the split-K sweep; no hangs (dual-arm barrier sequence matches). - Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better vs old CK on the majority of spatial shapes. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-02 08:51:17 +00:00
Ville Pietilä	78d657c4f7	[rocm-libraries] ROCm/rocm-libraries#7284 (commit e7d25b2) [CK_TILE] Integrate CK Tile Dispatcher code generation into CK Tile Profiler (#7284) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation CK Tile is going to be delivered to hipDNN via CK Dispatcher. Currently the CK Tile Profiler using CK Builder for generating the profiled instances from the configuration files that identify the instances that old CK exposes. We need to replace this instance generation with the CK Tile Dispatcher codegen. ## Technical Details The old CK Profiler config files are converted to JSON files that the CK Tile Dispatcher can digest. The conversion script for configurations is stored to source control in case we need to update the JSON configurations later. The dispatcher generates instance libraries per conv direction (fwd, bwd data, and bwd weight) that are linked to the CK Profiler executable. I also implemented codegne for the stream-K and depthwise conv instances. The proposed solution replaces the CK Builder codegen with the CK Tile Dispatcher codegen. There are two new methods that are exposed via the dispatcher backend - `is_supported` - required to enabled the profiler workflow where we check the applicability of the kernel instance before running it. - `get_instance_string` - this mainly for verification. This provide the CK Builder instance string for verifying that the old CK Builder based profiler and the new CK Tile Dispatcher based profiler have the same instances. The rules that limit the generated instances are now collected to a single location under the dispacther. The CK Builder codegen uses these, which ensures that the two codegen pipelines are in sync. The next step (different PR) is to remove the CK Builder codegen pipeline altogether. ## Test Plan Verified that the old CK Builder based profiler and the new CK Tile Dispatcher based profiler have the same instances, that is, the Dispatcher based codgen can generate the same instances as the old CK Builder. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-28 21:03:37 +00:00
jakpiase	309d823056	[rocm-libraries] ROCm/rocm-libraries#7466 (commit cc2861f) [CK Tile] Enable hardware OOB buffer load offset trick by default (#7466) ## Summary Enables `CK_TILE_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK` inside `config.hpp`. ### Background When loading from global memory with out-of-bound (OOB) check, CK Tile must suppress invalid lanes. The previous default used a software branch: ```cpp // Old path (oob_conditional_check, no trick) if(!src_thread_element_valid) { return zeros; } return amd_buffer_load_impl(...); ``` This generates divergent control flow, the compiler emits exec-mask save/restore and per-lane comparison SALU instructions one set per buffer load that touches a padded dimension. ### Change With the trick enabled, invalid lanes are suppressed entirely in hardware: ```cpp // New path (trick enabled) uint32_t shift = src_thread_element_valid ? 0 : 0x80000000; return amd_buffer_load_impl(resource, shift + offset, 0); ``` The `0x80000000` offset overflows the buffer descriptor's declared size, causing the hardware to silently return zero for that lane - no branch, no exec mask manipulation. This matches the behavior of old CK XDL kernels, which use an unconditional load followed by a `v_cndmask` select. ### Expected impact Eliminates ALU overhead from OOB validity branches which reduces the kernel execution time, especially for memory-bound cases. --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2026-05-21 12:06:40 +02:00
jakpiase	9cf49cd322	[rocm-libraries] ROCm/rocm-libraries#7465 (commit 81f1cf0) [CK TILE] Increase default kPerXdl for grouped convolution instances (#7465) ## Summary Increases the default `kPerXdl` used in CK Tile grouped convolution instance generation for forward, backward-data, and backward-weight operations. ### Changes in `generate_instances.py` - Larger default `kPerXdl` for all fp16/bf16 tile sizes: `get_k_mfma()` now returns `32` for `m/nPerXdl = 16` and `16` for `m/nPerXdl = 32`. - Cap `kPerXdl` to `kPerBlock`: All three parsers (`parse_fwd_instances`, `parse_bwd_weight_instances`, `parse_bwd_data_instances`) now clamp the computed value with `min(..., k_per_block)` to prevent generating invalid instances where `kPerXdl > kPerBlock`. ### Expected impact Higher `kPerXdl` increases the number of MFMA instructions issued per warp per inner-loop iteration, improving arithmetic intensity and reducing pipeline stall overhead for memory-bound shapes.	2026-05-21 12:05:09 +02:00
JH-Leon-KIM-AMD	720ceb6500	[rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f) [CK Tile] Support multi-vector reads in static encoding patterns (#7528) ## Motivation The thread-raked / warp-raked / block-raked static tile distribution patterns in `ck_tile` silently produce wrong results when the contiguous tile dimension is larger than `warp_size * vector_size`, because the encoding has no per-thread iteration dimension along X. Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in `ConvConfigComputeV3`, the grouped convolution backward-weight example reports about 50 percent wrong values, with errors starting exactly at the `X0X1 = 64` boundary. The second pass over the contiguous dim is never performed. This PR extends the encoding so multi-vector reads in the contiguous tile dimension are supported, while keeping every existing call site bit-for-bit identical. ## Technical Details Three files changed. ### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp` Add a per-thread X iteration dimension in all three raked specializations: - `X0 = min(warp_size, XPerTile / X1)` — threads in X dim - `X1 = min(LargestVec, VecSize)` — vector size per access - `X2 = XPerTile / (X0 X1)` — number of X-iters per thread (new) `X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both `make_2d_static_tile_distribution()` and `make_shuffled_2d_static_tile_distribution()`. The new encoding places `X2` in the middle of the Ys iteration list, which preserves reverse symmetry between the regular `<..., X2, X1>` and shuffled `<X1, X2, ...>` encodings. Patterns updated: `thread_raked`, `warp_raked`, `block_raked`. ### 2. `include/ck_tile/core/tensor/transpose_tile.hpp` Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch alongside the existing `NDimY == 2` branch. The original branch is byte-for-byte unchanged. Both branches dispatch to the same `transpose_tile2d_impl_in_thread`, whose body has always been NDimY-generic (iterates with `static_for<0, NDimY, 1>` and `number<NDimY>{}`). ### 3. `experimental/grouped_convolution_tile_instances/generate_instances.py` Removed the two now-obsolete skip guards in `parse_bwd_weight_instances` and `parse_bwd_data_instances`: ```python if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector): print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.") continue ``` Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating, irregular-load shapes, scalar-per-vector > tile size) are kept untouched. ### Compatibility Strict. Every existing caller has `X2 == 1` and therefore hits the original encoding path verbatim. No upstream config or pipeline behavior changes. ## Test Plan The grouped convolution example is the natural exerciser since `GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for both A and B tiles, and all three conv directions share the same `ConvConfigComputeV3`. For each test below we ran: ``` ./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}] ``` with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both code paths: \| Test \| M / N / K \| VecA/B/C \| A path \| B path \| dtype \| \|------\|-------------\|----------\|------------\|----------------\|-------------\| \| T1 \| 16/64/32 \| 4/8/4 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T2 \| 128/128/64 \| 2/2/2 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T3 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 \| \| T5 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 (3 dir)\| \| T4b \| 128/128/128 \| 1/1/1 \| new (X2=2) \| new (X2=2) \| fp16 + bf16 (3 dir) \| A larger T4a (256/256/128) was attempted to stress both A and B with X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128 KB > 64 KB limit), independent of this PR. For the generator change we ran: ``` python3 generate_instances.py --mode profiler --direction all ``` and verified `Skipping instance ... with multiple warps per continous tile dim` no longer appears (count went from non-zero to 0); other skip categories are unchanged. `clang-format-18` was applied to both modified `.hpp` files (matches the repo's `.clang-format`). ## Test Result - T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`. Confirms existing callers are unaffected. - T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3 encoding + transpose branch. - T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3 `correct`. - T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs `correct`. - Generator: 0 `multiple warps per continous tile dim` skips remaining; other skips unchanged. Sample run output (T4b, bf16, bwd_data): ``` shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32 pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0 Vector size A: 1, Vector size B: 1, Vector size C: 1 0.934907 ms, 8.34683 TFlops, 34.3178 GB/s Relative error threshold: 0.00390625 Absolute error threshold: 0.25 The CPU verification result is: correct ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-20 17:25:22 +03:00
Bartłomiej Kocot	cc5c79a1e7	[rocm-libraries] ROCm/rocm-libraries#5904 (commit f4e261a) [CK][CK Tile] Grouped Conv Backward Weight Streamk instances (#5904) ## Motivation Add streamk instance to grouped convolution backward weight profiler. ## Technical Details - New instances for grouped conv backward weight with streamk ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-05-16 10:49:18 +02:00
Bartłomiej Kocot	067e5e0ca4	[rocm-libraries] ROCm/rocm-libraries#6838 (commit ff7a665) [CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32) (#6838) ## Motivation CK currently has no kernel optimized for depthwise convolution (G=C_in=C_out, C=K=1 per group) and existing generic paths perform poorly for this workload. This PR adds a dedicated depthwise conv forward kernel in CK Tile. ## Technical Details Adds a dedicated depthwise conv2d forward op to CK Tile that performs direct convolution rather than falling back to the generic GEMM path. The kernel is templatized by filter size, stride, and data type, and compiled into ~60 instances covering common configurations (kernel 3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and RDNA (gfx1100/gfx1200) architectures. ## Test Plan - [x] Correctness and performance validated on gfx942, gfx950, and gfx1100, with ckProfiler `grouped_conv_fwd` as baseline. - [ ] MI300A (gfx942) and gfx1200 validation. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1137 --------- Co-authored-by: GenDu <Gen.Du@amd.com>	2026-05-15 15:47:55 +02:00
Bartłomiej Kocot	9348b8eb82	[rocm-libraries] ROCm/rocm-libraries#5646 (commit 05680a4) [CK_TILE] Add conv bwd data tests (#5646) ## Motivation This PR adds tests for CK Tile's convolution backward data operation to enable functionality regression tracking and error-detection. ## Technical Details Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3 dim channel-last) layouts are being tested, since only they are implemented in CK Tile. Current tests support FP16, BF16 and FP32 datatypes and various different convolutions scenarios. The tested instances are listed in `experimental/grouped_convolution_tile_instances` directory. ## Test Result All implemented tests are working properly and passing. --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>	2026-04-21 21:49:19 +00:00
Ville Pietilä	9e28c5ffea	[rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda) [CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516) ## Motivation We want close the performance gap between old CK and CK Tile for bwd data convolutions. To achieve this, we need tow things - Configurations for the old CK kernel instances such that we can map them into CK Tile instances. - Support in CK profiler to run the CK Tile instance with the same API as for old CK instances. ## Technical Details Extracted kernel configurations from old CK. The codegen python script for CK Tile convs is extended to support also bwd data. The generated instances are added to the CMake build (target `device_grouped_conv_bwd_data_tile_instances`). A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the CK Profiler. The API is same as for old CK's profiler op `grouped_conv_bwd_data`. --------- Co-authored-by: Ville Pietilä <>	2026-03-25 14:34:13 +00:00
Bartłomiej Kocot	8acb392dcd	[rocm-libraries] ROCm/rocm-libraries#5609 (commit 95afb2c) [CK][CK Tile] Move grouped conv cpp instances to build dir (#5609) ## Motivation Move grouped conv .cpp instances to build dir. Fix generate instances script. ## Technical Details Avoid CI problem when instances in experimental directory are not removed ## Test Plan test_grouped_convnd_*_tile ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 13:19:31 +00:00
Bartłomiej Kocot	e92c207845	[rocm-libraries] ROCm/rocm-libraries#5470 (commit fe3405d) [CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470) ## Motivation Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile window. ## Technical Details - Fix dram step for KM/KN layouts in V1 pipeline - Disable instances which use more threads than warp size in continous dim (not supported in ck tile yet) - Use 1x1 specialization for explicit gemm - Use two stage for vectorsize =1 and sizeof(datatype) ==2 - remove not needed check sinze GetVectorSizeA/B check if vector size is fixed ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-966	2026-03-19 11:59:44 +00:00
Bartłomiej Kocot	8ebabd19d2	[rocm-libraries] ROCm/rocm-libraries#5387 (commit 0c259bd) [CK][CK Tile] Grouped Convolution Backward Weight set of fixes (#5387) ## Motivation Grouped Convolution Backward Weight split k fixes for CK tile kernels ## Technical Details - get k batch from kargs to get deduced k batch - multiply zeroing size by data type size - disable v6 (producing a incorrect results) ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result Pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Ville Pietilä <>	2026-03-13 10:18:19 -06:00
Ville Pietilä	3d298d13e4	[rocm-libraries] ROCm/rocm-libraries#5237 (commit ef10dc6) [CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK Tile profiler (#5237) ## Motivation PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The two-stage kernels were not supported in the initial PR. This PR adds the the missing bwd weight two-stage kernels to the CK Profiler. ## Technical Details Extended the CK Tile conv builder factory to build also the elementwise ops required for the two-stage kernels. Extended the CK Builder for CK Tile instance to accept the two-stage flag as part of the algorithm configuration. ## Test Plan Added units tests for CK Builder that verify the two-stage kernel construction. ## Test Result If CI passes, the added unit tests are passing. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Ville Pietilä <>	2026-03-12 19:20:15 -06:00
Bartłomiej Kocot	3741885b52	[rocm-libraries] ROCm/rocm-libraries#5114 (commit 59b8cb5) [CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114) ## Motivation Improve profiling for grouped convolution forward for better comparison between CK and CK Tile ## Technical Details - Include preprocessing time for ck tile - Add flush cache for conv fwd profiler - Switch configs to builder reflect - Add KPerXdl deduce - Add non-grouped ported instances ## Test Plan test_grouped_convnd_fwd_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-786	2026-03-11 23:38:15 +01:00
Bartłomiej Kocot	e262252c4c	[rocm-libraries] ROCm/rocm-libraries#5115 (commit a21861e) [CK][CK Tile] Add grouped conv backward weight tile test and fix tr load in BASE_V1 pipeline (#5115) ## Motivation Test grouped conv backward weight from ck tile and fix incorrect values. ## Technical Details - Add test for CI - Add daily tests - Fix transpose load in BASE_V1 pipeline ## Test Plan test_grouped_convnd_backward_weight_tile ## Test Result in progress ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-783	2026-03-10 03:03:04 +00:00
Ville Pietilä	68b0f420ba	[rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400) [CK_TILE] Add CK Tile bwd weight profiler (#4797) ## Motivation To compare old CK and CK Tile, we need to extend the current CK profiler to support running also CK Tile instance with the same API. In order to have the same instance coverage in CK Tile compared to the old CK, I've added code generation from old CK configurations to CK Tile instances using the CK Builder. ## Technical Details - The codegen python script for CK Tile fwd convs is extended to support also bwd weight and bwd data. - The generated instances are added to the CMake build (target `device_grouped_conv_bwd_weight_tile_instance`s). - A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to the CK Profiler. --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2026-03-04 21:49:42 +00:00
Bartłomiej Kocot	d244aaa1c0	[rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6) [CK][CK TILE] Improve oob check (#4791) ## Motivation Improve OOB checks. Remove permutes which have been generated by thread buffer zero clear. at now in assembly there is only condmask instead of permute + condmask. Change number of KPack for generated instances ## Technical Details Remove permute instructions from assembly ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com>	2026-02-24 22:40:48 +01:00
Bartłomiej Kocot	2dd2f114b3	[rocm-libraries] ROCm/rocm-libraries#4407 (commit adde219) [CK][CK TILE] Add has hot loop check for pipeline v1 ## Motivation Add has hot loop check for pipeline v1 (v1 basic and v1 basic async). Enable more tests which have been fixed by this change. ## Technical Details Hot loop has been executed without num loop check. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-651 AICK-663	2026-02-11 13:43:01 +00:00
Bartłomiej Kocot	6d6ee8f023	[rocm-libraries] ROCm/rocm-libraries#4457 (commit 258a459) [CK][CK Tile] Temporary disable grouped conv fwd tile comp async instances (#4457) ## Motivation [CK][CK Tile] Temporary disable grouped conv fwd tile comp async instances due to the failures ## Technical Details disable configs to not comple these instances ## Test Plan test_grouped_convnd_fwd_Tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-11 01:52:59 +00:00
Bartłomiej Kocot	27e0a34e0f	[rocm-libraries] ROCm/rocm-libraries#4406 (commit 61f9f90) [CK] CK Tile grouped convolution direct load ## Motivation CK Tile grouped convolution forward direct load support. ## Technical Details Basic pipeline for direct load and new instances for forward for v1 and v4 pipelines. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-130	2026-02-09 21:09:42 +00:00
jakpiase	731afe535a	[rocm-libraries] ROCm/rocm-libraries#4357 (commit ff3e982) [CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357) Added support for conv v6 pipeline in ck tile's convolution forward kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline and should be faster than other pipelines for some cases. This PR also adds tests inside profiler that's currently inside experimental directory, so now we should be able to detect regressions easier.	2026-02-08 19:57:53 +00:00
Bartłomiej Kocot	1ae83137eb	Enable Grouped Conv Tile Fwd Tests daily (#3680 )	2026-01-31 15:55:25 -07:00
Bartłomiej Kocot	3d67e6c492	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt	2026-01-27 11:04:11 +02:00
Robin Voetter	cc75948d1c	[CK_BUILDER] conv bwd weight testing (#3618 ) * ck-builder: restructure testing conv In order to prepare for bwd of conv testing, this commit moves some files and types around so that we can reuse ckt::Args for both forward and backwards convolution. * ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp This will allow us to more easily include fwd.hpp from backwards definitions, which is required for initializing bwd values. * ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3 Turns out that the supplied layout isn't actually supported... * ck-builder: ck and reference conv integration for bwd weight * ck-builder: ck bwd weight execution test * ck-builder: ckt::run support for ck-tile bwd weight * ck-builder: ck tile bwd weight execution test * ck-builder: extra debug printing in MatchesReference * ck-builder: make ckt::run return RunResult This type is more convenient than std::tuple, as it will allow us to use google test matchers with this in the future. * ck-builder: RunResult matcher Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error message about how and why running an algorithm failed. * ck-builder: doc fixes * ck-builder: add missing headers	2026-01-26 23:50:15 +01:00
Bartłomiej Kocot	0727e85e52	[CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518 ) * [BULDER] Add grouped conv fwd ck tile profiler * [CK TILE] Fix grouped conv kernels splitk and double lds * Updates * Fixes * Move to ckProfiler * Fixes * fix * fix * Change instances to empty list by default * fix * fix * Update grouped_convolution_signatures.hpp * Update grouped_convolution_forward_tile_algs.hpp * [CK TILE] Add grouped convolution forward tests (#3556) * [CK TILE] Add grouped convolution forward tests * fix jenkins * fixes * comments fixes * unit test * unit test fix * Move instances outside builder * fix includes * clang format fix * readme fix * fix includes * fixes	2026-01-19 22:29:01 -07:00

29 Commits