mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-28 18:56:59 +00:00
276863ca874bedeee72fa8f46094de085c258aa6
3401 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0edfcf06e5 |
[rocm-libraries] ROCm/rocm-libraries#7894 (commit 5e66689)
[CK] add credentials to docker manifest inspect call ## Motivation This should fix an issue that we recently encountered in CI when we exceeded the limit of accessing docker without authentication: [2026-05-29T16:08:42.447Z] + docker manifest inspect --insecure rocm/composable_kernel:ck_ub24.04_rocm7.13 [2026-05-29T16:08:42.833Z] toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
15c904b460 |
[rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a)
ck_tile: add FillUniformScaleDistribution and fix MX GEMM scale init (#7724) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Problem MX GEMM pipeline tests were passing vacuously: scale bytes were drawn from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so numerical checks passed without exercising the GEMM. ### Changes **`include/ck_tile/host/fill.hpp`** — new `FillUniformScaleDistribution<ScaleType>` functor - Accepts human-readable float bounds and maps them to the raw byte range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE 754 exponent into the type's bias space - Sampling is uniform over raw bytes → uniform over representable values - Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to avoid shifting negative signed integers (C++17 UB) - Adds `assert(min_r <= max_r)` to catch inverted-range UB when both bounds exceed the type's representable range - Provides default member values (0.125f, 2.0f) and `std::optional` seed consistent with sibling fillers - `/** */` Doxygen style with `@note` on snapping asymmetry **`test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`** — fix scale initialization - Replace manual byte-range distribution with `FillUniformScaleDistribution<>{0.125f, 2.0f}` - Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid correlated scale tensors that were causing 60 test failures for fp4+e5m3/e4m3 combinations **`test/ck_tile/utility/test_fill.cpp`** — new unit tests for `FillUniformScaleDistribution` - 16 typed tests across e8m0, e4m3, e5m3: validity, range, reproducibility, coverage, snapping, stress, nullopt seed, and range overload - Test helper `expected_raw_range` mirrors implementation clamping exactly |
||
|
|
fe085f8a69 |
[rocm-libraries] ROCm/rocm-libraries#7761 (commit 237b766)
[CK][CK TILE] Clean up tile_engine grouped_conv harness (#7761) ## Motivation Tile_engine grouped_conv contains ML heuristic validation scripts which cause confusion to new developers. So, this PR is intended to relocate the scripts into dispatcher/heuristic directory to maintain separation of concern. ## Technical Details The grouped_conv tile_engine directory is a benchmarking harness for grouped convolution kernels; ML-heuristic content does not belong there. - Move compare_ml_vs_oracle.py and validate_ml_vs_oracle.py from tile_engine/ops/grouped_conv/ to dispatcher/heuristics/validation/grouped_conv/, and rebase their sys.path / oracle CSV / model dir lookups for the new location (CSV path is now an --oracle-csv flag instead of a hard-coded sibling). - Move GROUPED_CONV_HEURISTIC_REPORT.md (system-level ML report) into dispatcher/heuristics/ where the rest of the heuristic docs live. - Rewrite tile_engine/ops/grouped_conv/README.md as a pure benchmarking / dispatcher-sweep doc (kernel enumeration, JIT pipeline, CSV schema, problem registry), in the style of tile_engine/ops/fmha/README.md. All ML training / model-efficiency content is removed and replaced with a pointer to dispatcher/heuristics/. ## Test Plan Validation scripts are re-wired and tested locally ## Test Result Tests passed on local machine. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
d5c9215064 |
[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC ## Motivation Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async load instruction for tensor A, which complements the existing MX GEMM FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines are enabled on gfx1250. ## Technical Details The existing gfx950 implementation was extended to support gfx1250 architecture. All three MX FP data types are supported across the two ASICs. It should be noted that while the TDM pipeline uses an emulated 32x32x128 warp-tile instruction, the present submission relies on the built-in 16x16x128 instruction, called 4 times per warp. ## Test Plan Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250 functionality. To help facilitate the testing in development, `example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was introduced to verify various combinations of supported data types and pipeline versions. ## Test Result The present submission is expected to work on both gfx950 and gfx1250 hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. - [x] Relies on #6978 and should only be merged after the changes are merged to the `develop`. |
||
|
|
b619c374eb |
[rocm-libraries] ROCm/rocm-libraries#5438 (commit 7000562)
[CK_TILE] Normalize gpu_target before LDS_SIZE_MAP lookup (#5438) GPU targets passed with feature suffixes (e.g. `gfx950:xnack+`) were falling through to `DEFAULT_LDS_SIZE` instead of matching their entry in `LDS_SIZE_MAP`, potentially causing incorrect tile acceptance/rejection. ## Changes - **`gemm_validation_utils.py`**: Strip everything after `:` from `gpu_target` before the `LDS_SIZE_MAP` lookup; use the normalized base arch name in the error message as well. ```python # Before hw_lds_size = LDS_SIZE_MAP.get(gpu_target, DEFAULT_LDS_SIZE) # After base_gpu_target = gpu_target.split(":")[0] if gpu_target else gpu_target hw_lds_size = LDS_SIZE_MAP.get(base_gpu_target, DEFAULT_LDS_SIZE) ``` |
||
|
|
8bd8094012 |
[rocm-libraries] ROCm/rocm-libraries#7833 (commit 8a444cd)
[CK] Replace deprecated load_module function in python (#7833) ## Motivation Recent pytorch builds with python 3.15 failed in CK due to deprecation of load_module function. This should fix the issue. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
5d912538d3 |
[rocm-libraries] ROCm/rocm-libraries#7847 (commit b995ef2)
[CK] Remove IsPackedTensor function ## Motivation Fix codegen hipRTC ## Technical Details Remove not needed function. Since MakeArgument supports long_index_t strides. ## Test Plan Codegen tests. ## Test Result Passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
78d657c4f7 |
[rocm-libraries] ROCm/rocm-libraries#7284 (commit e7d25b2)
[CK_TILE] Integrate CK Tile Dispatcher code generation into CK Tile Profiler (#7284) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation CK Tile is going to be delivered to hipDNN via CK Dispatcher. Currently the CK Tile Profiler using CK Builder for generating the profiled instances from the configuration files that identify the instances that old CK exposes. We need to replace this instance generation with the CK Tile Dispatcher codegen. ## Technical Details The old CK Profiler config files are converted to JSON files that the CK Tile Dispatcher can digest. The conversion script for configurations is stored to source control in case we need to update the JSON configurations later. The dispatcher generates instance libraries per conv direction (fwd, bwd data, and bwd weight) that are linked to the CK Profiler executable. I also implemented codegne for the stream-K and depthwise conv instances. The proposed solution replaces the CK Builder codegen with the CK Tile Dispatcher codegen. There are two new methods that are exposed via the dispatcher backend - `is_supported` - required to enabled the profiler workflow where we check the applicability of the kernel instance before running it. - `get_instance_string` - this mainly for verification. This provide the CK Builder instance string for verifying that the old CK Builder based profiler and the new CK Tile Dispatcher based profiler have the same instances. The rules that limit the generated instances are now collected to a single location under the dispacther. The CK Builder codegen uses these, which ensures that the two codegen pipelines are in sync. The next step (different PR) is to remove the CK Builder codegen pipeline altogether. ## Test Plan Verified that the old CK Builder based profiler and the new CK Tile Dispatcher based profiler have the same instances, that is, the Dispatcher based codgen can generate the same instances as the old CK Builder. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
bf07a0150e |
[rocm-libraries] ROCm/rocm-libraries#7723 (commit 4ed6c51)
[CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels (#7723) ### Motivation The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not support LSE (Log-Sum-Exp) output. This PR enables LSE output support for fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics alongside attention outputs. ### Technical Details - StandardAttention: lse = softmax_scale * m + log(l) - LogitsSoftCap: lse = (m / log2(e)) + log(l) ### Test Plan Run FMHA forward example with fp8bf16 precision and LSE output enabled: - Test 1: Basic LSE functionality ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1 - Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter) ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0 |
||
|
|
c1aee52d3d |
[rocm-libraries] ROCm/rocm-libraries#7303 (commit 27b6b8c)
Bump urllib3 from 2.6.3 to 2.7.0 in /projects/composablekernel/docs/sphinx (#7303) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.6.3 to 2.7.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/urllib3/urllib3/releases">urllib3's releases</a>.</em></p> <blockquote> <h2>2.7.0</h2> <h2>🚀 urllib3 is fundraising for HTTP/2 support</h2> <p><a href="https://sethmlarson.dev/urllib3-is-fundraising-for-http2-support">urllib3 is raising ~$40,000 USD</a> to release HTTP/2 support and ensure long-term sustainable maintenance of the project after a sharp decline in financial support. If your company or organization uses Python and would benefit from HTTP/2 support in Requests, pip, cloud SDKs, and thousands of other projects <a href="https://opencollective.com/urllib3">please consider contributing financially</a> to ensure HTTP/2 support is developed sustainably and maintained for the long-haul.</p> <p>Thank you for your support.</p> <h2>Security</h2> <p>Addressed high-severity security issues. Impact was limited to specific use cases detailed in the accompanying advisories; overall user exposure was estimated to be marginal.</p> <ul> <li> <p>Decompression-bomb safeguards of the streaming API were bypassed:</p> <ol> <li>When <code>HTTPResponse.drain_conn()</code> was called after the response had been read and decompressed partially. (Reported by <a href="https://github.com/Cycloctane"><code>@Cycloctane</code></a>)</li> <li>During the second <code>HTTPResponse.read(amt=N)</code> or <code>HTTPResponse.stream(amt=N)</code> call when the response was decompressed using the official <a href="https://pypi.org/project/brotli/">Brotli</a> library. (Reported by <a href="https://github.com/kimkou2024"><code>@kimkou2024</code></a>)</li> </ol> <p>See GHSA-mf9v-mfxr-j63j for details.</p> </li> <li> <p>HTTP pools created using <code>ProxyManager.connection_from_url</code> did not strip sensitive headers specified in <code>Retry.remove_headers_on_redirect</code> when redirecting to a different host. (GHSA-qccp-gfcp-xxvc reported by <a href="https://github.com/christos-spearbit"><code>@christos-spearbit</code></a>)</p> </li> </ul> <h2>Deprecations and Removals</h2> <ul> <li>Used <code>FutureWarning</code> instead of <code>DeprecationWarning</code> for better visibility of existing deprecation notices. Rescheduled the removal of deprecated features to version 3.0. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3763">urllib3/urllib3#3763</a>)</li> <li>Removed support for end-of-life Python 3.9. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3720">urllib3/urllib3#3720</a>)</li> <li>Removed support for end-of-life PyPy3.10. (<a href="https://redirect.github.com/urllib3/urllib3/issues/4979">urllib3/urllib3#4979</a>)</li> <li>Bumped the minimum supported pyOpenSSL version to 19.0.0. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3777">urllib3/urllib3#3777</a>)</li> </ul> <h2>Bugfixes</h2> <ul> <li>Fixed a bug where <code>HTTPResponse.read(amt=None)</code> was ignoring decompressed data buffered from previous partial reads. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3636">urllib3/urllib3#3636</a>)</li> <li>Fixed a bug where <code>HTTPResponse.read()</code> could cache only part of the response after a partial read when <code>cache_content=True</code>. (<a href="https://redirect.github.com/urllib3/urllib3/issues/4967">urllib3/urllib3#4967</a>)</li> <li>Fixed <code>HTTPResponse.stream()</code> and <code>HTTPResponse.read_chunked()</code> to handle <code>amt=0</code>. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3793">urllib3/urllib3#3793</a>)</li> <li>Updated <code>_TYPE_BODY</code> type alias to include missing <code>Iterable[str]</code>, matching the documented and runtime behavior of chunked request bodies. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3798">urllib3/urllib3#3798</a>)</li> <li>Fixed <code>LocationParseError</code> when paths resembling schemeless URIs were passed to <code>HTTPConnectionPool.urlopen()</code>. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3352">urllib3/urllib3#3352</a>)</li> <li>Fixed <code>BaseHTTPResponse.readinto()</code> type annotation to accept <code>memoryview</code> in addition to <code>bytearray</code>, matching the <code>io.RawIOBase.readinto</code> contract and enabling use with <code>io.BufferedReader</code> without type errors. (<a href="https://redirect.github.com/urllib3/urllib3/issues/3764">urllib3/urllib3#3764</a>)</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/urllib3/urllib3/blob/main/CHANGES.rst">urllib3's changelog</a>.</em></p> <blockquote> <h1>2.7.0 (2026-05-07)</h1> <h2>Security</h2> <p>Addressed high-severity security issues. Impact was limited to specific use cases detailed in the accompanying advisories; overall user exposure was estimated to be marginal.</p> <ul> <li> <p>Decompression-bomb safeguards of the streaming API were bypassed:</p> <ol> <li>When <code>HTTPResponse.drain_conn()</code> was called after the response had been read and decompressed partially.</li> <li>During the second <code>HTTPResponse.read(amt=N)</code> or <code>HTTPResponse.stream(amt=N)</code> call when the response was decompressed using the official <code>Brotli <https://pypi.org/project/brotli/></code>__ library.</li> </ol> <p>See <code>GHSA-mf9v-mfxr-j63j <https://github.com/urllib3/urllib3/security/advisories/GHSA-mf9v-mfxr-j63j></code>__ for details.</p> </li> <li> <p>HTTP pools created using <code>ProxyManager.connection_from_url</code> did not strip sensitive headers specified in <code>Retry.remove_headers_on_redirect</code> when redirecting to a different host. (<code>GHSA-qccp-gfcp-xxvc <https://github.com/urllib3/urllib3/security/advisories/GHSA-qccp-gfcp-xxvc></code>__)</p> </li> </ul> <h2>Deprecations and Removals</h2> <ul> <li>Used <code>FutureWarning</code> instead of <code>DeprecationWarning</code> for better visibility of existing deprecation notices. Rescheduled the removal of deprecated features to version 3.0. (<code>[#3763](https://github.com/urllib3/urllib3/issues/3763) <https://github.com/urllib3/urllib3/issues/3763></code>__)</li> <li>Removed support for end-of-life Python 3.9. (<code>[#3720](https://github.com/urllib3/urllib3/issues/3720) <https://github.com/urllib3/urllib3/issues/3720></code>__)</li> <li>Removed support for end-of-life PyPy3.10. (<code>[#4979](https://github.com/urllib3/urllib3/issues/4979) <https://github.com/urllib3/urllib3/issues/4979></code>__)</li> <li>Bumped the minimum supported pyOpenSSL version to 19.0.0. (<code>[#3777](https://github.com/urllib3/urllib3/issues/3777) <https://github.com/urllib3/urllib3/issues/3777></code>__)</li> </ul> <h2>Bugfixes</h2> <ul> <li>Fixed a bug where <code>HTTPResponse.read(amt=None)</code> was ignoring decompressed data buffered from previous partial reads. (<code>[#3636](https://github.com/urllib3/urllib3/issues/3636) <https://github.com/urllib3/urllib3/issues/3636></code>__)</li> <li>Fixed a bug where <code>HTTPResponse.read()</code> could cache only part of the response after a partial read when <code>cache_content=True</code>.</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
016f8891de |
[rocm-libraries] ROCm/rocm-libraries#7815 (commit e34ac06)
[CK] fix daily build of CK for all supported targets. ## Motivation Fixing the daily build of CK packages for all supported targets. In the past, if no GPU_TARGETS was specified, we would by default build CK for all supported targets, But recently, the MIOpen team requested to change the default behavior to not build at all if no target is specified (for the purposes of filtering out unsupported targets in TheRock). So just adding the explicit list of targets to our daily builds now. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
58e2ab1fc7 |
[rocm-libraries] ROCm/rocm-libraries#6761 (commit d19f6f1)
[CK] Large tensor gemm workaround (#6761) ## Motivation Customer qeruested large tensor gemm support for 8bit and 4bit data types. Currently CK triggers “This GEMM not supported” error. The root cause appears to be the 2 GB limit on the input/output matrix, triggered by buffer offset constraints when testing a larger shape such as M = 699,904 (which is an exact multiple of MPerBlock = 256). ## Technical Details Quick workaround to have support ASAP. Split the tensors into inputs / outputs smaller than 2GB limit. Iterate on host and call all subproblems without device code change. Support is restricted to rowise layout in A, Ds and E All changes were implemented in DeviceGemm structures to avoid secondory affect on grouped convolutions. Got lots of AI generated comments. Addressed the ones that seemed relevant on the functionality. ## Test Plan Within CK the following examples can be used with modified input sizes: example_gemm_multiply_multiply_xdl_fp8 example_gemm_mx_fp4 Tested with Aiter tuning on provided shapes. ## Test Result All gemms run and provide correct results. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> |
||
|
|
6a17f951ea |
[rocm-libraries] ROCm/rocm-libraries#7714 (commit 13ae6d6)
[CK_TILE] Restructure naive GEMM tutorial and add tile distribution tutorials (#7714) ## Summary - Flatten naive GEMM tutorial directory structure (remove `block_level/`, `host_level/`, `warp_level/` subdirs) to match the composable_kernel repo layout - Add `CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION` macro switch to toggle between standard and transposed WarpGemm variants - Consolidate 6 verbose markdown files (~2600 lines) into one concise README (~120 lines) - Add 3 tile distribution encoding tutorials with step-by-step "How to read Ps/Ys" annotations: - Tutorial 1: A-matrix DRAM load (256×32) — NDimP=2, coalesced K-splitting - Tutorial 2: B-matrix DRAM load (128×32) — same pattern, fewer iterations - Tutorial 3: C-matrix register layout (32×32) — MFMA m32n32k8 hardware output mapping, standard vs transposed - Tile distribution tutorials guarded to build only for gfx942 and gfx950 |
||
|
|
5dc8fbd1a8 |
[rocm-libraries] ROCm/rocm-libraries#6900 (commit 28608c2)
[CK] Fix and expand CK's commit records in version.h (#6900) ## Motivation In `version.h` of a CK installation `CK_COMMIT_ID` would be empty for out-of-source builds. Additionally, if it worked, it would show the parent repo's (`rocm-libraries`) commit. ## Technical Details Dropped "required" constraint so "unknown" string becomes a graceful option. Changed process of determining the CK commit, now uses `WORKING_DIRECTORY`. Thus, `CK_COMMIT_ID` holds only the last CK-relevant commit. Added `CK_PARENT_COMMIT_ID` which holds the parent's, e.g. `rocm-libraries`, commit. This can be the same as `CK_COMMIT_ID`, or not even applicable, depending on the scenario. ## Test Plan Ran CMake configuration and installation of CK to verify happy path. ## Test Result Commit SHA's showed the expected values depending on the repo state. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
4aecc8de5b |
[rocm-libraries] ROCm/rocm-libraries#7442 (commit b7d57ef)
[CK] CompV4: remove redundant barrier (+5.7% gfx942, +1% gfx950) (#7442) ## Summary - Remove one redundant `block_sync_lds()` from the pong phase of the CompV4 GEMM pipeline hot loop - The pong phase had 2 barriers while ping had 1 — the second pong barrier (after LDS writes, before global loads) was unnecessary because the sync at the top of the next ping iteration already ensures LDS coherence - Removing this barrier allows global loads to overlap with LDS write drain, restoring the latency hiding the ping-pong design was built to provide - Abstracting away Ping Pong phases into generic lambda avoids making such mistake again. ## Benchmark ### gfx942 (MI300X), 86 fp16 GEMM shapes | Metric | Value | |---|---| | Improved (>1%) | **80** | | Neutral (±1%) | **4** | | Regressed | **2** | | Average gain | **+5.7%** | | Best gain | +18.0% (4096x256x16384) | | Worst regression | -2.9% (12288x3072x4096) | ### gfx950 (MI355X), 86 fp16 GEMM shapes | Metric | Value | |---|---| | Improved (>1%) | **32** | | Neutral (±1%) | **54** | | Regressed | **0** | | Best gain | +9.0% (4096x2048x28672) | ### Top gains by workload | Shape (MxNxK) | Source | gfx942 BL | gfx942 Opt | gfx942 Gain | gfx950 BL | gfx950 Opt | gfx950 Gain | |---|---|---|---|---|---|---|---| | 4096x256x16384 | bloom_fc2 | 38.3 | 45.2 | **+18.0%** | 75.6 | 77.0 | +1.9% | | 4096x512x22016 | llama2_7b | 77.8 | 90.8 | **+16.7%** | 152.4 | 154.9 | +1.7% | | 256x1536x7168 | deepseek | 14.4 | 16.7 | **+16.0%** | 27.2 | 28.0 | +2.8% | | 4096x1024x22016 | llama2_7b | 156.2 | 180.8 | **+15.7%** | 304.8 | 311.6 | +2.2% | | 4096x1024x16384 | bloom_fc2 | 154.6 | 178.5 | **+15.4%** | 303.1 | 309.5 | +2.1% | | 4096x4096x22016 | llama2_7b | 371.0 | 412.3 | **+11.1%** | 819.8 | 823.6 | +0.5% | | 4096x2048x28672 | llama3_8b | 235.5 | 259.5 | **+10.2%** | 530.0 | 577.7 | **+9.0%** | | 250880x256x4096 | bloom_logits | 289.0 | 335.9 | **+16.2%** | 595.5 | 599.1 | +0.6% | | 8192x8192x8192 | square | 411.8 | 432.9 | **+5.1%** | 825.1 | 825.8 | +0.1% | | 7168x4096x8192 | llama70b | 362.9 | 374.7 | **+3.3%** | 775.8 | 782.5 | +0.9% | ## Hardware counter analysis (rocprof-compute, 8192x8192x8192, gfx942) | Metric | Baseline | Optimized | Delta | |---|---|---|---| | s_barrier per ping+pong | 5 | 4 | **-1** | | MFMA Utilization | 47.8% | 55.5% | **+7.7pp** | | IPC | 0.17 | 0.21 | **+23.5%** | | MFMA F16 % of peak | 30.6% | 33.5% | **+2.8pp** | | VALU (instructions) | 41.67M | 41.67M | identical | | MFMA (instructions) | 65.91M | 65.91M | identical | | Spill/Stack Read | 8.27M | 8.27M | identical | All instruction counts are identical — the optimization removed one synchronization point, not any compute instructions. ## Correctness - gfx942: GPU verification (`-v=2`) passed on 4 shapes (8192x8192x8192, 4096x4096x4096, 22016x4096x4096, 4096x512x28672) - gfx950: GPU verification (`-v=2`) passed on all 86 shapes |
||
|
|
c24e528481 |
[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)
[CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
60df79085d |
[rocm-libraries] ROCm/rocm-libraries#7631 (commit d591a7c)
[CK] Grouped Convolution Global Load/Store support (#7631) ## Motivation Grouped Convolution Global Load/Store support to cover large tensor cases. ## Technical Details Utilize global load for grouped convolution forwad kernels. Update Indexes to use int64. ## Test Plan - test utils - test conv kernels in next pr with instances ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1255 --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
00e1d82ae7 |
[rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9)
[CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732) ## Motivation Fix silent miscompute in the grouped convolution backward-data kernel (`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride > dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast path that dropped all but the first sub-GEMM, producing zeroed slices of `dx` on the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness without giving up the perf gains PR #6208 delivered on stride=1 shapes. ## Technical Details - Tighten the flat-descriptor fast-path gate to require `arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its original purpose). For stride > 1, the implicit GEMM is split into `gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly; routing them through the flat path required dropping all but the first, which was the source of the bug. - Stride > 1 now falls through to the existing grouped CShuffle path, which packs all sub-GEMMs into one descriptor array and walks them on-device in a single kernel launch. This is the pre-PR-6208 production path; correctness is established and per-dispatch launch count is minimised. - Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0) intersection in `test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with `gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this intersection (all stride>1 cases used G=2; all G=1 cases used stride=1), which is why PR #6208's regression slipped past CI. ## Test Plan - `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke tier — runs on every PR via `smart_build_and_test.sh`). - End-to-end verify (`verify=1`) via `example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes including the original ALMIOPEN-1959 case and a cross-bucket (`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32` buckets. - ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate via an environment variable: full kernel-family enumeration, winning kernel + its avg_time reported under each gate. 33/41 shapes completed before the sweep was stopped; the remaining 8 were the largest i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape enumeration budget (not relevant to the verdict). ## Test Result ### Correctness | Test | Result | |---|:---:| | `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D, includes 3 new regression shapes) | **12/12 PASSED** in 14.18 s | | `test_grouped_convnd_bwd_data_interface` (API checks) | **PASSED** in 0.28 s | | ALMIOPEN-1959 stride=2 (`verify=1`) | **PASSED** | | stride=1 K3 (`verify=1`) | **PASSED** | | stride=3 K3 `gemms_count=9` (`verify=1`) | **PASSED** | | stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) | **PASSED** | ### Performance (ckProfiler A/B on gfx942 / MI300X) Comparing the **post-fix gate** (flat path only when `gemms_count_==1`, column "B") vs the **inner-loop variant** that keeps the flat path on stride>1 (column "A") across 25 stride>1 shapes where production picks a `_v1` instance (so the gate actually fires): | Stride | Shapes | A wins | Tie | B wins | Notes | |:------:|:------:|:------:|:---:|:------:|---| | 1 (sanity, gate moot) | 3 | 0 | 3 | 0 | gate doesn't differentiate — A == B as expected | | > 1 (gate fires) | 25 | **0** | 11 | **14** | B wins +6% to +32%; A never wins | Highlights from the firing-gate cases: | Shape (G=1, stride=2 unless noted) | A ms | B ms | B vs A | |---|---:|---:|---:| | ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) | 0.183 | 0.171 | **B +6%** | | Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) | 0.054 | 0.045 | **B +17%** | | i2v-010 (N=1, K=C=384, 3×3, 277×209) | 0.174 | 0.125 | **B +28%** | | Synthetic 50×50 K3 N=32 K=C=256 | 0.131 | 0.088 | **B +32%** | Why B wins everywhere the gate fires: for `gemms_count = N`, the flat path needs N kernel launches (one per sub-GEMM), while the grouped path loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) × launch-tax is a structural disadvantage A can't recover from. ### Diff | File | Lines | |---|---:| | `include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp` | +14 / −8 (one extra condition + expanded dispatch comment) | | `test/.../test_grouped_convnd_bwd_data.cpp` | +9 / −0 (3 new shapes) | ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
6181eb2adf |
[rocm-libraries] ROCm/rocm-libraries#4279 (commit 5b3f4b7)
[CK_TILE] Stream-K XCD remapping (#4279) ## Proposed changes This PR adds support for XCD remapping as detailed in this [document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A). On gfx942, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this [PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM. Note: This only applies to the gfx942 as they introduce the XCDs. Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged --- 🔁 Imported from [ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652) 🧑💻 Originally authored by @arai713 --------- Co-authored-by: Astha <astha.rai713@gmail.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com> |
||
|
|
efc063e462 |
[rocm-libraries] ROCm/rocm-libraries#6511 (commit 867bece)
[CK_TILE] Adding steps in Stream-K Tile Engine (#6511) ## Motivation This PR adds step functionality to the Stream-K instance generator in Tile Engine in order to quickly generate instance configurations within a certain max/min range. To complement this, the Stream-K Tile Engine validation file has been updated for more rigorous validation of generated instances. ## Technical Details - Added _generate_values helper to support min/max/step range-based tile config generation, matching Universal GEMM - Added validate_gemm, validate_whole_wg_cover_configuration, validate_cshuffle_epilogue_distribution, and other supporting functions to the Stream-K validation utils, aligning with the validation already present in the Universal GEMM ## Test Plan Tested using the generation in CK Tile Engine ## Test Result All instances were generated and validated correctly. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
0df3523ef1 |
[rocm-libraries] ROCm/rocm-libraries#6807 (commit ddda8ac)
[CK_TILE] Add save_matrix_txt() and extract HostTensor I/O to free functions (#6807) ## Summary - Extract `loadtxt`, `savetxt`, and `save_matrix_txt` from `HostTensor` member functions into standalone free functions in `host_tensor_io.hpp` (Single Responsibility Principle) - Add `save_matrix_txt()` for writing 2D tensors to space-separated text files with configurable output limit (default 256x256, pass 0 to dump all) - Supports float, int, and int8_t output formats via a `dtype` parameter - Validate dtype early and throw on unsupported values in all three functions - Update callers in `15_fused_moe/main.cpp` to use free function syntax |
||
|
|
66d6714376 |
[rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd)
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388) ## Motivation Improve precision of mxfp4 without performance penalties. ## Technical Details Since performance of scale MFMAs is the same when neither A nor B is fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the second GEMM, while types of Q, K, V stay the same. This allows to improve overall precision significantly because fp6 has 32 non-negative values used for P quantization compared to just 8 values for fp4. It was found that there is a compiler bug with `__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in LCOMPILER-561) but a workaround seems to fix all failing instances. ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
760f9e1d0a |
[rocm-libraries] ROCm/rocm-libraries#7104 (commit 0fab8d8)
[CK TILE] Unification Work – Add MFMA specialisations for `fp64_t` (#7104) ## Motivation This PR adds two specialisations related to `fp64_t`. ## Technical Details This adds two new specialisations for MFMA dense builtins, and adjusts ABLayout and CLayout to L{K1BM} and L{M1BN}. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
d65ad35b23 |
[rocm-libraries] ROCm/rocm-libraries#7180 (commit 54aed1e)
[CK] Add rocm_ck spec factories: GemmSpec, makeSpec() (#7180) ## What this PR does This is the third PR in the rocm_ck schema stack: 1. **#7150** — Foundation types (DataType, Layout, Args, Ops) 2. **#7163** — Schema engine (Signature, resolve(), ArchProperties) 3. **#7180 (this)** — Spec factories (GemmSpec, makeSpec()) `makeSpec()` is the bridge between user intent and kernel instantiation. It takes a **Signature** (WHAT to compute — operator graph, dtypes, layouts) and a **GemmAlgorithm** (HOW to compute it — tile sizes, pipeline, partitioning) and produces a validated `GemmSpec` — a structural type ready to use as a non-type template parameter. The key property: **every constraint is enforced at compile time.** An invalid GEMM configuration is a compile error, not a runtime crash or silent corruption. The 33 compile-fail tests are the executable specification of what's allowed. ## What's interesting **Physical tensor table.** Not every tensor in a compute graph needs device memory. The intermediate result of `C = A * B` in a fused GEMM+Add+ReLU lives only in registers. `makeSpec()` walks the operator chain and determines which tensors are physical (need Args slots) and which are intermediate. The output is a fixed-layout table: `[lhs, rhs, output, D0?, D1?, scale?]`. **Epilogue composition.** Instead of a combinatorial explosion of named patterns (GemmAdd, GemmAddRelu, GemmMulSilu, ...), the epilogue is a composable chain of ops. `{GemmOp, AddOp, ReluOp}` produces `epilogue_ops = {Add, Relu}` with the bias tensor automatically slotted as D0. Two consecutive AddOps fold into a single Add with two D tensors via CK Tile's parameter pack. **Signature/Algorithm split.** The same Signature can pair with multiple GemmAlgorithms to produce different tuning variants without changing the mathematical result. This is the foundation for the dispatcher — one operation description, many tile configurations. ## New types | Type | Role | |------|------| | `GemmSpec` | Validated NTTP kernel descriptor — physical tensors, tile geometry, epilogue chain | | `GemmAlgorithm` | User-facing tuning input — tile sizes, pipeline, partitioning, padding | | `EpilogueOp` | NTTP-compatible projection of the Op variant for epilogue chains | | `Dim3` | M x N x K triple for tile geometry | ## Test coverage - **69 unit tests** — happy paths, layouts, dtypes, quantization, epilogue chains, algorithm variants - **33 compile-fail tests** — one per constraint (tile divisibility, INT8 rules, pipeline restrictions, etc.) - **6 schema compatibility baselines** — frozen specs that break if the schema changes --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
6a9c03f692 |
[rocm-libraries] ROCm/rocm-libraries#7450 (commit 402dbad)
[CK_TILE] Use Persistent Scheduling for FMHA BWD Group Deterministic (#7450) ## Motivation FMHA BWD group-mode deterministic currently uses a non-persistent scheduler: each `(batch, head, K-row)` work-item is launched as its own block, with no work-stealing across CUs. On uneven workloads (varlen, GQA, many heads with few K-rows) this leaves CUs idle and forces a larger dq_acc workspace than necessary. This PR ports the persistent + deterministic scheduling already used in batch mode to group mode: a fixed-grid kernel that pre-computes per-CU work ranges on the host and uses sparse dq_acc slot indexing so multiple K-rows handled by the same CU share one accumulator slot via intra-CU atomic adds. Stacked on #7331; merge that first. ## Technical Details Single file changed: `ops/fmha/kernel/fmha_bwd_kernel.hpp`. A new `kUsePersistent` path is added to the group-mode deterministic kernel, mirroring the batch-mode persistent scheduler. The host pre-computes a fixed per-CU partition of the total `(batch, head, K-row)` work and packs it into `cu_states[]` so the GPU consumes it in a single launch. Host preparation happens in four steps: 1. Build per-batch `seqstart` prefix sums. 2. Fill per-batch `(sq_w, nc)` with a placeholder `nsplits` (bumped in step 3). 3. Two-pointer scan over CUs to fill `cu_states[c]` (`isplit`, `head_start`, `c_start`, `w_lo`, `w_hi`), accumulating `nsplits[b]` as `max(cs->isplit + 1)`. 4. Compute compact per-batch dq_acc offsets from the finalized `nsplits`. `isplit` is the sparse dq_acc slot index — one CU's multi-K-row writes share slot `ceil(wc_start / denom)`, enabling intra-CU atomic accumulation instead of one slot per K-row. `denom = max(sq_w, target_w)`, splitting two regimes: - `target_w >= sq_w` (large work): `denom = target_w`, intra-CU atomic optimization engaged. - `target_w < sq_w` (sub-K-row sharding, multiple CUs sharing one K-row): `denom = sq_w` collapses to per-K-row indexing (`= c_start`), keeping `isplit ∈ [0, nc-1]` and matching the `nsplits_max = ceil(s_k/kN0) = nc` upper bound that #7331's `GetWorkspaceDeviceSizeUpperBound` assumes for group+det. `isplit` is additionally clamped to `nc-1` to absorb empty CUs (rounded-up `wc_start` past the last K-row); they don't write dq_acc on GPU so the slot value is harmless. `nsplits[b]` is accumulated dynamically in step 3 rather than via a closed form so it tightly matches the actual sparse slots used; step 4 (offsets) follows step 3 since offsets now depend on the dynamic `nsplits`. Group mode also allows batches with `seqlen_q == 0`. The persistent scheduler skips them on the dQ path (no work) but dK/dV are still zero-filled. ## Test Plan Built `tile_example_fmha_bwd` with receipt 5 (fp16, no-bias, no-dropout, `dpad == dvpad`, group + batch) on gfx950 (MI355X). - 8-case smoke (shapes that exercise the sub-K-row regime). - 44-case sweep covering: mask 0/1/2, GQA, var seqlen, `d != d_v`, extreme small seqlen / `nc=1`, CU >> work, huge batch, batch-mode regression. - 12-case perf comparison vs the non-persistent baseline (warmup=10, repeat=50). ## Test Result - All 8 + 44 cases `valid:y`. - Perf: ±5% noise, average -0.4% across the 12 cases — neutral. - Batch-mode deterministic / non-deterministic regression unchanged. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
dd0bea4adf |
[rocm-libraries] direct push (commit c51a0ad)
[CK_TILE] Instruction cache POC Copilot review fixes. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> |
||
|
|
8de4cb72fb |
[rocm-libraries] direct push (commit 49b73ad)
[CK][CK_TILE] POC for Instruction Cache prefetch. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> |
||
|
|
c7fac341de |
[rocm-libraries] ROCm/rocm-libraries#4871 (commit 7d4c040)
[CK] Decouple EpilogueArgs from GridwiseGemm implementation (#4871) This is duplicate of #4537. I could not re-open it since te target branch got deleted and could not change the target branch since it was closed... :) ## Motivation Right now, all the Epilogues structs are declared inside the base gridwise struct. They should be independent of it and the specialization of the selected Epilogue Type should be declared within the the kernel function. ## Technical Details All Epilogue structs depend on template parameters that are known to the base Gridwise Gemm struct. In this PR, we export them to be used independently by any struct that might need to extract them. This approach will serve the decoupling purposes for the Epilogues, but also enable future constructs to use and expand this approach. See 30e2a4c01b64bdea68857c7badd9d7cffbf1adb9. Right now an issue that arises is that when implementing a new Epilogue Type, the developer is not forced to decide where this struct should/can be used or not. To fix this I propose defining an `enum struct EpilogueType` that will be used to fetch the Epilogue specialization through a helper struct. See a943ac8d130e12d6843715b322181186e54ba15c. Note that all the instantiation details will stay in this helper struct. Also note the static assertion in the else statement. ## Test Plan Test with existing CI, as nothing is added/removed. ## Test Result All relevant existing CI tests should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> |
||
|
|
b1975951d4 |
[rocm-libraries] ROCm/rocm-libraries#7179 (commit 05edc86)
[CK] Add rocm_ck schema engine: Signature, resolve(), ArchProperties (#7179) ## Summary A `Signature` is a directed compute graph: tensors are nodes, operators are edges. Shared names between operator outputs and inputs form the graph structure. `resolve()` walks this graph at compile time (`consteval`), inferring dtype, rank, and layout for every tensor — invalid configs become compiler errors, not runtime crashes. **Key design decisions:** - **Operators teach the system about tensors.** `GemmOp` implies rank 2 and Row/Col/Row layout. `AddOp` and `ReluOp` propagate from connected slots. The dtype cascade fills in the rest: per-tensor → signature-wide → error. - **Adding a new op is zero lines in the resolution engine** if it's structurally binary (`lhs/rhs/out`) or unary (`in/out`) — C++20 concepts handle dispatch automatically. Only ops with special semantics need explicit branches. - **TargetSet is a compile-time bitset over GPU targets.** The wave tile validation table is the single source of truth for valid instruction shapes, traced from CK Tile's WarpGemmDispatcher. FP8 tiles are available on gfx942+ via IterateK composition, not gfx950-only. **Reading order:** signature.hpp (data model) → arch_properties.hpp (TargetSet, wave tiles) → resolve.hpp (resolution engine). 3 new headers, 3 unit tests (including diamond DAG coverage), 3 compile-fail tests. Introduces tests/compile_fail/ infrastructure. **Stack**: PR 2 of 3 porting the rocm_ck constexpr schema from experimental to production. 1. Foundation types — DataType, Layout, Args, Ops (#7114) 2. **This PR** — Schema engine (graph resolution) 3. Spec factories — GemmSpec, makeSpec() (#7180 ) Note: We also removed `FmhaBwdOp` for clarity, since that was introduced early and doesn't have tests set up. **Depends on**: #7114 ## Test plan - [x] ctest --test-dir build --output-on-failure — unit tests + compile-fail tests pass - [x] Compile-fail tests correctly reject: mixed CDNA+RDNA TargetSet, conflicting layouts, empty quantization scale names --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
74bc86240b |
[rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a)
[CK Tile] Add gemm universal preshuffle to MX GEMM (#5647) ## Motivation Add gemm universal preshuffle support to existing MX GEMM pipeline. The straightforward way to do this is to port the `mx_flatmm` pipeline to the existing `gemm_mx` framework. ## Technical Details The `mx_flatmm` pipeline was not deleted, to allow for back-compatibility. ## Test Plan Add `preshuffle` option to example: `tile_example_mx_gemm`. Add new configurations with enabled preshuffle to the existing `test/ck_tile/gemm_mx` tests. ## Test Result Example and tests were successful on `gf950` architecture in the `Alola` cluster. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com> |
||
|
|
ebb97044f4 |
[rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1)
Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664) ## Motivation Incorrect results has been introduced for some conv bwd cases. ## Technical Details This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ALMIOPEN-1959 |
||
|
|
3ea9ce7e37 |
[rocm-libraries] ROCm/rocm-libraries#6567 (commit 753c7a8)
[CK Tile] Adding WMMA wrappers for sparse builtins (#6567) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the third of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801 and https://github.com/ROCm/rocm-libraries/pull/6014) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse WMMA intrinsics. ## Technical Details This change adds new specializations for WMMA sparse builtins. In total, we add 8 WMMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
e02c566795 |
[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> |
||
|
|
fc2862d712 |
[rocm-libraries] ROCm/rocm-libraries#6846 (commit 377def4)
[CK_TILE] Add fmha forward hdim 256 support (#6846) ## Motivation Enable Composable Kernel FMHA forward kernel for **hdim=256 BF16** on AMD gfx950 (MI350X). Prior to this change the (256, 256) head-dim configuration either failed to compile, was filtered out by the compatibility rules, or produced incorrect kernel output due to an LDS layout accounting bug. ## Technical Details Four files changed, all to enable hdim=256 BF16 on gfx950. - **`fmha_fwd.py`** — Allow `(256, 256)` in gfx950 compatibility rule; set `(256,256)` BF16 tile to `M0=128, N0=64` (the LDS-feasible shape on gfx950); emit minimal valid instance set for d=256 to bound compile time. - **`fmha_fwd_kernel.hpp`** — Gate Prefill launch path off for d=256 (`PrefillCase = kM0 > 64 && kQKHeaddim < 256`); the double-buffer Prefill variant overflows the 160 KB LDS budget. - **`trload_policy.hpp`** — **Critical correctness fix**: the LDS layout accounting in `GetSmemSize` was wrong (`max(Q, K+S+V)` instead of `max(Q, K) + V + S`), under-allocating LDS and silently corrupting d=256 output (~2% wrong values). - **`trload.hpp`** — Thread `LoadOnce=true` through all d=256 K-LDS descriptors so the compiler picks the matching XOR swizzle period; recompute the S-tile LDS offset to match the corrected `GetSmemSize` formula. ## Test Plan Built and ran `tile_example_fmha_fwd` on gfx950 (MI350X) with the canonical d=256 BF16 configurations: ```bash cd build && ninja tile_example_fmha_fwd ./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=1 -h=32 -h_k=2 -s=1024 -s_k=1024 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10 -kname=1 -v=1 ./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=8 -h=32 -h_k=2 -s=16384 -s_k=16384 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10 -kname=1 -v=1 ``` ## Test Result ```bash -b=1 -s=1024 [bf16|batch|bhsd] b:1, h:32/2, s:1024/1024, d:256/256, scale_s:0.0625, bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r, fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink, 0.058 ms, 298.42 TFlops, 618.68 GB/s, valid:y -b=4 -s=16384 [bf16|batch|bhsd] b:8, h:32/2, s:16384/16384, d:256/256, scale_s:0.0625, bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r, fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink, 42.797 ms, 822.18 TFlops, 106.63 GB/s, valid:y ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com> |
||
|
|
5ff7497fa7 |
[rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4)
[CK Tile] Fix Grouped Gemm quant mixed precision (#7537) <Migrate from Internal repo PR> test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8 cases: std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant, False, True, False>, std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant, False, True, False> GFX1250 would fail with incorrect results, GFX950 would fail when compiling BF8+FP8 and give incorrect results for FP8+BF8. The issue is due to the wrong ComputeDataType selection. The fix is to consider original ADataType and BDataType even when ComputeDataType is not void. For compiling error on gfx950, the bf8, fp8, 16x16x32 warp Gemm is added. |
||
|
|
309d823056 |
[rocm-libraries] ROCm/rocm-libraries#7466 (commit cc2861f)
[CK Tile] Enable hardware OOB buffer load offset trick by default (#7466) ## Summary Enables `CK_TILE_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK` inside `config.hpp`. ### Background When loading from global memory with out-of-bound (OOB) check, CK Tile must suppress invalid lanes. The previous default used a software branch: ```cpp // Old path (oob_conditional_check, no trick) if(!src_thread_element_valid) { return zeros; } return amd_buffer_load_impl(...); ``` This generates divergent control flow, the compiler emits exec-mask save/restore and per-lane comparison SALU instructions one set per buffer load that touches a padded dimension. ### Change With the trick enabled, invalid lanes are suppressed entirely in hardware: ```cpp // New path (trick enabled) uint32_t shift = src_thread_element_valid ? 0 : 0x80000000; return amd_buffer_load_impl(resource, shift + offset, 0); ``` The `0x80000000` offset overflows the buffer descriptor's declared size, causing the hardware to silently return zero for that lane - no branch, no exec mask manipulation. This matches the behavior of old CK XDL kernels, which use an unconditional load followed by a `v_cndmask` select. ### Expected impact Eliminates ALU overhead from OOB validity branches which reduces the kernel execution time, especially for memory-bound cases. --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> |
||
|
|
9cf49cd322 |
[rocm-libraries] ROCm/rocm-libraries#7465 (commit 81f1cf0)
[CK TILE] Increase default kPerXdl for grouped convolution instances (#7465) ## Summary Increases the default `kPerXdl` used in CK Tile grouped convolution instance generation for forward, backward-data, and backward-weight operations. ### Changes in `generate_instances.py` - **Larger default `kPerXdl` for all fp16/bf16 tile sizes**: `get_k_mfma()` now returns `32` for `m/nPerXdl = 16` and `16` for `m/nPerXdl = 32`. - **Cap `kPerXdl` to `kPerBlock`**: All three parsers (`parse_fwd_instances`, `parse_bwd_weight_instances`, `parse_bwd_data_instances`) now clamp the computed value with `min(..., k_per_block)` to prevent generating invalid instances where `kPerXdl > kPerBlock`. ### Expected impact Higher `kPerXdl` increases the number of MFMA instructions issued per warp per inner-loop iteration, improving arithmetic intensity and reducing pipeline stall overhead for memory-bound shapes. |
||
|
|
e7798e9560 |
[rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac)
Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112) ## Motivation The goal of this work is to apply XOR shuffle (swizzle) to the current `comp_async` GEMM pipeline and the `gemm_mx` pipeline. XOR swizzling has been helpful to avoid LDS bank conflicts, as data are redistributed across LDS banks, such that simultaneous threads accessing different rows land on different LDS banks. ## Technical Details A similar approach to the work in the existing eight-waves pipeline was followed. Currently, XOR swizzle support is available for FP8 and BF8 types. FP4 support is also available for MX GEMM. Should the types not match, or should the async vector width be of an unsupported size, then the pipeline falls through to the previously existing ('unswizzled') path. ## Test Plan Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM pipeline. Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for the MX GEMM pipeline. ## Test Result The tests passed successfully in the `Alola` cluster with MI350 hardware. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
c31fc4df52 |
[rocm-libraries] ROCm/rocm-libraries#7311 (commit 79d8cae)
[CK Tile Engine] Daily tier sampling for tile engine GEMM (#7311) Summary - Replace uniform random instance sampling (random.shuffle) with scrambled Sobol + Latin Hypercube + maximin space-filling sampling, per the Tile Engine Benchmark Sampling RFC - Add op-weighted budget allocation via new TILE_ENGINE_SAMPLING_TIER=daily CMake knob that auto-distributes 8,000 instances across ops proportional to registered weights in op_weights.json - Emit chosen_instances.json manifests for reproducibility tracking - Consolidate 5 copies of sampling logic into single _apply_sampling() method on the base class Jenkinsfile changes Replace per-op -D *_MAX_INSTANCES=250 with single -D TILE_ENGINE_SAMPLING_TIER=daily in gfx942/gfx950/gfx1201 stages. Budget auto-distributes (8000 total per GPU target). --------- Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> |
||
|
|
275629fe34 |
[rocm-libraries] ROCm/rocm-libraries#6014 (commit 2f8259d)
[CK Tile] Adding MFMA wrappers for dense builtins (#6014) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the second of the series of PRs (after #5801) that add all the necessary MMA builtins as `amdgcn_mma` structs. This PR focuses on dense MFMA intrinsics. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we add 55 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
cad4f236d4 |
[rocm-libraries] ROCm/rocm-libraries#7603 (commit 4ea31a8)
Bump idna from 3.11 to 3.15 in /projects/composablekernel/docs/sphinx (#7603) Bumps [idna](https://github.com/kjd/idna) from 3.11 to 3.15. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/kjd/idna/blob/master/HISTORY.md">idna's changelog</a>.</em></p> <blockquote> <h2>3.15 (2026-05-12)</h2> <ul> <li>Enforce DNS-length cap on individual labels early in <code>check_label</code>, short-circuiting contextual-rule processing for oversized input while staying compatible with UTS 46 usage.</li> <li>Tidy core helpers: hoist bidi category sets to module-level frozensets (avoiding per-codepoint list construction), simplify length checks, and reuse the shared <code>_unicode_dots_re</code> from <code>idna.core</code> in the codec module.</li> <li>Use <code>raise ... from err</code> for proper exception chaining and switch internal string formatting to f-strings.</li> <li>Allow <code>flit_core</code> 4.x in the build backend.</li> <li>Expand the ruff lint set (flake8-bugbear, flake8-simplify, pyupgrade, perflint) and apply the surfaced fixes; pin lint CI to Python 3.14.</li> <li>Add Dependabot configuration for GitHub Actions.</li> <li>Convert README and HISTORY from reStructuredText to Markdown.</li> <li>Reference CVE-2026-45409 for the 3.14 advisory in place of the initial GHSA identifier.</li> </ul> <p>Thanks to Felix Yan, Stan Ulbrych, and metsw24-max for contributions to this release.</p> <h2>3.14 (2026-05-10)</h2> <ul> <li>Removed opportunity to process long inputs into quadratic time by rejecting oversize inputs up-front. Closes a bypass of the CVE-2024-3651 mitigation. [CVE-2026-45409]</li> </ul> <p>Thanks to Stan Ulbrych for reporting the issue.</p> <h2>3.13 (2026-04-22)</h2> <ul> <li>Correct classification error for codepoint U+A7F1</li> </ul> <h2>3.12 (2026-04-21)</h2> <ul> <li>Update to Unicode 17.0.0.</li> <li>Issue a deprecation warning for the transitional argument.</li> <li>Added lazy-loading to provide some performance improvements.</li> <li>Removed vestiges of code related to Python 2 support, including segmentation of data structures specific to Jython.</li> </ul> <p>Thanks to Rodrigo Nogueira for contributions to this release.</p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
720ceb6500 |
[rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f)
[CK Tile] Support multi-vector reads in static encoding patterns (#7528) ## Motivation The thread-raked / warp-raked / block-raked static tile distribution patterns in `ck_tile` silently produce wrong results when the contiguous tile dimension is larger than `warp_size * vector_size`, because the encoding has no per-thread iteration dimension along X. Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in `ConvConfigComputeV3`, the grouped convolution backward-weight example reports about 50 percent wrong values, with errors starting exactly at the `X0*X1 = 64` boundary. The second pass over the contiguous dim is never performed. This PR extends the encoding so multi-vector reads in the contiguous tile dimension are supported, while keeping every existing call site bit-for-bit identical. ## Technical Details Three files changed. ### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp` Add a per-thread X iteration dimension in all three raked specializations: - `X0 = min(warp_size, XPerTile / X1)` — threads in X dim - `X1 = min(LargestVec, VecSize)` — vector size per access - `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new) `X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both `make_2d_static_tile_distribution()` and `make_shuffled_2d_static_tile_distribution()`. The new encoding places `X2` in the middle of the Ys iteration list, which preserves reverse symmetry between the regular `<..., X2, X1>` and shuffled `<X1, X2, ...>` encodings. Patterns updated: `thread_raked`, `warp_raked`, `block_raked`. ### 2. `include/ck_tile/core/tensor/transpose_tile.hpp` Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch alongside the existing `NDimY == 2` branch. The original branch is byte-for-byte unchanged. Both branches dispatch to the same `transpose_tile2d_impl_in_thread`, whose body has always been NDimY-generic (iterates with `static_for<0, NDimY, 1>` and `number<NDimY>{}`). ### 3. `experimental/grouped_convolution_tile_instances/generate_instances.py` Removed the two now-obsolete skip guards in `parse_bwd_weight_instances` and `parse_bwd_data_instances`: ```python if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector): print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.") continue ``` Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating, irregular-load shapes, scalar-per-vector > tile size) are kept untouched. ### Compatibility Strict. Every existing caller has `X2 == 1` and therefore hits the original encoding path verbatim. No upstream config or pipeline behavior changes. ## Test Plan The grouped convolution example is the natural exerciser since `GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for both A and B tiles, and all three conv directions share the same `ConvConfigComputeV3`. For each test below we ran: ``` ./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}] ``` with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both code paths: | Test | M / N / K | VecA/B/C | A path | B path | dtype | |------|-------------|----------|------------|----------------|-------------| | T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 | | T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 | | T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 | | T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)| | T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3 dir) | A larger T4a (256/256/128) was attempted to stress both A and B with X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128 KB > 64 KB limit), independent of this PR. For the generator change we ran: ``` python3 generate_instances.py --mode profiler --direction all ``` and verified `Skipping instance ... with multiple warps per continous tile dim` no longer appears (count went from non-zero to 0); other skip categories are unchanged. `clang-format-18` was applied to both modified `.hpp` files (matches the repo's `.clang-format`). ## Test Result - T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`. Confirms existing callers are unaffected. - T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3 encoding + transpose branch. - T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3 `correct`. - T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs `correct`. - Generator: 0 `multiple warps per continous tile dim` skips remaining; other skips unchanged. Sample run output (T4b, bf16, bwd_data): ``` shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32 pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0 Vector size A: 1, Vector size B: 1, Vector size C: 1 0.934907 ms, 8.34683 TFlops, 34.3178 GB/s Relative error threshold: 0.00390625 Absolute error threshold: 0.25 The CPU verification result is: correct ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
b5f8bef97f |
[rocm-libraries] ROCm/rocm-libraries#6088 (commit 6ac353c)
[CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088) ## Motivation This MR makes all the changes required for the unified architecture to be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The crux is that layout parameters should be interpreted as describing the pure mathematical matrix fragments, while the ext_vectors and tile distribution encodings describe everything in terms of packed datatype units. This matches how packed types are dealt with in ck_tile and should play nicely with the load and store tile ops once we integrate the unified framework into CK tile. The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and pk_bf6x32_t, which did not exist before. The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size 1 and 2) were extended to make the subscripting operator work as expected. The layout test was adapted to be compatible with all packed datatypes, and all new intrinsics were added to the test. This MR adds ALL intrinsics across ALL architectures which use packed datatypes, as well as ALL scale intrinsics: mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) wmma_i32_16x16x16_iu4_w32 wmma_i32_16x16x16_iu4_w32_gfx12 wmma_i32_16x16x32_iu4_w32_gfx12 ## Testing All intrinsics were tested on all architectures. |
||
|
|
458dd0ac4c |
[rocm-libraries] ROCm/rocm-libraries#7130 (commit 9e1e065)
[CK_TILE] Redesign LDS store API with pre-computed window coordinates (+15% MI355X, +6% MI300X) (#7130) ## Summary - Redesign the LDS store API to separate window creation from memory transfer - Add `MakeDistributedLdsStoreWindow` factory, `LocalStore` (fast path), and `LocalStoreWithCoordRecompute` (slow path) to the pipeline base class - Convert CompV3 as the reference implementation - Document the slow/fast path distinction across core tensor headers ## Motivation `LocalPrefill` hides a performance cliff: when given a bare `tile_window_with_static_lengths`, it silently reconstructs `tile_window_with_static_distribution` on every call — paying significant VALU overhead (~96 for typical configurations) for XOR coordinate computation. The cost is invisible at the call site. The new API makes the cost explicit via three verbs: | Verb | Method | Cost | When to use | |------|--------|------|-------------| | **Create** | `MakeDistributedLdsStoreWindow(bare, dstr)` | VALU (once) | Before hot loop, when VGPR budget allows | | **Store (fast)** | `LocalStore(precomputed_window, tensor)` | 0 VALU for coords | Pre-computed window available | | **Store (on-the-fly)** | `LocalStoreWithCoordRecompute(bare, tensor)` | VALU per call | VGPR budget tight, or one-shot stores | Both `LocalStore` and `LocalStoreWithCoordRecompute` enforce correct window types via `static_assert`. `LocalPrefill` is retained for backward compatibility (69 call sites across 6 pipeline files). ## Performance ### 86 Shapes, CompV3_2 (128×128 tile), fp16, RCR layout **gfx942 (MI300X): 86/86 improved, 0 regressions. Average gain: +6.2%** **gfx950 (MI355X): 85/86 improved, 1 neutral, 0 regressions. Average gain: ~+15%** <img width="2777" height="1178" alt="pr7130_perf_chart" src="https://github.com/user-attachments/assets/b2f5c406-eb20-469d-8da6-dd608c28fbcc" /> | Shape (MxNxK) | Source | gfx942 | gfx950 | |---|---|---|---| | 22016x256x4096 | llama2_7b_fc1 | +5.3% | +11.4% | | 22016x512x4096 | llama2_7b_pfill | +5.9% | +10.9% | | 4096x512x22016 | llama2_7b_pfill | +7.6% | +28.5% | | 22016x1024x4096 | llama2_7b_pfill | +6.1% | +10.1% | | 4096x1024x22016 | llama2_7b_pfill | +7.4% | +17.2% | | 22016x4096x4096 | llama2_7b_pfill | +5.2% | +9.3% | | 4096x4096x22016 | llama2_7b_pfill | +6.0% | +9.3% | | 4096x4096x4096 | llama2_7b_pfill | +5.7% | +10.6% | | 28672x256x4096 | llama3_8b_fc1 | +5.4% | +12.2% | | 28672x512x4096 | llama3_8b_pfill | +4.9% | +6.4% | | 4096x512x28672 | llama3_8b_pfill | +7.4% | +1.5% | | 28672x2048x4096 | llama3_8b_pfill | +4.9% | +8.6% | | 4096x2048x28672 | llama3_8b_pfill | +6.4% | +8.4% | | 28672x8192x4096 | llama3_8b_pfill | +5.4% | +8.0% | | 7168x1024x8192 | llama70b_pfill | +6.6% | +10.8% | | 8192x1024x7168 | llama70b_pfill | +6.4% | +11.4% | | 7168x4096x8192 | llama70b_pfill | +6.2% | +9.6% | | 16384x256x4096 | bloom_fc1 | +6.4% | +20.3% | | 16384x512x4096 | bloom_fc1 | +5.8% | +8.5% | | 16384x1024x4096 | bloom_fc1 | +6.0% | +10.9% | | 16384x2048x4096 | bloom_fc1 | +5.3% | +10.1% | | 16384x3072x4096 | bloom_fc1 | +5.5% | +8.8% | | 16384x4096x4096 | bloom_fc1 | +5.7% | +8.8% | | 4096x256x16384 | bloom_fc2 | +7.8% | +33.6% | | 4096x512x16384 | bloom_fc2 | +7.5% | +31.6% | | 4096x1024x16384 | bloom_fc2 | +7.1% | +17.1% | | 4096x2048x16384 | bloom_fc2 | +6.9% | +11.0% | | 4096x3072x16384 | bloom_fc2 | +6.8% | +11.0% | | 4096x4096x16384 | bloom_fc2 | +6.7% | +10.3% | | 12288x256x4096 | bloom_inproj | +6.7% | +22.0% | | 12288x512x4096 | bloom_inproj | +6.2% | +9.8% | | 12288x1024x4096 | bloom_inproj | +5.9% | +12.4% | | 12288x2048x4096 | bloom_inproj | +5.8% | +10.1% | | 12288x3072x4096 | bloom_inproj | +5.4% | +10.1% | | 12288x4096x4096 | bloom_inproj | +5.7% | +9.1% | | 250880x256x4096 | bloom_logits | +2.6% | +0.5% | | 4096x256x4096 | bloom_outproj | +7.1% | +28.4% | | 4096x512x4096 | bloom_outproj | +6.8% | +27.4% | | 4096x1024x4096 | bloom_outproj | +6.5% | +21.3% | | 4096x2048x4096 | bloom_outproj | +5.9% | +13.1% | | 4096x3072x4096 | bloom_outproj | +5.9% | +12.0% | | 16x1536x7168 | deepseek | +7.7% | +34.7% | | 32x1536x7168 | deepseek | +7.7% | +34.9% | | 64x1536x7168 | deepseek | +7.6% | +31.3% | | 128x1536x7168 | deepseek | +7.6% | +25.8% | | 256x1536x7168 | deepseek | +7.7% | +27.9% | | 512x1536x7168 | deepseek | +7.6% | +29.1% | | 1024x1536x7168 | deepseek | +7.3% | +28.8% | | 2048x1536x7168 | deepseek | +6.9% | +20.5% | | 4096x1536x7168 | deepseek | +6.3% | +11.0% | | 8192x1536x7168 | deepseek | +6.2% | +11.3% | | 16384x1536x7168 | deepseek | +6.0% | +9.1% | | 20480x1536x7168 | deepseek | +4.8% | +9.3% | | 16x3072x1536 | deepseek | +6.3% | +25.1% | | 32x3072x1536 | deepseek | +6.4% | +25.3% | | 64x3072x1536 | deepseek | +6.4% | +24.8% | | 1024x1024x1024 | square | +5.5% | +18.7% | | 2048x2048x2048 | square | +6.0% | +19.2% | | 3584x3584x3584 | square | +5.3% | +11.2% | | 5120x5120x5120 | square | +6.1% | +10.0% | | 6144x6144x6144 | square | +5.5% | +9.8% | | 8192x8192x8192 | square | +6.0% | +8.2% | | 1024x4608x1024 | midsize | +4.6% | +4.6% | | 512x18432x512 | midsize | +1.9% | +10.1% | | 4096x18432x4096 | midsize | +5.8% | +8.8% | | 320x8192x320 | stablediff | +4.0% | +11.3% | | 640x2048x640 | stablediff | +4.5% | +14.0% | | 320x8192x1280 | stablediff | +5.6% | +20.1% | | 1x1280x8192 | skinny_m1 | +7.7% | +35.3% | | 1x8192x1024 | skinny_m1 | +6.0% | +20.3% | | 1x7168x8192 | skinny_m1 | +7.7% | +36.6% | | 1x8192x3584 | skinny_m1 | +7.3% | +27.9% | | 1x13312x6656 | skinny_m1 | +7.6% | +30.3% | | 1x13312x16384 | skinny_m1 | +7.8% | +4.2% | | 1x16384x6656 | skinny_m1 | +7.5% | +28.7% | | 1x16384x16384 | skinny_m1 | +7.7% | +2.3% | | 16x4096x4096 | skinny_m16 | +7.4% | +31.9% | | 16x22016x4096 | skinny_m16 | +7.5% | +26.5% | | 16x28672x4096 | skinny_m16 | +7.0% | +15.1% | | 16384x1280x8192 | skinny_m16 | +5.6% | +8.7% | | 16384x8192x1024 | skinny_m16 | +4.5% | +8.8% | | 2048x4096x2048 | mixed | +4.7% | +9.0% | | 4096x2048x8192 | mixed | +6.8% | +11.0% | | 8192x4096x4096 | mixed | +5.2% | +10.0% | | 1x4096x4096 | mixed | +7.4% | +32.4% | | 1024x1024x4096 | mixed | +7.1% | +27.4% | ### ISA Hot Loop Diff (LBB1_32, per K-iteration, gfx942) | Metric | Baseline | Optimized | Delta | |--------|----------|-----------|-------| | Total VALU | 621 | 500 | **-121** | | VGPR / SGPR | 512 / 96 | 512 / 96 | unchanged | ### Hardware Counters — Instruction Mix (gfx950, rocprofiler-compute) Profiled on MI350X, shape 4096×256×16384 (bloom_fc2). Instruction counts are deterministic hardware counters. | Metric | Baseline | Optimized | Δ | |--------|----------|-----------|---| | **VALU instructions/kernel** | 4,642,473 | 987,958 | **−78.7%** | | **INT32 VALU** | 2,592,786 | 541,129 | **−79.1%** | | Instructions / wavefront | 39,178 | 24,400 | −37.7% | | VGPRs (avg) | 98 | 90 | −8% | | **MFMA instructions** | 2,059,702 | 2,059,702 | **0%** | | **LDS instructions** | 1,564,891 | 1,564,891 | **0%** | | **VMEM instructions** | 520,996 | 520,996 | **0%** | MFMA as fraction of total instructions: **30.7% → 67.5%**. Eliminating ~3.65M redundant INT32 VALU instructions (XOR coordinate recomputation per K-iteration) leaves the scheduler more headroom for MFMA dispatch, directly explaining the benchmark gains. |
||
|
|
9565ca21ec |
[rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2)
[CK Tile] Eight Waves pipeline for MX GEMM (#5552) ## Motivation Integrate Eight Waves pipeline in MX GEMM ## Technical Details - EightWaves pipeline: - Add pipeline, policy and block gemm (internally using existing implementation used by GEMM and ABQuant) - Extend support of EightWaves policy for FP4 (packed types) - Async pipeline: - Fix pipeline with packed scales (requires MRepeat and NRepeat to be contiguous) - block gemm specific for MX GEMM is defined because distribution encodings have changed - CShuffle: - Add new functionality to support MRepeat and NRepeat contiguous (defined by `TilesPacked`) - Examples: - Refactor examples to easily switch different configurations (similar to GEMM universal) - Scales values generated consistently with other microscale implementations in CK Tile - Add configuration for EightWaves pipeline - Tests: - Unify existing FP8 and FP4 tests - Add tests for EightWaves pipeline - Scales values generated consistently with other microscale implementations in CK Tile Note: FP6 support for MX GEMM was added later and the support for the Eight Waves pipeline will be done in following PR ## Test Plan Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and FP8 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
f01a8cb28d |
[rocm-libraries] ROCm/rocm-libraries#7547 (commit 7e032ad)
[CK] fix daily builds for pytorch (#7547) ## Motivation This will restore the daily builds that test whether the latest pytorch code can build with the latest CK code (pulled from the standalone CK repo). ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
fad83d9c90 |
[rocm-libraries] ROCm/rocm-libraries#7016 (commit 2b73c00)
[CK] Fix RDNA3 FMHA tile-load paths (#7016) ## Summary Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets. ## Details This PR addresses RDNA-specific issues hit while enabling xFormers CK FMHA on gfx11/gfx12: - On RDNA3, update FMHA P tile handling so the layout consumed by the second GEMM matches the WMMA path. ## Testing Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151. ```text pytest --import-mode=importlib -q \ tests/test_mem_eff_attention.py::test_forward \ tests/test_mem_eff_attention.py::test_backward \ tests/test_mem_eff_attention.py::test_dropout_ck 3844 passed, 5244 skipped, 26 warnings --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> |
||
|
|
5169cd14a1 |
[rocm-libraries] ROCm/rocm-libraries#7543 (commit 2b735ff)
Fix for #6207 (#7543) ## Motivation PR #6207 introduces an error. This PR is the fix of it. ## Technical Details Adds a path for GFX1250 in `to_string` ## Test Plan Test has already included. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
a11f53564f |
[rocm-libraries] ROCm/rocm-libraries#7530 (commit 378e049)
[CK] Fix FMHA sink dispatch when init_sink_value is set (#7530) ## Summary - Fix `traits.has_sink` in `fmha_fwd_runner.hpp` to also check `init_sink_value != 0`, so the GPU kernel dispatches with sink support when `-init_sink=1` is passed. - Gate `run_sink_mask_tests` (StreamLLM) and `run_sink_init_tests` (GPT-OSS) behind opt-in flags `-m` and `-g` in `smoke_test_fwd.sh`. These tests require sink=true kernel instances which are excluded by the `BUILD_TESTING` CMake filter (`*_nsink*`), causing unconditional "not supported yet" failures (48 tests in CI). The opt-in flag approach was borrowed from PR #6057. ## Why gate tests instead of compiling sink=true kernels? The `BUILD_TESTING` filter in `CMakeLists.txt` uses `*_nsink*` glob patterns for the `fwd` and `fwd_splitkv` APIs, excluding sink=true kernel instances from compilation. We chose opt-in flags over widening the filter because: - **Compile time**: Enabling sink=true kernels doubles the kernel variants for `fwd` and `fwd_splitkv` APIs. The filter exists specifically to reduce CI build times. - **Incremental enablement**: Sink support (StreamLLM / GPT-OSS) is still maturing. Gating lets teams opt in explicitly (`smoke_test_fwd.sh -g`) while keeping the default CI path fast. - **Precedent**: splitkv (`-s`) and appendkv (`-a`) tests already follow this opt-in pattern. ## Test plan - [ ] Run `smoke_test_fwd.sh -g` with sink=true kernels compiled and verify sink-enabled kernels are dispatched - [ ] Verify `smoke_test_fwd.sh` still passes without `-m` / `-g` flags - [ ] Confirm CI no longer fails on sink tests (they are now opt-in) |
||
|
|
3727d5220a |
[rocm-libraries] ROCm/rocm-libraries#5652 (commit 7dc7d1d)
[CK Conv] Wavelet gemm pipeline for bwd_weight convolution (#5652) ## Motivation In the current CShuffleV3 backward weight kernel, the in-kernel conv-to-GEMM transform generates significant INT32 VALU pressure per MFMA instruction. On VALU-heavy shapes (e.g., G=1, 3×3, C=256), these index computation ops compete with MFMA for VALU issue slots, creating a bottleneck that cannot be resolved by pipeline prefetching alone. This PR adds a wave-specialized ("wavelet") convolution backward weight kernel that splits workgroup threads into two roles: - **Load waves**: conv-to-GEMM address computation + global memory loads + LDS writes (all VALU/VMEM) - **Math waves**: LDS reads + MFMA + CShuffle epilogue (no index computation) By physically separating the two instruction classes onto different waves, VALU and MFMA execute on different hardware functional units without contention. ## Technical Details **Core kernel (new files):** - `gridwise_gemm_xdl_waveletmodel_cshuffle_conv_v3.hpp` — wave-specialized gridwise GEMM for conv bwd weight (2-way split: load + math) - `device_grouped_conv_bwd_weight_xdl_waveletmodel_cshuffle_v3.hpp` — device op following CShuffleV3 patterns; `BlockSize = TileMathThreadGroupSize` for MFMA wave assignment, `LaunchBlockSize = TileLoad + TileMath` for kernel launch **Wave pipeline (modified):** - `gridwise_gemm_waveletmodel.hpp` — load/math wave pipeline structs with `sched_group_barrier` scheduling hints to front-load VMEM reads before address-advance VALU **Two wave ratios:** - **(4,4)**: 256 load + 256 math = 512 threads (8 waves). Best on large shapes. - **(4,2)**: 256 load + 128 math = 384 threads (6 waves). Best on small shapes (fewer sync barriers, denser MFMA per math wave). **Instance coverage (F16 and BF16 symmetric):** | Ratio | Tiles | Layouts | ConvSpecs | |-------|-------|---------|-----------| | (4,4) | M128×N128, M64×N64, M128×N64, M64×N128 | 2D NHWGC, 3D NDHWGC | Default, Filter1x1Stride1Pad0 | | (4,2) | M64×N64, M128×N64, M64×N128 | 2D NHWGC | Default, Filter1x1Stride1Pad0 | **Existing wavelet model fixes:** - `BlockSize` corrected from `math::max(TileLoad, TileMath)` to `TileMathThreadGroupSize` in the flat-GEMM wavelet device op and gridwise kernel ## Test Plan - `test_grouped_convnd_bwd_weight` GTest: 34 hardcoded test cases covering 1D/2D/3D, F16/BF16, G=1/2/16, various spatial sizes - Performance benchmark: all 37 RetinaNet bwd_weight shapes on gfx950 ```bash ninja -C build test_grouped_convnd_bwd_weight ./build/bin/test_grouped_convnd_bwd_weight ``` ## Test Result **Correctness:** 34/34 GTest cases passed (F16/BF16 × 1D/2D/3D × Default/Filter1x1Stride1Pad0 × various G/N/K/C combinations). **Performance:** Wavelet is the fastest overall instance on 12/37 RetinaNet shapes — all G=1, 3×3 convolutions with C=256 (the VALU-heavy target shapes): | Shape | Uplift vs best baseline | |-------|------------------------| | K=36, 7×7 | 1.91x | | K=36, 100×100 | 1.60x | | K=36, 13×13 | 1.43x | | K=36, 25×25 | 1.38x | | K=36, 50×50 | 1.38x | | K=256, 100×100 | 1.24x | | K=256, 13×13, s=2 | 1.20x | | K=256, 25×25, s=2 | 1.20x | | K=256, 7×7 | 1.17x | | K=256, 13×13 | 1.13x | | K=2376, 50×50 | 1.05x | | K=2376, 100×100 | 1.06x | Where wavelet does not win (25/37): 1×1 convolutions (explicit kernel does host-side transform), grouped convolutions with small per-group channels, and shapes where standard CShuffleV3 already amortizes VALU overhead. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com> |