Commit Graph

4050 Commits

Author SHA1 Message Date
assistant-librarian[bot]
4d2856612c Merge commit '4c2c18ef486641d1493f3dc272a1e0e079676308' into develop 2026-01-22 02:55:52 +00:00
Michał Kulikowski
04f7e1fce4 [CK][Examples] Extending support for rdna3/4 part 4: (#3264)
* [CK][Examples] Extending support for rdna3/4 part 4:
-example_gemm_xdl_streamk
-example_gemm_xdl_fp16_fp8_v3
-example_gemm_xdl_fp16_v3

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

* [CK][Examples] Revert example\01_gemm\gemm_xdl_streamk parameters change.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

---------

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 4c2c18ef48]
2026-01-21 18:10:16 -08:00
assistant-librarian[bot]
aadd581b8d Merge commit '1040d9b1f53945867d78d0bbcf03de65ee01aea3' into develop 2026-01-21 18:24:44 +00:00
Robin Voetter
2b54a86c04 [CK_BUILDER] Replace reference conv with old ck implementation (#3604)
* ck-builder: remove SPATIAL_DIM parameter from ConvTensorLayouts

This information is already in the SIGNATURE, so its pointless to pass it
separately. This streamlines the interface of those functions a bit. Also
touches up the style of those files in general.

* ck-builder: implement reference conv using old ck

The old ck implementation is more featureful and better tested.

* ck-builder: replace test_reference_execution reference with old ck

This strips out the ck-tile gpu reference implementation completely.

* ck-builder: clean up test_reference_execution

- Remove unneccesary messages
- Replace EXPECT_TRUE(true) with EXPECT_NO_THROW()

[ROCm/composable_kernel commit: 1040d9b1f5]
2026-01-21 19:18:47 +01:00
andrew clark
5a27de45e5 Sanitizing URL-encoded characters from the image file name (#3622)
[ROCm/composable_kernel commit: 0fbb3bb8c4]
2026-01-21 11:00:53 -07:00
assistant-librarian[bot]
579d2eb5fb Merge commit 'f41f37da969d8f0dbcf590b72e5ac8e74e8846b6' into develop 2026-01-21 16:34:17 +00:00
Yi DING
0bb1c90674 Add CMakePresets.json (#3284)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: f41f37da96]
2026-01-21 08:04:24 -08:00
assistant-librarian[bot]
8fbde9114b Merge commit 'fcc9372c009c8e0a23fece77b582da83b04a654f' into develop 2026-01-21 02:52:11 +00:00
Yi DING
a0935f7669 [CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615)
[ROCm/composable_kernel commit: fcc9372c00]
2026-01-21 09:54:46 +08:00
assistant-librarian[bot]
b2c76ff10f Merge commit 'd5ae81b2922773f7cdf4a02a2e1fd57d0e4df851' into develop 2026-01-20 22:14:29 +00:00
Erwin Terpstra
b079841b10 Implement batched gemm add relu gemm add for rdna4 (#3391)
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation

* wip: many fixes in implementation of batched gemm gemm multiple d

* wip: batched gemm gemm multiple d gridwise op compiling, not working yet

* fix: incorrect d0 grid indexing in batched gemm gemm multipled

* feat: add instances for batched gemm add relu gemm add

* chore: configure instance with low vector transfer size for odd sizes

* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense

* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes

* fix: disable odd size tests on XDL archs

* chore: removed temporary logging

* chore: update some references to C tensor to E tensor

* Tentative fix for example template params

* Tentative fix for non-multi-D batched gemm gemm device impl.

* Tentative fix for xdl example template params

* Tentative fix for profiler build on gfx90a

* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions

* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply

* fix: make the gemm1 data types match what happens in the device op

* feat: add d0s/d1s datatypes and layouts to the device op type string

* chore: change element-wise op so addition happens in fp32

* chore: add static asserts for gemm0/gemm1 calculated wave sizes

* chore: also updated other element-wise ops to use fp32 calculations

* chore: log number of supported instances

* chore: update instance comment

* chore: disable kernel timing in example by default

* fix: gemm1 wave size calculation

* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions

* chore: remove increased tolerance in batched gemm gemm multiple d example

* chore: add comment explaining that verification fails for certain input values

* chore: clarify instance comment

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: d5ae81b292]
2026-01-20 13:06:59 -08:00
assistant-librarian[bot]
5f61470a1f Merge commit '91b4102a59c6013d3faeb54f250cf577b2f129ce' into develop 2026-01-20 19:35:23 +00:00
Max Podkorytov
8b842250da Add persistent async input scheduler for GEMM kernels (#3520)
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
  chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks

Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
  chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations

[ROCm/composable_kernel commit: 91b4102a59]
2026-01-20 10:37:09 -08:00
assistant-librarian[bot]
58bb88f499 Merge commit '8f75869408210cb85e9eb7ff639c4c9dad1331cb' into develop 2026-01-20 18:17:53 +00:00
Linjun-AMD
e227e837be Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)
This reverts commit 217ac48fd83deef3d0d5084815689e8c79958cc1.

[ROCm/composable_kernel commit: 8f75869408]
2026-01-20 09:40:54 -08:00
Estevan Vedovelli
8e5475654b Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598)
* Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions

Enables hipTensor to access the WMMA HW functionalities
for these combinations of datatype on gfx11 and gfx12.

* Fix change to contraction scale tests

* Fix clang-format

[ROCm/composable_kernel commit: 7d8bca7ddc]
2026-01-20 09:39:57 -08:00
assistant-librarian[bot]
6a0cbcb01d Merge commit '4d58c70e6cf76ce6cb40aa6035ebccbb28493f71' into develop 2026-01-20 17:18:34 +00:00
Cong Ma
364ad3d521 [CK TILE GEMM] Add bf8 support to tile engine streamk generator (#3543)
[ROCm/composable_kernel commit: 4d58c70e6c]
2026-01-20 10:01:33 -07:00
assistant-librarian[bot]
a7320b9717 Merge commit '6300ad3c62298dc6fdddfcf19ecd074f7f08fa96' into develop 2026-01-20 16:18:17 +00:00
music-dino
750bd72b3d Batched gemm softmax gemm descriptor fix (#3564)
* Add rocm to prefix path for codegen

* Fix issue with c0_matrix_mask construction

[ROCm/composable_kernel commit: 6300ad3c62]
2026-01-20 07:25:30 -08:00
assistant-librarian[bot]
43058803dc Merge commit 'b09121f86066381f3662fdbdee6a810849a8a1a7' into develop 2026-01-20 10:16:09 +00:00
Wojciech Laskowski
6ad65bc855 WMMA support for batched_gemm_reduce (#3332)
Summary:
- added new device impl of Batched GEMM Reduce for WMMA
- added instance library
- added WMMA impl to the Batched GEMM Reduce tests

[ROCm/composable_kernel commit: b09121f860]
2026-01-20 10:50:46 +01:00
assistant-librarian[bot]
38c7251ed1 Merge commit '0727e85e523aac7a1e82af00f44081cc67f5cde0' into develop 2026-01-20 06:20:32 +00:00
Bartłomiej Kocot
85c5741492 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes

[ROCm/composable_kernel commit: 0727e85e52]
2026-01-19 22:29:01 -07:00
assistant-librarian[bot]
895404d62b Merge commit '0517d43d312356c62cc33bea4f0ecc5613e87079' into develop 2026-01-20 00:37:44 +00:00
Cong Ma
c42cd28370 [CK TILE] remove dependency on std chrono (#3599)
* [CK TILE] remove dependency on std chrono

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 0517d43d31]
2026-01-19 15:31:02 -08:00
Linjun-AMD
ecda0fe2e9 [CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: f3aafb9555]
2026-01-19 15:22:33 -08:00
assistant-librarian[bot]
b5bde883eb Merge commit '98abfa4ade0f7b5204adf4da00e95be9453dce74' into develop 2026-01-19 21:13:18 +00:00
Max Podkorytov
8bd33c4a35 Optimize clang-format check in Jenkins CI (#3597)
This change improves the clang-format CI check to be faster and not
depend on git being available in the build environment.

Changes:
- Use `find` instead of `git ls-files` (no git dependency)
- Check all C++ files: *.h, *.hpp, *.cpp, *.h.in, *.hpp.in, *.cpp.in, *.cl
- Exclude build/ and include/rapidjson directories
- Use parallel processing with 8 cores (-P 8) for ~8x speedup
- Show only errors with unified diff format (-u)
- Clear error messages: "ERROR: <file> needs formatting"
- Preserve original logic: run clang-format only when RUN_CPPCHECK=false,
  or run both clang-format and cppcheck when RUN_CPPCHECK=true

Performance:
- Sequential processing: ~93 seconds for 5,899 files
- Parallel with 8 cores: ~12 seconds for 5,899 files
- Per-file processing time: ~15ms

This reduces CI time while maintaining code formatting standards.

[ROCm/composable_kernel commit: 98abfa4ade]
2026-01-19 12:23:06 -08:00
assistant-librarian[bot]
17b4f104b2 Merge commit '66d6a1cfa6807866487becc87cba95a0965f51f9' into develop 2026-01-19 16:15:25 +00:00
dependabot[bot]
ae64f66966 Bump rocm-docs-core[api_reference] from 1.31.2 to 1.31.3 in /docs/sphinx (#3602)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.2 to 1.31.3.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.2...v1.31.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 66d6a1cfa6]
2026-01-19 07:41:59 -08:00
assistant-librarian[bot]
3d4bb495f8 Merge commit '1a6d1b59ef7358e4f07afcc0a163af7aa4b985a9' into develop 2026-01-19 10:16:14 +00:00
Adam Osewski
a9ff38bc89 [CK_BUILDER] Convolution forward transfer concepts. (#3535)
* Rename member variable to better reflect its actuall meaning.

* Add transfer checks for conv fwd xdl.

* Validate tensor layouts & vector size conv fwd v3.

* Add combined transfer concepts.

* Add transfer concepts for conv fwd factories.

* Fix clang format

* Add helper instruction to get max mem vector instruction width.

* Apply review comments.

* Rename thread cluster access(->arrange) order concept

* FIx merge artifacts.

* Add generic access order limits into block transfer concept.

[ROCm/composable_kernel commit: 1a6d1b59ef]
2026-01-19 10:54:10 +01:00
assistant-librarian[bot]
e60d79a9a1 Merge commit 'fe40a5d13941b64162cffce9496d1d94a90f80a5' into develop 2026-01-17 08:14:43 +00:00
Erwin Terpstra
9c660bfbe3 Implement batched gemm bias permute for RDNA4 (#3534)
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)

* wip: device struct for WMMA batched contraction multiple d based on new gridwise op

* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

* fix: failure to resolve template parameters when calling new function overload

* fix: passing reference type as parameter instead of underlying types

* fix: merge error caused duplicate definitions

* fix: make sure constness of template and parameters types match

* fix: don't compile batched contraction test on unsupported architectures

* feat: add example for new wmma implementation, and consolidate example code between platforms

* style: return inline instead of with branch

* chore: add extra assert on vector memory access sizes

* chore: clean up some unused variables

* fix: correct tail number calculation, added small cases and extra instances to the test

* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method

[ROCm/composable_kernel commit: fe40a5d139]
2026-01-17 08:30:27 +01:00
assistant-librarian[bot]
9c4010cd17 Merge commit 'f9104ef9b3b794f8e02757cbf2935818f5389dac' into develop 2026-01-17 00:38:39 +00:00
Cong Ma
487f1beee9 [CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584)
[ROCm/composable_kernel commit: f9104ef9b3]
2026-01-16 15:27:39 -08:00
assistant-librarian[bot]
73b0cfde4e Merge commit '3f735c127b8e78b702a31e19cb6e0e35eda3588a' into develop 2026-01-16 19:13:41 +00:00
Johannes Graner
b12d70ae04 [CK Profiler] Restore CPU tensor initialization when verification is not done on GPU (#3594)
* Fix large case init bounds

* Revert "Fix large case init bounds"

This reverts commit 1abca05c6f.

* Restore CPU initialization for do_verification != 2

[ROCm/composable_kernel commit: 3f735c127b]
2026-01-16 10:56:53 -08:00
logicat
fb918acff9 Remove unnecessary hip_fp16 include from stream_config (#3549)
[ROCm/composable_kernel commit: fec81109f1]
2026-01-16 10:40:05 -08:00
John Shumway
0b3ee64c89 Disable CK Builder for SLES15 in Jenkins CI (#3581)
1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder

2. Added a detailed comment explaining why this is necessary:

   - SLES15 is a legacy platform with limited C++20 ecosystem support
   - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder

[ROCm/composable_kernel commit: 2d233c838a]
2026-01-16 10:36:23 -08:00
spolifroni-amd
f7614e006b CK Tile: fix some issues (#3557)
* Adding CK Tile documentation

* Updates based on feedback

* Fix tile window API description

* Fix remaining images

* add documentation about flush_cache and rotating_buffer functionality in ck_tile

* Supplement the documentation

* light edit of the ck tile conceptual doc

---------

Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 427d4fb9e9]
2026-01-16 10:34:44 -08:00
Thrupti Raj Lakshmana Gowda
f9ff023328 Fixing GEMM Multi D on Tile Engine (#3583)
[ROCm/composable_kernel commit: de8ee379ad]
2026-01-16 10:17:21 -08:00
assistant-librarian[bot]
d9030f5343 Merge commit '644cdbe3c92f9af16067e539edb4a13e6b9e7c86' into develop 2026-01-16 02:52:08 +00:00
John Shumway
d4990deb79 Merge pull request #3573 from ROCm/jshumway/builder-readme
[ROCm/composable_kernel commit: 644cdbe3c9]
2026-01-15 17:55:04 -08:00
assistant-librarian[bot]
7b405e44b0 Merge commit '086a1f8861ef8c81db854e7f2749458b69121617' into develop 2026-01-15 17:20:33 +00:00
Max Podkorytov
f6d1bb77e0 Add LLM-agnostic Docker and build analysis tools (#3576)
This commit introduces utility tools for building, testing, and analyzing
Composable Kernel. The tools are designed to be LLM-agnostic and can be
used with any AI assistant or directly from the command line.

Tools Added:
============

1. ck-docker - Docker container management
   - Start/stop ROCm-enabled containers
   - Build targets with CMake + Ninja
   - Run tests with gtest filters
   - Auto-detect GPU targets (gfx950, gfx942, etc.)
   - Per-user, per-branch container naming to avoid conflicts

2. ck-build-analysis - Build time profiling
   - Uses Clang's -ftime-trace for compilation analysis
   - Aggregates statistics across multiple trace files
   - Identifies template instantiation bottlenecks
   - Generates detailed Markdown reports with:
     * Compilation phase breakdown
     * Top expensive instantiations
     * Template family analysis
     * Data-driven optimization recommendations
   - Configurable granularity (1µs to 500µs)
   - PEP 723 compliant Python script with auto-dependency management via uv

Key Features:
=============

- LLM-agnostic design (works with any AI assistant)
- Zero-configuration setup with automatic dependency installation
- Comprehensive documentation in script/tools/README*.md
- Security hardening (input validation, no command injection)
- Multi-file trace aggregation for accurate build analysis
- Jinja2-based report generation for customizable output

Implementation:
===============

- script/tools/ck-docker - Main Docker orchestration script
- script/tools/ck-build-analysis - Build analysis orchestration
- script/tools/common.sh - Shared utilities (container mgmt, GPU detection)
- script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer
- script/tools/templates/ - Jinja2 templates for report generation
- script/tools/README*.md - Comprehensive documentation

Directory Structure:
====================

script/tools/
├── README.md                          # Main overview
├── README_ck-docker.md                # ck-docker documentation
├── README_ck-build-analysis.md        # ck-build-analysis documentation
├── ck-docker                          # Docker orchestration script
├── ck-build-analysis                  # Build analysis orchestration
├── common.sh                          # Shared utilities
├── analyze_build_trace.py             # Python analyzer (PEP 723)
└── templates/
    └── build_analysis_report.md.jinja # Report template

The tools follow Unix philosophy: do one thing well, compose easily,
and work from both CLI and programmatic contexts.

[ROCm/composable_kernel commit: 086a1f8861]
2026-01-15 08:30:23 -08:00
assistant-librarian[bot]
f0f4dbbffc Merge commit 'f57395689b92ca1f644e6e549e763f6c293ced22' into develop 2026-01-15 16:19:30 +00:00
dependabot[bot]
fcdc0f7fee Bump rocm-docs-core[api_reference] from 1.31.1 to 1.31.2 in /docs/sphinx (#3577)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.1 to 1.31.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.1...v1.31.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: f57395689b]
2026-01-15 07:49:06 -08:00
Michal Kulikowski
eb0080ab85 [CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation
-Fixing args num in ck examples 68/69

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>


[ROCm/composable_kernel commit: e1f2a44096]
2026-01-15 16:43:02 +01:00