Commit Graph

3949 Commits

Author SHA1 Message Date
assistant-librarian[bot]
5f61470a1f Merge commit '91b4102a59c6013d3faeb54f250cf577b2f129ce' into develop 2026-01-20 19:35:23 +00:00
Max Podkorytov
b8595c5684 Add persistent async input scheduler for GEMM kernels (#3520)
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
  chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks

Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
  chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations

[ROCm/composable_kernel commit: 91b4102a59]
2026-01-20 10:37:09 -08:00
assistant-librarian[bot]
58bb88f499 Merge commit '8f75869408210cb85e9eb7ff639c4c9dad1331cb' into develop 2026-01-20 18:17:53 +00:00
Linjun-AMD
30ac278911 Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)
This reverts commit a0e77e4329.

[ROCm/composable_kernel commit: 8f75869408]
2026-01-20 09:40:54 -08:00
Estevan Vedovelli
4db6fcdf65 Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598)
* Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions

Enables hipTensor to access the WMMA HW functionalities
for these combinations of datatype on gfx11 and gfx12.

* Fix change to contraction scale tests

* Fix clang-format

[ROCm/composable_kernel commit: 7d8bca7ddc]
2026-01-20 09:39:57 -08:00
assistant-librarian[bot]
6a0cbcb01d Merge commit '4d58c70e6cf76ce6cb40aa6035ebccbb28493f71' into develop 2026-01-20 17:18:34 +00:00
Cong Ma
2df8d912eb [CK TILE GEMM] Add bf8 support to tile engine streamk generator (#3543)
[ROCm/composable_kernel commit: 4d58c70e6c]
2026-01-20 10:01:33 -07:00
assistant-librarian[bot]
a7320b9717 Merge commit '6300ad3c62298dc6fdddfcf19ecd074f7f08fa96' into develop 2026-01-20 16:18:17 +00:00
music-dino
5827d0d892 Batched gemm softmax gemm descriptor fix (#3564)
* Add rocm to prefix path for codegen

* Fix issue with c0_matrix_mask construction

[ROCm/composable_kernel commit: 6300ad3c62]
2026-01-20 07:25:30 -08:00
assistant-librarian[bot]
43058803dc Merge commit 'b09121f86066381f3662fdbdee6a810849a8a1a7' into develop 2026-01-20 10:16:09 +00:00
Wojciech Laskowski
f9a06ea114 WMMA support for batched_gemm_reduce (#3332)
Summary:
- added new device impl of Batched GEMM Reduce for WMMA
- added instance library
- added WMMA impl to the Batched GEMM Reduce tests

[ROCm/composable_kernel commit: b09121f860]
2026-01-20 10:50:46 +01:00
assistant-librarian[bot]
38c7251ed1 Merge commit '0727e85e523aac7a1e82af00f44081cc67f5cde0' into develop 2026-01-20 06:20:32 +00:00
Bartłomiej Kocot
d15cc593ea [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes

[ROCm/composable_kernel commit: 0727e85e52]
2026-01-19 22:29:01 -07:00
assistant-librarian[bot]
895404d62b Merge commit '0517d43d312356c62cc33bea4f0ecc5613e87079' into develop 2026-01-20 00:37:44 +00:00
Cong Ma
1a5d3590ef [CK TILE] remove dependency on std chrono (#3599)
* [CK TILE] remove dependency on std chrono

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 0517d43d31]
2026-01-19 15:31:02 -08:00
Linjun-AMD
a0e77e4329 [CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: f3aafb9555]
2026-01-19 15:22:33 -08:00
assistant-librarian[bot]
b5bde883eb Merge commit '98abfa4ade0f7b5204adf4da00e95be9453dce74' into develop 2026-01-19 21:13:18 +00:00
Max Podkorytov
44434d33d5 Optimize clang-format check in Jenkins CI (#3597)
This change improves the clang-format CI check to be faster and not
depend on git being available in the build environment.

Changes:
- Use `find` instead of `git ls-files` (no git dependency)
- Check all C++ files: *.h, *.hpp, *.cpp, *.h.in, *.hpp.in, *.cpp.in, *.cl
- Exclude build/ and include/rapidjson directories
- Use parallel processing with 8 cores (-P 8) for ~8x speedup
- Show only errors with unified diff format (-u)
- Clear error messages: "ERROR: <file> needs formatting"
- Preserve original logic: run clang-format only when RUN_CPPCHECK=false,
  or run both clang-format and cppcheck when RUN_CPPCHECK=true

Performance:
- Sequential processing: ~93 seconds for 5,899 files
- Parallel with 8 cores: ~12 seconds for 5,899 files
- Per-file processing time: ~15ms

This reduces CI time while maintaining code formatting standards.

[ROCm/composable_kernel commit: 98abfa4ade]
2026-01-19 12:23:06 -08:00
assistant-librarian[bot]
17b4f104b2 Merge commit '66d6a1cfa6807866487becc87cba95a0965f51f9' into develop 2026-01-19 16:15:25 +00:00
dependabot[bot]
56b7aca81d Bump rocm-docs-core[api_reference] from 1.31.2 to 1.31.3 in /docs/sphinx (#3602)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.2 to 1.31.3.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.2...v1.31.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 66d6a1cfa6]
2026-01-19 07:41:59 -08:00
assistant-librarian[bot]
3d4bb495f8 Merge commit '1a6d1b59ef7358e4f07afcc0a163af7aa4b985a9' into develop 2026-01-19 10:16:14 +00:00
Adam Osewski
c03fa25f6f [CK_BUILDER] Convolution forward transfer concepts. (#3535)
* Rename member variable to better reflect its actuall meaning.

* Add transfer checks for conv fwd xdl.

* Validate tensor layouts & vector size conv fwd v3.

* Add combined transfer concepts.

* Add transfer concepts for conv fwd factories.

* Fix clang format

* Add helper instruction to get max mem vector instruction width.

* Apply review comments.

* Rename thread cluster access(->arrange) order concept

* FIx merge artifacts.

* Add generic access order limits into block transfer concept.

[ROCm/composable_kernel commit: 1a6d1b59ef]
2026-01-19 10:54:10 +01:00
assistant-librarian[bot]
e60d79a9a1 Merge commit 'fe40a5d13941b64162cffce9496d1d94a90f80a5' into develop 2026-01-17 08:14:43 +00:00
Erwin Terpstra
beffadc5a0 Implement batched gemm bias permute for RDNA4 (#3534)
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)

* wip: device struct for WMMA batched contraction multiple d based on new gridwise op

* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

* fix: failure to resolve template parameters when calling new function overload

* fix: passing reference type as parameter instead of underlying types

* fix: merge error caused duplicate definitions

* fix: make sure constness of template and parameters types match

* fix: don't compile batched contraction test on unsupported architectures

* feat: add example for new wmma implementation, and consolidate example code between platforms

* style: return inline instead of with branch

* chore: add extra assert on vector memory access sizes

* chore: clean up some unused variables

* fix: correct tail number calculation, added small cases and extra instances to the test

* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method

[ROCm/composable_kernel commit: fe40a5d139]
2026-01-17 08:30:27 +01:00
assistant-librarian[bot]
9c4010cd17 Merge commit 'f9104ef9b3b794f8e02757cbf2935818f5389dac' into develop 2026-01-17 00:38:39 +00:00
Cong Ma
80bc8aaf76 [CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584)
[ROCm/composable_kernel commit: f9104ef9b3]
2026-01-16 15:27:39 -08:00
assistant-librarian[bot]
73b0cfde4e Merge commit '3f735c127b8e78b702a31e19cb6e0e35eda3588a' into develop 2026-01-16 19:13:41 +00:00
Johannes Graner
74c4b5df53 [CK Profiler] Restore CPU tensor initialization when verification is not done on GPU (#3594)
* Fix large case init bounds

* Revert "Fix large case init bounds"

This reverts commit 1abca05c6f.

* Restore CPU initialization for do_verification != 2

[ROCm/composable_kernel commit: 3f735c127b]
2026-01-16 10:56:53 -08:00
logicat
2f59f74334 Remove unnecessary hip_fp16 include from stream_config (#3549)
[ROCm/composable_kernel commit: fec81109f1]
2026-01-16 10:40:05 -08:00
John Shumway
c4dce7cb69 Disable CK Builder for SLES15 in Jenkins CI (#3581)
1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder

2. Added a detailed comment explaining why this is necessary:

   - SLES15 is a legacy platform with limited C++20 ecosystem support
   - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder

[ROCm/composable_kernel commit: 2d233c838a]
2026-01-16 10:36:23 -08:00
spolifroni-amd
b56d46606d CK Tile: fix some issues (#3557)
* Adding CK Tile documentation

* Updates based on feedback

* Fix tile window API description

* Fix remaining images

* add documentation about flush_cache and rotating_buffer functionality in ck_tile

* Supplement the documentation

* light edit of the ck tile conceptual doc

---------

Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 427d4fb9e9]
2026-01-16 10:34:44 -08:00
Thrupti Raj Lakshmana Gowda
01adec72bf Fixing GEMM Multi D on Tile Engine (#3583)
[ROCm/composable_kernel commit: de8ee379ad]
2026-01-16 10:17:21 -08:00
assistant-librarian[bot]
d9030f5343 Merge commit '644cdbe3c92f9af16067e539edb4a13e6b9e7c86' into develop 2026-01-16 02:52:08 +00:00
John Shumway
4faf012d92 Merge pull request #3573 from ROCm/jshumway/builder-readme
[ROCm/composable_kernel commit: 644cdbe3c9]
2026-01-15 17:55:04 -08:00
assistant-librarian[bot]
7b405e44b0 Merge commit '086a1f8861ef8c81db854e7f2749458b69121617' into develop 2026-01-15 17:20:33 +00:00
Max Podkorytov
79139825a9 Add LLM-agnostic Docker and build analysis tools (#3576)
This commit introduces utility tools for building, testing, and analyzing
Composable Kernel. The tools are designed to be LLM-agnostic and can be
used with any AI assistant or directly from the command line.

Tools Added:
============

1. ck-docker - Docker container management
   - Start/stop ROCm-enabled containers
   - Build targets with CMake + Ninja
   - Run tests with gtest filters
   - Auto-detect GPU targets (gfx950, gfx942, etc.)
   - Per-user, per-branch container naming to avoid conflicts

2. ck-build-analysis - Build time profiling
   - Uses Clang's -ftime-trace for compilation analysis
   - Aggregates statistics across multiple trace files
   - Identifies template instantiation bottlenecks
   - Generates detailed Markdown reports with:
     * Compilation phase breakdown
     * Top expensive instantiations
     * Template family analysis
     * Data-driven optimization recommendations
   - Configurable granularity (1µs to 500µs)
   - PEP 723 compliant Python script with auto-dependency management via uv

Key Features:
=============

- LLM-agnostic design (works with any AI assistant)
- Zero-configuration setup with automatic dependency installation
- Comprehensive documentation in script/tools/README*.md
- Security hardening (input validation, no command injection)
- Multi-file trace aggregation for accurate build analysis
- Jinja2-based report generation for customizable output

Implementation:
===============

- script/tools/ck-docker - Main Docker orchestration script
- script/tools/ck-build-analysis - Build analysis orchestration
- script/tools/common.sh - Shared utilities (container mgmt, GPU detection)
- script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer
- script/tools/templates/ - Jinja2 templates for report generation
- script/tools/README*.md - Comprehensive documentation

Directory Structure:
====================

script/tools/
├── README.md                          # Main overview
├── README_ck-docker.md                # ck-docker documentation
├── README_ck-build-analysis.md        # ck-build-analysis documentation
├── ck-docker                          # Docker orchestration script
├── ck-build-analysis                  # Build analysis orchestration
├── common.sh                          # Shared utilities
├── analyze_build_trace.py             # Python analyzer (PEP 723)
└── templates/
    └── build_analysis_report.md.jinja # Report template

The tools follow Unix philosophy: do one thing well, compose easily,
and work from both CLI and programmatic contexts.

[ROCm/composable_kernel commit: 086a1f8861]
2026-01-15 08:30:23 -08:00
assistant-librarian[bot]
f0f4dbbffc Merge commit 'f57395689b92ca1f644e6e549e763f6c293ced22' into develop 2026-01-15 16:19:30 +00:00
dependabot[bot]
48becfa5ad Bump rocm-docs-core[api_reference] from 1.31.1 to 1.31.2 in /docs/sphinx (#3577)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.1 to 1.31.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.1...v1.31.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: f57395689b]
2026-01-15 07:49:06 -08:00
Michal Kulikowski
fc889be2a5 [CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation
-Fixing args num in ck examples 68/69

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>


[ROCm/composable_kernel commit: e1f2a44096]
2026-01-15 16:43:02 +01:00
Yung-sheng Tu
1bf5861e43 Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429)
* add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp

* add examples

* add instances to test

* remove duplicate code between examples

[ROCm/composable_kernel commit: 6df2d70143]
2026-01-15 07:19:31 -08:00
assistant-librarian[bot]
43e3e63175 Merge commit 'e30207985aa5d9d0b53dc837904bf2ac3063a412' into develop 2026-01-15 15:14:37 +00:00
Estevan Vedovelli
e71d3df441 Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: e30207985a]
2026-01-15 09:35:24 -05:00
Jeff Huang
e3eda32062 [FMHA] Enable page size 16 for batch prefill kernel (#3568)
* [FMHA] Enable page size 16 for batch prefill kernel

* Refactor batch prefill KV offset logic to simplify template arguments
- Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args.
- Add static assert to forbid `page_size=1` with vectorized layout.

[ROCm/composable_kernel commit: 993d3e2f0e]
2026-01-15 22:11:44 +08:00
assistant-librarian[bot]
eb83c23157 Merge commit '51226372156901aa20a34ed5146d6bd57c63e519' into develop 2026-01-15 09:16:31 +00:00
John Shumway
6ae6b01721 [CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547)
* Factor helpers out of conv_traits.hpp

* Create a non-templated conv_traits struct

* Migrate to new instance-specific instance_to_conv_traits functions

* Clean up reflection concepts

* Clean up ConvTraits helpers

* Update testing for convolution traits

This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.

* Address reviewer comments and resolve merge conflict

[ROCm/composable_kernel commit: 5122637215]
2026-01-15 10:03:21 +01:00
John Shumway
b4a7cc7524 Update README.md files to match recent code changes
This is mostly adjustments to enum values so that the docs align correctly with the current code.

Also updated the calendar scope of the project to extend through March 2026.


[ROCm/composable_kernel commit: df7ee270a6]
2026-01-15 02:15:29 -05:00
assistant-librarian[bot]
35ec0097e5 Merge commit '8705fdcb0c738907fea74b7ed39c9f73fb9a5892' into develop 2026-01-14 22:14:05 +00:00
Illia Silin
8b415db3d6 add aiter test_batch_prefill and simplify jenkins file a bit (#3570)
[ROCm/composable_kernel commit: 8705fdcb0c]
2026-01-14 14:07:47 -08:00
assistant-librarian[bot]
5386db55e1 Merge commit '7f912909ca2c3cedfa1c6397d75daba4903a6d0d' into develop 2026-01-14 21:07:55 +00:00
Emily Martins
c07c2fa0ab Disable CK Tile Stream-K reduction tests (#3559)
The test_ck_tile_streamk_reduction test suite seems to have transient
failures; hence, we are disabling these tests for now. We will re-enable
them once the bug is resolved.

[ROCm/composable_kernel commit: 7f912909ca]
2026-01-14 14:02:21 -07:00