Commit Graph

3879 Commits

Author SHA1 Message Date
Wojciech Laskowski
6ad65bc855 WMMA support for batched_gemm_reduce (#3332)
Summary:
- added new device impl of Batched GEMM Reduce for WMMA
- added instance library
- added WMMA impl to the Batched GEMM Reduce tests

[ROCm/composable_kernel commit: b09121f860]
2026-01-20 10:50:46 +01:00
assistant-librarian[bot]
38c7251ed1 Merge commit '0727e85e523aac7a1e82af00f44081cc67f5cde0' into develop 2026-01-20 06:20:32 +00:00
Bartłomiej Kocot
85c5741492 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes

[ROCm/composable_kernel commit: 0727e85e52]
2026-01-19 22:29:01 -07:00
assistant-librarian[bot]
895404d62b Merge commit '0517d43d312356c62cc33bea4f0ecc5613e87079' into develop 2026-01-20 00:37:44 +00:00
Cong Ma
c42cd28370 [CK TILE] remove dependency on std chrono (#3599)
* [CK TILE] remove dependency on std chrono

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 0517d43d31]
2026-01-19 15:31:02 -08:00
Linjun-AMD
ecda0fe2e9 [CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: f3aafb9555]
2026-01-19 15:22:33 -08:00
assistant-librarian[bot]
b5bde883eb Merge commit '98abfa4ade0f7b5204adf4da00e95be9453dce74' into develop 2026-01-19 21:13:18 +00:00
Max Podkorytov
8bd33c4a35 Optimize clang-format check in Jenkins CI (#3597)
This change improves the clang-format CI check to be faster and not
depend on git being available in the build environment.

Changes:
- Use `find` instead of `git ls-files` (no git dependency)
- Check all C++ files: *.h, *.hpp, *.cpp, *.h.in, *.hpp.in, *.cpp.in, *.cl
- Exclude build/ and include/rapidjson directories
- Use parallel processing with 8 cores (-P 8) for ~8x speedup
- Show only errors with unified diff format (-u)
- Clear error messages: "ERROR: <file> needs formatting"
- Preserve original logic: run clang-format only when RUN_CPPCHECK=false,
  or run both clang-format and cppcheck when RUN_CPPCHECK=true

Performance:
- Sequential processing: ~93 seconds for 5,899 files
- Parallel with 8 cores: ~12 seconds for 5,899 files
- Per-file processing time: ~15ms

This reduces CI time while maintaining code formatting standards.

[ROCm/composable_kernel commit: 98abfa4ade]
2026-01-19 12:23:06 -08:00
assistant-librarian[bot]
17b4f104b2 Merge commit '66d6a1cfa6807866487becc87cba95a0965f51f9' into develop 2026-01-19 16:15:25 +00:00
dependabot[bot]
ae64f66966 Bump rocm-docs-core[api_reference] from 1.31.2 to 1.31.3 in /docs/sphinx (#3602)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.2 to 1.31.3.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.2...v1.31.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 66d6a1cfa6]
2026-01-19 07:41:59 -08:00
assistant-librarian[bot]
3d4bb495f8 Merge commit '1a6d1b59ef7358e4f07afcc0a163af7aa4b985a9' into develop 2026-01-19 10:16:14 +00:00
Adam Osewski
a9ff38bc89 [CK_BUILDER] Convolution forward transfer concepts. (#3535)
* Rename member variable to better reflect its actuall meaning.

* Add transfer checks for conv fwd xdl.

* Validate tensor layouts & vector size conv fwd v3.

* Add combined transfer concepts.

* Add transfer concepts for conv fwd factories.

* Fix clang format

* Add helper instruction to get max mem vector instruction width.

* Apply review comments.

* Rename thread cluster access(->arrange) order concept

* FIx merge artifacts.

* Add generic access order limits into block transfer concept.

[ROCm/composable_kernel commit: 1a6d1b59ef]
2026-01-19 10:54:10 +01:00
assistant-librarian[bot]
e60d79a9a1 Merge commit 'fe40a5d13941b64162cffce9496d1d94a90f80a5' into develop 2026-01-17 08:14:43 +00:00
Erwin Terpstra
9c660bfbe3 Implement batched gemm bias permute for RDNA4 (#3534)
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)

* wip: device struct for WMMA batched contraction multiple d based on new gridwise op

* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

* fix: failure to resolve template parameters when calling new function overload

* fix: passing reference type as parameter instead of underlying types

* fix: merge error caused duplicate definitions

* fix: make sure constness of template and parameters types match

* fix: don't compile batched contraction test on unsupported architectures

* feat: add example for new wmma implementation, and consolidate example code between platforms

* style: return inline instead of with branch

* chore: add extra assert on vector memory access sizes

* chore: clean up some unused variables

* fix: correct tail number calculation, added small cases and extra instances to the test

* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method

[ROCm/composable_kernel commit: fe40a5d139]
2026-01-17 08:30:27 +01:00
assistant-librarian[bot]
9c4010cd17 Merge commit 'f9104ef9b3b794f8e02757cbf2935818f5389dac' into develop 2026-01-17 00:38:39 +00:00
Cong Ma
487f1beee9 [CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584)
[ROCm/composable_kernel commit: f9104ef9b3]
2026-01-16 15:27:39 -08:00
assistant-librarian[bot]
73b0cfde4e Merge commit '3f735c127b8e78b702a31e19cb6e0e35eda3588a' into develop 2026-01-16 19:13:41 +00:00
Johannes Graner
b12d70ae04 [CK Profiler] Restore CPU tensor initialization when verification is not done on GPU (#3594)
* Fix large case init bounds

* Revert "Fix large case init bounds"

This reverts commit 1abca05c6f.

* Restore CPU initialization for do_verification != 2

[ROCm/composable_kernel commit: 3f735c127b]
2026-01-16 10:56:53 -08:00
logicat
fb918acff9 Remove unnecessary hip_fp16 include from stream_config (#3549)
[ROCm/composable_kernel commit: fec81109f1]
2026-01-16 10:40:05 -08:00
John Shumway
0b3ee64c89 Disable CK Builder for SLES15 in Jenkins CI (#3581)
1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder

2. Added a detailed comment explaining why this is necessary:

   - SLES15 is a legacy platform with limited C++20 ecosystem support
   - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder

[ROCm/composable_kernel commit: 2d233c838a]
2026-01-16 10:36:23 -08:00
spolifroni-amd
f7614e006b CK Tile: fix some issues (#3557)
* Adding CK Tile documentation

* Updates based on feedback

* Fix tile window API description

* Fix remaining images

* add documentation about flush_cache and rotating_buffer functionality in ck_tile

* Supplement the documentation

* light edit of the ck tile conceptual doc

---------

Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 427d4fb9e9]
2026-01-16 10:34:44 -08:00
Thrupti Raj Lakshmana Gowda
f9ff023328 Fixing GEMM Multi D on Tile Engine (#3583)
[ROCm/composable_kernel commit: de8ee379ad]
2026-01-16 10:17:21 -08:00
assistant-librarian[bot]
d9030f5343 Merge commit '644cdbe3c92f9af16067e539edb4a13e6b9e7c86' into develop 2026-01-16 02:52:08 +00:00
John Shumway
d4990deb79 Merge pull request #3573 from ROCm/jshumway/builder-readme
[ROCm/composable_kernel commit: 644cdbe3c9]
2026-01-15 17:55:04 -08:00
assistant-librarian[bot]
7b405e44b0 Merge commit '086a1f8861ef8c81db854e7f2749458b69121617' into develop 2026-01-15 17:20:33 +00:00
Max Podkorytov
f6d1bb77e0 Add LLM-agnostic Docker and build analysis tools (#3576)
This commit introduces utility tools for building, testing, and analyzing
Composable Kernel. The tools are designed to be LLM-agnostic and can be
used with any AI assistant or directly from the command line.

Tools Added:
============

1. ck-docker - Docker container management
   - Start/stop ROCm-enabled containers
   - Build targets with CMake + Ninja
   - Run tests with gtest filters
   - Auto-detect GPU targets (gfx950, gfx942, etc.)
   - Per-user, per-branch container naming to avoid conflicts

2. ck-build-analysis - Build time profiling
   - Uses Clang's -ftime-trace for compilation analysis
   - Aggregates statistics across multiple trace files
   - Identifies template instantiation bottlenecks
   - Generates detailed Markdown reports with:
     * Compilation phase breakdown
     * Top expensive instantiations
     * Template family analysis
     * Data-driven optimization recommendations
   - Configurable granularity (1µs to 500µs)
   - PEP 723 compliant Python script with auto-dependency management via uv

Key Features:
=============

- LLM-agnostic design (works with any AI assistant)
- Zero-configuration setup with automatic dependency installation
- Comprehensive documentation in script/tools/README*.md
- Security hardening (input validation, no command injection)
- Multi-file trace aggregation for accurate build analysis
- Jinja2-based report generation for customizable output

Implementation:
===============

- script/tools/ck-docker - Main Docker orchestration script
- script/tools/ck-build-analysis - Build analysis orchestration
- script/tools/common.sh - Shared utilities (container mgmt, GPU detection)
- script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer
- script/tools/templates/ - Jinja2 templates for report generation
- script/tools/README*.md - Comprehensive documentation

Directory Structure:
====================

script/tools/
├── README.md                          # Main overview
├── README_ck-docker.md                # ck-docker documentation
├── README_ck-build-analysis.md        # ck-build-analysis documentation
├── ck-docker                          # Docker orchestration script
├── ck-build-analysis                  # Build analysis orchestration
├── common.sh                          # Shared utilities
├── analyze_build_trace.py             # Python analyzer (PEP 723)
└── templates/
    └── build_analysis_report.md.jinja # Report template

The tools follow Unix philosophy: do one thing well, compose easily,
and work from both CLI and programmatic contexts.

[ROCm/composable_kernel commit: 086a1f8861]
2026-01-15 08:30:23 -08:00
assistant-librarian[bot]
f0f4dbbffc Merge commit 'f57395689b92ca1f644e6e549e763f6c293ced22' into develop 2026-01-15 16:19:30 +00:00
dependabot[bot]
fcdc0f7fee Bump rocm-docs-core[api_reference] from 1.31.1 to 1.31.2 in /docs/sphinx (#3577)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.1 to 1.31.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.1...v1.31.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: f57395689b]
2026-01-15 07:49:06 -08:00
Michal Kulikowski
eb0080ab85 [CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation
-Fixing args num in ck examples 68/69

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>


[ROCm/composable_kernel commit: e1f2a44096]
2026-01-15 16:43:02 +01:00
Yung-sheng Tu
97f2fa2912 Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429)
* add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp

* add examples

* add instances to test

* remove duplicate code between examples

[ROCm/composable_kernel commit: 6df2d70143]
2026-01-15 07:19:31 -08:00
assistant-librarian[bot]
43e3e63175 Merge commit 'e30207985aa5d9d0b53dc837904bf2ac3063a412' into develop 2026-01-15 15:14:37 +00:00
Estevan Vedovelli
09d084bfb4 Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: e30207985a]
2026-01-15 09:35:24 -05:00
Jeff Huang
445ec888ba [FMHA] Enable page size 16 for batch prefill kernel (#3568)
* [FMHA] Enable page size 16 for batch prefill kernel

* Refactor batch prefill KV offset logic to simplify template arguments
- Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args.
- Add static assert to forbid `page_size=1` with vectorized layout.

[ROCm/composable_kernel commit: 993d3e2f0e]
2026-01-15 22:11:44 +08:00
assistant-librarian[bot]
eb83c23157 Merge commit '51226372156901aa20a34ed5146d6bd57c63e519' into develop 2026-01-15 09:16:31 +00:00
John Shumway
753043b27a [CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547)
* Factor helpers out of conv_traits.hpp

* Create a non-templated conv_traits struct

* Migrate to new instance-specific instance_to_conv_traits functions

* Clean up reflection concepts

* Clean up ConvTraits helpers

* Update testing for convolution traits

This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.

* Address reviewer comments and resolve merge conflict

[ROCm/composable_kernel commit: 5122637215]
2026-01-15 10:03:21 +01:00
John Shumway
c756e421db Update README.md files to match recent code changes
This is mostly adjustments to enum values so that the docs align correctly with the current code.

Also updated the calendar scope of the project to extend through March 2026.


[ROCm/composable_kernel commit: df7ee270a6]
2026-01-15 02:15:29 -05:00
assistant-librarian[bot]
35ec0097e5 Merge commit '8705fdcb0c738907fea74b7ed39c9f73fb9a5892' into develop 2026-01-14 22:14:05 +00:00
Illia Silin
3827441343 add aiter test_batch_prefill and simplify jenkins file a bit (#3570)
[ROCm/composable_kernel commit: 8705fdcb0c]
2026-01-14 14:07:47 -08:00
assistant-librarian[bot]
5386db55e1 Merge commit '7f912909ca2c3cedfa1c6397d75daba4903a6d0d' into develop 2026-01-14 21:07:55 +00:00
Emily Martins
8661ee5a16 Disable CK Tile Stream-K reduction tests (#3559)
The test_ck_tile_streamk_reduction test suite seems to have transient
failures; hence, we are disabling these tests for now. We will re-enable
them once the bug is resolved.

[ROCm/composable_kernel commit: 7f912909ca]
2026-01-14 14:02:21 -07:00
John Shumway
c744f9015e [CK_BUILDER] Update owners file for more reviews for CK Builder (#3572)
Adding owners permissions for two leading developers on the CK Builder subproject to help with reviews on that project, especially in the EU time zones.

Remove aska-0096, who has left AMD


[ROCm/composable_kernel commit: f08fb3f748]
2026-01-14 12:43:55 -08:00
Bartłomiej Kocot
8c72adabeb Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566)
[ROCm/composable_kernel commit: a346cfa960]
2026-01-14 12:37:12 -08:00
assistant-librarian[bot]
0211adaff3 Merge commit 'a07c8e38bd5152f2582dd0c8c1f8eef72f1086e5' into develop 2026-01-14 19:12:59 +00:00
Bartłomiej Kocot
9aea6a52ed Fix grouped conv bwd data wmma check (#3562)
[ROCm/composable_kernel commit: a07c8e38bd]
2026-01-14 11:04:37 -08:00
Khushbu Agarwal
7da4e47a5f [CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* G128 working for both prefill and decode shapes for preshufflequant

* clean up after merging with develop

* fixing group 64 for decode shapes

* non preshufflequant working for group size 128

* enable preshuffleb and preshufflequant with variour group sizes

* reduce build time by splitting example into diff datatype files

* Adding tests for preshuffleQuant

* address review comment

* fix for gfx1201

* compile time fix for gfx1201

* clang formatted

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>

[ROCm/composable_kernel commit: 118afa455c]
2026-01-14 10:00:19 -08:00
assistant-librarian[bot]
9a2c80d30f Merge commit '1fc5a3f3ac6bf204305c089f77a898b1f8765903' into develop 2026-01-14 16:16:26 +00:00
Ville Pietilä
2eb573a0e2 Build CK on Windows (#3458)
* CMakeLists.txt hack for Windows.

* Add Windows build instructions.

* Fix  type issue with variadic min function.

* Use std::common_type to fix the variadic min/max functions.

* Enable CPU guard compilation on Windows.

* Suppress warnings related to std::getenv on Windows platform.

* Git ignore the output directory on Windows platform.

* Powershell script for running tests and generating reports.

* Improve test logging.

* Disable non-conv tests.

* Fix Debug build on Windows.

* More debug build changes.

* Update Windows build instructions.

* Enable all tests.

* Test fixes.

* Suppress not found linker options warning.

* Update unsigned long literals and format specifiers to work correctly in Windows

* Fix conv 3D bwd weight bilinear tests on Windows.

* Revert changes on .gitignore.

* Clean-up CMake project file for Windows builds.

* clang-format

* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.

* Fix building examples on Windows.

* Update Readme.

* Remove the suppression of the deprecated warnings.

* Remove Windows specific min/max implementations from CK Tile math core.

* Remove unnecessary no-op on Windows.

---------

Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>

[ROCm/composable_kernel commit: 1fc5a3f3ac]
2026-01-14 07:31:45 -08:00
assistant-librarian[bot]
7501af9cc6 Merge commit 'f173642087ed6034a0ac16188de2f36f4c008945' into develop 2026-01-14 15:15:11 +00:00
Johannes Graner
b313b8eaea [CK] Refactor GPU verification kernel to gather error stats on GPU (#3551)
* Refactor GPU verification kernel to gather erorr stats on GPU

* Check if result is all zero

* non-negative error count doesn't need custom Atomics

* Remove unnecessary AtomicMaxFloat function

* Simpler warp reduction, remove passed flag

* Move verification header to include

* Fix header path in test

* Fix block reduction loop

[ROCm/composable_kernel commit: f173642087]
2026-01-14 16:04:50 +01:00
Johannes Graner
923923ac8d [CK Profiler] Initialize tensors on GPU in CK profiler (#3550)
* Initialize tensors on GPU in CK profiler

* Kick CI

[ROCm/composable_kernel commit: 3ccb15ea02]
2026-01-14 16:04:14 +01:00