Commit Graph

2941 Commits

Author SHA1 Message Date
Illia Silin
70ffbc577b add dockerfile for manylinux (#3651)
[ROCm/composable_kernel commit: 054c437dec]
2026-01-26 09:23:19 -08:00
Ville Pietilä
e587756695 Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639)
Co-authored-by: Ville Pietilä <>

[ROCm/composable_kernel commit: 7ac3794284]
2026-01-25 13:42:23 +01:00
Emily Martins
b6f1e99074 [CK_TILE] Fix alignment in Stream-K workspace buffer (#3625)
* Fix alignment issue in Stream-K workspace buffer

In CK Tile Stream-K, the workspace buffer is used to hold flags and
partials, where the first i bytes holds the flags and the remaining
bytes hold partials. This change adds padding to the flags prefix of the
workspace buffer to ensure the number of bytes is 128B-aligned. Without
this alignment, since workgroups do not skip cache when reading from
partials, they may read stale partials data in cache, leading to
incorrect results. The added padding avoids the stale data reading.

This change also re-enables the test_ck_tile_streamk_reduction tests.

* Compute reference GEMM on GPU for test verification to decrease testing time

[ROCm/composable_kernel commit: f5c2f09036]
2026-01-23 16:14:22 -07:00
chris-tsiaousis-hpc
3c247733af Remove code duplications in batched gemm wmma (#3580)
* Moved device struct for batched gemm wmma to a common file

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Use the common device struct in the scaled batched gemm wmma implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: Remove unused includes and ambiguous comment

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved pointer offset calculation and gridwise argument to common struct

This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common"

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

[ROCm/composable_kernel commit: e1c46ff548]
2026-01-23 12:39:03 -08:00
ltqin
90b3476006 Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit 723b7ce0be2884da131036301892bf9157f51876.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 67f0b74ec6]
2026-01-23 09:03:22 -08:00
Wojciech Laskowski
a0445aff5f WMMA grouped conv fwd large tensor bias bnorm clamp (#3595)
* Added bias_bnorm_clamp for WMMA conv fwd large tensor.

Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout.
- grouped_conv2d_fwd_bias_bnorm_clamp
- grouped_conv3d_fwd_bias_bnorm_clamp

* changed strategy to handle GemmArgs array

* Adding generic instance

* fixed last nits from reviewers and copilot

[ROCm/composable_kernel commit: 2e08a7e5ab]
2026-01-23 12:20:00 +01:00
Wojciech Laskowski
ee595ee58a WMMA grouped conv fwd large tensor extra flavors (#3582)
* Additional flavors for WMMA conv fwd large tensor

- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors

* changed strategy to handle GemmArgs array

* Adding generic instance

* Added generic instance to clamp and bias_clamp ops

[ROCm/composable_kernel commit: 81ee19bd2c]
2026-01-23 12:19:51 +01:00
Bartłomiej Kocot
5c6c0f5ad1 Grouped conv fwd direct load vector=2 (#3632)
[ROCm/composable_kernel commit: 7b3db1a878]
2026-01-23 10:29:59 +01:00
Po Yen Chen
4ded7e5984 Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit ceccf15275645cc64db0a4ae53f5a215c93a7969.

[ROCm/composable_kernel commit: de5a1d730d]
2026-01-22 21:21:19 -08:00
damien-lejeune
10e975e7af Add missing check target in reduce tile engine op (#3631)
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>

[ROCm/composable_kernel commit: f30d04654e]
2026-01-22 16:06:02 -08:00
Vidyasagar Ananthan
74e270abab Speed up glob recurse. (#3626)
[ROCm/composable_kernel commit: eb2dc8f466]
2026-01-22 14:44:47 -08:00
arai713
bfa37887fb Addition of Stream-K tests using Tile Engine (#3514)
* Addition of Stream-K tests using Tile Engine

This change adds an implementation for generating Stream-K tests using Tile Engine.
This will generate various test executables for different combinations based on the
config files. This addition has simple tests running for bf16 and fp16, with both
atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation
of Stream-K in Tile Engine.

* integrating addition of tree reduction and editing the README

* temporarily removing parallel and tree reduction from configs while bugs regarding them are being resolved

[ROCm/composable_kernel commit: b9bb1db5d9]
2026-01-22 12:53:52 -08:00
kensclin
16e6a2c696 GEMM Blockscale ABQuant Optimization (#3620)
* GEMM Blockscale ABQuant Optimization

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix precommit error

* clean

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>

[ROCm/composable_kernel commit: 31a35ecab4]
2026-01-22 09:39:38 -08:00
Vidyasagar Ananthan
7ce0127e8f Adding dispatcher architecture (#3300)
* WIP POC of dispatcher

* Dispatcher python workflow setup.

* Dispatcher cleanup and updates.

Further dispatcher cleanup and updates.

Build fixes

Improvements and python to CK example

Improvements to readme

* Fixes to python paths

* Cleaning up code

* Improving dispatcher support for different arch

Fixing typos

* Fix formatting errors

* Cleaning up examples

* Improving codegeneration

* Improving and fixing C++ examples

* Adding conv functionality (fwd,bwd,bwdw) and examples.

* Fixes based on feedback.

* Further fixes based on feedback.

* Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug.

* Another round of improvements  based on feedback.

* Trimming out unnecessary code.

* Fixing the multi-D implementation.

* Using gpu verification for gemms and fixing convolutions tflops calculation.

* Fix counter usage issue and arch filtering per ops.

* Adding changelog and other fixes.

* Improve examples and resolve critical bugs.

* Reduce build time for python examples.

* Fixing minor bug.

* Fix compilation error.

* Improve installation instructions for dispatcher.

* Add docker based  installation instructions for dispatcher.

* Fixing arch-based filtering to match tile engine.

* Remove dead code and fix arch filtering.

* Minor bugfix.

* Updates after rebase.

* Trimming code.

* Fix copyright headers.

* Consolidate examples, cut down code.

* Minor fixes.

* Improving python examples.

* Update readmes.

* Remove conv functionality.

* Cleanup following conv removable.

[ROCm/composable_kernel commit: 9e049a32a1]
2026-01-22 09:34:33 -08:00
Bartłomiej Kocot
9c3ab51d9b [CK TILE] Fix basic gemm pipelines (#3611)
* [CK TILE] Fix basic pipelines

* fixes

[ROCm/composable_kernel commit: 44f481a45c]
2026-01-22 08:11:18 -06:00
ApoorvaKalyani
ec0f5c82ca Grouped conv_fwd_bias_bnorm_clamp instances and tests (#3525)
* Added bias_bnorm_clamp instances.

* fwd_bias_bnorm_clamp comp instances

* fwd_bias_bnorm_mem_inter and mem_intra instances

* fwd_bias_bnorm_merged_group_instances

* fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances

* Device level changes for fwd_bias_bnorm_clamp

* Added the test to the regression test list.

* Removed the part 2 and 2x instances

* Removed the irrelevant checks in wmma

* Refactored the instances to adapt to new device implementation

* Updated the reference and include files

* enabling tests

* Added missing profiler

* Added missing instance entry , deleted by mistake

* Reduce bias bnorm clamp instances to only a single generic one.

* Clean up cmakelists file

* clang-format

* Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up.

* Renaming some instance lists and add functions to be more standardized.

* Commented out non default instances.

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: 8daf6ea302]
2026-01-22 09:53:59 +01:00
Linjun-AMD
f6fac4cea6 [CK_TILE][FMHA]Add new tile size for async (#3623)
* Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)"

This reverts commit cfdad49edda4b2ccef92571f23646a8505bb2859.

* Add new tile_size for async pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 0b13697a88]
2026-01-22 16:07:14 +08:00
ltqin
14254656f0 Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel

* add block scale to kernel

* add smoke test

* format

* Revert "format"

This reverts commit 356c3c9706.

* only format my code

* format py

* fix auto not allowd in function prototype

* change instance tttt to ttff

* fix structured binding issue

* change s_acc elementwise op

* async pipeline add block scale

* add quantation P using shift exp2

* precompute (m - shift) once per row

* change blk scale seqstrt ptr name

* fix some name

* fix for  deduction guide

* fix some comments

* add P scale to qr_ksvs_pipeline

* add comment to idx_identity

* change the method of calculating descale block index

* unify naming style: use block_scale_ as name prefix

* unify naming style

* update the CHANGELOG.md

* Add FP8 block scale quantization support for FMHA forward kernel

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: dd0b4294af]
2026-01-21 20:58:26 -08:00
Michał Kulikowski
04f7e1fce4 [CK][Examples] Extending support for rdna3/4 part 4: (#3264)
* [CK][Examples] Extending support for rdna3/4 part 4:
-example_gemm_xdl_streamk
-example_gemm_xdl_fp16_fp8_v3
-example_gemm_xdl_fp16_v3

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

* [CK][Examples] Revert example\01_gemm\gemm_xdl_streamk parameters change.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

---------

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 4c2c18ef48]
2026-01-21 18:10:16 -08:00
Robin Voetter
2b54a86c04 [CK_BUILDER] Replace reference conv with old ck implementation (#3604)
* ck-builder: remove SPATIAL_DIM parameter from ConvTensorLayouts

This information is already in the SIGNATURE, so its pointless to pass it
separately. This streamlines the interface of those functions a bit. Also
touches up the style of those files in general.

* ck-builder: implement reference conv using old ck

The old ck implementation is more featureful and better tested.

* ck-builder: replace test_reference_execution reference with old ck

This strips out the ck-tile gpu reference implementation completely.

* ck-builder: clean up test_reference_execution

- Remove unneccesary messages
- Replace EXPECT_TRUE(true) with EXPECT_NO_THROW()

[ROCm/composable_kernel commit: 1040d9b1f5]
2026-01-21 19:18:47 +01:00
andrew clark
5a27de45e5 Sanitizing URL-encoded characters from the image file name (#3622)
[ROCm/composable_kernel commit: 0fbb3bb8c4]
2026-01-21 11:00:53 -07:00
Yi DING
0bb1c90674 Add CMakePresets.json (#3284)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: f41f37da96]
2026-01-21 08:04:24 -08:00
Yi DING
a0935f7669 [CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615)
[ROCm/composable_kernel commit: fcc9372c00]
2026-01-21 09:54:46 +08:00
Erwin Terpstra
b079841b10 Implement batched gemm add relu gemm add for rdna4 (#3391)
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation

* wip: many fixes in implementation of batched gemm gemm multiple d

* wip: batched gemm gemm multiple d gridwise op compiling, not working yet

* fix: incorrect d0 grid indexing in batched gemm gemm multipled

* feat: add instances for batched gemm add relu gemm add

* chore: configure instance with low vector transfer size for odd sizes

* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense

* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes

* fix: disable odd size tests on XDL archs

* chore: removed temporary logging

* chore: update some references to C tensor to E tensor

* Tentative fix for example template params

* Tentative fix for non-multi-D batched gemm gemm device impl.

* Tentative fix for xdl example template params

* Tentative fix for profiler build on gfx90a

* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions

* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply

* fix: make the gemm1 data types match what happens in the device op

* feat: add d0s/d1s datatypes and layouts to the device op type string

* chore: change element-wise op so addition happens in fp32

* chore: add static asserts for gemm0/gemm1 calculated wave sizes

* chore: also updated other element-wise ops to use fp32 calculations

* chore: log number of supported instances

* chore: update instance comment

* chore: disable kernel timing in example by default

* fix: gemm1 wave size calculation

* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions

* chore: remove increased tolerance in batched gemm gemm multiple d example

* chore: add comment explaining that verification fails for certain input values

* chore: clarify instance comment

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: d5ae81b292]
2026-01-20 13:06:59 -08:00
Max Podkorytov
8b842250da Add persistent async input scheduler for GEMM kernels (#3520)
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
  chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks

Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
  chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations

[ROCm/composable_kernel commit: 91b4102a59]
2026-01-20 10:37:09 -08:00
Linjun-AMD
e227e837be Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)
This reverts commit 217ac48fd83deef3d0d5084815689e8c79958cc1.

[ROCm/composable_kernel commit: 8f75869408]
2026-01-20 09:40:54 -08:00
Estevan Vedovelli
8e5475654b Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598)
* Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions

Enables hipTensor to access the WMMA HW functionalities
for these combinations of datatype on gfx11 and gfx12.

* Fix change to contraction scale tests

* Fix clang-format

[ROCm/composable_kernel commit: 7d8bca7ddc]
2026-01-20 09:39:57 -08:00
Cong Ma
364ad3d521 [CK TILE GEMM] Add bf8 support to tile engine streamk generator (#3543)
[ROCm/composable_kernel commit: 4d58c70e6c]
2026-01-20 10:01:33 -07:00
music-dino
750bd72b3d Batched gemm softmax gemm descriptor fix (#3564)
* Add rocm to prefix path for codegen

* Fix issue with c0_matrix_mask construction

[ROCm/composable_kernel commit: 6300ad3c62]
2026-01-20 07:25:30 -08:00
Wojciech Laskowski
6ad65bc855 WMMA support for batched_gemm_reduce (#3332)
Summary:
- added new device impl of Batched GEMM Reduce for WMMA
- added instance library
- added WMMA impl to the Batched GEMM Reduce tests

[ROCm/composable_kernel commit: b09121f860]
2026-01-20 10:50:46 +01:00
Bartłomiej Kocot
85c5741492 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes

[ROCm/composable_kernel commit: 0727e85e52]
2026-01-19 22:29:01 -07:00
Cong Ma
c42cd28370 [CK TILE] remove dependency on std chrono (#3599)
* [CK TILE] remove dependency on std chrono

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 0517d43d31]
2026-01-19 15:31:02 -08:00
Linjun-AMD
ecda0fe2e9 [CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: f3aafb9555]
2026-01-19 15:22:33 -08:00
Max Podkorytov
8bd33c4a35 Optimize clang-format check in Jenkins CI (#3597)
This change improves the clang-format CI check to be faster and not
depend on git being available in the build environment.

Changes:
- Use `find` instead of `git ls-files` (no git dependency)
- Check all C++ files: *.h, *.hpp, *.cpp, *.h.in, *.hpp.in, *.cpp.in, *.cl
- Exclude build/ and include/rapidjson directories
- Use parallel processing with 8 cores (-P 8) for ~8x speedup
- Show only errors with unified diff format (-u)
- Clear error messages: "ERROR: <file> needs formatting"
- Preserve original logic: run clang-format only when RUN_CPPCHECK=false,
  or run both clang-format and cppcheck when RUN_CPPCHECK=true

Performance:
- Sequential processing: ~93 seconds for 5,899 files
- Parallel with 8 cores: ~12 seconds for 5,899 files
- Per-file processing time: ~15ms

This reduces CI time while maintaining code formatting standards.

[ROCm/composable_kernel commit: 98abfa4ade]
2026-01-19 12:23:06 -08:00
dependabot[bot]
ae64f66966 Bump rocm-docs-core[api_reference] from 1.31.2 to 1.31.3 in /docs/sphinx (#3602)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.2 to 1.31.3.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.2...v1.31.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 66d6a1cfa6]
2026-01-19 07:41:59 -08:00
Adam Osewski
a9ff38bc89 [CK_BUILDER] Convolution forward transfer concepts. (#3535)
* Rename member variable to better reflect its actuall meaning.

* Add transfer checks for conv fwd xdl.

* Validate tensor layouts & vector size conv fwd v3.

* Add combined transfer concepts.

* Add transfer concepts for conv fwd factories.

* Fix clang format

* Add helper instruction to get max mem vector instruction width.

* Apply review comments.

* Rename thread cluster access(->arrange) order concept

* FIx merge artifacts.

* Add generic access order limits into block transfer concept.

[ROCm/composable_kernel commit: 1a6d1b59ef]
2026-01-19 10:54:10 +01:00
Erwin Terpstra
9c660bfbe3 Implement batched gemm bias permute for RDNA4 (#3534)
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)

* wip: device struct for WMMA batched contraction multiple d based on new gridwise op

* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

* fix: failure to resolve template parameters when calling new function overload

* fix: passing reference type as parameter instead of underlying types

* fix: merge error caused duplicate definitions

* fix: make sure constness of template and parameters types match

* fix: don't compile batched contraction test on unsupported architectures

* feat: add example for new wmma implementation, and consolidate example code between platforms

* style: return inline instead of with branch

* chore: add extra assert on vector memory access sizes

* chore: clean up some unused variables

* fix: correct tail number calculation, added small cases and extra instances to the test

* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method

[ROCm/composable_kernel commit: fe40a5d139]
2026-01-17 08:30:27 +01:00
Cong Ma
487f1beee9 [CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584)
[ROCm/composable_kernel commit: f9104ef9b3]
2026-01-16 15:27:39 -08:00
Johannes Graner
b12d70ae04 [CK Profiler] Restore CPU tensor initialization when verification is not done on GPU (#3594)
* Fix large case init bounds

* Revert "Fix large case init bounds"

This reverts commit 1abca05c6f.

* Restore CPU initialization for do_verification != 2

[ROCm/composable_kernel commit: 3f735c127b]
2026-01-16 10:56:53 -08:00
logicat
fb918acff9 Remove unnecessary hip_fp16 include from stream_config (#3549)
[ROCm/composable_kernel commit: fec81109f1]
2026-01-16 10:40:05 -08:00
John Shumway
0b3ee64c89 Disable CK Builder for SLES15 in Jenkins CI (#3581)
1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder

2. Added a detailed comment explaining why this is necessary:

   - SLES15 is a legacy platform with limited C++20 ecosystem support
   - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder

[ROCm/composable_kernel commit: 2d233c838a]
2026-01-16 10:36:23 -08:00
spolifroni-amd
f7614e006b CK Tile: fix some issues (#3557)
* Adding CK Tile documentation

* Updates based on feedback

* Fix tile window API description

* Fix remaining images

* add documentation about flush_cache and rotating_buffer functionality in ck_tile

* Supplement the documentation

* light edit of the ck tile conceptual doc

---------

Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 427d4fb9e9]
2026-01-16 10:34:44 -08:00
Thrupti Raj Lakshmana Gowda
f9ff023328 Fixing GEMM Multi D on Tile Engine (#3583)
[ROCm/composable_kernel commit: de8ee379ad]
2026-01-16 10:17:21 -08:00
John Shumway
d4990deb79 Merge pull request #3573 from ROCm/jshumway/builder-readme
[ROCm/composable_kernel commit: 644cdbe3c9]
2026-01-15 17:55:04 -08:00
Max Podkorytov
f6d1bb77e0 Add LLM-agnostic Docker and build analysis tools (#3576)
This commit introduces utility tools for building, testing, and analyzing
Composable Kernel. The tools are designed to be LLM-agnostic and can be
used with any AI assistant or directly from the command line.

Tools Added:
============

1. ck-docker - Docker container management
   - Start/stop ROCm-enabled containers
   - Build targets with CMake + Ninja
   - Run tests with gtest filters
   - Auto-detect GPU targets (gfx950, gfx942, etc.)
   - Per-user, per-branch container naming to avoid conflicts

2. ck-build-analysis - Build time profiling
   - Uses Clang's -ftime-trace for compilation analysis
   - Aggregates statistics across multiple trace files
   - Identifies template instantiation bottlenecks
   - Generates detailed Markdown reports with:
     * Compilation phase breakdown
     * Top expensive instantiations
     * Template family analysis
     * Data-driven optimization recommendations
   - Configurable granularity (1µs to 500µs)
   - PEP 723 compliant Python script with auto-dependency management via uv

Key Features:
=============

- LLM-agnostic design (works with any AI assistant)
- Zero-configuration setup with automatic dependency installation
- Comprehensive documentation in script/tools/README*.md
- Security hardening (input validation, no command injection)
- Multi-file trace aggregation for accurate build analysis
- Jinja2-based report generation for customizable output

Implementation:
===============

- script/tools/ck-docker - Main Docker orchestration script
- script/tools/ck-build-analysis - Build analysis orchestration
- script/tools/common.sh - Shared utilities (container mgmt, GPU detection)
- script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer
- script/tools/templates/ - Jinja2 templates for report generation
- script/tools/README*.md - Comprehensive documentation

Directory Structure:
====================

script/tools/
├── README.md                          # Main overview
├── README_ck-docker.md                # ck-docker documentation
├── README_ck-build-analysis.md        # ck-build-analysis documentation
├── ck-docker                          # Docker orchestration script
├── ck-build-analysis                  # Build analysis orchestration
├── common.sh                          # Shared utilities
├── analyze_build_trace.py             # Python analyzer (PEP 723)
└── templates/
    └── build_analysis_report.md.jinja # Report template

The tools follow Unix philosophy: do one thing well, compose easily,
and work from both CLI and programmatic contexts.

[ROCm/composable_kernel commit: 086a1f8861]
2026-01-15 08:30:23 -08:00
dependabot[bot]
fcdc0f7fee Bump rocm-docs-core[api_reference] from 1.31.1 to 1.31.2 in /docs/sphinx (#3577)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.1 to 1.31.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.1...v1.31.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.31.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: f57395689b]
2026-01-15 07:49:06 -08:00
Michal Kulikowski
eb0080ab85 [CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation
-Fixing args num in ck examples 68/69

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>


[ROCm/composable_kernel commit: e1f2a44096]
2026-01-15 16:43:02 +01:00
Yung-sheng Tu
97f2fa2912 Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429)
* add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp

* add examples

* add instances to test

* remove duplicate code between examples

[ROCm/composable_kernel commit: 6df2d70143]
2026-01-15 07:19:31 -08:00
Estevan Vedovelli
09d084bfb4 Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: e30207985a]
2026-01-15 09:35:24 -05:00
Jeff Huang
445ec888ba [FMHA] Enable page size 16 for batch prefill kernel (#3568)
* [FMHA] Enable page size 16 for batch prefill kernel

* Refactor batch prefill KV offset logic to simplify template arguments
- Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args.
- Add static assert to forbid `page_size=1` with vectorized layout.

[ROCm/composable_kernel commit: 993d3e2f0e]
2026-01-15 22:11:44 +08:00