Commit Graph

3879 Commits

Author SHA1 Message Date
ltqin
002e077401 Fix block scale init value (#3666)
* Make blockscale descale range adaptive to data type max value

* format

[ROCm/composable_kernel commit: 654bec3362]
2026-01-28 12:37:15 -08:00
assistant-librarian[bot]
dbadcf487a Merge commit '42048bdb7d8d931966af76c6dacfedce1c9da90a' into develop 2026-01-28 17:20:56 +00:00
Robin Voetter
97d6e59580 [CK_BUILDER] Integrate CKB validation with CK verification (#3649)
* ck-builder: tensor copy function

This function copies one tensor to another, so that the memory
layout can be changed between them.

* ck-builder: fix ck::bhalf literals

These types don't work properly.

* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it

This reduces the amount of duplicated code a bit.

* ck-builder: add flat tensor iterator

This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.

* ck-builder: integrate validation with ck gpu verification

By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.

* ck-builder: add check_by_accumulations

This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.

* ck: fix test_gpu_verification GenerateRandomData for bhalf

is_integer_it<bhalf_t> yields true, but it is not actually
an integer.

* ck: make gpu_verification kernels be proper persistent kernels

Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.

* ck: clean up gpu_verification.hpp using block_reduce

This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.

* ck-builder: doc typos

* ck-builder: update testing readme with validation interface.

* ck-builder: rebase fixes + review comments

* ck-builder: fix device integer generation with float types

Passing bfloat here causes a nans due to type_convert performing
a bitcast.

* ck: another bhalf_t bug

CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.

[ROCm/composable_kernel commit: 42048bdb7d]
2026-01-28 17:41:02 +01:00
kabrahamAMD
05f1759f0e [CK_BUILDER] Add reflection for wmma and bwd weight instances to ck builder reflection (#3592)
* added reflection for conv_fwd_multiple_d_wmma_cshuffle.hpp

* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle

* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle v3

* added reflection of max_transpose parameters

* fix printing of std optional parameters

* fix use of undefined ck::index

* added conv traits for device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle

* added xdl two stage instance to reflection

* added additional variables

* added reflection for grouped_conv_bwd_weight_multiple_d_wmma_cshuffle, _v3, grouped_conv_two_stage_wmma_cshuffle_v3,

* added reflection for device_grouped_conv_bwd_weigh_wmma_cshuffle_v3

* added reflection for bwd_weight_wmma_cshuffle

* added comments back in

* add printed output for optional parameters

* update README

* fix typo

* added num_gemm_k_prefetch_stage and small fixes

* modified test string due to reflection of new parameter

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>

[ROCm/composable_kernel commit: d6cccf6093]
2026-01-28 09:33:45 -07:00
assistant-librarian[bot]
78b36a13ab Merge commit 'bc6083bdd466d1e060253e7a49626c923293c483' into develop 2026-01-28 15:18:44 +00:00
Johannes Graner
28efdbd1c9 Update pytorch version in convolution dataset test generation (#3667)
* Update torch version in dataset test gen

[ROCm/composable_kernel commit: bc6083bdd4]
2026-01-28 07:38:10 -07:00
assistant-librarian[bot]
19a000f6c3 Merge commit '8e3d84aba3be5e851de5d6c6c3e9c08cadbce1da' into develop 2026-01-28 08:16:51 +00:00
Yi DING
bb0986e59e [CK_TILE] ABQuant New Preshuffle (#3638)
* Refactor

* Gemm quant improvement

* Change preshuffle

* Fix

* Fix grouped gemm ut

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 8e3d84aba3]
2026-01-27 23:46:49 -08:00
assistant-librarian[bot]
d4b61e4db5 Merge commit '91e32f305fa4d809103431a81594c52240753d40' into develop 2026-01-27 22:14:22 +00:00
damien-lejeune
373d8dd63d [CK Tile] multi reduce improvements (#3607)
* WIP: refactoring

* Swap operation/data nested loops order

* Improve memory coalescing

* Add comments

* Enforce same identity element for the reduce operations

* Re-add compile time constant

* Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>

[ROCm/composable_kernel commit: 91e32f305f]
2026-01-27 12:56:09 -08:00
linqunAMD
e9af74cb84 [ck] add gridwise base class for in all xdl kernel (#186) (#3544)
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue

[ROCm/composable_kernel commit: 23cefda140]
2026-01-27 12:49:47 -08:00
assistant-librarian[bot]
362767eba8 Merge commit 'b737f1dee5a097f8b62156335e21259d8dd2784c' into develop 2026-01-27 19:18:39 +00:00
Michał Kulikowski
8130aa058e [CK]Refactoring threadwise_tensor_slice_transfer_v3r1.hpp (#3263)
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: b737f1dee5]
2026-01-27 10:48:16 -08:00
assistant-librarian[bot]
417ad9c7f1 Merge commit 'b26cb596b0cbea9f40ae36b3f245b5aa7120c5c9' into develop 2026-01-27 18:19:38 +00:00
Illia Silin
71ac48d63a fix some syntax errors (#3658)
[ROCm/composable_kernel commit: b26cb596b0]
2026-01-27 09:59:39 -08:00
assistant-librarian[bot]
21657a1f32 Merge commit '0cc83cb8e8c9d9d926469f862bc1272ef0cf0dc8' into develop 2026-01-27 16:17:03 +00:00
spolifroni-amd
cf363cb35d CK: removed the api reference (#3571)
* removed the api reference

* updating to the latest rocm-docs-core min version

* fixed a formatting issue with buffer views

* removed reference links from code snippets

* removed reference links from code snippets

---------

Co-authored-by: John Afaganis <john.afaganis@amd.com>

[ROCm/composable_kernel commit: 0cc83cb8e8]
2026-01-27 07:36:47 -08:00
assistant-librarian[bot]
fea562ee53 Merge commit 'b66597ed96180ce21e7e6a6678dfc232ed07c800' into develop 2026-01-27 14:20:24 +00:00
Max Podkorytov
078912ec20 Add build time optimization documentation (#3608)
This document describes techniques for reducing C++ template instantiation
overhead in the Composable Kernel codebase, including:

- Replacing recursive templates with pack expansion (O(N) → O(1) depth)
- Using named functors instead of lambdas to share instantiations
- Replacing template recursion with constexpr loops
- Using fold expressions for accumulation operations

These techniques can significantly reduce build times for template-heavy code.

[ROCm/composable_kernel commit: b66597ed96]
2026-01-27 06:07:27 -07:00
assistant-librarian[bot]
d7af03f452 Merge commit '3d67e6c4927a9daea9076fab75b23fb44fdc22b1' into develop 2026-01-27 09:19:15 +00:00
Bartłomiej Kocot
ab6bbbfee1 [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624)
* [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err

* Update test_grouped_convnd_fwd_tile.cpp

* Update test_grouped_convnd_fwd_tile.cpp

* Update conv_tuning_params.hpp

* clang format fix

* Update CMakeLists.txt

[ROCm/composable_kernel commit: 3d67e6c492]
2026-01-27 11:04:11 +02:00
Johannes Graner
eb72f85509 [CK tests] Extend conv GPU reference (#3539)
* test_convnd_fwd

* test_convnd_bwd_data

* test_conv_bwd_data_scale

* test_grouped_convnd_fwd_clamp

* test_grouped_convnd_fwd_scale

* multiple A/B tensors and D tensor for fwd GPU ref

* test_grouped_convnd_fwd_scaleadd_ab

* test_grouped_convnd_fwd_bias_clamp

* test_grouped_convnd_fwd_bilinear

* test_grouped_convnd_fwd_gk_bias_clamp

* Extend GPU reference to enable batchnorm epilogue

* test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp

* test_grouped_conv_bwd_data_bilinear

* test_grouped_convnd_bwd_weight_bilinear

* Add missing template instantiation

* Perform operations in float in reference

* Slightly increase tolerance for batchnorm profiler

* Revert "Slightly increase tolerance for batchnorm profiler"

This reverts commit a3b2475229.

* Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp"

This reverts commit 6da4576060.

* Revert "Extend GPU reference to enable batchnorm epilogue"

This reverts commit e2f75fa10e.

* Clarify variable names

* Refactor elementwise ops into helper functions

* Make helpers C++17-compatible

[ROCm/composable_kernel commit: c190d8d61f]
2026-01-27 09:49:42 +01:00
assistant-librarian[bot]
aa3b7866b0 Merge commit 'cc75948d1c7f732d102c8e31dc007a2ccd07761f' into develop 2026-01-27 01:42:04 +00:00
Robin Voetter
a7b7eae2a1 [CK_BUILDER] conv bwd weight testing (#3618)
* ck-builder: restructure testing conv

In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.

* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp

This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.

* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3

Turns out that the supplied layout isn't actually supported...

* ck-builder: ck and reference conv integration for bwd weight

* ck-builder: ck bwd weight execution test

* ck-builder: ckt::run support for ck-tile bwd weight

* ck-builder: ck tile bwd weight execution test

* ck-builder: extra debug printing in MatchesReference

* ck-builder: make ckt::run return RunResult

This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.

* ck-builder: RunResult matcher

Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.

* ck-builder: doc fixes

* ck-builder: add missing headers

[ROCm/composable_kernel commit: cc75948d1c]
2026-01-26 23:50:15 +01:00
assistant-librarian[bot]
65be39bfd1 Merge commit '8654c0628f83261d3dd64cfb4ec80e9dd2b29fa5' into develop 2026-01-26 22:14:16 +00:00
Andrew Clark
daa6cae5f5 Finished testing failure types. Removed testing code.
[ROCm/composable_kernel commit: 8654c0628f]
2026-01-26 15:09:49 -07:00
Andrew Clark
858387034e Removed working tests. Validating remaining tests.
[ROCm/composable_kernel commit: 402f21d0a6]
2026-01-26 15:09:49 -07:00
Andrew Clark
b555c06b8d Removed working tests. Validating remaining tests.
[ROCm/composable_kernel commit: 1397924c21]
2026-01-26 15:09:49 -07:00
Andrew Clark
8136cf5c72 Testing a pattern to support all text variations
[ROCm/composable_kernel commit: 6c596b9553]
2026-01-26 15:09:49 -07:00
Andrew Clark
b87853431e Removing working cases to test other failure examples
[ROCm/composable_kernel commit: 58e1d03244]
2026-01-26 15:09:49 -07:00
Andrew Clark
8c91ce81bf Adding forcing failure to test notifications
[ROCm/composable_kernel commit: 95768d1b22]
2026-01-26 15:09:49 -07:00
Andrew Clark
18e95f26aa Fixing Jenkinsfile too large error
[ROCm/composable_kernel commit: 786965b95e]
2026-01-26 15:09:49 -07:00
Andrew Clark
e2f587ad01 Updating failure patterns to be more reliable and adding tests to verify they are caught in the logs
[ROCm/composable_kernel commit: 42a731b791]
2026-01-26 15:09:49 -07:00
John Shumway
7565ca5310 Add python analysis scripts for Clang's time trace (#3644)
This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575).

The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration.

Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows.

This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection.

[ROCm/composable_kernel commit: a213ce676b]
2026-01-26 13:44:36 -08:00
assistant-librarian[bot]
63dde06485 Merge commit '2e49b6b2f79d5ab0fe2fca79812affd44de94db7' into develop 2026-01-26 21:13:59 +00:00
Enrico Degregori
f2c7d07666 Padding support for wave transfer (#3537)
* Add padding support with transpose

Also move check before writing storing is_src_valid during reading

* Add/modify instances to use wave transfer for gemm universal

Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer

* Fix clang format

* Modify example

* Fix bwd data

* Add restriction for wave transfer with padding and transpose

Add test case which shows this limitation

* Fix validity checks 8 bit types

* Add validity check gemm_bias_add_reduce

* Add validity check grouped gemm tile loop

* Fix validity checks new flavours

* Minor fixes

* Fix clang format

[ROCm/composable_kernel commit: 2e49b6b2f7]
2026-01-26 12:57:09 -08:00
assistant-librarian[bot]
1298575103 Merge commit 'bd5fec81afdb6df7f4637128a3ba86dbfd6bcca1' into develop 2026-01-26 20:15:40 +00:00
Thrupti Raj Lakshmana Gowda
ab65977dae Removing [4,64,16] warp tile from Tile Engine (#3643)
[ROCm/composable_kernel commit: bd5fec81af]
2026-01-26 11:56:06 -08:00
yinglu
b980f0febe ck: add CK_USE_GFX950 macro (#3636)
[ROCm/composable_kernel commit: 8942a19d5e]
2026-01-26 11:38:45 -08:00
Aviral Goel
a26adffadf feat: Add Interwave scheduler for aquant memory pipeline (#3540)
* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* test: add unit test for aquant memory pipeline

* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* fix: compilation error on gfx950

* chore: remove debug statements from the code

* test: resolve merge conflict

* test: remove non rcr unit tests from test suite

[ROCm/composable_kernel commit: b8751e505d]
2026-01-26 11:27:42 -08:00
assistant-librarian[bot]
39405747ab Merge commit '3900e1e7ceacfa32cb8d1522260ed30befd4dae3' into develop 2026-01-26 19:16:22 +00:00
Thomas Ning
0983dea2be Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634)
* Finished the work

* Fix the pipeline

[ROCm/composable_kernel commit: 3900e1e7ce]
2026-01-26 10:29:28 -08:00
SamiAario-AMD
e01c295551 Re enable f8 x bf8 tests on compv3 and compv4 (#3605)
* Re-enable f8 x bf8 tests on CompV3 as they now pass

* On CompV4, fp8 x bf8 tests now pass with K_BlockSize I32

* Add a changelog entry

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 834642202c]
2026-01-26 10:23:26 -08:00
chris-tsiaousis-hpc
4de19a1601 Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617)
* Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation

This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

[ROCm/composable_kernel commit: 917f35553a]
2026-01-26 10:20:30 -08:00
assistant-librarian[bot]
06fb853279 Merge commit 'de59c0716c631edfa4742e4309ee11d4379ef6e8' into develop 2026-01-26 18:17:51 +00:00
Max Podkorytov
bebf8c3720 Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585)
This change significantly improves compile-time performance by reducing template
instantiation depth for sequence generation and merging operations:

Optimizations:
- sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using
  __make_integer_seq to generate indices in a single step, then applying the
  functor via pack expansion
- uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq
  with a helper that applies a constant value via pack expansion
- sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction
  strategy. Added direct concatenation specializations for 1-4 sequences to
  avoid recursion in common cases, falling back to binary tree merging for 5+
  sequences

Documentation:
- Added extensive inline comments explaining why sequence_merge cannot achieve
  O(1) depth like sequence_gen (requires computing cumulative sequence lengths
  from heterogeneous inputs, inherently requiring recursion)
- Documented the binary tree reduction approach and why it's superior to fold
  expressions for this use case

Testing:
- Added comprehensive unit tests for uniform_sequence_gen with different values,
  sizes, and edge cases
- Added tests for sequence_gen with custom functors (double, square, identity,
  constant) to verify the new implementation works with arbitrary functors
- Added tests for sequence_merge with 4, 5, and many sequences to verify both
  the direct concatenation path and binary tree reduction path
- Added tests for empty sequence edge cases

[ROCm/composable_kernel commit: de59c0716c]
2026-01-26 10:08:55 -08:00
Illia Silin
70ffbc577b add dockerfile for manylinux (#3651)
[ROCm/composable_kernel commit: 054c437dec]
2026-01-26 09:23:19 -08:00
assistant-librarian[bot]
8968bceee4 Merge commit '7ac379428408337a231a86f8a8b7353b5b45aa2d' into develop 2026-01-25 13:22:29 +00:00
Ville Pietilä
e587756695 Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639)
Co-authored-by: Ville Pietilä <>

[ROCm/composable_kernel commit: 7ac3794284]
2026-01-25 13:42:23 +01:00
assistant-librarian[bot]
6a21c125a0 Merge commit 'f5c2f09036cdc22dc8944719215dd47003c50a24' into develop 2026-01-24 00:38:47 +00:00