* Add multi AB support to wave transfer
* Improviments to multi ABD examples
* Add instances and use intrawave v1 instead of interwave
* Apply changes to other transfers
* Wave transfer: add support for multiple internal vgpr buffers
* Fix compilation error gfx11
[ROCm/composable_kernel commit: f16d9100e4]
* Enable bwd weight splitk autodeduction with cap
* Fix error threshold calculations
* Add missing logic to wmma multiple d kernel
* Fix threshold calculation
* Update test with new applicability
[ROCm/composable_kernel commit: fabac7e2c3]
Force merging because I verified this fix manually:
git checkout develop
git pull
ninja smoke-builder (failed to build, as expected)
git checkout rvoetter/ckb-fix
ninja smoke-builder (passed!)
[ROCm/composable_kernel commit: e33f15709f]
* initial commit
* preshuffleQuant support for ABQuant
* fix mxfp4 to use correct QuantGroupSize
* addressing review comments and seperated Preshufflequant for A and B
* updated grouped gemm example for updated traits definition
* fix for CI failure
* updated grouped_gemm_abquant test for updated traits definition
* updated grouped_gemm_abquant test for updated traits definition
[ROCm/composable_kernel commit: 9b168082b7]
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit: e3556fed04]
* ck-builder: tensor copy function
This function copies one tensor to another, so that the memory
layout can be changed between them.
* ck-builder: fix ck::bhalf literals
These types don't work properly.
* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it
This reduces the amount of duplicated code a bit.
* ck-builder: add flat tensor iterator
This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.
* ck-builder: integrate validation with ck gpu verification
By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.
* ck-builder: add check_by_accumulations
This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.
* ck: fix test_gpu_verification GenerateRandomData for bhalf
is_integer_it<bhalf_t> yields true, but it is not actually
an integer.
* ck: make gpu_verification kernels be proper persistent kernels
Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.
* ck: clean up gpu_verification.hpp using block_reduce
This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.
* ck-builder: doc typos
* ck-builder: update testing readme with validation interface.
* ck-builder: rebase fixes + review comments
* ck-builder: fix device integer generation with float types
Passing bfloat here causes a nans due to type_convert performing
a bitcast.
* ck: another bhalf_t bug
CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.
[ROCm/composable_kernel commit: 42048bdb7d]
* added reflection for conv_fwd_multiple_d_wmma_cshuffle.hpp
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle v3
* added reflection of max_transpose parameters
* fix printing of std optional parameters
* fix use of undefined ck::index
* added conv traits for device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle
* added xdl two stage instance to reflection
* added additional variables
* added reflection for grouped_conv_bwd_weight_multiple_d_wmma_cshuffle, _v3, grouped_conv_two_stage_wmma_cshuffle_v3,
* added reflection for device_grouped_conv_bwd_weigh_wmma_cshuffle_v3
* added reflection for bwd_weight_wmma_cshuffle
* added comments back in
* add printed output for optional parameters
* update README
* fix typo
* added num_gemm_k_prefetch_stage and small fixes
* modified test string due to reflection of new parameter
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
[ROCm/composable_kernel commit: d6cccf6093]
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
[ROCm/composable_kernel commit: 23cefda140]
* removed the api reference
* updating to the latest rocm-docs-core min version
* fixed a formatting issue with buffer views
* removed reference links from code snippets
* removed reference links from code snippets
---------
Co-authored-by: John Afaganis <john.afaganis@amd.com>
[ROCm/composable_kernel commit: 0cc83cb8e8]
This document describes techniques for reducing C++ template instantiation
overhead in the Composable Kernel codebase, including:
- Replacing recursive templates with pack expansion (O(N) → O(1) depth)
- Using named functors instead of lambdas to share instantiations
- Replacing template recursion with constexpr loops
- Using fold expressions for accumulation operations
These techniques can significantly reduce build times for template-heavy code.
[ROCm/composable_kernel commit: b66597ed96]
* ck-builder: restructure testing conv
In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.
* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp
This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.
* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3
Turns out that the supplied layout isn't actually supported...
* ck-builder: ck and reference conv integration for bwd weight
* ck-builder: ck bwd weight execution test
* ck-builder: ckt::run support for ck-tile bwd weight
* ck-builder: ck tile bwd weight execution test
* ck-builder: extra debug printing in MatchesReference
* ck-builder: make ckt::run return RunResult
This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.
* ck-builder: RunResult matcher
Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.
* ck-builder: doc fixes
* ck-builder: add missing headers
[ROCm/composable_kernel commit: cc75948d1c]
This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575).
The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration.
Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows.
This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection.
[ROCm/composable_kernel commit: a213ce676b]
* Add padding support with transpose
Also move check before writing storing is_src_valid during reading
* Add/modify instances to use wave transfer for gemm universal
Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer
* Fix clang format
* Modify example
* Fix bwd data
* Add restriction for wave transfer with padding and transpose
Add test case which shows this limitation
* Fix validity checks 8 bit types
* Add validity check gemm_bias_add_reduce
* Add validity check grouped gemm tile loop
* Fix validity checks new flavours
* Minor fixes
* Fix clang format
[ROCm/composable_kernel commit: 2e49b6b2f7]
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* test: add unit test for aquant memory pipeline
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* fix: compilation error on gfx950
* chore: remove debug statements from the code
* test: resolve merge conflict
* test: remove non rcr unit tests from test suite
[ROCm/composable_kernel commit: b8751e505d]