1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
This document describes techniques for reducing C++ template instantiation
overhead in the Composable Kernel codebase, including:
- Replacing recursive templates with pack expansion (O(N) → O(1) depth)
- Using named functors instead of lambdas to share instantiations
- Replacing template recursion with constexpr loops
- Using fold expressions for accumulation operations
These techniques can significantly reduce build times for template-heavy code.
* Add padding support with transpose
Also move check before writing storing is_src_valid during reading
* Add/modify instances to use wave transfer for gemm universal
Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer
* Fix clang format
* Modify example
* Fix bwd data
* Add restriction for wave transfer with padding and transpose
Add test case which shows this limitation
* Fix validity checks 8 bit types
* Add validity check gemm_bias_add_reduce
* Add validity check grouped gemm tile loop
* Fix validity checks new flavours
* Minor fixes
* Fix clang format
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* test: add unit test for aquant memory pipeline
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* fix: compilation error on gfx950
* chore: remove debug statements from the code
* test: resolve merge conflict
* test: remove non rcr unit tests from test suite
* Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation
This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct.
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
This change significantly improves compile-time performance by reducing template
instantiation depth for sequence generation and merging operations:
Optimizations:
- sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using
__make_integer_seq to generate indices in a single step, then applying the
functor via pack expansion
- uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq
with a helper that applies a constant value via pack expansion
- sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction
strategy. Added direct concatenation specializations for 1-4 sequences to
avoid recursion in common cases, falling back to binary tree merging for 5+
sequences
Documentation:
- Added extensive inline comments explaining why sequence_merge cannot achieve
O(1) depth like sequence_gen (requires computing cumulative sequence lengths
from heterogeneous inputs, inherently requiring recursion)
- Documented the binary tree reduction approach and why it's superior to fold
expressions for this use case
Testing:
- Added comprehensive unit tests for uniform_sequence_gen with different values,
sizes, and edge cases
- Added tests for sequence_gen with custom functors (double, square, identity,
constant) to verify the new implementation works with arbitrary functors
- Added tests for sequence_merge with 4, 5, and many sequences to verify both
the direct concatenation path and binary tree reduction path
- Added tests for empty sequence edge cases
* Fix alignment issue in Stream-K workspace buffer
In CK Tile Stream-K, the workspace buffer is used to hold flags and
partials, where the first i bytes holds the flags and the remaining
bytes hold partials. This change adds padding to the flags prefix of the
workspace buffer to ensure the number of bytes is 128B-aligned. Without
this alignment, since workgroups do not skip cache when reading from
partials, they may read stale partials data in cache, leading to
incorrect results. The added padding avoids the stale data reading.
This change also re-enables the test_ck_tile_streamk_reduction tests.
* Compute reference GEMM on GPU for test verification to decrease testing time
* Moved device struct for batched gemm wmma to a common file
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Use the common device struct in the scaled batched gemm wmma implementation
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Boy-scout: Remove unused includes and ambiguous comment
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Moved pointer offset calculation and gridwise argument to common struct
This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations.
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common"
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Additional flavors for WMMA conv fwd large tensor
- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors
* changed strategy to handle GemmArgs array
* Adding generic instance
* Added generic instance to clamp and bias_clamp ops
* add block scale parameters to kernel
* add block scale to kernel
* add smoke test
* format
* Revert "format"
This reverts commit 356c3c9706.
* only format my code
* format py
* fix auto not allowd in function prototype
* change instance tttt to ttff
* fix structured binding issue
* change s_acc elementwise op
* async pipeline add block scale
* add quantation P using shift exp2
* precompute (m - shift) once per row
* change blk scale seqstrt ptr name
* fix some name
* fix for deduction guide
* fix some comments
* add P scale to qr_ksvs_pipeline
* add comment to idx_identity
* change the method of calculating descale block index
* unify naming style: use block_scale_ as name prefix
* unify naming style
* update the CHANGELOG.md
* Add FP8 block scale quantization support for FMHA forward kernel
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>