Replace specific benchmark numbers with qualitative descriptions since
measurements vary across environments and may become outdated.
Co-authored-by: Claude <noreply@anthropic.com>
* WIP: add splitk to bquant
* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types
* chore: remove temporary test script
* fix: incorrect tile window length for splitted bq tensor window
* chore: improve comments
* test: add unit tests to cover bquant splitk functionality
* fix: conflict resolution by renaming variables
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Change call to the intrinsics
* fix clang format
* Undo changes under include/ck/utility
* Use named variable as vector size
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Adding remaining flavors for grouped conv fwd
As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu
* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.
* Do not build f8 / bf8 only flavor tests on RDNA3
* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.
* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.
* adding int8 and fp16 overloads to the elementwise operations
* fixed copilot nits
* Addressing review comments:
- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files
* clang-format
* reduced no of tests
---------
Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
* chore: split block scale example instances in more separate files to speed up compile times
* wip: fp4 scaffolding for abquant
* feat: add fp4 decoding-while-loading to abquant pipeline
* feat: add support for fp4 CPU verification in abquant
* chore: add time tracking to reference calculation
* feat: add a4w4 test for blockscale gemm
* feat: optimize reference calculation by preconverting values to AccType
* feat: add fp4 to fp8 look-up table
* fix: reference to wrong ComputeDataType field in QuantProblem
* feat: type utilities for determining MFMA compute types
* feat: packed fp4 for abquant weight preshuffle
* feat: add separate tests for a4w4 base case, padding and preshuffleB
* fix: fp4 conversion on gfx950 attempting to use non-supported method
* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size
* chore: add fp4 preshuffleb mode to block scale example
* chore: sanity check for packed types being 1 byte
* chore: clarify tensor dimension indices with constants
* chore: replace traits check with specialized check for packed types
* style: some minor refactoring and cleanup
* fix: correct conversion table for FNUZ fp8
* chore: add fp4 instances to main abquant instances again
* chore: use same initialization branch for int4 and fp4
* chore: add missing initialization for fp4 in block scale gemm example
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Add multi AB support to wave transfer
* Improviments to multi ABD examples
* Add instances and use intrawave v1 instead of interwave
* Apply changes to other transfers
* Wave transfer: add support for multiple internal vgpr buffers
* Fix compilation error gfx11
* Enable bwd weight splitk autodeduction with cap
* Fix error threshold calculations
* Add missing logic to wmma multiple d kernel
* Fix threshold calculation
* Update test with new applicability
* initial commit
* preshuffleQuant support for ABQuant
* fix mxfp4 to use correct QuantGroupSize
* addressing review comments and seperated Preshufflequant for A and B
* updated grouped gemm example for updated traits definition
* fix for CI failure
* updated grouped_gemm_abquant test for updated traits definition
* updated grouped_gemm_abquant test for updated traits definition
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* ck-builder: tensor copy function
This function copies one tensor to another, so that the memory
layout can be changed between them.
* ck-builder: fix ck::bhalf literals
These types don't work properly.
* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it
This reduces the amount of duplicated code a bit.
* ck-builder: add flat tensor iterator
This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.
* ck-builder: integrate validation with ck gpu verification
By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.
* ck-builder: add check_by_accumulations
This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.
* ck: fix test_gpu_verification GenerateRandomData for bhalf
is_integer_it<bhalf_t> yields true, but it is not actually
an integer.
* ck: make gpu_verification kernels be proper persistent kernels
Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.
* ck: clean up gpu_verification.hpp using block_reduce
This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.
* ck-builder: doc typos
* ck-builder: update testing readme with validation interface.
* ck-builder: rebase fixes + review comments
* ck-builder: fix device integer generation with float types
Passing bfloat here causes a nans due to type_convert performing
a bitcast.
* ck: another bhalf_t bug
CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
This document describes techniques for reducing C++ template instantiation
overhead in the Composable Kernel codebase, including:
- Replacing recursive templates with pack expansion (O(N) → O(1) depth)
- Using named functors instead of lambdas to share instantiations
- Replacing template recursion with constexpr loops
- Using fold expressions for accumulation operations
These techniques can significantly reduce build times for template-heavy code.
* Add padding support with transpose
Also move check before writing storing is_src_valid during reading
* Add/modify instances to use wave transfer for gemm universal
Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer
* Fix clang format
* Modify example
* Fix bwd data
* Add restriction for wave transfer with padding and transpose
Add test case which shows this limitation
* Fix validity checks 8 bit types
* Add validity check gemm_bias_add_reduce
* Add validity check grouped gemm tile loop
* Fix validity checks new flavours
* Minor fixes
* Fix clang format
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* test: add unit test for aquant memory pipeline
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* fix: compilation error on gfx950
* chore: remove debug statements from the code
* test: resolve merge conflict
* test: remove non rcr unit tests from test suite
* Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation
This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct.
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
* Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>