[CK TILE] fix numerical errors of preshuffle_b
This pull request introduces several improvements and fixes related to
quantized grouped GEMM (General Matrix Multiply) pipelines and their
supporting utilities.
# The numerical issue
## Steps to reproduce
```bash
Run
./bin/tile_example_gemm_weight_preshuffle -prec=fp8
./bin/tile_example_gemm_weight_preshuffle -prec=int4
```
# Solution
The main changes address type correctness, improve data layout and
shuffling logic, and expand test coverage to better validate different
GEMM configurations.
**Key changes include:**
### Data layout and shuffling logic
* Refactored the logic in `shuffle_b_permuteN` to use `constexpr`
variables for `KLane` and `ItemsPerAccess`, simplifying tile view
construction and correcting the permutation order for improved
efficiency and correctness (`tensor_shuffle_utils.hpp`).
* Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline
policies to account for internal data type conversion (e.g., from
`pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in
quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`,
`gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`).
[[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275)
[[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105)
### Test infrastructure enhancements
* Unit tests did not catch this issue since there were no tests for fp8.
Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`)
to support additional GEMM tile shapes and updated tests to run with
these configurations for broader coverage
(`test_gemm_pipeline_util.hpp`).
[[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103)
[[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant
pipeline (#4268)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [X] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [X] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [X] Any dependent changes have been merged
=?UTF-8?q?Increase=20tolerance=20for=20FP16=20GEMM=20test?=
=?UTF-8?q?s=20to=20handle=20non-deterministic=20ro=E2=80=A6=20(#4335)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
…unding
Three tests were failing intermittently with small errors (0.01-1.5%)
due to non-deterministic FP16 accumulation order from GPU thread
scheduling:
- test_ck_tile_batched_gemm
- test_ck_tile_grouped_gemm_preshuffle
- test_ck_tile_grouped_gemm_multi_d
These tests use kbatch=1 (no split-K), so errors are from
order-dependent rounding, not atomics. Increased tolerances from 1e-3 to
2e-3 (0.2%) to account for FP16 precision limits while still catching
real bugs.
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Stream-K smoke test config file generation
This change converts the stream-k smoke tests to use tile engine. Since
the m, n, and k values dependent on the CU count of a device, the
configs are generated during the Configuration Phase.
* Compute GEMM reference on GPU
* Remove redundant Stream-K tests
Removing redundant tests that are now run via tile engine.
* Fix relative and absolute tolerance calculation
This change updates the Stream-K tile engine interface to ensure that
num_wgs_per_tile is propaged and passed into the compare_results
function to calculate the rel and abs tolerance. Before, split-k was
used, which is incorrect for Stream-K since the split-k value is
always 1.
* Cleanup imports, types, and other misc items
This commit makes the following changes:
- Uses Typing module for nested type hints
- Uses quotes around cu_count_arg argument in generate_configs.cmake in
if statements
- Adds explicit include for tuple in test_gemm_streamk_simple.cpp
- Adds a type for the tiles argument in argparser to check argument
validity
* Use CU count as return value for better parsing
* Add reduction tests for bf16, fp8, and bf8
* WIP: add splitk to bquant
* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types
* chore: remove temporary test script
* fix: incorrect tile window length for splitted bq tensor window
* chore: improve comments
* test: add unit tests to cover bquant splitk functionality
* fix: conflict resolution by renaming variables
* chore: split block scale example instances in more separate files to speed up compile times
* wip: fp4 scaffolding for abquant
* feat: add fp4 decoding-while-loading to abquant pipeline
* feat: add support for fp4 CPU verification in abquant
* chore: add time tracking to reference calculation
* feat: add a4w4 test for blockscale gemm
* feat: optimize reference calculation by preconverting values to AccType
* feat: add fp4 to fp8 look-up table
* fix: reference to wrong ComputeDataType field in QuantProblem
* feat: type utilities for determining MFMA compute types
* feat: packed fp4 for abquant weight preshuffle
* feat: add separate tests for a4w4 base case, padding and preshuffleB
* fix: fp4 conversion on gfx950 attempting to use non-supported method
* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size
* chore: add fp4 preshuffleb mode to block scale example
* chore: sanity check for packed types being 1 byte
* chore: clarify tensor dimension indices with constants
* chore: replace traits check with specialized check for packed types
* style: some minor refactoring and cleanup
* fix: correct conversion table for FNUZ fp8
* chore: add fp4 instances to main abquant instances again
* chore: use same initialization branch for int4 and fp4
* chore: add missing initialization for fp4 in block scale gemm example
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* initial commit
* preshuffleQuant support for ABQuant
* fix mxfp4 to use correct QuantGroupSize
* addressing review comments and seperated Preshufflequant for A and B
* updated grouped gemm example for updated traits definition
* fix for CI failure
* updated grouped_gemm_abquant test for updated traits definition
* updated grouped_gemm_abquant test for updated traits definition
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* test: add unit test for aquant memory pipeline
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* fix: compilation error on gfx950
* chore: remove debug statements from the code
* test: resolve merge conflict
* test: remove non rcr unit tests from test suite
* Re-enable f8 x bf8 tests on CompV3 as they now pass
* On CompV4, fp8 x bf8 tests now pass with K_BlockSize I32
* Add a changelog entry
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Fix alignment issue in Stream-K workspace buffer
In CK Tile Stream-K, the workspace buffer is used to hold flags and
partials, where the first i bytes holds the flags and the remaining
bytes hold partials. This change adds padding to the flags prefix of the
workspace buffer to ensure the number of bytes is 128B-aligned. Without
this alignment, since workgroups do not skip cache when reading from
partials, they may read stale partials data in cache, leading to
incorrect results. The added padding avoids the stale data reading.
This change also re-enables the test_ck_tile_streamk_reduction tests.
* Compute reference GEMM on GPU for test verification to decrease testing time
* Addition of Stream-K tests using Tile Engine
This change adds an implementation for generating Stream-K tests using Tile Engine.
This will generate various test executables for different combinations based on the
config files. This addition has simple tests running for bf16 and fp16, with both
atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation
of Stream-K in Tile Engine.
* integrating addition of tree reduction and editing the README
* temporarily removing parallel and tree reduction from configs while bugs regarding them are being resolved
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks
Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
The test_ck_tile_streamk_reduction test suite seems to have transient
failures; hence, we are disabling these tests for now. We will re-enable
them once the bug is resolved.
* formatted
* formatted
* formatting
* formatting
* formatting
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* enable prefill shapes
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Add support for rowcol and tensor GEMM operations
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update README
* adding preshuffle quant as new parameter and its associated new files
* remove debugging statements
* adding test
* enable preshuffle quant with permuteN
* updating readme and correcponding gemmconfigs
* updating cmake file
* fixing CI failures for grouped quant gemm
* debugging permuteN
* debugging
* debugging PermuteN
* initial commit
* resolving merge conflicts
* adding test cases
* initial commit with prints
* debugging
* fine-grained working
* debugging medium grained
* fixing the tile window
* formatting
* enabling prefill shapes
* working prefill shapes
* formatted
* clean up
* code cleanup
* bug fix after merging with develop
* G128 working for both prefill and decode shapes for preshufflequant
* clean up after merging with develop
* fixing group 64 for decode shapes
* non preshufflequant working for group size 128
* enable preshuffleb and preshufflequant with variour group sizes
* reduce build time by splitting example into diff datatype files
* Adding tests for preshuffleQuant
* address review comment
* fix for gfx1201
* compile time fix for gfx1201
* clang formatted
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
* solve compiler issue
* solve the gfx950 mfma shuffle regression
* refactor jenkinsfile to handle arch name better
* [CK TILE] set divisor to count of thread along k dimension
* fix the compiler error
* solve degradation
* Finish the multiplies fix
* fix the scales
* solve compilation error
* solve the composes
* solve the error of tile sweeper
* fix the test and example
* fix for gfx950
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
* WIP
* Add Unit tests for the Multi Reduction Kernel
* clang format
* Rename multiblock to threadwise
* Multiblock WIP
* Fix multi reduce multi block unit tests
* Multi Reduce Tile Engine: WIP
* refactoring + try addressing precision error
* Fix multiops examples
* Cleanup
* Clean up tile engine's reduce op
* Update changelog
* Fix remod/clang
* Fix dates
* Fix documentation & missing file
* Fix comments
* Use the update_tile api in the multi-block kernel
* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests
* Add TileParitioner
* Cleanup
* Add warning when no data to process, in the example
* Refactoring Reduce kernel Tile Partioner + cleanup
* Move the tile partioner to its own file
* Add missing includes
* Fix copyright header with update_amd_copyright_headers.py
* Fix change of interface in Reduce2dProblem
---------
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* update grouped_gemm blockwise kernel
* update config
* update kernel
* update examples
* remove test code for now
* sync test files with origin/develop
* update example
* fix code lint
* fix code-lint
* update test code
* run clang format
* run pre-commit
* update api
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* hacky poc
* fix tile engine
* kill the lambda
* maybe fix test build
* more fixes
* clang-format
* save temp
* clang-format
* mostly fix examples
* clang-format
* remove dead code
* more cleanup
* fix fmha bwd build (default epilogue set/add appears to be broken)
* fix default epilogue tests but not correctness
* clang-format
* fix bquant
* clang-format
* cleanup dead code
* rearrange make windows for readability
* restore changes to IsSupportedArgument
* fix smoke-builder
* clang-format
* fixup rename class
* build fixes
* clang-format
* fix builder
* fixup
* remove set from builder tests
* fix test
* clang-format
* re-refactor the kernels
* clang-format
* fix header license
* remove memory operation from conv bwd test
* clang-format
* clang-format example,include
* clang-format test
* build fixes
* clang-format
* solve compilation error
* fix the CI
* solve compilation error
* clang format
* solve merge conflict
* solve merge conflict
* solve the gfx11 error
* solve test error
* moar build fixes
* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.
* feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time
* chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times
* chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems
* fix: missing check for transposed load in bquant pipeline
* chore: lower unit test tensors dimensions a bit for faster tests
* chore: set grouped gemm example M/N warp tile to 16
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* fix for splitk if splitk < grid
* add different splitk implementation
* minor bugfix for streamk gemm
* Add test
---------
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* refactor reduce kernel
- Rename Reduce kernel as per convention
- Move kept_dim and reduce_dims from runtime to compile-time parameters
- Update Reduce2dProblem template to include KeptDim, ReduceDims, and
Rank
- Remove IsSupportedArgument validation function as it's unnecessary.
Not using the GuaranteedLastDimensionVectorStride while making tensor
view or descriptor which removes the bounds enforced earlier. We still
calculate and use vector size.
- Update reduce example to demonstrate NCHW->NHW reduction with
non-contiguous support
- Update tests
Kernel now handles both contiguous and non-contiguous memory layout.
* fix compile errors
* Support A/B Quantization in Blockscale GEMM
* Support A/B Quantization in Blockscale GEMM
* Support A/B Quantization in Blockscale GEMM
* Support A/B Quantization in Blockscale GEMM
* Support A/B Quantization in Blockscale GEMM
* Implement review suggested changes
* Implement review suggested changes
* Sync with develop
* fix pre-commit error
* Add unit tests for blockscale AB-Quantization
* fix pre-commit error
* fix pre-commit error
* fix compile error
* fix compile error
* fix clang-format
* fix clang-format
* fix enumeration values not handled in switch
* rebase file
* Add missing enums to data_type_sizeof (#3430)
Fixes broken build on gfx942. This was some test code that got merged at the same time.
* [CK_BUILDER] CK Tile header installation for builder, algorithm concept improvements (#3419)
* Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen
* Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts
---------
Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com>
* Add build trace diagnostics to CI. (#3432)
* generate and visualize build traces for all archs
* generate build traces in all cases
* fix jenkins logic
* fix typo
* use more threads for parsing dependency map
* add script to parse ninja traces and issue warnings
* fix python script syntax and header
* fix python syntax one more time
* fix python syntax
* Support A/B Quantization in Blockscale GEMM
* Implement review suggested changes
* Sync with develop
* Add unit tests for blockscale AB-Quantization
* fix enumeration values not handled in switch
* rebase file
* rebase file
---------
Co-authored-by: John Shumway <jshumway@amd.com>
Co-authored-by: DarylHawkinsAMD <Daryl.Hawkins@amd.com>
Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* build: reduce build time for bqaunt unit tests by splitting into multiple cpp
* reduce the test case & add the gfx10 support
* fix: copyright header for new file
* chore: add copyright to pass the CI
* build: Hot fix to reduce massive build time by just disabling the instances
* Update include/ck_tile/core/config.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: khushbu <khuagarw@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* CK Tile Stream-K Tree Reduction
This change adds the first implementation of the Stream-K tree reduction
strategy into CK Tile. The tree reduction reduces the the number of
steps for accumulating results for a tile from O(N) to O(logN) where N
is the number of workgroups contributing to a C tile.
Additionally, in the original non-atomic reduction strategy, atomics
were used to set the flags buffer and to read from the flags buffer.
Howeover, through investigation with the tree reduciton, atomics with
default (relaxed) semantics were not enough to guarantee workgroups
would not read stale data, leading to incorrect results. Stronger
acquire/release memory orderings are too expensive. So, this change
also eliminates the use of atomics for setting the flags. Instead, we
leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby
avoiding the use of atomics.
Prelimiary tests were also added for the normal reduction and tree
reduction. More will be added in a future PR via tile engine.
* Move Stream-K kernel files to a subdirectory
* Cleanup Code Style & Handle Unsupported Reductions
This change makes the following small changes:
- Add an explicit else block for unimplemented reduction strategies
- Clarify type of sk_flags_ptr via auto*
- Add description for extra_iters_before_me variable
* Run new copyright script on new files
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* rebase to new develop
* fix clang format
* update code according to reviewer's comment
* Update README.md
* update code according to reviewer's comment
* update code according to reviewer's comment
* Update CMakeLists.txt
* Update README.md
* Update CMakeLists.txt
* Delete files
* Delete files
* Add unit tests
* Update test_gemm_quant_base.hpp
* merge bf16*fp4 example to develop branch
* fix clang format
* fix clang format
* Update CMakeLists.txt
* fix ci test
* fix clang format
* resolve conflicts
---------
Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* First version of split-K autodeduction.
* Fix circular dependency and kernel construction.
* Fix tolerance calculation for bwd weight example.
* Simplify kernel construction.
* Fix kernel launching bug for split-K autodeduce.
* Add split-K autodeduction support for the two stage example.
* Fix a corner case.
* Fix clang-format.
* Fix clang-format for inc files.
* Add missing header.
* Prevent too large split-K values.
* Fix formatting.
* Add unit tests for IsSupportedArgument in grouped bwd conv.
* clang-format.
* Fix merge conflicts.
* Address feedback from code review.
* clang-format
* Fix new tests after merge.
---------
Co-authored-by: Ville Pietilä <>
* WIP: preparing to add transpose bq support
* WIP: handle both row/col layout for BQ windows/tile dstr
* Fix build
* WIP: adding some test, debugging numerical errors
* Fix all but pkint4 tests
* Remove test_gemm_quant_typed.cpp again
* update disabled tests
* add conversion from pkint4 for b matrix
* fix formatting
* fix formatting
* Fix tr_load and use override b datatype for clarity
* fix formatting
* make bquant preshuffle tests bqlayout column-major
* wip: add aquant to grouped gemm quant example
* fix: properly handle hot loop count in aquant pipeline
* fix: add separate GemmConfig structs for AQuant, automatically select the correct one
* feat: finish support for a non-persistent kernel invocation for grouped gemm quant, and add support code to example
* refactor: cleaned up grouped gemm quant example a bit by reusing pipeline selection logic
* chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants
* feat: add quant grouped gemm tests cases for aquant (regular and transpose C) and non-persistent kernel
* fix: update base pipeline classes according to changes in develop branch
* Revert "chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants"
This reverts commit b3fd4d326d.
* feat: remove aquant config from grouped gemm quant example, update to add persistency as runtime parameter
* chore: removed work-around for aquant bug that has been fixed
* chore: fix typo in command-line parameters
* fix: correct K warp tile size for gfx950
* chore: incorrect warp tile configuration on gfx942
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* feat(gemm_quant): add RRR and CRR layout support for aquant gemm
* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm
* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests
* fix: test cases compilation failure due to PR# 2095
* fix: make condition to filter out tests for gfx950 more explicit
* need to support the gfx950
* fix: add layout suppot for gfx950
* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)
* WIP: add support for pk_int4_t for aquant mode layouts RR and CR
* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant
* fix: compile time error for gfx950
* fix: minor bug where is_a_load_tr_v() was mising
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* docs: update changelog with new layout support info
* Update CHANGELOG.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor: break test instances into multiple cpp files to reduce build time (#3319)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* refactor: break test instances into multiple cpp files to reduce build time
* chore: rename file for better code readability
* fix: merge conflict resolution
* fix: remove memory pipeline because new layout is not compatible
* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()
* refactor: address review comments
* solve the conflict
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>