[CK_TILE] Add blockscale GEMM support for EightWarps on
gfx950 (#4280)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
gemm blockscale eightwarps support
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
[CK_TILE] Fix MMA concepts compiler error
## Motivation
CK Tile is required to support certain older OSs; on these OSs, cpp 20
is not fully supported. For ROCm 7.2, compiler errors occur on one of
these older OSs. An example of this error is as follows:
```bash
/composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments
34 | { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>;
|
```
The goal of this PR is to resolve these compiler errors.
## Technical Details
The existing guards around the mma concepts only check if the concepts
language feature is supported, as follows:
```cpp
#if defined(__cpp_concepts) && __cpp_concepts >= 201907L
// ...
template <typename CtrlFlags>
concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) {
// Flag members for Gfx9 MFMA instructions
{ CtrlFlags::Cbsz } -> std::convertible_to<int>;
{ CtrlFlags::Abid } -> std::convertible_to<int>;
{ CtrlFlags::Blgp } -> std::convertible_to<int>;
};
#endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L
```
That said, in cases where functionality from the `<concepts>` header is
used (e.g., `std::convertible_to`), this guard fails to check whether
the `<concepts>` header is available.
This change adds an additional check to the concepts that make use of
functionality from the `<concepts>` header to ensure the header is
available.
## Test Plan
I tested the changes on the relevant docker for gfx90a, gfx950, and
gfx942 and the compiler issue is not present.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* chore: split block scale example instances in more separate files to speed up compile times
* wip: fp4 scaffolding for abquant
* feat: add fp4 decoding-while-loading to abquant pipeline
* feat: add support for fp4 CPU verification in abquant
* chore: add time tracking to reference calculation
* feat: add a4w4 test for blockscale gemm
* feat: optimize reference calculation by preconverting values to AccType
* feat: add fp4 to fp8 look-up table
* fix: reference to wrong ComputeDataType field in QuantProblem
* feat: type utilities for determining MFMA compute types
* feat: packed fp4 for abquant weight preshuffle
* feat: add separate tests for a4w4 base case, padding and preshuffleB
* fix: fp4 conversion on gfx950 attempting to use non-supported method
* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size
* chore: add fp4 preshuffleb mode to block scale example
* chore: sanity check for packed types being 1 byte
* chore: clarify tensor dimension indices with constants
* chore: replace traits check with specialized check for packed types
* style: some minor refactoring and cleanup
* fix: correct conversion table for FNUZ fp8
* chore: add fp4 instances to main abquant instances again
* chore: use same initialization branch for int4 and fp4
* chore: add missing initialization for fp4 in block scale gemm example
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* add block scale parameters to kernel
* add block scale to kernel
* add smoke test
* format
* Revert "format"
This reverts commit 356c3c9706.
* only format my code
* format py
* fix auto not allowd in function prototype
* change instance tttt to ttff
* fix structured binding issue
* change s_acc elementwise op
* async pipeline add block scale
* add quantation P using shift exp2
* precompute (m - shift) once per row
* change blk scale seqstrt ptr name
* fix some name
* fix for deduction guide
* fix some comments
* add P scale to qr_ksvs_pipeline
* add comment to idx_identity
* change the method of calculating descale block index
* unify naming style: use block_scale_ as name prefix
* unify naming style
* update the CHANGELOG.md
* Add FP8 block scale quantization support for FMHA forward kernel
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks
Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
* Rename member variable to better reflect its actuall meaning.
* Add transfer checks for conv fwd xdl.
* Validate tensor layouts & vector size conv fwd v3.
* Add combined transfer concepts.
* Add transfer concepts for conv fwd factories.
* Fix clang format
* Add helper instruction to get max mem vector instruction width.
* Apply review comments.
* Rename thread cluster access(->arrange) order concept
* FIx merge artifacts.
* Add generic access order limits into block transfer concept.
* solve compiler issue
* solve the gfx950 mfma shuffle regression
* refactor jenkinsfile to handle arch name better
* [CK TILE] set divisor to count of thread along k dimension
* fix the compiler error
* solve degradation
* Finish the multiplies fix
* fix the scales
* solve compilation error
* solve the composes
* solve the error of tile sweeper
* fix the test and example
* fix for gfx950
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
* WIP
* Add Unit tests for the Multi Reduction Kernel
* clang format
* Rename multiblock to threadwise
* Multiblock WIP
* Fix multi reduce multi block unit tests
* Multi Reduce Tile Engine: WIP
* refactoring + try addressing precision error
* Fix multiops examples
* Cleanup
* Clean up tile engine's reduce op
* Update changelog
* Fix remod/clang
* Fix dates
* Fix documentation & missing file
* Fix comments
* Use the update_tile api in the multi-block kernel
* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests
* Add TileParitioner
* Cleanup
* Add warning when no data to process, in the example
* Refactoring Reduce kernel Tile Partioner + cleanup
* Move the tile partioner to its own file
* Add missing includes
* Fix copyright header with update_amd_copyright_headers.py
* Fix change of interface in Reduce2dProblem
---------
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Refactor function amd_buffer_load_invalid_element_return_zero to avoid
the inefficient ASM code generated by compiler.
Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage
Tested compilers:
- Rocm 7.0.1
- Rocm 7.1.1
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* [CK-TILE] Guard against compiler lexer diagnostic
A recent change to Clang added a lexer-level diagnostic about that C2y
language feature. Since that is lexer level, the `__extension__`
compiler built-in does not work as it is only respected *after* the
lexer when parsing.
This change adds guarding pragmas to disable the diagnostic in the
lexer and not lead to warnings being treated as errors.
* Fixing still existing build issue
Once the one warning was removed, another one poppoed up. Both are
related to the same c2y feature. Thus, ignoring both.
* clang-format handling
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* build: reduce build time for bqaunt unit tests by splitting into multiple cpp
* reduce the test case & add the gfx10 support
* fix: copyright header for new file
* chore: add copyright to pass the CI
* build: Hot fix to reduce massive build time by just disabling the instances
* Update include/ck_tile/core/config.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: khushbu <khuagarw@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* rebase to new develop
* fix clang format
* update code according to reviewer's comment
* Update README.md
* update code according to reviewer's comment
* update code according to reviewer's comment
* Update CMakeLists.txt
* Update README.md
* Update CMakeLists.txt
* Delete files
* Delete files
* Add unit tests
* Update test_gemm_quant_base.hpp
* merge bf16*fp4 example to develop branch
* fix clang format
* fix clang format
* Update CMakeLists.txt
* fix ci test
* fix clang format
* resolve conflicts
---------
Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic
* correct clang-format
* removed unused rtol_atol variable from example code
* clang format correction
* remove unused varable max_accumulated_value from example
* First look at mfma / wmma unification
* Refactor
* Re-org file structure
* Restructure transform selection and WaveWiseMma class
* Update license files. Add missing gfx1151 support. Change wave size for HOST to 1. Update datatypes naming consistency
* Fixes default MmaSelector implentation
* Adds unit tests for amdgcn_mma and arch
* Consolidate common arch id checks to constexpr functions. Strongly type ids as amdgcn_target_arch_id object.
* Refactor is_any_value_of
* Fixes mma_selector logic
* Fix typo
* Add mma selector test for tile decomposition
* Fix compilation of mma.hpp
* Revert back to c++17 compatibility
* Fix compiler error by returning index_t from get_warp_size()
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fixes compiler error for missing is_wave32() function
* Fixes compiler error for host wave_size() should be 64
* Fixes compiler errors where __cpp_concepts is not defined
* Fixes compiler errors where __cpp_concepts is not defined
* Fix test failure for host is wave64 by default
---------
Co-authored-by: Chris Millette <you@example.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Add missing copyright statements
* Use ck_tile::host_tensor_descriptor instead of a custom lambda
* Refactor use of check_data_type in test classes
* Use TEST_SUITE_NAME with TYPED_TEST_SUITE
* Remove an unused namespace
* Make dim3 const
* Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile
* Add missing precision type combinations to CompV4 from CompV3
* Move the INT8 tests around for consistency with KernelTypesCompV3Wmma
* Add missing precision type combinations to CompV3Wmma from CompV3
* Remove the basic and universal tests and their dependencies
* On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B
* Use ADataType and BDataType instead of ComputeDataType for WarpGemm
* Explicitly set some return types to void
* Use more general typenames in InterleavedPKTypeLoader
* Add load_interleaved_pk_type.hpp to common.hpp
* Use std::is_same_v in load_int4_tile
* Add handling of LoadTranspose to load_int4_tile
* Factor out common code in several places using load_int4_tile
* Add support for pk_int4_t using load_int4_tile
* Fix formatting
* fix: fix bug in print tile window when printing bf8/fp8 tiles
* test(print_tile_window_range): add unit tests to maintain function integrity
* fix: fp8 numerical mismatch error on gfx950 by adding DCK_TILE_USE_OCP_FP8
Fixes LDS bank conflicts on gfx950 for universal gemm v3 pipeline
Replaces hardcoded LDS layer calculations with dynamic computation using the new architecture helpers
Adds architecture-specific helper function get_n_lds_banks()
Changes function attributes from CK_TILE_HOST_DEVICE to CK_TILE_DEVICE in universal gemm policy