* HasHotLoop is a constexpr
* Remove an unused function
* Remove some unused include statements
* Add implementation and tests for fp8 x bf8 weight preshuffle GEMM
* Add implementation and tests for fp8 x bf8 in CK Tile basic and universal GEMMs
* Remove two barrier calls that HotLoopScheduler already calls
* No need to suppress a variable that hasn't been declared
* Replace six arg_parser arguments with constexpr literals
* Simplify run_gemm_test_prec_type
* The strides don't need to be passed via arg_parser as we use their default values
* The layouts don't need to be passed as arguments twice
* Pass M N and K as regular arguments, not using the argument parser
* We can now remove the argument parser
* Add a common file for precision types to be used in testing
* Convert basic and universal GEMM tests to use gtest
* Make GemmConfig a test parameter, and form test cases as the cartesian product GemmConfigs x PrecTypes
* Add GemmConfigComputeV4 to the GEMM configs to run the universal tests on
* Added a changelog entry
* Add missing copyright statements
* ifndef-define-endif is not needed with pragma once
* Fix a comment
* Add F8 x BF8 tests for CompV4 in test_gemm_pipeline_kernel_types.hpp
* Disable the unreliable test MoeSortingCase4
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Pass hdim to tile_example_fmha_fwd in fp8 tests
* Add WMMA support to fwd FMHA pipelines
* Tune tile sizes a bit for less spilling
fp16 256 is still quite slow
* Fix Q grad tile distribution for warp size = 32 and hdim >= 256
With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.
* Use code based on BlockDropout in BlockDropoutBwd
* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal
* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
* any number of warps (power-of-2), not only 4;
* kN1 = 16, not only 32;
* any number of splits;
* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now
* Replace hard-coded kN1 in dispatch code with the requested tile size
* Add gfx12-specific tile sizes for split KV
* Pass GPU architecture to kernel generation scripts
This is still a temporary solution.
* Build and run FMHA CI tests for gfx12
* Fix issue after merging
* Fix bwd tile sizes
The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).
* Use hardware f32->f8 on gfx12, remove v_perm
__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
32-bit dword) is used to store results (two f8 values).
* Update changelog
* Add WMMA support to pagedkv
* Fix scripts after rebasing
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Fix names after cherry-picking
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Do not use filters related to qr_async_trload
They disable tiles/pipelines which are valid for gfx12.
* Use different dstr encoding when C is transposed
* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code
Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().
* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution
Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.
* Generalize usages of WarpGemmDispatcher for MFMA and WMMA
WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.
* Mark has_load_tr as maybe_unused
There are no transpose loading for RDNA.
* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code
* Detect BlockSize on host based on warp size of the current device
If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.
* Fix calculation of grid size for combine kernel with warp size = 32
* Add missing includes and header
* Support multiple archs in one binary for fwd
* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill
* Support multiple archs in one binary for bwd
* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
they can be used as fallbacks (similarly to fwd);
* Extract common code from register_traits
* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"
To simplify merging , the proper fix is in develop already.
* Support new numerical d paddings in trait ordering checks
* Build fp32 tests only on gfx9
* Do not use hardcoded M0 = 64 for dot bwd kernel
* Use textwrap.indent from standard library
* Make fp8 pipelines on gfx12 consistent with gfx9
* Update tests for current pipelines
* Make ninja check more responsive in CI
ninja buffers output so this job looks hanging.
* Support fp8fp32 by limiting O vector size
The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).
* Remove unused cmake options
* Unify including amd_buffer_addressing.hpp/_builtins.hpp
* Temporarily use amd_buffer_addressing.hpp on >=gfx10
amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
* 1 scalar for exec masks instead of 2,
* gfx12 uses different instruction names etc.
* Update asm in bf16 conversions to work with warp 32
* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd
* Add arch tags to kernels/host funcs, compile for each arch separately
* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename
* Add workaround for miscompilation of bwd with padded hdim
SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.
* Fix add_gtest_executable for absolute paths
Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.
* Run only hdim 128 smoke tests for fp8fp32
There are no instances for hdim 64 and 256.
* Format py with ruff to simplify merging develop
* Fix incorrect var name
* Codegen for gfx9,gfx950 when --targets is not specified
Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).
* Combine arch-related values into ArchTrait
This more centralized approach removes duplication of various formatting templates.
* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"
Some code is extracted into a function.
* Add indexing support to pooling operator
- Add IndexDataType template parameter to pooling problem and kernel
definitions
- Enable pooling kernel to output indices of selected elements during
max/absmax pooling
- Add overloaded operators for Max and AbsMax that track when values
change using bool changed parameter
- Support optional index buffer allocation and management in device
memory
- Modify BlockReduce2d classes to handle index tensors alongside value
tensors
- Add separate shared memory allocation for index data in cross-warp
reductions
- Create validate_pool_indices function to verify index correctness
- Modify pool3d.cpp example to demonstrate index output functionality
- Add tests for index output
* fixes
* Refactor BlockReduce2D functions to get rid auxiliary private types.
* comment resolutions and some changes to block_reduce2d
- index reference implementation improved
- reduce_operator.hpp cleanedup
- updated the block_reduce2d.hpp to have index calculation for
BlockReduce2dLinearCrossWarpSync as well
* conditionally used variable declaration improvement
- the conditionally used vairbales are used only when indexing is
enabled. To inform the compiler that they may be unused and declare them
with least size possible. This may allow it to be optimized compared to
the previous declarations
* comment resolutions
* lexical ordering of the indicies
- introduced accumulate methods that handle the intermediate steps if
needed to order the indexes
* add reduce_operator_accumulate.hpp to core.hpp
---------
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* [CK_TILE] fmha: Add query padding support to backward pass
Introduces support for query sequence padding (q_padding) in the FMHA backward pass kernels.
- Passing `seqlen_q_ptr` to the backward kernels to distinguish logical from physical sequence lengths.
- Updating `OGradDotO`, `ConvertQGrad`, and `DQDKDV` kernels to respect logical lengths and handle zero-length sequences.
- Aligning LSE indexing in the forward kernel with the padded layout for consistency.
- Adding a new GTest suite (`test_fmha_bwd_kernel_padding.cpp`) with comprehensive tests for various padding scenarios, including zero-length
sequences and deterministic mode.
* fix clang format
* Adapt fmha_bwd_runner.cpp to new q, kv sequence padding
Add backward q/kv sequence padding unit tests.
* [CK_TILE] fmha: Unify sequence length and padding handling
Refactor the handling of sequence lengths and padding in the
FMHA forward and backward kernels to provide a more unified and flexible
interface.
- Replaced `seqstart_padded_*_ptr` with a more robust system that uses
`seqstart_*_ptr` for physical sequence lengths and introduces
`seqlen_*_ptr` and `cu_seqlen_*_ptr` for logical (unpadded) lengths.
- Established a clear order of precedence for determining sequence
length: cumulative lengths (`cu_seqlen_*_ptr`) take priority,
followed by per-sequence lengths (`seqlen_*_ptr`), and finally
physical lengths derived from `seqstart_*_ptr`.
- Clarified the distinction between "group mode" and "batch mode" and
how sequence lengths are handled in each case.
- Renamed `cu_seqlen_kv_ptr` to `cu_seqlen_k_ptr` for consistency.
- Updated comments and documentation to reflect the new argument
structure and usage.
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* update test cases
* format codes
* use GTEST_FAIL
* add bquant to grouped_gemm
* fix a bug in test_grouped_gemm_util
* skip test when use wmma on grouped_quant kernel
* add tensorwise quant in grouped gemm
* fix example issue
* update test cases
* format codes
* fix a bug in test_grouped_gemm_util
* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm
* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* feat: add bf8 support
* chore: remove unnecessary decltype usage
* chore: add default quant_mode to function signature as fallback
* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel
Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.
* chore: set default quant mode in function signature
* test: add additional test cases to cover edge case of no hotloop
* chore: clang formatting
---------
Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Addition of streamk fp8 example for CK Tile
* Adding in bf8 streamk example in CK Tile
* Refactoring fp8/bf8 unit tests
Refactored the unit tests for fp8/bf8 to utilize the test harness.
Implemented smoke tests with layouts: CCR, CRR, RCR, RRR for fp8/bf8.
The tests are using 128x128x32 for the tile configuration, as other
configurations revealed implementation gaps that are currently being
documented.
* Implement argument passing to element-wise functions for fwd convolution
* Add files for fwd + bias + clamp example
* Implement Bias
* Implement Clamp
* Elementwise function composition
* Composition unit test
* Implement fwd + bias + clamp example
* Simplify argument passing and composition
* elfunc -> bias_and_clamp
* Rename function to specify example
* Move element-wise function instantiation to kernel
* Make bias a runtime tensor
* No ugly namespace aliasing
* Initialize element-wise function on host
* Remove function initialization helper, simplify Compose initialization
* Remove unintended LSP compatibility patch
* Clean up includes and unused code
* Switch names in cshuffle epilogue
* Move CDElementwise to conv traits
* Re-add required include
* Initialize bias in same way as other tensors
* Better type specification for ds pointer
* Disable 1D convolution
* Add warning for non-group-constant bias
* Persistent Stream-K Kernel Implementation
This change implements an operator() function in the
reboot::StreamKKernel class that is enabled when the Persistent flag is
set to true. In this case, the data-parallel portion and the Stream-K
portion of the kernel are fully persistent.
The changes were made in the reboot namespace. A future PR will remove
the old Stream-K kernel class and remove the reboot namespace.
* Unit Tests for Persistent Stream-K Kernel
This change contains the inital test suite for the Persitent Stream-K
Kernel. The files contain "reboot" in the name; a future PR will remove
tests for the old Stream-K Kernel and remove the "reboot" naming.
A future commit will add tests for the non-persistent kernel.
Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase
class. This allows us to estimate the number of accumulations done per
macro tile in C to use during validation when computing relative and
absolute tolerance.
* Adding implementation for the Non-Persistent Stream-K kernel
This code is adding the operator() function for the Non-Persistent Stream-K
kernel. Persistency of the kernel is determined through a template argument.
The Non-Persistent kernel will allocate additional workgroups for the data
parallel section, leading to a different structure for processing the data
parallel and Stream-K sections.
There has been an addition to the TilePartitioner to get access to the whether
Persistent has been set to true or false in the StreamKKernel.
* Adding in the tests for the Non-Persistent Stream-K kernel
* Refactor Stream-K Reboot Unit Tests
This commit makes the following changes:
- Update test cases to determine M, N, and K based on the number of CUs.
This ensures that each test case is one of Edge Case, SK Only, DP
Only, or DP + 2 Tile SK regardless of the architecture.
- Since the DP + 2 Tile SK test case takes long to run, this change
moves this case into a separate .inc file and labels it as an extended
test.
- Since the extended test takes > 30 seconds to run, this test is added
to the list of regression tests.
* Fix spelling errors in comments for test cases
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Changes based on review
Removed const volatile for typenames
Set up alias for is_tuple_t
Naming changes for clarity: GemmCommon -> BaseGemm
Moved std::enable_if_t out of template parameters and changed to a return type for operator()
Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance
---------
Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add tensorwise quant in grouped gemm
* fix example issue
* update test cases
* format codes
* clang format
* use GTEST_FAIL
* fix a bug in test_grouped_gemm_util
* skip test when use wmma on grouped_quant kernel
* change cmake
* change code based on comments
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
fix transpose_vectors logic for 2x2 8-bit tiles
add a test which goes through this code path.
factor out constexpr'd cases into smaller functions.
add inline docs about the data movement
impact: gemms with 8-bit non-rcr inputs on gfx942
* Reading gpuname from target for gemm in ck tile engine
* Reading gpuname from target for gemm preshuffle in ck tile engine
* Reading gpuname from target for gemm preshuffle in ck tile engine
* Get GPU changes for GEMM Muti D in TILE ENGINE
* Addressing errors for gpu name in cktileengine
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.
This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.
The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.
Unit tests for the derived structs are included.
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.
This change also includes unit tests for the base tile partitioner.
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
Refine README files by removing outdated references to the Tile Engine
* Updates based on PR feedback 1
* Updates based on PR feedback 2
* Updates based on PR feedback 3
* Updates based on PR feedback 4
* Updates based on PR feedback 5
* Updates based on PR feedback 6
* Updates based on PR feedback 7
* Updates based on PR feedback 8
* Content Modification of CK Tile Example
* Modify the ck_tile gemm config
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite
* Account for stride when computing K offsets for A and B tensor
This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.
* Move helper code to test_gemm_streamk_util.hpp
* Separate tests into smoke/regression/extended. Add bf16 datatype
* Run clang-format
* Refactor combinatorial macro expansion and naming
* Adjust the initialization values to account for better tolerance on bf16
* Correct BF16 datatypes in comments
* Move the extended tests under the REGRESSION_TESTS label
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixes compilation error on SLES15 with GCC 7 for gfx942 builds:
error: 'vector' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]
Changes:
- Explicitly specify template argument for `std::vector<mode_enum>` instead of relying on C++17 CTAD
- Maintains compatibility with both older (GCC 7) and newer compilers
* debugging
* debugging for prefill shapes
* comment unused code
* fix for prefill shapes
* clearing up the code
* add int4 to universal gemm example
* clang formatted
* adding test for prefill shapes in block scale gemm
* lil improv on the block pipeline
* Address Review Comment
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* reuse local prefetch logic from compute v4 pipeline
add single-tile test
explicit lambda capture
reuse lds block descriptors from base policy for the transposed case
match the test case kernel configuration with compute v4
* add comments
* Pooling 2D/3D with refernce
* Tests & cleanup
- added test for ppoling
- cleanup
- removed 2d example
* Comment resolution
- README added
- example target name rectified
- appropriate arg description and comments added
* clang-format
* appropriate blocksize calc
* modifications for future indexing addition
- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future
* some basic fixes
* comment resolutions
* comment resolutions
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* first try to understand how tile engine works
* 1st implemented unit tests
* manage different types for unit tests
* manage using different config files to have different unit tests
* manage different layouts
* making instances and running them by unit test
* Add reference calculation
* manage different input dimension combination
* add splitk to unit tests. clean code.
* remove unused files
* clean and test with a simple json file
* [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle
When TransposeC and QuantPreshuffle are both true, Aquant generates
correct result.
* [CK TILE GEMM] Support Aquant GEMM with transposeC and preshuffle
- Add unit tests
* Fix bug in is_quantpreshuffle_enabled
* clang format
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* check in pipeline and policy
for async load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked
solve merge conflicts
* fix cmakelists
* make it build
* fix? buffer async fence
* relax fences; it appears it only is needed between pairs of ping-pongs
* remove fences
* remove fences
* cleanup and reformat
* add steps annotations
* comment all pipeline steps / remove unexplainable syncs
* clang-format
* add comment
* cleanup kernel types for test
* fix comment
* fix hardcoded warp size
* faithfully copy block gemm from compute v4 policy to async policy
* make async test gfx950 only
* fix cmake logic
* set separate compile options for async
* refine comment in policy
* try update hotloop scheduler
* cleanup comments
* test more K block sizes
* unhardcode Ks, sort of
* add large odd test case
* fix build for quant
* add comment to hot loop scheduler and rename enum
* reformat
* reword the pipeline description
* reformat
* address review / add static asserts / typo fix
* update changelog
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults
* feat(grouped_gemm_multi_d): add functionality to run persistant kernel
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite
* tests: add unit tests for grouped_gemm_multi_d persistent kernels
* parent 5b0af640369b93849335b126d6826b204ccc43a3
author AviralGoelAMD <aviral.goel@amd.com> 1758919991 +0000
committer AviralGoelAMD <aviral.goel@amd.com> 1759338256 +0000
docs: updated changelog with new feature info
fix wp gemm bug when permuteN is false (#2935)
* fix wp gemm bug when permuteN is false
* code clean
---------
Co-authored-by: valarLip <340077269@qq.com>
fix copy-paste bug in get_matrix_b; re-enable all tests in multi_abd (#2939)
[CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934)
* Fix validation of rotary embedding with time_kernel_
When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.
* Fix synchronization issue in splitkv combine pipeline
Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"
[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
Use git ls-files to select candidate files for clang format
This change ensures that the files being selected for clang format validation are exactly the ones tracked by the git repo we are testing. This protects against an known issue where the repo being tested contained "stray files" from a previous test.
[CK_TILE] Fixing Type Conversions in PassThroughPack8 (#2769)
* Change the return type of run_gemm_combinations in the basic tests
* Change the return type of run_gemm_combinations in the universal tests
* Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4
* Add universal GEMM test for fp8 x pk_i4
* Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.
* Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t>
* No need for utility in test_ck_tile_elementwise_1d
* Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8
* Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant
* For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union
* Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced
* Convert from float to bf16 during compilation rather than using magic values
* Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8
* Comment out the basic test for fp16 x pk_i4 as it does not pass
* Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8
* Add basic and universal GEMM tests for bf8 x pk_i4
* Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now
* Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now
* Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp
* Use explicit macros for enabling and disabling the the constexpr lookup based converters
* Fix two failing tests
* Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant
* Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16
* On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t
Grouped Conv Bwd Data out index calculation optimizations (#2917)
* Grouped Conv Bwd Data index calculation optimizations
* fixes
* refactor instances
* gfx12 fixes
* temporary disable splitK for gfx12
[CK] Fix example_grouped_conv_bwd_data_xdl_fp16 with ksplit = 2 (#2943)
root cause: AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1.
[CK][Examples] Extending support for rdna3/4 in following examples: (#2884)
* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
---------
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
hot fix check eid range (#2924)
* hot fix check eid range
* fix clang format
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Weight Preshuffle Block Scale gemm support (#2877)
* initial commit
* remove extra files
* fixing errors
* updated ReadMe file for mapping of diff quants with diff configs
* addressing review comments
* addressing review comments
* Resolved merge conflicts
* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled
The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.
* initial commit
* debugging
* working fp8 for init constant
* fp8 working with all inits
* updated block level code with comments
* changing the loop iter
* debugging
* debugging
* debugging
* code fix
* code clean up
* clang formatted
* Add comment
* code cleanup
* clang formatted
* merge conflicts fixes
* applying the latest int4 changes to the piepline
* fixing test code for updated traits
* Adding gtest
* review comments addressed
* addressing review comments
* remove c++20 code
* added flush cache changes
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>
increase time limit for AITER tests (#2948)
Code style clean-up and documentation
The following changes were made:
- Clean-up of variable namings
- Addition of README
- Removal of num_cu and occupancy args; such options are meant for
testing purposes and should not be exposed to the user
- Removal of CK_TILE_PIPELINE_MEMORY macro and PipelineTypeTraits class
since we only support one pipeline at the moment.
Fix timing issue in CK_TILE GEMM example (#2940)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults
* feat(grouped_gemm_multi_d): add functionality to run persistant kernel
* fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint
Fix timing issue in CK_TILE GEMM example (#2940)
* style: clang format
* refactor: removed unused file
[CK] Add command option instance_index and param_mask to run partial ck test (#2889)
* [CK] Add command option instance_index and param_mask to run partial ck test
Many CK test are instance test. it will loop all instance in the instance library. It causes test often out-of-time if we run test on simulator/emulator.
This PR add option instance_index and param_mask to reduce the workload of instance test
instance_index: only run test 1 available instance with specified index.
param_mask: filter the embedded parameter with specified mask
* fix CI error
* fix clang format
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
[CK_TILE]enhance elementwise test (#2683)
* enhance elementwise
* fix ci issues
* initial commit
* remove extra files
* fixing errors
* updated ReadMe file for mapping of diff quants with diff configs
* addressing review comments
* addressing review comments
* Resolved merge conflicts
* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled
The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.
* initial commit
* debugging
* working fp8 for init constant
* fp8 working with all inits
* updated block level code with comments
* changing the loop iter
* debugging
* debugging
* debugging
* code fix
* code clean up
* clang formatted
* Add comment
* code cleanup
* clang formatted
* merge conflicts fixes
* applying the latest int4 changes to the piepline
* fixing test code for updated traits
* Adding gtest
* review comments addressed
* addressing review comments
* remove c++20 code
* added flush cache changes
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>
* Change the return type of run_gemm_combinations in the basic tests
* Change the return type of run_gemm_combinations in the universal tests
* Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4
* Add universal GEMM test for fp8 x pk_i4
* Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.
* Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t>
* No need for utility in test_ck_tile_elementwise_1d
* Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8
* Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant
* For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union
* Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced
* Convert from float to bf16 during compilation rather than using magic values
* Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8
* Comment out the basic test for fp16 x pk_i4 as it does not pass
* Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8
* Add basic and universal GEMM tests for bf8 x pk_i4
* Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now
* Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now
* Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp
* Use explicit macros for enabling and disabling the the constexpr lookup based converters
* Fix two failing tests
* Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant
* Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16
* On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* feat: generalized grouped_gemm_kernel.hpp
* feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* docs: add multi d info and trim down outdated content
* tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite