* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
`test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
`benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
specifically validate/benchmark the new padding and variable-length
functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
* [CK_TILE] add mqa, gqa to sequence padding unit tests
* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI
* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel
* Remove C++20 code
C++20 features should not be used in CK. Remove all C++20 code.
* fix c++17 build
* format
* fix merge issue
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* upgrade default docker to rocm7.0.1
* turn on build and test on gfx950 by default
* use rocm-dev instead of rocm
* link libhiprtc for codegen targets
* resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards
---------
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
* Update grouped_gemm example and pipeline
* find the root cause error in did not enable the transpose in gfx950 correctly
* Fix v3 pipeline, row and col major
* Disable f8 datatype tests, it fails on gfx950
* fix the abd test by clear the runtime argument unsupported
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
* disable cast_tile_pk_fp16_fp32 on gfx950
* fix wrong encoding when hdim is not exponentiation of 2
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
* Have a workable version for SGPR
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* substitute with the new sgpr read api
* update the CHANGELOG
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* change to static for logic
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* rename gemm_group_quant to gemm_quant
* Add TensorWise quant mode
* Cshuffle epilogue tests with tensor scaling
* Add tensor quant to example
* Don't use readfirstlane for reading scales - doesn't work for some reason
* Add to changelog
* revert include - from a merge problem?
* revert common.hpp include
* revert host.hpp include
* remove unused utility function
* rename quant pipeline problem
* refactor quant tests
* remove aquant utils
* use TEST_F
* fix all tests by changing gemm config
* Use typed tests
* fix copyright
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
These test cases and benchmarks that specifically validate/benchmark the
new padding and variable-length functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
* Run ctest with --output-on-failure
* Fix synchronization issues in bwd pipelines
The bwd kernel reuses the same area of LDS for ds (SGrad), bias and
dbias (BiasGrad). This means that there must be block_sync_lds between
loading one tensor and storing another to the same area.
Heavy instructions like MFMA/WMMA and global loads are executed between
reuses of the same memory so in MOST cases loading is finished by all
warps before storing is started. However, sometimes warps progress at
different speeds.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure
* change host using fp16 to check
* fp8 to fp8 compare
* rewrite input parameters
* add not squant
* remove some output code
* for scale = 1
* format
* saturates only for fp8
* add fp8bf16 data type
* add fp8bf16 data type
* fix test fp8 code
* add run_fp8bf16_tests
* change fmha fwd example parameter(adding fp8bf16)
* Support fp8bf16 for Aiter
* Support aiter fp8bf16 in c++
* fix comment about fp8 in readme.md
* add fp8fp32
* add fp8fp32 test
* remove range_q etc.
* format
* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type
* add fp8bf16 to data_type function
* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8
* format
* reset atol for fp8
* fix bug for atol
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
* Factor out the three separate copies of load_interleaved_pk_type into a common utility class
* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the weight preshuffle GEMM example
* Remove a duplicate function
* Add support for B tensor type pk_int4_t for the weight preshuffle GEMM, with tests included
* I4 support introduced more failing test cases that mirror the existing ones for F8
* Simplify the check for which tests to skip (they all have F8 as A tensor type)
* Add a changelog entry
* add the test for v2 wp pipeline, polish the code, add the support of int4 for v2 wp pipeline
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)"
This reverts commit 03b59f8c76.
* fix compile error on gf12x
* only run tf32 example on gfx942
* only build tf32 instance on gfx942
* ckProfiler:only support tf32 in gfx942
* delete unuseful messages
* fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num
* WIP: add temp config to stress test numerical error correction
* refactor: remove comments
- profiler for gemm quantization for DL/XDL
- tests for gemm quantization for DL/XDL
- implementation for gemm quantization for WMMA
- profiler/tests for gemm qunatization for WMMA
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function
Prior to this change, the splitk_batch_offset parameter of
MakeGemmTensorViews had type SplitKBatchOffset. But, the only member
variable of the SplitKBatchOffset class used in the MakeGemmTensorViews
function was splitted_k (an int32_t). The splitted_k value was used as
part of defining the dimensions of the tensor view. That said, for
Stream K, we do not need to use the SplitKBatchOffset class since we are
not using Split K. Thus, this commit changes the splitk_batch_offset
parameter to a int32_t called k_size. This will avoid the constraint of
requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset
class while still providing the same functionality. Calls to
UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.
* StreamK Kernel RunGemm Implementation
Stream K cannot simply use UniversalGemmKernel's RunGemm for the
following reasons:
1. The UniversalGemmKernel::RunGemm function computes num_loop based on
a static function of the TilePartitioner. That said, for Stream K,
num_loop must be computed using a member function (namely
GetCurrentIterLength from PR #2708).
2. The UniversalGemmKernel::RunGemm function requires the use of a
SplitKBatchOffset object which is not used for Stream K since we are
not using Split K.
Thus, this change adds a RunGemm function in the StreamKKernel class.
* initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm
* Fix indexing and offset issues for StreamK
These changes do the following:
- Ensure offsets along the M and N dimensions are multiplied by
MPerblock or NPerBlock, respectively. This ensures tile window origins
are at the correct locations.
- Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply
divmod to the given references to ensure correct values are available
to the caller.
- Added documentation in the Stream-K operator()
* Initial gtests for Stream-K
These changes add an initial gtest suite for the CK Tile Stream-K
kernel. Currently, due to bugs in the StreamKTilePartitioner (which will
be handled in a future PR), there are validation issues for certain
cases which may differ on different architectures. Thus, we opted to run
cases that are only fully data-parallel (skipping others). A guard was
added to Stream-K's IsSupportedArgument method to ensure that callers
are aware of this constraint. Additionally, to ensure testing
reproducibility, options for setting the number of CUs and occupancy
were added to MakeKernelArgs.
* Use GemmPipeline operator() variant that takes hot loop and tail num
In Stream-K, the num_loop value varies per WG and per iteration of a
Stream-K loop. So instead, we use the version of the GemmPipeline's
operator() function that takes in has_hot_loop and tail_num. This is
similar to what is done in Grouped GEMM.
* changes from review: comments, move readfirstlane, remove ifndef
* Switch direction of C tensor traversal & add padding guard
Prior to this change, WGs travelled backwards through their assigned
macro tiles in the C tensor. For instance, if WG0 is responsible for C
tiles 0 and 1, it would first visit tile 1 then tile 0. This means that
the iter_end decrements in each iteration of the stream-K while loop.
Since we are working with unsigned integers, the subtraction operation
may not be safe. Thus, this change makes is such that WGs travel forward
so that their iter_start is incremented and their iter_end remains
fixed.
Additionally, we added a guard against WGs that are neither sk_blocks
nor dp_blocks to ensure such WGs do not participate in the GEMM.
Together, these changes make is such that the algorithm is correct when
sk_blocks is greater than zero.
* Disable StreamK_M256_N256_K256_SKBlocks12 test case
This instance involves >=3 WGs contributing to each macro tile in C. Due
to the use of atomics, this is resulting in precision errors. These
errors will not persist once the reduction strategy is implemented. We
will re-enable this test then.
---------
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
* [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (c254f3d7b4 )
WarpPerBlock_M * WarpPerBlock_N are not equal with ThreadPerBlock_M * ThreadPerBlock_N /warpSize. we should calculate BlockSize from WarpPerBlock_M * WarpPerBlock_N
To compatible with wave32, function GetBlockSize is added to calculate correct size in host side.
* fix blocksize for all kernel related with generic2dblockshap
* remove constexpr for blocks
## What's New
Add Split-N support for grouped convolution forward to handle tensors >2GB by splitting the batch dimension.
## Bug Fix
Fixed 32-bit integer overflow that caused crashes with 6+ splits:
- Use `long_index_t` for batch offset calculations
- Remove redundant GemmM initialization in constructors
## How It Works
- Automatically splits batch dimension when tensor exceeds 2GB
- Uses grid.z dimension for parallel processing of splits
- Each split processes a subset of batches independently
## Testing
Verified with tile_example_grouped_conv_fwd:
- n=3000 (6 splits) ✓
- n=3500 (7 splits) ✓
- n=10480 (40 splits) ✓
Added gemm + reduce instance library for RDNA4. This includes:
- New device implementation running GEMM and reduction kernel
- instances for wmma (xdl parity)
- examples for wmma (xdl parity)
- tests for existing xdl and wmma