* Initial commit. create batched_contraction_kernel file
* initial problem definition
* implement initial example to launch kernel
* add universal gemm to contraction. initial phase
* complete implementation for special case all Dims are 1 and no Ds
* clean code
* initial changes to support multi dimensional G
* more progress in implementing multiple G
* tmp commit
* manage dynamic NumDimG in kernel
* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit
* implement the example for general Multi dimension G M N K and test different reference calculation algorithms
* 2 functions for reference using multi dimensional and flat indexing
* clean the code for muti dimentional G, M, N, K contraction and add some logs
* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E
* some cleaning on kernel
* clean the code for calculating the offsets from flatten batch number
* Start adding MultiD support to kernel and example
* more changes to manage multi D in kernel and example
* manage passing multi d to kernel and testing.
* complete multi D support in kernel. modify example code to support it
* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning
* Minor fix
* Generalize example code for variable NumD tensors and apply cleanup based on review feedback
* Refactored code and addressed review feedback
* refactoring, cleaning, add documents, in kernel side and example codes
* Optimize batch offset calculation in kernel
* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* debugging
* debugging for prefill shapes
* comment unused code
* fix for prefill shapes
* clearing up the code
* add int4 to universal gemm example
* clang formatted
* adding test for prefill shapes in block scale gemm
* lil improv on the block pipeline
* Address Review Comment
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Pooling 2D/3D with refernce
* Tests & cleanup
- added test for ppoling
- cleanup
- removed 2d example
* Comment resolution
- README added
- example target name rectified
- appropriate arg description and comments added
* clang-format
* appropriate blocksize calc
* modifications for future indexing addition
- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future
* some basic fixes
* comment resolutions
* comment resolutions
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* WIP: add memory pipeline boiler plate code that compiles and works for one block
* WIP: tail handling works for memory pipeline
* WIP: numerical errors appears to have gone by adding block_sync_lds()
* fix: numerical error with memory pipeline by adding block_sync_lds() and new tail handler
* refactror: remove debug print statements and lints
* fix: remove redundant sync barriars
* chore: remove lint
* fix: remove unused code from tile handler and remove redundant block_sync_lds()
* fix: correct parent struct name for memory pipeline
* fix: remove static assert check from parent struct and add it to child struct because not all child structs needs to static assert
* fix: defer block sync lds to just before prefill
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults
* feat(grouped_gemm_multi_d): add functionality to run persistant kernel
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite
* tests: add unit tests for grouped_gemm_multi_d persistent kernels
* parent 5b0af640369b93849335b126d6826b204ccc43a3
author AviralGoelAMD <aviral.goel@amd.com> 1758919991 +0000
committer AviralGoelAMD <aviral.goel@amd.com> 1759338256 +0000
docs: updated changelog with new feature info
fix wp gemm bug when permuteN is false (#2935)
* fix wp gemm bug when permuteN is false
* code clean
---------
Co-authored-by: valarLip <340077269@qq.com>
fix copy-paste bug in get_matrix_b; re-enable all tests in multi_abd (#2939)
[CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934)
* Fix validation of rotary embedding with time_kernel_
When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.
* Fix synchronization issue in splitkv combine pipeline
Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"
[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
Use git ls-files to select candidate files for clang format
This change ensures that the files being selected for clang format validation are exactly the ones tracked by the git repo we are testing. This protects against an known issue where the repo being tested contained "stray files" from a previous test.
[CK_TILE] Fixing Type Conversions in PassThroughPack8 (#2769)
* Change the return type of run_gemm_combinations in the basic tests
* Change the return type of run_gemm_combinations in the universal tests
* Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4
* Add universal GEMM test for fp8 x pk_i4
* Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.
* Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t>
* No need for utility in test_ck_tile_elementwise_1d
* Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8
* Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant
* For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union
* Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced
* Convert from float to bf16 during compilation rather than using magic values
* Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8
* Comment out the basic test for fp16 x pk_i4 as it does not pass
* Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8
* Add basic and universal GEMM tests for bf8 x pk_i4
* Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now
* Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now
* Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp
* Use explicit macros for enabling and disabling the the constexpr lookup based converters
* Fix two failing tests
* Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant
* Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16
* On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t
Grouped Conv Bwd Data out index calculation optimizations (#2917)
* Grouped Conv Bwd Data index calculation optimizations
* fixes
* refactor instances
* gfx12 fixes
* temporary disable splitK for gfx12
[CK] Fix example_grouped_conv_bwd_data_xdl_fp16 with ksplit = 2 (#2943)
root cause: AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1.
[CK][Examples] Extending support for rdna3/4 in following examples: (#2884)
* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
* [CK][Examples] Extending support for rdna3/4 in following examples:
-example_gemm_xdl_splitk_reduce_multi_d_fp16
-example_gemm_xdl_splitk_reduce_multi_d_bf16
-example_gemm_xdl_splitk_reduce_bf16A_i8B
-example_gemm_xdl_splitk_reduce_bfp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
-example_gemm_add_multiply_xdl_fp16
-example_complex_contraction_bilinear_xdl_fp32
-example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16
-example_batched_gemm_bias_e_permute_xdl_fp16
-example_gemm_xdl_fp16
-example_gemm_xdl_fp16_av2
-example_gemm_xdl_wavelet_fp16
-example_gemm_add_add_fastgelu_xdl_bf16
-example_gemm_add_add_fastgelu_xdl_fp16
-example_gemm_add_add_fastgelu_xdl_fp32
-example_grouped_gemm_xdl_fp32
-example_grouped_gemm_xdl_fp16
-example_grouped_gemm_xdl_bf16
-example_cgemm_xdl_bf16
-example_cgemm_xdl_fp16
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
---------
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
hot fix check eid range (#2924)
* hot fix check eid range
* fix clang format
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Weight Preshuffle Block Scale gemm support (#2877)
* initial commit
* remove extra files
* fixing errors
* updated ReadMe file for mapping of diff quants with diff configs
* addressing review comments
* addressing review comments
* Resolved merge conflicts
* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled
The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.
* initial commit
* debugging
* working fp8 for init constant
* fp8 working with all inits
* updated block level code with comments
* changing the loop iter
* debugging
* debugging
* debugging
* code fix
* code clean up
* clang formatted
* Add comment
* code cleanup
* clang formatted
* merge conflicts fixes
* applying the latest int4 changes to the piepline
* fixing test code for updated traits
* Adding gtest
* review comments addressed
* addressing review comments
* remove c++20 code
* added flush cache changes
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>
increase time limit for AITER tests (#2948)
Code style clean-up and documentation
The following changes were made:
- Clean-up of variable namings
- Addition of README
- Removal of num_cu and occupancy args; such options are meant for
testing purposes and should not be exposed to the user
- Removal of CK_TILE_PIPELINE_MEMORY macro and PipelineTypeTraits class
since we only support one pipeline at the moment.
Fix timing issue in CK_TILE GEMM example (#2940)
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults
* feat(grouped_gemm_multi_d): add functionality to run persistant kernel
* fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint
Fix timing issue in CK_TILE GEMM example (#2940)
* style: clang format
* refactor: removed unused file
[CK] Add command option instance_index and param_mask to run partial ck test (#2889)
* [CK] Add command option instance_index and param_mask to run partial ck test
Many CK test are instance test. it will loop all instance in the instance library. It causes test often out-of-time if we run test on simulator/emulator.
This PR add option instance_index and param_mask to reduce the workload of instance test
instance_index: only run test 1 available instance with specified index.
param_mask: filter the embedded parameter with specified mask
* fix CI error
* fix clang format
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
[CK_TILE]enhance elementwise test (#2683)
* enhance elementwise
* fix ci issues
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults
* feat(grouped_gemm_multi_d): add functionality to run persistant kernel
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite
* docs: improved README text based on review comments
* fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint
The following changes were made:
- Clean-up of variable namings
- Addition of README
- Removal of num_cu and occupancy args; such options are meant for
testing purposes and should not be exposed to the user
- Removal of CK_TILE_PIPELINE_MEMORY macro and PipelineTypeTraits class
since we only support one pipeline at the moment.
Addition of initial CK Tile Stream-K example for bf16 and fp16. These
examples are minimal. As more functionality and gtests are added for
Stream-K (coming in future PRs), these examples will be expanded.
* initial commit
* remove extra files
* fixing errors
* updated ReadMe file for mapping of diff quants with diff configs
* addressing review comments
* addressing review comments
* Resolved merge conflicts
* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled
The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.
* initial commit
* debugging
* working fp8 for init constant
* fp8 working with all inits
* updated block level code with comments
* changing the loop iter
* debugging
* debugging
* debugging
* code fix
* code clean up
* clang formatted
* Add comment
* code cleanup
* clang formatted
* merge conflicts fixes
* applying the latest int4 changes to the piepline
* fixing test code for updated traits
* Adding gtest
* review comments addressed
* addressing review comments
* remove c++20 code
* added flush cache changes
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
* Fix validation of rotary embedding with time_kernel_
When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.
* Fix synchronization issue in splitkv combine pipeline
Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* feat: generalized grouped_gemm_kernel.hpp
* feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* docs: add multi d info and trim down outdated content
* tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite
* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
`test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
`benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
specifically validate/benchmark the new padding and variable-length
functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
* [CK_TILE] add mqa, gqa to sequence padding unit tests
* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI
* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel
* Remove C++20 code
C++20 features should not be used in CK. Remove all C++20 code.
* fix c++17 build
* format
* fix merge issue
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* Update grouped_gemm example and pipeline
* find the root cause error in did not enable the transpose in gfx950 correctly
* Fix v3 pipeline, row and col major
* Disable f8 datatype tests, it fails on gfx950
* fix the abd test by clear the runtime argument unsupported
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
* rename gemm_group_quant to gemm_quant
* Add TensorWise quant mode
* Cshuffle epilogue tests with tensor scaling
* Add tensor quant to example
* Don't use readfirstlane for reading scales - doesn't work for some reason
* Add to changelog
* revert include - from a merge problem?
* revert common.hpp include
* revert host.hpp include
* remove unused utility function
* rename quant pipeline problem
* refactor quant tests
* remove aquant utils
* use TEST_F
* fix all tests by changing gemm config
* Use typed tests
* fix copyright
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
These test cases and benchmarks that specifically validate/benchmark the
new padding and variable-length functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
* change host using fp16 to check
* fp8 to fp8 compare
* rewrite input parameters
* add not squant
* remove some output code
* for scale = 1
* format
* saturates only for fp8
* add fp8bf16 data type
* add fp8bf16 data type
* fix test fp8 code
* add run_fp8bf16_tests
* change fmha fwd example parameter(adding fp8bf16)
* Support fp8bf16 for Aiter
* Support aiter fp8bf16 in c++
* fix comment about fp8 in readme.md
* add fp8fp32
* add fp8fp32 test
* remove range_q etc.
* format
* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type
* add fp8bf16 to data_type function
* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8
* format
* reset atol for fp8
* fix bug for atol
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
* fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num
* WIP: add temp config to stress test numerical error correction
* refactor: remove comments
* test(grouped_gemm): add gtests for the example to maintain its integrity
* test(grouped_gemm_preshuffle): add prefill variant to testbed to cover wider range
* fix: removed residue code to make b_shuffle() work again
* test(grouped_gemm_preshuffle): limit the test suite to gfx942 arch as it fails on gfx90a
* build: add gfx950 as build target for gtests
* test(grouped_gemm_preshuffle): temporarily disable fp8 prec tests due to numerical errors
* fix(grouped_gemm_preshuffle): resolved fp8 tests failure on gfx950 by adding correct compiler flag
* Use lse = false for PagedKV tests
There are no instances with lse = true so splitkv is actually launched
as a fallback.
* Reduce build time by disabling instances that are not tested
1. Refine Reduce2dShape to support both wave32 and wave64
2. Fix example reduce, permute and elementwise on gfx11 and gfx12
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* initial commit
* remove extra files
* fixing errors
* updated ReadMe file for mapping of diff quants with diff configs
* addressing review comments
* addressing review comments
* Resolved merge conflicts
* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled
The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.
---------
Co-authored-by: Cong Ma <congma13@amd.com>
* Improve random number generation
* use different seed for each input (Q, K, V...);
* use deterministic generation of:
* seqstart_q/k (for group mode);
* block_table (for paged-kvcahe);
* cache_batch_idx (for kvcache);
* Extract arg_parser-related code from run functions to use them as tests
* Split examples into main programs and fmha runners, build instances separately
* Add dummy tests that use instances and runners
* Fix a missed corner case of f32->f8 conversion
When value if < min f8 denormal but > min f8 denormal / 2, it must be
rounded to min f8 denormal (i.e. 0b1), not to 0.
* Fix incorrect fp8 scales for P and O in validation code
DataTypeConfig was incorrectly compared with fp8_t.
* Add host generation of dropout random values and use it for validation
Previously host validation (reference_batched_dropout) used random
numbers generated by BlockDropout of the kernel, meaning that incorrect
generation on device (bad distribution, repeated numbers, too many zeros,
etc.) would not trigger any validation errors.
* Implement tests from smoke_test_bwd.sh
* Return result as enum to distinguish failure and missing instance
* Add tests for bwd features: bias, alibi, dropout
* Implement tests from smoke_test_fwd.sh
* Pass seqlen_q/k as vectors to fwd and bwd runners
* Add tests for fwd features: bias, alibi, dropout
* Add tests for pagedkv and splitkv
* Fix conditions when to use splitkv and pagedkv kernels
splitkv was executed only when use_kvcache which == (need_append_kvcache || use_cache_batch_idx || 0 < page_block_size).
In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1.
In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance.
* Add tests for appendkv
* Use is_v_rowmajor = true because there are no instances with column layout anymore
* Split public and private compile options for instances
Tests and examples need to know only about CK_TILE_FMHA_FWD_*_API.
* Improve parsing validation in bias and mask
* Pass bias as string for consistency with mask
* Catch parsing and other exceptions
* Add bwd test for deterministic flag
* Initialize fp8 tensors (-init=ufq) similarly to uf
* Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null
seqlen_k cannot be used to determine padding when seqlen_k_ptr is
provided. The actual seqlen_k is taken from seqlen_k_ptr[b].
Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr
may contain arbitrary values.
In the example or tests this produces incorrect results with appendkv
(for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8).
* Fix use_pagedkv value when kvcache = true but page_block_size = 0
In this case block_table_ptr is nullptr which is accessed in the kernel.
* Clean up bwd tests
* Unify fwd tests for f16/bf16 and fp8
* Use better explicit instantiation declaration for fmha_bwd<2>
* Use the same seed for all tests, allow to override it with env variable
* Undo clang-format of one irrelevant file
For some reason my local clang-format-18 and the one in CI work differently.
* Do not build instances and tests on unsupported archs
* Build instance libraries as OBJECT library
* CI: Enable sccache for HIP
There are source files with LANGUAGE HIP, they need
-DCMAKE_HIP_COMPILER_LAUNCHER=sccache
* Add tests to REGRESSION_TESTS
* Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0
The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are
smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf.
* Pass CK_TILE_FMHA_FWD_*_API as INTERFACE compile options
The instances don't actually depend on them, only examples and tests do.
Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS
without recompiling instances that are already in ccache.
* Fix formatting and names
BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side.
Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape