This change ensures that the files being selected for clang format validation are exactly the ones tracked by the git repo we are testing. This protects against an known issue where the repo being tested contained "stray files" from a previous test.
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
* Fix validation of rotary embedding with time_kernel_
When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.
* Fix synchronization issue in splitkv combine pipeline
Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* feat: generalized grouped_gemm_kernel.hpp
* feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* docs: add multi d info and trim down outdated content
* tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite
* Adding MIOpen at HEAD
* Adding container and also adding CI run for .github paths
* Adding correct flags
* Adding patches
* Adding exception for ck
* rocm-libraries at new path
* adding global safe dir
* reorder
* Fixing paths
* Adding sharding
* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
`test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
`benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
specifically validate/benchmark the new padding and variable-length
functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
* [CK_TILE] add mqa, gqa to sequence padding unit tests
* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI
* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel
* Remove C++20 code
C++20 features should not be used in CK. Remove all C++20 code.
* fix c++17 build
* format
* fix merge issue
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
The prefix is causing the status updates from
gitStatusWrapper to be unique to the status updates that
are created by the Jenkins server, which creates duplicates
* upgrade default docker to rocm7.0.1
* turn on build and test on gfx950 by default
* use rocm-dev instead of rocm
* link libhiprtc for codegen targets
* resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards
---------
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
* Update grouped_gemm example and pipeline
* find the root cause error in did not enable the transpose in gfx950 correctly
* Fix v3 pipeline, row and col major
* Disable f8 datatype tests, it fails on gfx950
* fix the abd test by clear the runtime argument unsupported
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
* disable cast_tile_pk_fp16_fp32 on gfx950
* fix wrong encoding when hdim is not exponentiation of 2
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com>