* check in pipeline and policy
for async load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked
solve merge conflicts
* fix cmakelists
* make it build
* fix? buffer async fence
* relax fences; it appears it only is needed between pairs of ping-pongs
* remove fences
* remove fences
* cleanup and reformat
* add steps annotations
* comment all pipeline steps / remove unexplainable syncs
* clang-format
* add comment
* cleanup kernel types for test
* fix comment
* fix hardcoded warp size
* faithfully copy block gemm from compute v4 policy to async policy
* make async test gfx950 only
* fix cmake logic
* set separate compile options for async
* refine comment in policy
* try update hotloop scheduler
* cleanup comments
* test more K block sizes
* unhardcode Ks, sort of
* add large odd test case
* fix build for quant
* add comment to hot loop scheduler and rename enum
* reformat
* reword the pipeline description
* reformat
* address review / add static asserts / typo fix
* update changelog
* Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe
* also do the same for amd_buffer_addressing_builtins.hpp
* merge with develop
* fix clang format
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
* Have a workable version for SGPR
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* substitute with the new sgpr read api
* update the CHANGELOG
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* change to static for logic
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* Add more printing to core cktile
* Revert other changes in static encoding pattern
* Refactor to using a free print() function
* Remove loops and print just the containers
* Print tuple with better formatting, fix sequence compilation
* Add some tests for print utility
* Add print utility header
* Print for static_encoding_pattern
* add buffer_view printing
* Align vector_traits
* Fix formatting
* Lower-case enum strings
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Remove empty comment lines
* Fix test with lower-case too
* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y
* Add test_print_common.hpp
* add print.hpp in core.hpp
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Expand the bandwidth of direct_global_to_lds for gfx950
* clang-format
* fix the remod.py and script for clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* change bit width check
* refactor macros into constexpr functions
which still get inlined
* wrap s_waitcnt api
* parameterize test
* cleanup
* cleanup fp8 stub
* add fp8 test cases; todo which input parameters are valid?
* replace n for fp8 in test cases
* add large shapes; fp8 fails again
* change input init
* test sync/async
* time the test
* clang-format test
* use float instead of bfloat to cover a 4-byte type
* fix logic - arg sections should be 'or'd
* make block_sync_lds_direct_load interface similar to old ck
* fix a few comment typos
* name common shapes
* revert the example to original logic of not waiting lds
* clang-format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Use read_tr in universal gemm
* Enable all instances back
* Revert example37 changes
* Resolve comments
* resolve comments 2
* Fix assertion msg
* fix the gemm basic
* change index_t to bool for preshuffle variable
* Solve the comment
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
* add for async load builtin
* add async load api
* fix some compiling errors
* fix a compiling error
* fix some compiling errors
* add a pipeline which copies from v4
* add a new pipeline for async load
* fix some compiling errors
* add async load tests
* fix some issues in async load
* fix
* fix async inline assembly
* fix async inline assembly
* add ignore header file
* comment some not gfx950 codes
* comment some not gfx950 codes
* fix a error
* update async load apis
* fix lds descriptor
* fix a compiling error
* fix some compiling errors
* fix a descriptor issue
* update lds descriptor
* change async pipeline's tile distribution pattern from thread to warp
* fix clang format
* update async policy
* fix a CRTP issue
* fix a typo error
* change lds layout
* fix some sync issues
* improve codes
* delete the async test
* fix a commented format issue
* avoid compiling device functions when compile host
* make gemm run
* add the copy kernel support
* finish the feature
* Address comment
* add the support for buffer_builtin
* solved the merging problem
* Comment Addressed
---------
Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>
* add transpose load; no real logic
* fix some compile errors
* fix some issues
* update transpose load logic
* add some fixes
* fix a distribution issue
* update some codes
* add some fix
* can pass; but no logic
* transpose load enable
* update tile transpose
* miss output tile distribution mapping
* hack for transpose 16x16
* update output tensor distribution
* delete unused variables
* fix transpose related codes
* update transpose load example
* exchange the iteration order
* fix 16x16 related dimension transpose
* fix a transpose index issue
* fix a transpose index issue
* fix clang format check
* update load tile transpose related codes
* fix compile errors and pass 16x16 tests
* fix a typo
* update logic
* check other data types
* add transpose load api
* update transpose load api
* fix clang format check
* change file name
* refactor codes
* update code name
* delete some unused codes
* delete the unused oob flag for transpose load
* update tensor view api for transpose load
* update for testing
* fix a typo error
* move transpose ops to example directory
* update transpose api
* update include file
* fix for pr review
* fix compile errors
* add transpose load; no real logic
* fix some compile errors
* fix some issues
* update transpose load logic
* add some fixes
* fix a distribution issue
* update some codes
* add some fix
* can pass; but no logic
* transpose load enable
* update tile transpose
* miss output tile distribution mapping
* hack for transpose 16x16
* update output tensor distribution
* delete unused variables
* fix transpose related codes
* update transpose load example
* exchange the iteration order
* fix 16x16 related dimension transpose
* fix a transpose index issue
* fix a transpose index issue
* fix clang format check
* update load tile transpose related codes
* fix compile errors and pass 16x16 tests
* fix a typo
* update logic
* check other data types
* add transpose load api
* update transpose load api
* fix clang format check
* change file name
* refactor codes
* update code name
* delete some unused codes
* delete the unused oob flag for transpose load
* update tensor view api for transpose load
* update for testing
* fix a typo error
* move transpose ops to example directory
* update transpose api
* update include file
* fix for pr review
* fix compile errors
* change directory name
* delete the duplicated directory
* update cmakelists file
* delete the unused codes
* update function names
* update transpose policy
* update code after remod.py
* update codes
* add some comment
* Polish the instr infrastructure
* build up the fixed instr
* redesign the transpose api, currently it has numerical error
* add the bf16 transpose
* fix some issues
* add some comments
* update document
* Finished the refactor of API and pass through the verification
* fix the merging issue
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround
* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo
---------
Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>
* replace buffer load/store intrinsics with builtins
* fix clang format
* replace buffer load/store intrinsics with built-ins in ck_tile
* fix clang format
* add switch between buffer intrinsics and built-ins
* change the builtins threshold to clang20
* fix clang format
* fix some compilation errors
* revert changes in ck_tile
* revert changes in ck_tile
* delete all root files and folders when CI completes
* try changing the username in CI
* fix groovy syntax
* add user and group id info to ci dockers
* change ownership of all files in CI to jenkins at the end
* update changelog
* Support bf16/fb8/bf8 datatypes for ck_tile/gemm
* remove commented out code.
* Addressing code review comments and enabling universal_gemm for all the supported data types.
* Merge conflict resolution.
* Solve the memory pipeline compilation error. Merge with the new change of CShuffle
* finish the feature, pass the tests
* Fix the pipeline and add the benchmark script for other data types
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>