ThomasNing
ea5f9718ec
workable async copy under gfx950
2025-09-07 18:21:43 -05:00
ThomasNing
285dd07e32
add the copy kernel support
2025-06-30 04:50:31 -05:00
ThomasNing
152b6c0940
make gemm run
2025-06-30 04:18:35 -05:00
joye
bc4c7d26af
avoid compiling device functions when compile host
2025-05-21 16:52:58 +08:00
joye
8e3e59cb1f
fix a commented format issue
2025-05-21 15:28:16 +08:00
joye
b17a665e08
delete the async test
2025-05-21 15:24:20 +08:00
joye
2e296ee963
improve codes
2025-05-21 15:22:33 +08:00
joye
7377bc7200
fix some sync issues
2025-05-21 15:05:52 +08:00
joye
619508f89d
change lds layout
2025-05-21 14:35:05 +08:00
joye
ba86551534
fix a typo error
2025-05-21 14:29:49 +08:00
joye
37a4a9aae4
fix a CRTP issue
2025-05-21 14:13:09 +08:00
joye
88601c8a05
update async policy
2025-05-21 13:31:13 +08:00
joye
ad32373ff1
fix clang format
2025-05-21 11:56:35 +08:00
joyeamd
8470702ac0
change async pipeline's tile distribution pattern from thread to warp
2025-05-21 11:53:13 +08:00
joye
5079e3f3a2
update lds descriptor
2025-05-20 17:14:29 +08:00
joye
fee156a37d
fix a descriptor issue
2025-05-20 16:58:18 +08:00
joye
33c643001e
fix some compiling errors
2025-05-20 15:52:33 +08:00
joye
55f3632901
fix a compiling error
2025-05-20 15:16:53 +08:00
joye
c71b3840e8
fix lds descriptor
2025-05-20 15:00:32 +08:00
joye
3500259bdc
update async load apis
2025-05-20 13:26:49 +08:00
joye
4e7f427b88
fix a error
2025-05-19 17:50:51 +08:00
joye
91113f6464
comment some not gfx950 codes
2025-05-19 10:13:49 +08:00
joye
4a544c4084
comment some not gfx950 codes
2025-05-19 10:11:49 +08:00
joye
5c936dc8f3
add ignore header file
2025-05-19 09:34:26 +08:00
joye
0f2dfa8d38
fix async inline assembly
2025-05-19 09:33:17 +08:00
joye
2a9f0fff5d
fix async inline assembly
2025-05-19 09:30:53 +08:00
joye
bd755d74ce
fix
2025-05-19 08:57:41 +08:00
joye
f0e1dbca49
fix some issues in async load
2025-05-19 08:56:03 +08:00
joye
41f5ccb0e6
add async load tests
2025-05-19 08:43:14 +08:00
joye
7a03addc12
fix some compiling errors
2025-05-19 08:28:29 +08:00
joye
9c429ad0cc
add a new pipeline for async load
2025-05-16 09:25:16 +08:00
joye
acdef41575
add a pipeline which copies from v4
2025-05-15 18:00:30 +08:00
joye
0c15891dea
fix some compiling errors
2025-05-15 14:54:13 +08:00
joye
8c09b31f72
fix a compiling error
2025-05-15 14:39:21 +08:00
joye
2fdcb55cf1
fix some compiling errors
2025-05-15 14:36:02 +08:00
joye
726ae62113
add async load api
2025-05-15 13:36:46 +08:00
joye
4dd9e8c0c8
add for async load builtin
2025-05-15 13:29:39 +08:00
BingYuan.Zhou
41c17d0a95
fix moe sorting build fail ( #2190 )
...
* fix moe sorting build fail
* refile code
---------
Co-authored-by: solin <bingzhou@amd.com >
2025-05-14 09:31:26 +08:00
Illia Silin
58f9e9ffbc
Update the buffer load/store intrinsic names for clang>=20. ( #2192 )
...
* fix the buffer load/store intrinsic names
* fix clang format
2025-05-13 10:18:14 -07:00
Bartłomiej Kocot
c53b7bd22e
Switch to v2 pipeline for grouped conv bwd data ( #2181 )
...
* Change to old pipeline for grouped conv bwd data
* fix
* fix
* fix
* fix
* fix
* fix
* Fix
2025-05-13 10:14:30 +02:00
Po Yen Chen
2920604786
[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines ( #2163 )
...
* hack for cap logits
* fix bug
* Re-format files
* Allow specifying logits_soft_cap through APIs
* Support turn on/off logits_soft_cap in async pipeline
* Do not generate non-verified kernels
* Align receipt used in Aiter
* Sync logits soft-capping across pipelines
* Re-enable some hdim pipelines
* fix perf
* Add attention variant for logits_soft_cap
* Add newline at end-of-file
* Fix performance
* Add comment to explain logits_soft_cap pre-processing
* Unify code
* Unify floating-point literal style
* Use class data member to slience the compilation error
* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133 )
* Send 'mask' along with variant params to the LogitsMask()
* Send block indices to the variant
* Add indices parameters in variant interface
* Fix fmha bwd codegen error
* Allow switch logits_soft_cap impl
* Eliminate register spills
* Fix compilation errors
* Fix wrong LSE
* Fix LSE for splitkv kernel
* Sync splitkv pipeline changes
* Add batch_prefill kernel/pipeline
* Fix codegen error
* Undo changes in CMakeLists.txt
* Merge pipeline filtering check
* Use different code path if kHasLogitsSoftCap=false
* Remove [[maybe_unused]] attribute
* Use pre-existing compile-time flag to instantiate templates
* Sync pipeline changes
* Update CHANGELOG.md
---------
Co-authored-by: Bernard <bernaliu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
2025-05-13 12:19:25 +08:00
Khushbu Agarwal
f05e45ba59
Disable SMFMA gfx90a ( #2184 )
...
* sparsity fix for gfx90a
* reverting tile_engine changes
2025-05-12 09:56:23 -07:00
Thomas Ning
b49f7de81f
Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances ( #2166 )
...
* make the work compiled
* Solved the example code, but still have the profiler error
* Finished the feature
* Clang format and update the CHANGELOG
* solve the preshuffle v1 & v2 problem
* Comment Addressed
* Comment Addressed
2025-05-12 09:52:58 -07:00
Thomas Ning
9d1e44e56a
Vectorized Transpose for Batched Transpose CK Tile Operator ( #2131 )
...
* Shared Memory for single data point
* CKTile Transpose vectorize CP1
* CKTile Transpose vectorize CP2
* CKTile Transpose vectorize CP2.1
* fixed the compile error of the transpose tile 2d
* Have the correct result for the current test sample
* Changes to printing tensor
* fp8 support added
* Debugging for transpose
* solving the corner issue
* Changed padding flag
* Intermideate Debugging
* Intermidiate Debugging
* Intermediate Debugging
* Finished debugging of the transpose op
* Code Cleanup
* Adding edge case smoke tests
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Addressing Review Comment
* Addressing Comments
* Addressing Comments
* Measuring Perf Tests
* Code Cleanup
* Changlog
* Added the running iterations
* clang format
* Fix the changelog
* Fix the compilation error
* change the printing factor
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com >
2025-05-12 00:41:45 -07:00
Khushbu Agarwal
d8faf1c6a1
Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 ( #2172 )
...
* Changes for updating tile distribution for shuffle and transpose
* Fixed swizzle and transpose, removed comments
* clang formatted
* Adding support for bf16 type
* Addressing review comments
2025-05-10 22:40:05 -07:00
Bartłomiej Kocot
6fddb5708c
Add grouped conv fwd bias relu instances ( #2179 )
...
* Add grouped conv fwd bias relu instances
* fixes
* fix
2025-05-09 22:52:34 +02:00
jefyang1
6b1a339b6f
Fix grouped conv bwd data tests on gfx950 ( #2173 )
2025-05-09 09:01:06 -07:00
Mingtao Gu
a23390163d
fix moe gemm2 for gfx950 ( #2164 )
...
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-05-09 08:25:31 -07:00
Khushbu Agarwal
ef72a4b9bc
Disable SMFMA for gfx90a ( #2182 )
2025-05-09 00:18:07 -07:00
Illia Silin
3448e12609
Generate ckProfiler package for gfx942 only. ( #2180 )
...
* build CI for gfx942 exclusively
* run the last stage in a docker with user jenkins
* update the image for the last stage
* ignore perf_log if not found
* archive and store all packages
* use ccache for building packages
2025-05-08 13:29:14 -07:00