Commit Graph

590 Commits

Author SHA1 Message Date
Ding, Yi
d3015785cb Fix 'Merge gemm_mx_common.hpp' 2025-05-27 09:08:02 +00:00
aska-0096
71e7346bf4 Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4 2025-05-27 07:32:16 +00:00
aska-0096
137e28d151 temp save, 4.4~4.5 2025-05-27 07:31:16 +00:00
Ding, Yi
85ac576109 Merge gemm_mx_common.hpp 2025-05-27 06:13:03 +00:00
Ding, Yi
123053b685 Merge remote-tracking branch 'origin/wip-f4-wp' into wip-f4 2025-05-27 03:36:38 +00:00
aska-0096
d1d56e89ef fix the correctness issue 2025-05-26 09:29:36 +00:00
Ding, Yi
40af523e2c Add rotating to mx examples 2025-05-26 05:05:54 +00:00
Andriy Roshchenko
f03da29b65 Merge branch origin/wip-f4 into andriy/wip-f4 2025-05-23 22:14:30 +00:00
Andriy Roshchenko
1c91f6bf1e Fix example_gemm_mx build 2025-05-23 22:00:07 +00:00
aska-0096
574d65efed temp save 2025-05-23 14:51:24 +00:00
aska-0096
a4dae9eb86 optimize offset math in dma 2025-05-22 08:15:31 +00:00
aska-0096
7f7c4d35c7 lds conflict free + buffer load lds 2025-05-22 08:04:52 +00:00
Andriy Roshchenko
e302ab8f0c Merge branch origin/develop into wip-fp4 2025-05-22 06:31:47 +00:00
Lin, Qun
97709c4aa1 correct preShuffleBuffer
we should used packed k to do shuffle.
2025-05-22 01:09:13 -05:00
SamiAario-AMD
380bca2b85 Fix 11_add_rmsnorm2d_rdquant (#2207) 2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
aska-0096
e1084fe7d6 tempsave. compile pass, function wrong 2025-05-20 10:57:26 +00:00
Ding, Yi
0c21ae4ead fix fp4 profiler 2025-05-20 03:06:12 +00:00
Aviral Goel
c4929225f6 remove debug statements from CMakeLists (#2204) 2025-05-19 17:31:04 -07:00
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00
aska-0096
f3a296bad4 lds conflict free + buffer load lds 2025-05-19 09:40:39 +00:00
aska-0096
e2c8f98fef generalize the pipeline scheduling. 2025-05-19 02:29:02 +00:00
aska-0096
3e8b07ef58 tempsave; modify the way we represent fp4 2025-05-19 02:28:23 +00:00
aska-0096
248e287866 generalize the pipeline scheduling. 2025-05-16 10:41:59 +00:00
aska-0096
a0379d81e7 modify the way we represent fp4 2025-05-16 09:44:04 +00:00
aska-0096
a1bec7670a tempsave 2025-05-16 08:14:56 +00:00
Ding, Yi
a663ca3395 Fix example compile error 2025-05-16 07:29:47 +00:00
Ding, Yi
dc30e7d025 Add f4 ckProfiler 2025-05-16 07:19:22 +00:00
Ding, Yi
c04d44b5f6 Merge remote-tracking branch 'origin/develop' into wip-f4 2025-05-16 07:11:26 +00:00
Po Yen Chen
8cb0474b3d Use only qr_async pipeline for batch_prefill (#2195) 2025-05-15 11:47:29 -07:00
Ding, Yi
9009d75c7a Pack e8m0 as int32_t 2025-05-15 09:12:17 +00:00
aska-0096
062e16d54a Improve the pipeline 2025-05-15 09:08:36 +00:00
Ding, Yi
4ba9fe186c Use random scale for init1 2025-05-14 05:42:39 +00:00
Illia Silin
58f9e9ffbc Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format
2025-05-13 10:18:14 -07:00
Ding, Yi
521471c956 Fix fp8/bf8 B-row 2025-05-13 10:13:18 +00:00
Ding, Yi
178e361101 Fix fp8/bf8; remove duplicated code 2025-05-13 07:52:13 +00:00
Po Yen Chen
2920604786 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-05-13 12:19:25 +08:00
aska-0096
79246e6cb8 function pass with inline asm hacky 2025-05-12 16:54:44 +00:00
Thomas Ning
b49f7de81f Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166)
* make the work compiled

* Solved the example code, but still have the profiler error

* Finished the feature

* Clang format and update the CHANGELOG

* solve the preshuffle v1 & v2 problem

* Comment Addressed

* Comment Addressed
2025-05-12 09:52:58 -07:00
Thomas Ning
9d1e44e56a Vectorized Transpose for Batched Transpose CK Tile Operator (#2131)
* Shared Memory for single data point

* CKTile Transpose vectorize CP1

* CKTile Transpose vectorize CP2

* CKTile Transpose vectorize CP2.1

* fixed the compile error of the transpose tile 2d

* Have the correct result for the current test sample

* Changes to printing tensor

* fp8 support added

* Debugging for transpose

* solving the corner issue

* Changed padding flag

* Intermideate Debugging

* Intermidiate Debugging

* Intermediate Debugging

* Finished debugging of the transpose op

* Code Cleanup

* Adding edge case smoke tests

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Addressing Review Comment

* Addressing Comments

* Addressing Comments

* Measuring Perf Tests

* Code Cleanup

* Changlog

* Added the running iterations

* clang format

* Fix the changelog

* Fix the compilation error

* change the printing factor

---------

Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
2025-05-12 00:41:45 -07:00
Ding, Yi
4b19b934e8 fix fp8; fix even/odd 2025-05-12 07:31:28 +00:00
aska-0096
41ea1066ac implement shuffled scale mxfp4gemm, blocker: opsel not effect 2025-05-11 05:54:13 +00:00
aska-0096
6c761bf9b8 tempsave; buggy at passed 4 e8m0 to scaled mfma 2025-05-10 09:57:49 +00:00
aska-0096
0987b0af44 remove unnecessary hacky 2025-05-09 16:07:22 +00:00
Mingtao Gu
a23390163d fix moe gemm2 for gfx950 (#2164)
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-05-09 08:25:31 -07:00
aska-0096
7bde4b8d34 Add pipeline v3. Have some runtime issue and register spill 2025-05-09 09:47:22 +00:00
aska-0096
bb043a3202 remove some unnecessary hacky; enable 256x256x256 tilesize 2025-05-09 07:54:28 +00:00
aska-0096
b2efb06315 Spilt the fp4 target. Fix the known bugs. 128x128x128 sanity checked; remove prints 2025-05-08 15:07:33 +00:00