Commit Graph

1998 Commits

Author SHA1 Message Date
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
Andriy Roshchenko
57e0f5df29 MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199)
* Unify test interface for different layouts.

* WIP: Introducing FP4/FP6/FP8 abstractions

* WIP: Introducing packed storage abstraction

* WIP: Introducing packed storage abstraction

* WIP: Improved support for FP6 data type

* Refactor packed storage for f6_t

* WIP: FP6 MFMA test

* Test if we correctly represent all FP6/FP4 numbers

* Additional output for failed FP4 test.

* More failing conversion tests

* Even more failing conversion tests

* Working FP6 MFMA tests

* Expand MX MFMA testing to BF8/6

* Update and verify MX MFMA test for packed types

* Fix fp4 and fp6 conversions on host

* Working MX MFMA tests for FP8/6/4

* Cleanup

* Add missing type

* Cleanup

* Final cleanup

* Restrict FP6/4 values output to CK_LOGGING=1

* Use CHAR_BIT instead of number 8

* Fix typo

* Remove FP6 and FP4 from the list of native types

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
2025-05-19 16:52:51 -05:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00
Yanxing-Shi
5b83f76eb0 fix codegen bug 2025-05-19 14:03:16 +00:00
Yanxing-Shi
b3caa67694 fix csv bug 2025-05-19 11:44:26 +00:00
Yanxing-Shi
9897410acf refactor profiler 2025-05-19 10:42:57 +00:00
Bartłomiej Kocot
6342f6b5e8 Restore oddc instances (#2201) 2025-05-16 18:42:02 -07:00
arai713
5b3430b868 Narrowing error fix for codegen compilation (#2194)
* removed comment with special characters

* fix for arg/template change after merge from develop

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-05-16 11:11:54 -07:00
Illia Silin
40668c9a99 Build and store CK library deb package for all targets daily. (#2196)
* generate and store library package for all targets

* use ninja to build packages for all targets

* make sure to use ftime-trace when using ninja

* make sure build trace only runs on gfx9

* archive lib package and stash only library package
2025-05-16 07:40:53 -07:00
Yanxing-Shi
c821b1253a format python 2025-05-16 10:41:30 +00:00
Yanxing-Shi
adeb051095 Merge remote-tracking branch 'upstream/develop' into support_engine_benchmark 2025-05-16 10:38:59 +00:00
Yanxing-Shi
012c77125a recover benchmark_gemm and fix 2025-05-16 10:37:59 +00:00
Mateusz Ozga
fa3c6811d8 Disable conv for Filter1x1Stride1Pad0 when K or C is even (#2186) 2025-05-16 10:18:47 +02:00
Po Yen Chen
791802b381 [CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations (#2198)
* Write soft-sign in inline asm

* Change tile idx computation

* Add macro to turn off soft-sign asm opt

* Use simple for loop to avoid register spill

* Only do block id transform for masking cases
2025-05-16 15:14:46 +08:00
Yanxing-Shi
fb63cd5923 merge 2025-05-16 06:55:32 +00:00
Po Yen Chen
8cb0474b3d Use only qr_async pipeline for batch_prefill (#2195) 2025-05-15 11:47:29 -07:00
Khushbu Agarwal
3d8d6e75e4 Adding validation for tile sizes in Tile Engine (#2189)
* Adding validation for tile sizes

* Add architecture in config, and shuffle lines of code in warp_gemm.hpp

* Enable MFMA for gfx950, and invalid tile handling
2025-05-15 10:28:31 -07:00
Casey-Shi
fde83689d5 Merge branch 'develop' into support_engine_benchmark 2025-05-15 22:21:58 +08:00
Bartłomiej Kocot
7c0e29cc0f Extend 64x64 with 4 waves instances for grouped conv bwd wei (#2187)
* Extend 64x64 with 4 waves instnaces for grouped conv bwd wei

* Fix

* fix

* fix
2025-05-15 16:21:34 +02:00
Yanxing-Shi
68a4aff0b1 fix config 2025-05-15 14:20:17 +00:00
Yanxing-Shi
d6f31b680a modify changelog 2025-05-15 14:12:45 +00:00
Yanxing-Shi
d4107f55cf remove pydantic module 2025-05-15 13:54:26 +00:00
Yanxing-Shi
fc092038f7 fix README 2025-05-15 12:37:00 +00:00
Yanxing-Shi
ccf18b90e6 add asm cache control 2025-05-15 12:20:28 +00:00
Yanxing-Shi
457315dd8a Merge remote-tracking branch 'upstream/support_engine_benchmark' into support_engine_benchmark 2025-05-15 11:18:11 +00:00
Yanxing-Shi
047f6e4480 python format 2025-05-15 11:16:13 +00:00
Yanxing-Shi
62d2a63f43 add benchmark for cold and warmp up 2025-05-15 11:11:18 +00:00
Yanxing-Shi
cfbbae9bd6 fix 2025-05-15 06:15:38 +00:00
illsilin
3d58544b7d add pydantic module to the docker image 2025-05-14 09:59:55 -07:00
Yanxing-Shi
53c4429f37 format 2025-05-14 15:46:50 +00:00
Yanxing-Shi
2843f8e59d Merge remote-tracking branch 'upstream/develop' into support_engine_benchmark 2025-05-14 15:40:26 +00:00
Yanxing-Shi
16654510da fix 2025-05-14 15:12:27 +00:00
Yanxing-Shi
7fa1d4daea add changelog 2025-05-14 13:59:25 +00:00
Yanxing-Shi
4bbe7eca09 add cmake option & modify 2025-05-14 09:17:37 +00:00
BingYuan.Zhou
41c17d0a95 fix moe sorting build fail (#2190)
* fix moe sorting build fail

* refile code

---------

Co-authored-by: solin <bingzhou@amd.com>
2025-05-14 09:31:26 +08:00
Illia Silin
58f9e9ffbc Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format
2025-05-13 10:18:14 -07:00
Yanxing-Shi
58ab4eb617 remove comment 2025-05-13 16:22:22 +00:00
Yanxing-Shi
6086e3641d fix config 2025-05-13 15:56:37 +00:00
Yanxing-Shi
b4053e1ed3 remove config property 2025-05-13 15:35:42 +00:00
Yanxing-Shi
e5a7abd11b move struct 2025-05-13 15:25:08 +00:00
Yanxing-Shi
3140659357 fix 2025-05-13 14:18:16 +00:00
Yanxing-Shi
f4da2e3836 fix 2025-05-13 13:20:09 +00:00
Yanxing-Shi
0c3dc06e8c test success 2025-05-13 13:14:44 +00:00
Yanxing-Shi
6c82b60de6 range config 2025-05-13 09:20:55 +00:00
Bartłomiej Kocot
c53b7bd22e Switch to v2 pipeline for grouped conv bwd data (#2181)
* Change to old pipeline for grouped conv bwd data

* fix

* fix

* fix

* fix

* fix

* fix

* Fix
2025-05-13 10:14:30 +02:00
Yanxing-Shi
a8a19be1b0 merge 2025-05-13 07:39:51 +00:00
Yanxing-Shi
2d3dc763f8 merge 2025-05-13 06:27:16 +00:00
Yanxing-Shi
54d3d9468d fix bug 2025-05-13 05:57:41 +00:00
Po Yen Chen
2920604786 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-05-13 12:19:25 +08:00
Khushbu Agarwal
f05e45ba59 Disable SMFMA gfx90a (#2184)
* sparsity fix for gfx90a

* reverting tile_engine changes
2025-05-12 09:56:23 -07:00