aska-0096
d563dac424
fix performance bug of bpreshuffle f8 gemm
2025-05-29 10:02:46 +00:00
aska-0096
0db8d71dc1
Remove debug infos; Enable flags for blockscale f8
2025-05-29 08:21:54 +00:00
Ding, Yi
35b436c0d9
Clang-format after 2 merges
2025-05-28 11:16:00 +00:00
Ding, Yi
aecac410d0
Merge remote-tracking branch 'origin/f8blk_scale_opt' into wip-f4-mergemoe-2
2025-05-28 11:15:22 +00:00
Ding, Yi
ad7fd89c1d
Merge remote-tracking branch 'origin/feiw/mxfp4_moe_2Stages' into wip-f4
2025-05-28 09:28:26 +00:00
Ding, Yi
857ef9f8c4
Merge preshuffle device
2025-05-28 07:02:28 +00:00
Ding, Yi
e2e0e0025e
Profiler add f4 wp
2025-05-28 05:12:39 +00:00
aska-0096
78d0fd4e65
add vmcnt guard for async copy
2025-05-28 03:47:46 +00:00
aska-0096
65255e12fb
Unconditional Ascale padding
2025-05-28 01:55:23 +00:00
mtgu0705
52b764d59f
update MX moe GEMM1 hotloopscheduling
2025-05-27 20:43:22 -05:00
aska-0096
63c9388881
Pad the M for scale buffer unconditionaly
2025-05-27 11:52:12 +00:00
aska-0096
9da2995163
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 10:23:21 +00:00
aska-0096
04f7265c19
refactor the pipeline
2025-05-27 10:14:45 +00:00
aska-0096
71e7346bf4
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 07:32:16 +00:00
aska-0096
137e28d151
temp save, 4.4~4.5
2025-05-27 07:31:16 +00:00
Ding, Yi
85ac576109
Merge gemm_mx_common.hpp
2025-05-27 06:13:03 +00:00
Ding, Yi
123053b685
Merge remote-tracking branch 'origin/wip-f4-wp' into wip-f4
2025-05-27 03:36:38 +00:00
aska-0096
61748eddba
Add NT flag to B/BScale buffer
2025-05-27 02:26:43 +00:00
Ding, Yi
91eb136937
Fix v1; use M padding
2025-05-26 10:32:26 +00:00
aska-0096
d1d56e89ef
fix the correctness issue
2025-05-26 09:29:36 +00:00
aska-0096
4a3205f94a
Merge branch 'wip-f4-wp' of https://github.com/ROCm/composable_kernel into wip-f4-wp
2025-05-26 02:22:09 +00:00
Lin, Qun
d5e7580473
correct a typo in tail
2025-05-25 19:22:47 -05:00
mtgu0705
a36a747e29
rename the block pipeline
2025-05-24 00:03:43 -05:00
Andriy Roshchenko
f03da29b65
Merge branch origin/wip-f4 into andriy/wip-f4
2025-05-23 22:14:30 +00:00
aska-0096
574d65efed
temp save
2025-05-23 14:51:24 +00:00
feifei14119
ce4e7b39da
gemm1 func pass
2025-05-23 09:26:38 +00:00
joye
8afac88f89
fix f4 pipeline issues
2025-05-23 17:13:10 +08:00
Andriy Roshchenko
715ad01bf2
Fix MX MFMA tests
2025-05-22 21:51:36 +00:00
aska-0096
a4dae9eb86
optimize offset math in dma
2025-05-22 08:15:31 +00:00
aska-0096
7f7c4d35c7
lds conflict free + buffer load lds
2025-05-22 08:04:52 +00:00
Andriy Roshchenko
e302ab8f0c
Merge branch origin/develop into wip-fp4
2025-05-22 06:31:47 +00:00
Ding, Yi
352542c49e
Better kernel selection in device classes
2025-05-22 06:05:10 +00:00
Lin, Qun
6f8e643629
fix 2 typos in fp4_preshuffle
2025-05-21 23:21:00 -05:00
Aviral Goel
fa39c4e798
Add Doxygen Documentation for HostTesnor, HostTensorDescriptor, DeviceMem, FillUniformDistribution ( #2160 )
...
* added documentation for HostTensorDescriptor
* added documentation for DeviceMem and FillUniformDistribution
* fixed merging error
* fixed host_tensor_descriptor error
* clang format
2025-05-21 10:34:30 -07:00
Thomas Ning
1386924749
Add the instances for small sized GEMM in preshuffle and improve CMake Flag ( #2212 )
...
* Add small instance, add the bug fix, & improve the example CMake
* clang format
2025-05-20 15:05:08 -07:00
mtgu0705
582dc7f842
enable fp4 moe no weigth preshuffle, function pass
2025-05-20 08:18:22 -05:00
aska-0096
e1084fe7d6
tempsave. compile pass, function wrong
2025-05-20 10:57:26 +00:00
mtgu0705
589e1dfea9
init mx fp4 B no preshuffle version
2025-05-20 04:40:22 -05:00
mtgu0705
680de28f77
commit with debug info
2025-05-19 21:47:35 -05:00
jefyang1
f18170064d
Use new mfma instructions for FP8 on gfx950 ( #2202 )
...
* Add logic to use new mfma instructions for fp8 bf8
* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format
* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
* Fix intrin_mfma f8 calls due to merge mistake
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
2025-05-19 17:29:51 -07:00
Andriy Roshchenko
57e0f5df29
MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types ( #2199 )
...
* Unify test interface for different layouts.
* WIP: Introducing FP4/FP6/FP8 abstractions
* WIP: Introducing packed storage abstraction
* WIP: Introducing packed storage abstraction
* WIP: Improved support for FP6 data type
* Refactor packed storage for f6_t
* WIP: FP6 MFMA test
* Test if we correctly represent all FP6/FP4 numbers
* Additional output for failed FP4 test.
* More failing conversion tests
* Even more failing conversion tests
* Working FP6 MFMA tests
* Expand MX MFMA testing to BF8/6
* Update and verify MX MFMA test for packed types
* Fix fp4 and fp6 conversions on host
* Working MX MFMA tests for FP8/6/4
* Cleanup
* Add missing type
* Cleanup
* Final cleanup
* Restrict FP6/4 values output to CK_LOGGING=1
* Use CHAR_BIT instead of number 8
* Fix typo
* Remove FP6 and FP4 from the list of native types
---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com >
2025-05-19 16:52:51 -05:00
mtgu0705
685a01877f
fix a, b scale loading bugs, a, b scale loading now correctly
2025-05-19 06:18:57 -05:00
aska-0096
f3a296bad4
lds conflict free + buffer load lds
2025-05-19 09:40:39 +00:00
Ding, Yi
f0535522e2
Fix blockwise gemm mx v1
2025-05-19 07:22:31 +00:00
mtgu0705
2e6fafaf75
updated code, build passed.
2025-05-18 22:29:32 -05:00
aska-0096
e2c8f98fef
generalize the pipeline scheduling.
2025-05-19 02:29:02 +00:00
aska-0096
3e8b07ef58
tempsave; modify the way we represent fp4
2025-05-19 02:28:23 +00:00
mtgu0705
a4b5a374b9
Merge remote-tracking branch 'origin/wip-f4-pk' into mx_moe_f4_scale_shuffle
2025-05-17 09:49:24 -05:00
mtgu0705
eeeba8901f
update code
2025-05-17 09:28:26 -05:00
mtgu0705
94fb9190be
init moe mx f4 scale shuffle
2025-05-16 14:46:09 -05:00