aska-0096
78d0fd4e65
add vmcnt guard for async copy
2025-05-28 03:47:46 +00:00
aska-0096
65255e12fb
Unconditional Ascale padding
2025-05-28 01:55:23 +00:00
aska-0096
63c9388881
Pad the M for scale buffer unconditionaly
2025-05-27 11:52:12 +00:00
aska-0096
9da2995163
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 10:23:21 +00:00
aska-0096
04f7265c19
refactor the pipeline
2025-05-27 10:14:45 +00:00
aska-0096
71e7346bf4
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 07:32:16 +00:00
aska-0096
137e28d151
temp save, 4.4~4.5
2025-05-27 07:31:16 +00:00
Ding, Yi
85ac576109
Merge gemm_mx_common.hpp
2025-05-27 06:13:03 +00:00
Ding, Yi
123053b685
Merge remote-tracking branch 'origin/wip-f4-wp' into wip-f4
2025-05-27 03:36:38 +00:00
aska-0096
61748eddba
Add NT flag to B/BScale buffer
2025-05-27 02:26:43 +00:00
Ding, Yi
91eb136937
Fix v1; use M padding
2025-05-26 10:32:26 +00:00
aska-0096
d1d56e89ef
fix the correctness issue
2025-05-26 09:29:36 +00:00
aska-0096
4a3205f94a
Merge branch 'wip-f4-wp' of https://github.com/ROCm/composable_kernel into wip-f4-wp
2025-05-26 02:22:09 +00:00
Lin, Qun
d5e7580473
correct a typo in tail
2025-05-25 19:22:47 -05:00
Andriy Roshchenko
f03da29b65
Merge branch origin/wip-f4 into andriy/wip-f4
2025-05-23 22:14:30 +00:00
aska-0096
574d65efed
temp save
2025-05-23 14:51:24 +00:00
joye
8afac88f89
fix f4 pipeline issues
2025-05-23 17:13:10 +08:00
aska-0096
a4dae9eb86
optimize offset math in dma
2025-05-22 08:15:31 +00:00
aska-0096
7f7c4d35c7
lds conflict free + buffer load lds
2025-05-22 08:04:52 +00:00
Andriy Roshchenko
e302ab8f0c
Merge branch origin/develop into wip-fp4
2025-05-22 06:31:47 +00:00
Ding, Yi
352542c49e
Better kernel selection in device classes
2025-05-22 06:05:10 +00:00
Lin, Qun
6f8e643629
fix 2 typos in fp4_preshuffle
2025-05-21 23:21:00 -05:00
Thomas Ning
1386924749
Add the instances for small sized GEMM in preshuffle and improve CMake Flag ( #2212 )
...
* Add small instance, add the bug fix, & improve the example CMake
* clang format
2025-05-20 15:05:08 -07:00
aska-0096
e1084fe7d6
tempsave. compile pass, function wrong
2025-05-20 10:57:26 +00:00
jefyang1
f18170064d
Use new mfma instructions for FP8 on gfx950 ( #2202 )
...
* Add logic to use new mfma instructions for fp8 bf8
* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format
* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
* Fix intrin_mfma f8 calls due to merge mistake
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
2025-05-19 17:29:51 -07:00
aska-0096
f3a296bad4
lds conflict free + buffer load lds
2025-05-19 09:40:39 +00:00
Ding, Yi
f0535522e2
Fix blockwise gemm mx v1
2025-05-19 07:22:31 +00:00
aska-0096
e2c8f98fef
generalize the pipeline scheduling.
2025-05-19 02:29:02 +00:00
aska-0096
3e8b07ef58
tempsave; modify the way we represent fp4
2025-05-19 02:28:23 +00:00
arai713
5b3430b868
Narrowing error fix for codegen compilation ( #2194 )
...
* removed comment with special characters
* fix for arg/template change after merge from develop
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-05-16 11:11:54 -07:00
aska-0096
248e287866
generalize the pipeline scheduling.
2025-05-16 10:41:59 +00:00
aska-0096
a0379d81e7
modify the way we represent fp4
2025-05-16 09:44:04 +00:00
Mateusz Ozga
fa3c6811d8
Disable conv for Filter1x1Stride1Pad0 when K or C is even ( #2186 )
2025-05-16 10:18:47 +02:00
aska-0096
a1bec7670a
tempsave
2025-05-16 08:14:56 +00:00
Ding, Yi
c04d44b5f6
Merge remote-tracking branch 'origin/develop' into wip-f4
2025-05-16 07:11:26 +00:00
Ding, Yi
9009d75c7a
Pack e8m0 as int32_t
2025-05-15 09:12:17 +00:00
aska-0096
062e16d54a
Improve the pipeline
2025-05-15 09:08:36 +00:00
Ding, Yi
e7130d483c
a/b thread_desc stride fix
2025-05-14 05:11:32 +00:00
Bartłomiej Kocot
c53b7bd22e
Switch to v2 pipeline for grouped conv bwd data ( #2181 )
...
* Change to old pipeline for grouped conv bwd data
* fix
* fix
* fix
* fix
* fix
* fix
* Fix
2025-05-13 10:14:30 +02:00
Ding, Yi
178e361101
Fix fp8/bf8; remove duplicated code
2025-05-13 07:52:13 +00:00
aska-0096
79246e6cb8
function pass with inline asm hacky
2025-05-12 16:54:44 +00:00
Thomas Ning
b49f7de81f
Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances ( #2166 )
...
* make the work compiled
* Solved the example code, but still have the profiler error
* Finished the feature
* Clang format and update the CHANGELOG
* solve the preshuffle v1 & v2 problem
* Comment Addressed
* Comment Addressed
2025-05-12 09:52:58 -07:00
aska-0096
66f93b6e08
tempsave
2025-05-12 07:35:09 +00:00
Ding, Yi
4b19b934e8
fix fp8; fix even/odd
2025-05-12 07:31:28 +00:00
aska-0096
41ea1066ac
implement shuffled scale mxfp4gemm, blocker: opsel not effect
2025-05-11 05:54:13 +00:00
aska-0096
6c761bf9b8
tempsave; buggy at passed 4 e8m0 to scaled mfma
2025-05-10 09:57:49 +00:00
Bartłomiej Kocot
6fddb5708c
Add grouped conv fwd bias relu instances ( #2179 )
...
* Add grouped conv fwd bias relu instances
* fixes
* fix
2025-05-09 22:52:34 +02:00
aska-0096
087b20dc1d
clang format
2025-05-09 16:15:10 +00:00
aska-0096
0987b0af44
remove unnecessary hacky
2025-05-09 16:07:22 +00:00
jefyang1
6b1a339b6f
Fix grouped conv bwd data tests on gfx950 ( #2173 )
2025-05-09 09:01:06 -07:00