Commit Graph

312 Commits

Author SHA1 Message Date
OscarXu
9fdfff82ea Fix v2 topk_weight cal. Add silu asm. 2025-05-20 13:42:06 +08:00
OscarXu
a4ed178b4c Add gemm2 v3 64x128x128 2025-05-19 19:33:04 +08:00
OscarXu
9c9bf1aece Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler. 2025-05-17 04:33:52 -05:00
OscarXu
19aa39b978 Merge remote-tracking branch 'origin/wjx/moe_v3_aiter' into moe_bs_stage1_dev 2025-05-13 10:46:40 +08:00
lalala-sh
d9ab2fe4f5 update moe 2025-05-12 05:08:58 +00:00
OscarXu
3e5005a47b fix compile error. Gemm2 pass. Gemm1 WIP 2025-05-12 11:21:14 +08:00
OscarXu
7729fdb349 temp save for gemm1. Function not ready 2025-05-12 09:32:58 +08:00
Bartłomiej Kocot
6fddb5708c Add grouped conv fwd bias relu instances (#2179)
* Add grouped conv fwd bias relu instances

* fixes

* fix
2025-05-09 22:52:34 +02:00
jefyang1
6b1a339b6f Fix grouped conv bwd data tests on gfx950 (#2173) 2025-05-09 09:01:06 -07:00
OscarXu
03d5a50b45 Merge remote-tracking branch 'origin/wjx/align_v3_pipeline' into moe_bs_stage1_dev 2025-05-09 20:15:11 +08:00
Andriy Roshchenko
cb27e7c77f Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171)
* Re-enable MX GEMM instances

* Fix compilation error when building MX GEMM for multiple architectures
2025-05-08 13:26:03 -06:00
lalala-sh
6c459e8c38 Merge branch 'develop' into wjx/align_v3_pipeline 2025-05-08 18:08:31 +08:00
lalala-sh
def952a178 use mem_op::set when topk=1 2025-05-08 09:49:16 +00:00
OscarXu
586c2a9a15 Merge remote-tracking branch 'origin/f8blk_scale_opt' into moe_blockscale_dev 2025-05-08 16:39:08 +08:00
aska-0096
68c7fe843d temp save, a performant version. 2025-05-08 07:44:06 +00:00
OscarXu
62877fa074 16x16 function merged to moe 2025-05-08 14:33:31 +08:00
OscarXu
8021128572 merge origin/f8blk_scale_opt 2025-05-07 17:10:03 +08:00
OscarXu
1fae385575 add warmup to asm launch 2025-05-07 16:55:42 +08:00
OscarXu
c989bbe3aa Update v1_128x128x128 to 2x2 instead of 4x1 2025-05-07 14:24:30 +08:00
OscarXu
e7fe8587f6 Add asm for no-loop v3_128x128x128 2025-05-06 10:47:29 +08:00
aska-0096
8c40f3d8fd experimental optimization for scale load in blkscale gemm 2025-05-02 07:12:01 +00:00
OscarXu
efbb85be2a Add .co direct asm support by CK_USE_ASM_MOE_STAGE2_BLOCKSCALE 2025-04-30 22:05:54 +08:00
mtgu0705
fe1648f0ce fix device kernel selection. 2025-04-30 13:33:31 +08:00
OscarXu
7024136ca1 Fix no-mainloop scenario 2025-04-29 16:09:49 +08:00
Bartłomiej Kocot
4094ad158a Integrate universal gemm with conv bwd data and add SplitK (#1315)
* Integrate universal gemm with conv bwd data

* Fix multi d kernel

* Add splitK support

* instances refactor

* instances refactor

* refactor

* fixeS

* fixes

* 16x16 instnaces

* Fixes

* Fix

* Fix

* Fix

* Fix

* Fix

* Fixes

* fix

* fix
2025-04-28 23:54:49 +02:00
Anton Gorenko
edd92fc546 DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096)
* Prepare files for DeviceGemm_Wmma_CShuffleV3

* Implement main part of CShuffleV3 with block pipeline v3 for WMMA

* Remove unused functions and template params for A/B descriptors

* Support both gfx11 and gfx12

* Enable SplitK for gfx12 and disable for gfx11

* Added RowColRow layout for DeviceGemmV2 fp16

* Added more instances for Row, Col, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout

* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout

* Fix formatting

* Add documentation

Based on e5ad48a784

* Enable gemm_universal profiling for gfx11/12

* Add WMMA intrinsics for F8/BF8

* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances

* Add BF16 instances and tests

* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
2025-04-28 10:14:21 +05:00
lalala-sh
39ba03f25d Moe gemm activation (#2026)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* impl int64 but result not correct

* int64 index ok now

* input output all ok

* fix uint32

* revert v1 test

* use uint32

* mork to support 13w tokens

* moe sorting fix moebuf

* fix merge

* update moe api fix aiter build

* fix buid

* fuse silu

* silu ok

* acale ok

* add silu

* change code

* gemm2 ok

* gufusion compatible ok, fix warnings

* gu fusion for m32 m64 ok

* support bf16 cshuffle

* i4 gemm2 ok

* i4 gemm2 ok and i4 gemm1 build

* 16x16 run ok

* change flops; change cshuffle dtype

* fuse gelu silu act in moe gemm1

* fp8 with act ready

* int4 act ready

* remove useless changes

* remove useless code change

* fix clang format

* add the arch limit of int4 moe gemm

* fuse moe activation

* fix fp8 16x16

* fix no quant case

* fix bugs

* fix fp8 gufusion bug

* remove useless comments

* refine activation code & complete moe example

* fix int8 bugs

* merge tkw1

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: root <root@hjbog-srdc-51.amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-04-23 10:35:34 +08:00
mtgu0705
066f209640 added moe blockscale, compiler successful, function not correct. 2025-04-19 16:02:51 +08:00
aska-0096
d883056744 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into f8blockscale_bpreshuffle 2025-04-18 06:35:59 +00:00
lalala-sh
bcf5bb41be enable do top k weights in moe stage1 gemm (#2094)
* add switch for mul topk weights

* fix bf16/f16 bugs

* complete
2025-04-18 10:45:49 +08:00
mtgu0705
ef0b5f72af added moe_blockscale_dev branch to delete the per-token scale code. 2025-04-17 14:16:07 +08:00
Andriy Roshchenko
213b203a3c MX GEMM - Parameterized Test Template (#2088)
* Tests for MX FP8 GEMM

* Improve documentation
2025-04-16 19:56:00 -06:00
Andriy Roshchenko
da54464cce MX GEMM - Add MX BF8 example (#2071)
* Add MX GEMM example for MX BF8

* Verified MX FP8 with 16x16x128 scale builtin

* Verify MX BF8 GEMM with BF16 output
2025-04-16 15:25:02 -06:00
Andriy Roshchenko
7106976a72 MX GEMM - New GEMM pipeline for MX data types (#2059)
* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods
2025-04-15 17:17:07 -06:00
Mingtao Gu
56378f810f CK pk_i4_t test failures fix (SWDEV-518629) (#2075)
* fix pk_i4_v3 tests failures in Unbuntu env.

* fix pk_i4_t tests failure on Unbuntu issues.

* some fixed.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-04-14 16:58:57 +08:00
Illia Silin
0d4f145078 Fix build issues for multiple targets. (#2077)
* build for multiple targets on gfx942

* add missing ignore statements
2025-04-11 12:12:53 -07:00
Illia Silin
572cd820ce Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic
2025-04-03 15:30:21 -07:00
aledudek
9329432f6c Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments
2025-04-03 13:35:43 +02:00
Bartłomiej Kocot
2ccf914888 Add support for GKCYX grouped conv weight (#2023)
* Grouped conv bwd weight GKCYX support

* fix and changelog

* fix

* fix

* fixes

* comments

* fix
2025-04-02 23:59:49 +02:00
Adam Osewski
e5ad48a784 Basic docs for universal gemm & ck-tile gemm. (#2014)
* Basic docs for universal gemm & ck-tile gemm.

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Reviewers suggestions.

* Align tparam names in doc with class tparams.

* More reviewers fine tuning ;)

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
2025-04-02 11:03:40 +02:00
Bartłomiej Kocot
8c0ab61ece Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
ec742908bd Grouped conv fwd v3 fix for SplitN an G > 1 (#2038)
* Grouped conv fwd v3 fix for SplitN an G > 1

* Remove int8 large test

* Retore int8 test
2025-04-01 13:19:35 -07:00
Bartłomiej Kocot
54c81a1fcf Add support for GKCYX grouped conv fwd (#2015)
* Add support for GKCYX grouped conv fwd

* fixes

* fix

* changelog

* Fixes
2025-03-26 21:13:38 +01:00
Andriy Roshchenko
6660dc6b8e Introduce MX GEMM for FP8 data type (#2000) 2025-03-24 15:41:07 -06:00
aska-0096
d60d23ea8e v1 performance debugging 2025-03-19 12:57:16 +00:00
aledudek
5095906975 Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue
2025-03-17 16:42:43 +01:00
Bartłomiej Kocot
c2e4898b4b Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example
2025-03-17 13:32:00 +01:00
aska-0096
24ca573d74 add ckProfiler. performance debugging for blockscale_wp prefill 2025-03-17 07:19:54 +00:00
Illia Silin
4c97cc511e use old instrinsics with staging compiler (#1970) 2025-03-12 07:29:09 -07:00
Haocong WANG
cbd74c2d12 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-11 10:11:21 -07:00