Commit Graph

922 Commits

Author SHA1 Message Date
aska-0096
5616551115 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa 2025-07-18 05:17:27 +00:00
aska-0096
ae39c84f55 tempsave 2025-07-18 05:16:39 +00:00
Mateusz Ozga
7fc000d7b3 Fix CI clang-format (#2521) 2025-07-17 14:41:29 +02:00
aska-0096
94b6430489 temp save 2025-07-17 10:06:09 +00:00
aska-0096
7e330553dc Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline 2025-07-17 07:24:32 +00:00
Haocong WANG
28072adc3a fix mfma32x32 dispatch (#2490) 2025-07-17 15:24:12 +08:00
Yi DING
f1d8ad2818 [CK_TILE] Use read_tr in universal gemm (#2436)
* Use read_tr in universal gemm

* Enable all instances back

* Revert example37 changes

* Resolve comments

* resolve comments 2

* Fix assertion msg

* fix the gemm basic

* change index_t to bool for preshuffle variable

* Solve the comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
2025-07-16 23:56:22 -07:00
Po Yen Chen
722c22fb15 Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389)" (#2514)
This reverts commit b2dea90116.
2025-07-17 10:09:01 +08:00
linqunAMD
fbd9f32abe [CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459)
1. Port NCHW support from ConvFwd (#2375) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw

Co-authored-by: azhuang <anzhong.huang@amd.com>
2025-07-17 08:19:57 +08:00
Max Podkorytov
21fd7e9538 Merge branch 'develop' into test_copy_fix 2025-07-16 11:23:57 -07:00
linqunAMD
6e76b82059 Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-16 07:58:23 -07:00
Illia Silin
a4bf78ac0e replace obsolete warpSize system variable with the new one (#2496) 2025-07-16 07:39:15 -07:00
aska-0096
d6df7bf851 fix vmcnt shift 2025-07-16 08:55:50 +00:00
aska-0096
40e039e4e4 Improve s_waitcnt_imm calculation 2025-07-16 08:37:07 +00:00
huaiguxu
c1badfd30c Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438)
Co-authored-by: felix <felix.li@amd.com>
2025-07-16 15:44:34 +08:00
MHYangAMD
3499fe67ff [CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409)
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass

* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass

1.  Add BlockReduce2dTreeCrossWarpSync

* Add Rmsnorm2dFusedModelSensitiveEnum

* Update patch

1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce

* Refactor RMSNorm model enum and introduce T5-like option

* Update the n stage for cross warp reduce

* Add new cmdline option in RMSNorm for new pipeline testing

---------

Co-authored-by: Clement Lin <clement.lin@amd.com>
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>
2025-07-16 14:05:26 +08:00
aska-0096
c30f8b709b fix the s_waitcnt_imm calculation 2025-07-16 05:39:50 +00:00
aska-0096
e5cc4af808 Add block_sync_lds_direct_load utility 2025-07-16 03:54:33 +00:00
carlushuang
cfe211cc60 [CK_TILE] moe sorting optimize local_token (#2469)
* fix bug in loops that need use local tokens to compute

* support extra chain local_token

* update

* update

* refine some main

* update

* support dispatch_policy

* fix 15 example
2025-07-15 09:42:18 +08:00
Gino Lu
141bf2d54d [CK_TILE] Add pk_fp4 data type (#2422)
* [draft] Add pk_fp4 and test

* Add hw conversion for fp4

* Refine test code and pk_fp4 constructor.

* fix test indent

* modify according to comment.

* fix clang-format

* modify according comments.

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-07-14 20:35:06 +08:00
Andriy Roshchenko
518dc21ae8 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8
2025-07-11 13:07:05 -06:00
Khushbu Agarwal
d239b91fd5 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted
2025-07-11 08:27:55 -07:00
Qianfeng
45904b8fd7 Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487)
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>
2025-07-11 18:14:47 +08:00
Illia Silin
1b66f3f4a3 Add declarations for atomic add for fp16 and unsigned short. (#2483)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

* revrt back to atomic add using casting
2025-07-10 07:18:56 -07:00
aska-0096
18669925cc temp save, change all instance to 1wave 2025-07-10 04:29:33 +00:00
shay-li77
d814fefe18 support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338)
* mask support ratio for y axis

* format code

* add notes for param y_ratio

* fix comments error

* support template and mdiv for ratio mask

* refactor y-ratio mask constructor

* optimize coordinate calculation

* add SimplifiedRatioAttentionMask
2025-07-09 23:18:55 +08:00
Yi DING
032ca60015 [CK_TILE] Avoid compile kernel in host pass (#2475) 2025-07-09 22:27:54 +08:00
Illia Silin
93420ecf89 Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474)
This reverts commit 112b47e885.
2025-07-08 19:01:26 -07:00
Illia Silin
112b47e885 Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short
2025-07-08 18:09:30 -04:00
aska-0096
18686cfe5b tempsave, fmha_decode 2025-07-08 08:37:20 +00:00
Haocong WANG
5557eadce6 [CK TILE] Fix FA build filter (#2369)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* cmake depends & bwd filter order fix

* revert unexpected reformat

* Avoid change fmha bwd filter order for downstream compatibility

* Revert unexpected changes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2025-07-08 10:42:07 +08:00
Illia Silin
e033a1b4bf fix compilation errors with clang20 (#2464) 2025-07-07 19:40:30 -07:00
Po Yen Chen
b2dea90116 Eliminate warning caused by failed to meet occupancy requirement (#2389)
Co-authored-by: felix <felix.li@amd.com>
2025-07-08 09:17:25 +08:00
Thomas Ning
f240ae3248 Enable Async Copy for MI355 (#2425)
* add for async load builtin

* add async load api

* fix some compiling errors

* fix a compiling error

* fix some compiling errors

* add a pipeline which copies from v4

* add a new pipeline for async load

* fix some compiling errors

* add async load tests

* fix some issues in async load

* fix

* fix async inline assembly

* fix async inline assembly

* add ignore header file

* comment some not gfx950 codes

* comment some not gfx950 codes

* fix a error

* update async load apis

* fix lds descriptor

* fix a compiling error

* fix some compiling errors

* fix a descriptor issue

* update lds descriptor

* change async pipeline's tile distribution pattern from thread to warp

* fix clang format

* update async policy

* fix a CRTP issue

* fix a typo error

* change lds layout

* fix some sync issues

* improve codes

* delete the async test

* fix a commented format issue

* avoid compiling device functions when compile host

* make gemm run

* add the copy kernel support

* finish the feature

* Address comment

* add the support for buffer_builtin

* solved the merging problem

* Comment Addressed

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>
2025-07-07 10:08:49 -07:00
Andriy Roshchenko
054f85ab7c MX GEMM - FP6 Example (#2419)
Adds support for MX FP6 data type in MX GEMM block pipeline version v1.
Provides an example of MX FP6 GEMM algorithm.

---------

Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
2025-07-07 10:33:26 -06:00
ltqin
9f4c5d7372 ck tile pagedkv prefill (#2405)
* add prefetching physical block id for pagedkv

* start add pagedkv prefill

* rename pipeline

* add kernel for pagedkv

* add an init version pagedkv prefill

* fix redefine issue

* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args

* generate dispatch code

* add body generating code

* comipling pass

* remove dropout from pagedkv

* set lse to false in generating code

* start changing qr kernel to pagedkv

* init version of  kernerl with pagedkv

* change names of file that are generated

* chang host validation for pagedkv prefill

* using iglp to change blockgemm

* add kernel files to op head file

* show parameters

* rewrite print parameter fun

* add fwd

* remove default parameter of GridSize

* format

* fix nhead issue and add seqlen_k_ptr to batch mode

* format code

* remove no-longer used code

* format

* fix some comments

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-07-07 16:16:54 +08:00
carlushuang
0aecb5ab68 default skip y point to r (#2457)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-07-06 23:54:34 -07:00
carlushuang
a8742f7e31 [CK_TILE][CORE] enhance slice_tile api (#2430)
* support slice cross p

* fix some bug in y_len

* more case

* fix a bug when R exist

* support -1 to hint end of current length

* format

* change commit
2025-07-06 20:13:12 -07:00
Mingtao Gu
7998ae8969 [CK] Mxfp4 moe blockscale buf2lds version support (#2455)
* change cshuffle size

* added mxfp4 moe async buffer loading without B preshuffle

* added mx moe B shuffling + scale shuffling (async loads)

* minor fix

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-07-06 15:42:00 +08:00
Adam Osewski
3d70c638d1 Always force output clearing for grouped conv bwd data (#2446)
* Always force output clearing

* dont run set zero for residual

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-07-04 07:49:52 -06:00
Max Podkorytov
158ddeb8ce [CK-TILE] File-level documentation for static encoding pattern (#2433)
* add file-level comment

* Finished the write-up

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-07-04 02:26:18 -07:00
Vidyasagar Ananthan
2e971eff90 Removing reference to undefined parameter for ignore statement. (#2447) 2025-07-03 20:10:29 -07:00
damien-lejeune
1183824573 Fix clang in ck develop branch (#2445)
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2025-07-02 10:07:47 -06:00
chenjun
74a34e0f50 fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437)
Co-authored-by: valarLip <340077269@qq.com>
2025-07-02 19:12:07 +08:00
Gino Lu
60eb70f543 Fix return value bug that drops minus sign in some cases. (#2415)
* fix return value bug.

* refine change according to comment.
2025-07-02 14:53:00 +08:00
huaiguxu
e1c5172fdb Huaiguxu/moe fp8 pertoken scale fix (#2391)
* fix pertoken_scale a_scale dimension

* clang-format

* Fix moe_gemm2_fp8 perTokenScale reference and example.
2025-06-27 10:24:34 +08:00
linqunAMD
1749c0409e [CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375)
1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it.
2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance.
3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.
2025-06-26 08:32:39 +08:00
Thomas Ning
e03293ebce [CK Tile] Int8 Support on CK Tile GEMM (#2267)
* updates to support int8 in 03_gemm example

* added comments, using aliases, helper functions

* test(gemm_universal): add test cases for int8 gemm pipeline

* fix(test_gemm): fix for failing test unit test for int8

* test(ck_tile): add int8 unit test for gemm universal

* refactor(gemm_universal): GPU reference verification for GEMM code improved

* style(gemm_universal): removed extra comments and did clang format

* merging recent changes to universal gemm to tile_engine

* ck tile engine integration work

* feat(tile_engine): add int8 support to tile engine ops/gemm

* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8

* style: Format code with clang-format-12

* refactor(tile_engine): address review comments

* style: removed unhelpful comments & unused variables.

* build: tile engine uses default config

* feat: add int8 support for CK_TILE GEMM

* style: added trailing commas to codegen_utils.py

* refactor: tile engine

* refactor: formatting and code review

* refactor: code formatting for python files

* fix: suppress build warning

* add support for gfx950

* refactor:KWarpTile size in gemms util

* Fix the branch and wrap up the k warp tile

* Add bf8 integration

* refactor: clang format and rebase

---------

Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
2025-06-25 08:20:35 -07:00
Rostyslav Geyyer
daf71fb8e4 Enable fp4 tests (#2329) 2025-06-25 07:38:54 -05:00
linqunAMD
37e1a27537 [CK_TILE] Refine fp8 support in flatmm (#2239)
* [CK_TILE] Refine fp8 in flatmm

1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct

* fix clang format

* fix clang format

* keep default bhavior unchanged (warp tile = 16x16)

* fix tile engine build error

* fix a typo in codegen_utils.py

* address review comments

* address review comments

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-06-25 01:07:45 -07:00