Commit Graph

596 Commits

Author SHA1 Message Date
Mingtao Gu
7998ae8969 [CK] Mxfp4 moe blockscale buf2lds version support (#2455)
* change cshuffle size

* added mxfp4 moe async buffer loading without B preshuffle

* added mx moe B shuffling + scale shuffling (async loads)

* minor fix

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-07-06 15:42:00 +08:00
Mateusz Ozga
394e5be10d [CK-TILE DOC] Ck-tile grouped GEMM documentation (#1939)
* Ck-tile readme

* After review

* Review: part1

* Review part 3
2025-07-04 02:56:42 -07:00
huaiguxu
e1c5172fdb Huaiguxu/moe fp8 pertoken scale fix (#2391)
* fix pertoken_scale a_scale dimension

* clang-format

* Fix moe_gemm2_fp8 perTokenScale reference and example.
2025-06-27 10:24:34 +08:00
Thomas Ning
e03293ebce [CK Tile] Int8 Support on CK Tile GEMM (#2267)
* updates to support int8 in 03_gemm example

* added comments, using aliases, helper functions

* test(gemm_universal): add test cases for int8 gemm pipeline

* fix(test_gemm): fix for failing test unit test for int8

* test(ck_tile): add int8 unit test for gemm universal

* refactor(gemm_universal): GPU reference verification for GEMM code improved

* style(gemm_universal): removed extra comments and did clang format

* merging recent changes to universal gemm to tile_engine

* ck tile engine integration work

* feat(tile_engine): add int8 support to tile engine ops/gemm

* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8

* style: Format code with clang-format-12

* refactor(tile_engine): address review comments

* style: removed unhelpful comments & unused variables.

* build: tile engine uses default config

* feat: add int8 support for CK_TILE GEMM

* style: added trailing commas to codegen_utils.py

* refactor: tile engine

* refactor: formatting and code review

* refactor: code formatting for python files

* fix: suppress build warning

* add support for gfx950

* refactor:KWarpTile size in gemms util

* Fix the branch and wrap up the k warp tile

* Add bf8 integration

* refactor: clang format and rebase

---------

Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
2025-06-25 08:20:35 -07:00
linqunAMD
37e1a27537 [CK_TILE] Refine fp8 support in flatmm (#2239)
* [CK_TILE] Refine fp8 in flatmm

1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct

* fix clang format

* fix clang format

* keep default bhavior unchanged (warp tile = 16x16)

* fix tile engine build error

* fix a typo in codegen_utils.py

* address review comments

* address review comments

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-06-25 01:07:45 -07:00
Po Yen Chen
50fad03524 [CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403)
* Rename batch_prerfill interface

* Add min_seqlen_q parameter in MakeKargs()
2025-06-25 15:19:21 +08:00
Anton Gorenko
77123600ee Improve fmha_bwd tests performance (#2376)
* Avoid passing indices (std::vector) by value to host tensor's operator()

Each access requires 2 allocations and copies of the vector.

* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification

* Compute ds_hp_host_ref in parallel

This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.

* Do not use ForEach for simple copy and conversion of large tensors

These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
2025-06-24 07:45:24 -07:00
Yi DING
b8212864cf [CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114)
* 160+192

* Add splitkv d160

* cleanup

* fix

* Add change log

* Fix CHANGELOG

* Use static_cast

* Update ignored instance

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-06-24 01:33:31 +08:00
Linjun-AMD
61eb622e85 update the way to compute fmha fwd tflop, include mask type (#2386)
* update the way to compute fwd tflop, include mask type

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* remove unneccessary comment

* add necessary comment

* remove some comment

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu>
2025-06-23 15:53:58 +08:00
Po Yen Chen
7d669440a6 [CK_TILE] Fix compilation errors introduced in #2320, #2219 and #2214 (#2388)
* Fix compilation errors

* Fix more ck_tile example compilation errors
2025-06-23 12:29:15 +08:00
Bartłomiej Kocot
cebdee4d9e [CK TILE] Grouped Convolution Forward Kernel (#2188)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* rebase fixes

* fixes

* fixes
2025-06-20 15:44:36 -07:00
Thomas Ning
df6023e305 fix the mi350 error (#2378) 2025-06-20 12:50:13 -07:00
joyeamd
a2f01141aa transpose load api development (#2177)
* add transpose load; no real logic

* fix some compile errors

* fix some issues

* update transpose load logic

* add some fixes

* fix a distribution issue

* update some codes

* add some fix

* can pass; but no logic

* transpose load enable

* update tile transpose

* miss output tile distribution mapping

* hack for transpose 16x16

* update output tensor distribution

* delete unused variables

* fix transpose related codes

* update transpose load example

* exchange the iteration order

* fix 16x16 related dimension transpose

* fix a transpose index issue

* fix a transpose index issue

* fix clang format check

* update load tile transpose related codes

* fix compile errors and pass 16x16 tests

* fix a typo

* update logic

* check other data types

* add transpose load api

* update transpose load api

* fix clang format check

* change file name

* refactor codes

* update code name

* delete some unused codes

* delete the unused oob flag for transpose load

* update tensor view api for transpose load

* update for testing

* fix a typo error

* move transpose ops to example directory

* update transpose api

* update include file

* fix for pr review

* fix compile errors

* add transpose load; no real logic

* fix some compile errors

* fix some issues

* update transpose load logic

* add some fixes

* fix a distribution issue

* update some codes

* add some fix

* can pass; but no logic

* transpose load enable

* update tile transpose

* miss output tile distribution mapping

* hack for transpose 16x16

* update output tensor distribution

* delete unused variables

* fix transpose related codes

* update transpose load example

* exchange the iteration order

* fix 16x16 related dimension transpose

* fix a transpose index issue

* fix a transpose index issue

* fix clang format check

* update load tile transpose related codes

* fix compile errors and pass 16x16 tests

* fix a typo

* update logic

* check other data types

* add transpose load api

* update transpose load api

* fix clang format check

* change file name

* refactor codes

* update code name

* delete some unused codes

* delete the unused oob flag for transpose load

* update tensor view api for transpose load

* update for testing

* fix a typo error

* move transpose ops to example directory

* update transpose api

* update include file

* fix for pr review

* fix compile errors

* change directory name

* delete the duplicated directory

* update cmakelists file

* delete the unused codes

* update function names

* update transpose policy

* update code after remod.py

* update codes

* add some comment

* Polish the instr infrastructure

* build up the fixed instr

* redesign the transpose api, currently it has numerical error

* add the bf16 transpose

* fix some issues

* add some comments

* update document

* Finished the refactor of API and pass through the verification

* fix the merging issue

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-06-18 01:28:34 -07:00
linqunAMD
7aeec9a901 [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd (#2243)
* [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd

* fix error with the latest develop code.
2025-06-17 21:37:59 -07:00
carlushuang
a4e1248dba [CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335)
* support local_token for hipgraph

* update README

* fix comment

* fix fmoe example
2025-06-18 10:49:43 +08:00
linqunAMD
0eb8974502 [CK_TILE] Support multi-config in tile_example_gemm_universal (#2240)
* [CK_TILE] Support multi-config in tile_example_gemm_universal

Add GemmConfig in run_gemm_example to support multiple tile config.
- It is useful when use you need compare gemm perf with different tile/pipeline config
- we also can use it simplify the code for wmma support in the furture.

* [CK_TILE] Support multi-config in tile_example_gemm_universal

Address review comments

* rebase code and fix clang format.

* fix clang format

* support pipeline v5.

* fix merge conflict

* address review comment

* add missing file

* address review comment v2

* fix build error
2025-06-17 17:27:46 -07:00
John Afaganis
df54667102 Add missing copyright headers (#2359)
* Add missing copyright headers

* empty commit
2025-06-17 14:29:45 -07:00
Satyanvesh Dittakavi
4c57157d50 Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-06-17 11:54:30 -07:00
Thomas Ning
3c4cdfac4f Fix the CK Tile related operators (#2356)
* fix the flatmm

* Fix the pipeline

* address the comment
2025-06-16 17:38:52 -07:00
Illia Silin
5523df4b2d Revert "fix the flatmm (#2349)" (#2352)
This reverts commit d996bc78be.
2025-06-16 07:54:55 -07:00
Thomas Ning
d996bc78be fix the flatmm (#2349) 2025-06-16 02:17:53 -07:00
Illia Silin
a0f4db8d9c check for if misched-bottomup flag is valid (#2341) 2025-06-13 13:34:22 -07:00
Mateusz Ozga
bd96ac9742 [CK_TILE] Multiple-D GEMM example (#2219)
* Multiple d, initial commit

* Check Ds Layout

* Readme and clang format

* Update branch & conflicts

* Multiple D - fix clang-formatter

* Rename elemetwise_op

* Fix CI

* Code review part1

* Remove printf

* Remove unnecessary comment

* Add new tests with Col layout

* Review part 2

* Added support for Multiple D GEMM

* Update comment

* Remove maybe_unused

* Clang-format

* Review part 3

* Add comment to function

* Add comment to function: another

* Take number of params for a refrence function

* Remove additional d param for 0 tensor

* Change name of function

* Fix CI fails
2025-06-13 19:39:11 +02:00
kylasa
5f1ad09b61 Code drop for 2 warp ping pong scheduler along K dimension. (#2276)
* Code drop for 2 warp ping pong scheduler along K dimension.

* Addressing code review comments.

* Addressing Clang formatting issues.

* Addressing build issues.

* Addressing build issues of other GEMM pipelines with ping pong scheduler code drop.

* Fix for LDS memory size for GEMM pipelines.

* Addressing code review feedback comments.

* Change log update.

* Addressing code review comments and build issues.

* Added new policy for pipeline specific logic about LDS needs.

* Clang Fix during build.
2025-06-12 18:24:02 -07:00
Aviral Goel
e5ece14467 fix(gemm_universal): Update gemm_utils.hpp so it builds successfully for memory pipeline (#2336) 2025-06-12 15:27:14 -07:00
Yi DING
37554c31e8 Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297)
* [fix] align v3 gufusion pipeline

* fix device kernel selection.

* Add .co direct asm support by CK_USE_ASM_MOE_STAGE2_BLOCKSCALE

* experimental optimization for scale load in blkscale gemm

* Add asm for no-loop v3_128x128x128

* fix bugs

* tune fp8 example

* Update v1_128x128x128 to 2x2 instead of 4x1

* wip

* add warmup to asm launch

* wip2

* 16x16 function merged to moe

* temp save, a performant version.

* wip3

* Update .co binary to 16x16

* 16x16x128 correct; 64x64x128 failed

* update

* use mem_op::set when topk=1

* add mx fp8 b_preshuffle support, function not yet tested.

* Spilt the fp4 target. Fix the known bugs. 128x128x128 sanity checked; remove prints

* some fixes

* fix update

* remove some unnecessary hacky; enable 256x256x256 tilesize

* update for function debug

* Add pipeline v3. Have some runtime issue and register spill

* Fix pipe v3 correctness issue

* remove unnecessary hacky

* clang format

* fix a bug

* fix the bug, functional test passed

* tempsave; buggy at passed 4 e8m0 to scaled mfma

* added fp4_bpreshuffle example, build failures

* fixed some bugs

* implement shuffled scale mxfp4gemm, blocker: opsel not effect

* hotfix

* fix bugs, build passed

* (M, N, K)=(128, 128, 128) function failed.

* temp save for gemm1. Function not ready

* fix compile error. Gemm2 pass. Gemm1 WIP

* fix bug for a lds read

* update moe

* Compile pass. Gemm1 function WIP

* update moe

* fix fp8; fix even/odd

* tempsave

* update moe

* Revert "update"

This reverts commit 960b2bce1c.

* Revert "use mem_op::set when topk=1"

This reverts commit def952a178.

* Add v3 128x128x128_4x4_16x16.co for gfx950

* temp cmake flag suppression  for aiter test

* add code for mxfp4 gemm, blockscale not supported yet

* gemm1 up-only pass. GU WIP

* function pass with inline asm hacky

* revert unexpected file change

* updated and build passed

* update CE elementOP

* added code for debug

* Gemm1 GUFusion function pass. Perf WIP

* Fix fp8/bf8; remove duplicated code

* disable the scheduler in v3; bring it back when compiler feature ready.

* update moe v1 pipeline

* Add gemm1 v1 32x128x128

* remove schedule barrier

* updated

* Fix fp8/bf8 B-row

* mfma using asm, device result correct, host result need to check

* gemm1 v3 64x128x128 debug

* fix cpu ref

* a/b thread_desc stride fix

* Use random scale for init1

* 16x16x128 input size blockscale function passed

* fix blockscale gemm bug

* tempsave. Almost all instances passed.

* v1 fix for mi350.

* temp save

* debug save

* update debug

* fix the bug, 128x128x256 tile function passed

* v3

* rename moe block selector and pipeline

* Add gemm1 v1

* Add gemm1 v1 to selector

* added mx moe block v3 support, function passed

* compile error fix

* Improve the pipeline

* Pack e8m0 as int32_t

* v1 compile pass. Function not ready

* debug synchronize issue over different GPU/ROCm

* minor fix

* Add profiler filter

* Add f4 ckProfiler

* Fix example compile error

* Add f4 profiler examples

* tempsave

* v1 function pass.

* v3 function pass

* align file and function name

* mx_moe_fp4 ready for aiter with clang-format.

* modify the way we represent fp4

* generalize the pipeline scheduling.

* init moe mx f4 scale shuffle

* Cmakelist diable compiler-bound flags

* mx_fp4 default parameter change

* Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler.

* update code

* tempsave; modify the way we represent fp4

* generalize the pipeline scheduling.

* Add gemm1 gfx942 .co support

* updated code, build passed.

* Update gemm2 asm with latest compiler flag

* Fix mx f4 ckProfiler

* Fix blockwise gemm mx v1

* lds conflict free + buffer load lds

* Add gemm2 v3 64x128x128

* fix a, b scale loading bugs, a, b scale loading now correctly

* Add gemm2 v3 64x128x128

* commit with debug info

* fix fp4 profiler

* Add mx fp4 pileline v1 instances

* Fix v2 topk_weight cal. Add silu asm.

* v2 tok_weight WIP

* init mx fp4 B no preshuffle version

* tempsave. compile pass, function wrong

* enable fp4 moe no weigth preshuffle, function pass

* update the TFlops calculation in the example

* Add gemm2 64x128x128 asm. Fix BF16 ref.

* fix 2 typos in fp4_preshuffle

* Better kernel selection in device classes

* correct preShuffleBuffer

we should used packed k to do shuffle.

* lds conflict free + buffer load lds

* optimize offset math in dma

* Fix fp4 ckProfiler

* Fix MX MFMA tests

* fix f4 pipeline issues

* gemm1 func pass

* update mx moe gemm1_bns tile size to 64x128x256

* update mx moe gemm1 gemm2 TF and BW calculation

* fix typo

* temp save

* Fix example_gemm_mx build

* rename the block pipeline

* correct a typo in tail

* Add rotating to mx examples

* fix the correctness issue

* Fix v1; use M padding

* Add NT flag to B/BScale buffer

* Merge gemm_mx_common.hpp

* temp save, 4.4~4.5

* Fix 'Merge gemm_mx_common.hpp'

* refactor the pipeline

* Pad the M for scale buffer unconditionaly

* update MX moe GEMM1 hotloopscheduling

* change the gemm1 tile from 64x128x128 to 128x64x128

* Unconditional Ascale padding

* Pad shuffled a scale only

* pad ascale

* add vmcnt guard for async copy

* Profiler add f4 wp

* Merge preshuffle device

* Add more fp4 wp instances

* Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format

* Clang-format after 2 merges

* Remove rocm6.3 workaround flags and macro

* Fix fp8 config

* Fix bf8 config

* flag and barrier fix for copmiler branch MainOpSelV3

* Add fp8 profiler instances

* Remove debug infos; Enable flags for blockscale f8

* No asm ver. for merging moe blocksale fp8 into mainline

* update the flag name for f8blockscale

* recover example

* fix performance bug of bpreshuffle f8 gemm

* clang format, remove  single rate mfma restriction for f8

* remove single rate mfma restriction for f8 blockscale gemm

* Fix moe blockscale gemm1 barrier 0x800 for new compiler

* add pipeline v1 for MOE Gemm2

* Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns

* Fix OOB; add MB96 instances

* remove unnecessary files

* fix the cmake issue

* Enable splitk for mxfp4; clang format;

* Generate random tensor values with multiple threads

* Use packed_size_v for A/BPackedSize

* Fix warning

* Fix target_compile_options for disabled target on gfx942

* fix moe pki4 on gfx950

* doc the kGroup definition

* Fix ThreadwiseTensorSliceTransfer_v4::Run (Fuse scale)

* Refactor thread_copy_lds_direct_load; fix gfx942 direct lds load example; fix f16_pki4 example

* Fix unknown compiler flag

* fix two failed examples.

* fix some failure tile size in gfx950 universal gemm. fix test_gemm_fp16

* workaround fix for test_gemm_f32; * We have very limited support for lds direct load if input matrix is not K major

* fix test_gemm_splitk;

* Fix compile for mx_mfma_op

* add mfma selection logic for multipled_v3

* Clean up

* Fix device gemm mx link error

* improve the global atomic pattern

* Revert unnecessary copyright updates

* restore minimum_occupancy logic

* Avoid data race in moe gemm2 ref

* Build fp8 gemm_multiply_multiply and moe only on gfx94/95

* update the instance in device_mx_gemm

* Resolve comments

* Copyright 2025

* Remove unused code

* fix library linking issue

---------

Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-06-12 09:25:59 +08:00
Muhammed Emin Ozturk
6fad1c4874 Stream-K Reduction option as Runtime parameter and Compilation Error Fix (SK- Reduction) (#2145)
* reduction is passed as runtime parameter

* clang

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_streamk_v3.hpp

Co-authored-by: John Afaganis <john.afaganis@amd.com>

* Update include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp


* remove comment

---------
2025-06-11 10:59:44 -07:00
Khushbu Agarwal
bd270fe4bc fix flatmm kernel for bigger size for fp16 datatype (#2302) 2025-06-10 11:13:40 -07:00
Aviral Goel
aed0f5880c Label CMakeLists message() as DEBUG or STATUS for clean build output (#2301)
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)

* - marked all low importance build messages as log_level=DEBUG
2025-06-10 10:46:47 -07:00
Eisuke Kawashima
4e586ca958 chore: unset executable permission (#2303)
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2025-06-10 09:13:59 -07:00
Sami Remes
1c6f83df6c [CK_TILE] Tileloop persistent gemm - resubmit (#2299)
* Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293)

This reverts commit 233e274077.

* Add missing header for kentry

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-06-06 14:18:49 -07:00
Andriy Roshchenko
00247e3c29 Optimized GEMMs for MX FP4/8 (#2294)
Adds V3 GEMM pipeline for MX FP4 and MX FP8 
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275)
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler




Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
2025-06-05 13:54:15 -06:00
Illia Silin
233e274077 Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293)
This reverts commit ffb52783d0.
2025-06-05 09:24:00 -07:00
Sami Remes
7ea1508b59 [CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222)
* Add TailHandler for V3, V4 and Mem pipelines

* Adapt examples and tests to use TailHandler

* move tail-handling logic to pipeline in persistent grouped gemm

* Fix Mem pipeline dispatching, add CompV4 dispatching

* Use a macro for handling the many tails of Mem pipeline

* Fix formatting again

* Use const-ref RunFunction, remove unnecessary try_run
2025-06-04 11:50:21 +03:00
Sami Remes
ffb52783d0 [CK_TILE] Tile loop persistent gemm kernel (#2191)
* Implement tile loop persistent gemm kernel

* Enable timing

* Add tests for persistent gemm

* Fix formatting

* Fix gemm_basic

* Rename True/False to Persistent/NonPersistent

* Use only one set of layouts for persistent tests

* Fix gemm example persistent template parameter

* Fix formatting
2025-06-04 11:46:28 +03:00
Anton Gorenko
52b4860a30 WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230)
* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Updates to support mixed precision

* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip

* Added support for F8xF16xF16 to gemm_wmma_universal

* Added support for F16xF8xF16 to gemm_wmma_universal

* Added support for BF16xI4xBF16 to gemm_wmma_universal

* Added support for F16xI4xF16 to gemm_wmma_universal

* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType

* Added missing test class for FP16_KM_NK

* Pre-commit hooks fixes

* Added padding instances for f16xf16xf16

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Ammending changes for adding support for padding instances for f16xf16xf16

* Fixes for padding instances for f16xf16xf16

* Added padding instances for bf16xbf16, f8xf8

* Added packed instances for bf16xi4xbf16

* Added padding instances for f8xf16xf16

* Added padding instances for f16xf8xf16, f16xi4xf16

* Fixed typos for bf16xbf16xbf16 padding instances

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Fix int8 DTYPES check for gemm_bilinear

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
2025-06-04 12:22:33 +06:00
Khushbu Agarwal
59a85cb4bc [CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262)
* debugging issue

* debugging issue

* debugging

* debugging

* reverting debugging code

* clang formatted

* updating default_config.json

* fix ci failure

* clang formatted
2025-06-03 20:16:10 -07:00
Aviral Goel
11f6c14e03 Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268)
* add 0 as valid default arguement for strides

* add 0 as valid default arguement for strides

# Conflicts:
#	example/01_gemm/common.hpp
2025-06-03 07:26:58 -07:00
Illia Silin
b76fdbe47f Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274)
* upgrade to rocm6.4.1 and use gfx1x-generic targets

* add rocm version parsing

* fix the gfx10-3-generic syntax in cmake
2025-06-03 07:17:35 -07:00
Khushbu Agarwal
2e38eb4f1c Rotating buffer PR CI fix (#2257)
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)"

This reverts commit bbdaf79a52.

* fix regression
2025-06-02 10:25:01 -07:00
slippedJim
57f497452a remove restriction of group mode hd192 no lse (#2252)
Co-authored-by: Jim <jimguo12@amd.com>
2025-05-30 10:14:21 +08:00
Po Yen Chen
28cd0dffc9 [CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251)
* Add constraint on traits/tile/pipeline

* Use kM0=128 if max_seqlen_q == 8192

* Re-format codegen script

* Remove redundant attr name postix

* Fix import error: default field in dataclass

* Use kK0=64 & kK1=64 to hide latency

* Use CU utilization to decide tile size
2025-05-29 18:36:33 +09:00
Illia Silin
bbdaf79a52 Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)
This reverts commit 99857e10e6.
2025-05-28 09:46:52 -06:00
Khushbu Agarwal
99857e10e6 [CK_tile] Add rotating buffer feature for universal gemm (#2200)
* Add rotating buffer feature for universal gemm

* adding changes in tile_engine

* Updated code to merge kernel_launch

* removing comments

* Enable rotating buffer changes to flatmm

* Created diff launch_kernel function for rotating buffer

* Simplfied calculation using macros

* merge code with new changes in tile_engine

* clang formatted

* Redefine macros
2025-05-27 23:00:58 -07:00
Aviral Goel
c52649ad57 Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234)
* added catch statements to examples

* clang format
2025-05-27 22:32:42 -07:00
Zzz9990
ece38b9d7a [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit aa9847e42d.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
SamiAario-AMD
380bca2b85 Fix 11_add_rmsnorm2d_rdquant (#2207) 2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00