Commit Graph

209 Commits

Author SHA1 Message Date
Khushbu Agarwal
395262f196 Adding validation for tile sizes in Tile Engine (#2189)
* Adding validation for tile sizes

* Add architecture in config, and shuffle lines of code in warp_gemm.hpp

* Enable MFMA for gfx950, and invalid tile handling

[ROCm/composable_kernel commit: 3d8d6e75e4]
2025-05-15 10:28:31 -07:00
BingYuan.Zhou
1b3d08bd1b fix moe sorting build fail (#2190)
* fix moe sorting build fail

* refile code

---------

Co-authored-by: solin <bingzhou@amd.com>

[ROCm/composable_kernel commit: 41c17d0a95]
2025-05-14 09:31:26 +08:00
Illia Silin
11b71b421e Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format

[ROCm/composable_kernel commit: 58f9e9ffbc]
2025-05-13 10:18:14 -07:00
Po Yen Chen
c3f74a83b3 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 2920604786]
2025-05-13 12:19:25 +08:00
Khushbu Agarwal
699091846a Disable SMFMA gfx90a (#2184)
* sparsity fix for gfx90a

* reverting tile_engine changes

[ROCm/composable_kernel commit: f05e45ba59]
2025-05-12 09:56:23 -07:00
Thomas Ning
1f7c2a88c0 Vectorized Transpose for Batched Transpose CK Tile Operator (#2131)
* Shared Memory for single data point

* CKTile Transpose vectorize CP1

* CKTile Transpose vectorize CP2

* CKTile Transpose vectorize CP2.1

* fixed the compile error of the transpose tile 2d

* Have the correct result for the current test sample

* Changes to printing tensor

* fp8 support added

* Debugging for transpose

* solving the corner issue

* Changed padding flag

* Intermideate Debugging

* Intermidiate Debugging

* Intermediate Debugging

* Finished debugging of the transpose op

* Code Cleanup

* Adding edge case smoke tests

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Addressing Review Comment

* Addressing Comments

* Addressing Comments

* Measuring Perf Tests

* Code Cleanup

* Changlog

* Added the running iterations

* clang format

* Fix the changelog

* Fix the compilation error

* change the printing factor

---------

Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>

[ROCm/composable_kernel commit: 9d1e44e56a]
2025-05-12 00:41:45 -07:00
Khushbu Agarwal
2dc3190518 Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 (#2172)
* Changes for updating tile distribution for shuffle and transpose

* Fixed swizzle and transpose, removed comments

* clang formatted

* Adding support for bf16 type

* Addressing review comments

[ROCm/composable_kernel commit: d8faf1c6a1]
2025-05-10 22:40:05 -07:00
Khushbu Agarwal
83180829b6 Disable SMFMA for gfx90a (#2182)
[ROCm/composable_kernel commit: ef72a4b9bc]
2025-05-09 00:18:07 -07:00
Thomas Ning
9fa7eed532 Revert "Disable the SMFMA instruction for gfx90a. (#2174)" (#2175)
This reverts commit 95b372278a9a57ac2be99700105674f25e5a3554.

[ROCm/composable_kernel commit: c757046d49]
2025-05-08 00:07:03 -07:00
Khushbu Agarwal
f7db78cfcf Disable the SMFMA instruction for gfx90a. (#2174)
* remove smfma for gfx90a

* clang formatted

[ROCm/composable_kernel commit: a32d907771]
2025-05-07 23:09:22 -07:00
BingYuan.Zhou
2428249bc7 Flatmm merge (#2168)
* sync with function interface of cshuffleepiloge,fix flatmm build fail

* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm

---------

Co-authored-by: solin <bingzhou@amd.com>

[ROCm/composable_kernel commit: 6a3960c1e1]
2025-05-08 12:59:57 +08:00
jakpiase
1653700350 fix for default epilogue (#2167)
[ROCm/composable_kernel commit: cb07ad84d5]
2025-05-07 10:46:53 -07:00
Aviral Goel
12085db8ff [CK_TILE] Add type traits to detect tile window types at compile time (#2158)
* added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline

* added type traits instead of enum to tile_window() and tile_window_linear() with debug comments

* removed comments, added documentation and clang format

[ROCm/composable_kernel commit: 769336b640]
2025-05-07 00:00:39 -07:00
carlushuang
807501ac3d [CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153)
* combine 2-3 as single stage

* support zeroing

* improve long tokens

* update specialization

* b16 ws

* 8bit topk optimize

* update 15 example

[ROCm/composable_kernel commit: 4e9b76f88c]
2025-05-06 17:32:07 +08:00
jakpiase
354a09f55c [CK_TILE] Remove scratch usage from universal gemm (#2001)
* moves kbatch condition outside of kernel

* add reviewer comments

* fixes

* fix tests

* fixes after review

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 0bcb804ad0]
2025-05-05 18:46:44 +02:00
Khushbu Agarwal
5d0b1b733c mfma_32x32x64_fp8/bf8 (#2148)
* support for mfma_32x32x64_fp8

* clang-formatted

* Fixing sparsity in codegen

[ROCm/composable_kernel commit: d58f2b8bd0]
2025-05-01 13:36:24 -07:00
Aviral Goel
7cedc33703 Add documentation for ck_tile::array<T,N> (#2078)
* addded documentation for ck_tile::array<T,N>

* clang format fix

* spelling errros

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* spelling errros

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Revert "spelling errros"

This reverts commit 4179e7d193.

* Revert "spelling errros"

This reverts commit 3f90733dbe.

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: John Afaganis <john.afaganis@amd.com>

[ROCm/composable_kernel commit: 1d8ef40760]
2025-04-30 16:43:36 -07:00
Illia Silin
0b07559cbe Revert "Add ck tile examples to package (#1880)" (#2150)
[ROCm/composable_kernel commit: 9a9f59ae69]
2025-04-30 10:20:16 -07:00
Aviral Goel
d79abc2e03 Add Matrix A and Matrix B Swizzle for LDS in Computev4 policy (#2136)
* fixed computev4 policy bug for lds swizzle

* added swizzle for input matrix B

* Improved ComputeV4 policy and pipeline by swizzling A and B

* consolidated LDS descriptor functions in parent struct

[ROCm/composable_kernel commit: 65f182d617]
2025-04-28 18:20:47 -07:00
Khushbu Agarwal
aeb46e6a49 Support for MFMA_16x16x128 for fp8/bf8 (#2125)
* Adding 16x16x128 support for gfx950

* Support for fp8 and bf8

* fix input arguments for MFMA scale instruction

* clang-formatted

* Fixes for lwpck-3145 (#2138)

* Fix lds tile & cmake dep & default epilogue

* Fallback BTypeToUse to ADataType in WOQ cases

* reverting instance json file

* reverting instance json file

---------

Co-authored-by: Yi DING <yi.ding@amd.com>

[ROCm/composable_kernel commit: d107f3c3a5]
2025-04-28 18:19:50 -07:00
jakpiase
28783ec2f4 Add ck tile examples to package (#1880)
* add ck tile examples to package

* Update jenkinsfile

* fix for jenkinsfile

* fix for building ck tile code on non gfx9

* compile ck tile examples only for gfx94

* include ck tile examples in all target

* fix for basic gemm UseStructuredSparsity

* Update CMakeLists.txt

* Update gemm_pipeline_problem.hpp

* add targets to rocm install

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 434d19f696]
2025-04-28 09:53:19 -07:00
Yi DING
77c7fb1e6b Fix fp8 convert & add option for basic example (#2129)
[ROCm/composable_kernel commit: 8add2cf45d]
2025-04-27 16:26:05 -07:00
Khushbu Agarwal
5c17305a82 MFMA_32x32x16 for gfx950 (#2121)
* Enable MFMA_32x32x16 for fp16/BF16 for gfx950

* clang formatted

[ROCm/composable_kernel commit: a2ed34a112]
2025-04-24 10:20:22 -07:00
Illia Silin
08f859b460 make code compliant with std=c++20 (#2123)
[ROCm/composable_kernel commit: 01cb8379cd]
2025-04-24 10:14:52 -07:00
carlushuang
65199be10e [CK_TILE] support gfx950 matrix core in 01_fmha fwd (#2110)
* gfx950 01_fmha fwd

* fix comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 5487289fc4]
2025-04-23 12:40:18 -07:00
Gino Lu
983dac1699 [CK-Tile] warp-gemm support for using V_MFMA_F32_16x16x32_BF16 (#2073)
* draft v_mfma_f32_16x16x32_bf16

* fix error config and add debug code.

* Solve the CShuffle Problem

* draft v_mfma_f32_16x16x32_bf16

* fix error config and add debug code.

* Solve the CShuffle Problem

* fix error while testing new command

* Finished the feature of new mfma 16*16*32

* Addressed the comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 504f563f78]
2025-04-22 15:52:36 -07:00
Thomas Ning
83197c769f MFMA 16x16x32fp8 (#2103)
* add mfma_16x16x32_fp8

* clang format code

* Finished the fix for gemm basic

* clang foramt

* rebuild CI

* recover gemm.hpp

* add MFMA 16*16*32bf8

---------

Co-authored-by: solin <bingzhou@amd.com>

[ROCm/composable_kernel commit: a738e43445]
2025-04-21 10:21:35 -07:00
solin
dd2c3289c9 fix CI build fail
[ROCm/composable_kernel commit: c318ec0778]
2025-04-21 16:00:12 +08:00
BingYuan.Zhou
f6b51a1c7b [flatmm] implement basic fp16 flatmm (#2089)
* [flatmm] implement basic fp16 flatmm

* fix CI build fail

---------

Co-authored-by: root <root@hjbog-srdc-50.amd.com>
Co-authored-by: solin <bingzhou@amd.com>

[ROCm/composable_kernel commit: eaf1f0bf3b]
2025-04-16 16:51:17 +08:00
Thomas Ning
1d62f5edb9 Solve the Static Encoding Pattern compile error when the tile size is too small (#2079)
[ROCm/composable_kernel commit: 269f4f6af5]
2025-04-13 20:09:30 -07:00
jakpiase
addcd203eb [CK_TILE] Add 2:4 structured sparsity support for fp16 gemm (#1957)
* add structured sparsity fp16 support for gemm

* added reviewer suggestions

* update changelog

* update changelog

* add reviewers suggestions

* Minor fix

* clang fix

* fix doxygen

[ROCm/composable_kernel commit: 6c61f4d237]
2025-04-11 12:18:26 +02:00
slippedJim
cca9cca699 add fmha fwd splitkv receipt for aiter c++ api (#2068)
* add s_randval for c++ api

* Fix bug of bias in splitkv

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

[ROCm/composable_kernel commit: 5f885d2b7a]
2025-04-10 23:21:13 +08:00
Juan Manuel Martinez Caamaño
c4e0659e51 Replace inline assembly with builtins in FHMA (#2067)
* Replace inline assembly with builtins in FHMA

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: f14e648e7c]
2025-04-10 09:48:37 +02:00
Illia Silin
9cc561987f enable gfx115x support (#2065)
[ROCm/composable_kernel commit: 3e6d21adeb]
2025-04-09 10:06:42 -07:00
MHYang-gh
4e5b335b5e Make buffer coherence configurable in tensor view (#2041)
* Make buffer coherence configurable in tensor view

* Fix clang-format for tensor_view.hpp

[ROCm/composable_kernel commit: 03ce8729fd]
2025-04-08 15:34:11 -07:00
Max Podkorytov
1eec5cc073 simplify generate_tuple (#2043)
[ROCm/composable_kernel commit: 6ce0797dad]
2025-04-08 09:00:51 -07:00
aledudek
a42081842c [CK_TILE] Fix GEMM Memory Pipeline (#2034)
* [CK_TILE] Fix GEMM Memory Pipeline

* Fix transpose tile

* Add comments

[ROCm/composable_kernel commit: 80aae6119b]
2025-04-08 12:40:04 +02:00
Illia Silin
6d90b2eb50 Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic

[ROCm/composable_kernel commit: 572cd820ce]
2025-04-03 15:30:21 -07:00
Adam Osewski
0a607256cf Basic docs for universal gemm & ck-tile gemm. (#2014)
* Basic docs for universal gemm & ck-tile gemm.

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Reviewers suggestions.

* Align tparam names in doc with class tparams.

* More reviewers fine tuning ;)

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: e5ad48a784]
2025-04-02 11:03:40 +02:00
Seunghoon Lee
0ac4c37028 Fix Windows build. (#2012)
* Remove duplicate using uint64_t.

* Cast before shift.

[ROCm/composable_kernel commit: df32020f93]
2025-04-01 12:22:10 -07:00
Max Podkorytov
70ad8571c2 add a fast compilation path for static for (0..N) (#2005)
* add a fast compilation path for static for (0..N)

* Update functional2.hpp

add comment and put range applier into detail namespace

* Update functional.hpp

ditto for ck-tile

* prettify

* prettify more

* add comment

* clang-format

[ROCm/composable_kernel commit: c59a8bb206]
2025-04-01 12:06:25 -07:00
rocking
2b657d9a2c Reduce redundant space in bias tensor (#2024)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 8a20b62e91]
2025-03-28 21:58:06 +08:00
felix
817752cdb4 hotfix fix sorting int64 (#2025)
* fix sorting int64

* clang format

* fix example issue

* update WA issue #

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

[ROCm/composable_kernel commit: a82f338fb9]
2025-03-28 11:31:52 +08:00
ruanjm
fcbf9630fe [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx

[ROCm/composable_kernel commit: d49abdaa87]
2025-03-25 20:09:45 +08:00
MHYang-gh
1998fcda26 Fix A/B lds transform (#2007)
[ROCm/composable_kernel commit: c027637a8f]
2025-03-22 23:13:50 -07:00
BingYuan.Zhou
764d9468b2 fix ck_tile/basic_gemm build error (#1988)
[ROCm/composable_kernel commit: 5a0d693b86]
2025-03-20 22:01:14 -07:00
carlushuang
f588e4b08e [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: e3c9886cdf]
2025-03-20 11:00:29 -07:00
jakpiase
9bf1c41338 [CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919)
* switch to universal gemm for batched and grouped gemms

* added reviewer comments

* fixed grouped gemm tests

[ROCm/composable_kernel commit: 0e91d32c61]
2025-03-20 11:17:04 +01:00
rocking
8be61cfc9d Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: b819c217e4]
2025-03-20 00:06:45 +08:00
carlushuang
5293517d0a Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit a222573537b139c8a4f6b870910bdd487935efa8.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 3e81279d26]
2025-03-13 11:41:39 +08:00