Commit Graph

2071 Commits

Author SHA1 Message Date
Andriy Roshchenko
a024e11036 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8



[ROCm/composable_kernel commit: 518dc21ae8]
2025-07-11 13:07:05 -06:00
Khushbu Agarwal
f3120e7526 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted

[ROCm/composable_kernel commit: d239b91fd5]
2025-07-11 08:27:55 -07:00
Qianfeng
337126469c Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487)
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: 45904b8fd7]
2025-07-11 18:14:47 +08:00
Aviral Goel
70900db661 fix(precommit_install): fix bug for bare metal machines (#2448)
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: a26ba690fd]
2025-07-10 11:00:47 -06:00
Andres Lugo
701b95cd25 Update FMHA recipe for Pytorch SDPA integration (#2480)
* Add receipts in splitk and appendk

* remove grouped

* Remove logits

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: aadeffde18]
2025-07-10 09:00:23 -07:00
Illia Silin
e61ceee502 Add declarations for atomic add for fp16 and unsigned short. (#2483)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

* revrt back to atomic add using casting

[ROCm/composable_kernel commit: 1b66f3f4a3]
2025-07-10 07:18:56 -07:00
Illia Silin
84a7b00497 Fix blockscale fp8 gemm examples (#2476)
* fix blockscale fp8 gemm examples

* refactor the compiler flags

* fix hip version calculation

[ROCm/composable_kernel commit: d9b37c7121]
2025-07-10 07:12:13 -07:00
shay-li77
4f08a02dae support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338)
* mask support ratio for y axis

* format code

* add notes for param y_ratio

* fix comments error

* support template and mdiv for ratio mask

* refactor y-ratio mask constructor

* optimize coordinate calculation

* add SimplifiedRatioAttentionMask

[ROCm/composable_kernel commit: d814fefe18]
2025-07-09 23:18:55 +08:00
Yi DING
9f5cf4f49d [CK_TILE] Avoid compile kernel in host pass (#2475)
[ROCm/composable_kernel commit: 032ca60015]
2025-07-09 22:27:54 +08:00
Po Yen Chen
bdce8dbc9b [CK_TILE] Low CU utilization optimization for fMHA fwd kernels (#2402)
* Wrap tile size mapping as class method

* Warp pipeline generating as class method

* Add constraint as kernel dispatching criteria

* Support mutltiple tile size for a (hdim, hdim_v) combination

* Use smaller tile size if CU utilization is low

* Use integar as the key of the tile size map

* Fix type error

* Simply override parent class method return value

* Add attribute to eliminate warnging

* Allow using environment variables to turn on/off custom factory

* Unify param naming style

* Add missing HIP runtime include directive

* Fix os.environ.get() usage

[ROCm/composable_kernel commit: ad9863fe05]
2025-07-09 22:01:33 +08:00
Vidyasagar Ananthan
eb26ffa875 New ninja tracing script (#2472)
* Adding ninja log json convertion utility

* Updating to match old ninjatracing

* Updating Jenkins to use new ninjatracing

* Ensuring v7 works

* Removing old ninjatracing from dockerfile

[ROCm/composable_kernel commit: e391b025a0]
2025-07-08 22:36:50 -07:00
Illia Silin
b1be1b8a3a Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474)
This reverts commit cf4002ad26835f0058c0d5d21fd2e1e3f401ea08.

[ROCm/composable_kernel commit: 93420ecf89]
2025-07-08 19:01:26 -07:00
Illia Silin
85af00c08c Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

[ROCm/composable_kernel commit: 112b47e885]
2025-07-08 18:09:30 -04:00
Vidyasagar Ananthan
89f226aace Separating ninja build tracing and setting flag to false (#2470)
* Separating ninja build tracing and setting flag to false

* Add ftime-tracing flag

* Fix conditional issue

* Try adding a script block

* Embed Clang analysis in ftime trace block

[ROCm/composable_kernel commit: 33d704a6f9]
2025-07-08 10:52:00 -07:00
Haocong WANG
7c04d93083 [CK TILE] Fix FA build filter (#2369)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* cmake depends & bwd filter order fix

* revert unexpected reformat

* Avoid change fmha bwd filter order for downstream compatibility

* Revert unexpected changes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>

[ROCm/composable_kernel commit: 5557eadce6]
2025-07-08 10:42:07 +08:00
Illia Silin
99cf9b9cae fix compilation errors with clang20 (#2464)
[ROCm/composable_kernel commit: e033a1b4bf]
2025-07-07 19:40:30 -07:00
Po Yen Chen
a71dc1245f Eliminate warning caused by failed to meet occupancy requirement (#2389)
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: b2dea90116]
2025-07-08 09:17:25 +08:00
Thomas Ning
1129c9dc4e Enable Async Copy for MI355 (#2425)
* add for async load builtin

* add async load api

* fix some compiling errors

* fix a compiling error

* fix some compiling errors

* add a pipeline which copies from v4

* add a new pipeline for async load

* fix some compiling errors

* add async load tests

* fix some issues in async load

* fix

* fix async inline assembly

* fix async inline assembly

* add ignore header file

* comment some not gfx950 codes

* comment some not gfx950 codes

* fix a error

* update async load apis

* fix lds descriptor

* fix a compiling error

* fix some compiling errors

* fix a descriptor issue

* update lds descriptor

* change async pipeline's tile distribution pattern from thread to warp

* fix clang format

* update async policy

* fix a CRTP issue

* fix a typo error

* change lds layout

* fix some sync issues

* improve codes

* delete the async test

* fix a commented format issue

* avoid compiling device functions when compile host

* make gemm run

* add the copy kernel support

* finish the feature

* Address comment

* add the support for buffer_builtin

* solved the merging problem

* Comment Addressed

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>

[ROCm/composable_kernel commit: f240ae3248]
2025-07-07 10:08:49 -07:00
Andriy Roshchenko
2325a9fe3a MX GEMM - FP6 Example (#2419)
Adds support for MX FP6 data type in MX GEMM block pipeline version v1.
Provides an example of MX FP6 GEMM algorithm.

---------

Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>

[ROCm/composable_kernel commit: 054f85ab7c]
2025-07-07 10:33:26 -06:00
dependabot[bot]
f9c677d27f Bump sphinxcontrib-bibtex from 2.6.4 to 2.6.5 in /docs/sphinx (#2424)
---
updated-dependencies:
- dependency-name: sphinxcontrib-bibtex
  dependency-version: 2.6.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: bfe573d3ba]
2025-07-07 07:30:49 -07:00
spolifroni-amd
2fb9474ff0 updating the doxyfile and the index.rst so that it gets the full API (#2416)
* updating the doxyfile and the index.rst so that it gets the full API

* added recommended doxygen values

[ROCm/composable_kernel commit: 096bf2de41]
2025-07-07 07:29:36 -07:00
rahjain-amd
85c6fd56c5 Fixing Debug build (#2404)
Failed to build `tile_example_fmha_bwd` due to below error

```
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
  358 |         assert(slopes.size() == nhead);
      |                ~~~~~~~~~~~~~ ^  ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
  103 |      (static_cast <bool> (expr)                                         \
      |                           ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:989:16: note: in instantiation of function template specialization 'run<FmhaBwdFp16>' requested here
  989 |         return run<FmhaBwdFp16>(arg_parser) ? 0 : -2;
      |                ^
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
  358 |         assert(slopes.size() == nhead);
      |                ~~~~~~~~~~~~~ ^  ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
  103 |      (static_cast <bool> (expr)                                         \
      |                           ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:993:16: note: in instantiation of function template specialization 'run<FmhaBwdBf16>' requested here
  993 |         return run<FmhaBwdBf16>(arg_parser) ? 0 : -2;
      |                ^
2 errors generated when compiling for gfx942.
```

Fixed with proper cast

[ROCm/composable_kernel commit: ad593c286f]
2025-07-07 14:46:22 +05:30
ltqin
ba133fe9b7 ck tile pagedkv prefill (#2405)
* add prefetching physical block id for pagedkv

* start add pagedkv prefill

* rename pipeline

* add kernel for pagedkv

* add an init version pagedkv prefill

* fix redefine issue

* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args

* generate dispatch code

* add body generating code

* comipling pass

* remove dropout from pagedkv

* set lse to false in generating code

* start changing qr kernel to pagedkv

* init version of  kernerl with pagedkv

* change names of file that are generated

* chang host validation for pagedkv prefill

* using iglp to change blockgemm

* add kernel files to op head file

* show parameters

* rewrite print parameter fun

* add fwd

* remove default parameter of GridSize

* format

* fix nhead issue and add seqlen_k_ptr to batch mode

* format code

* remove no-longer used code

* format

* fix some comments

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 9f4c5d7372]
2025-07-07 16:16:54 +08:00
carlushuang
8e15d99ddc default skip y point to r (#2457)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 0aecb5ab68]
2025-07-06 23:54:34 -07:00
carlushuang
4ed061c05d [CK_TILE][CORE] enhance slice_tile api (#2430)
* support slice cross p

* fix some bug in y_len

* more case

* fix a bug when R exist

* support -1 to hint end of current length

* format

* change commit

[ROCm/composable_kernel commit: a8742f7e31]
2025-07-06 20:13:12 -07:00
Mingtao Gu
7face91352 [CK] Mxfp4 moe blockscale buf2lds version support (#2455)
* change cshuffle size

* added mxfp4 moe async buffer loading without B preshuffle

* added mx moe B shuffling + scale shuffling (async loads)

* minor fix

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 7998ae8969]
2025-07-06 15:42:00 +08:00
Adam Osewski
a53dbafec9 Always force output clearing for grouped conv bwd data (#2446)
* Always force output clearing

* dont run set zero for residual

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 3d70c638d1]
2025-07-04 07:49:52 -06:00
Mateusz Ozga
38d5c02e8a [CK-TILE DOC] Ck-tile grouped GEMM documentation (#1939)
* Ck-tile readme

* After review

* Review: part1

* Review part 3

[ROCm/composable_kernel commit: 394e5be10d]
2025-07-04 02:56:42 -07:00
Max Podkorytov
70f959ba12 [CK-TILE] File-level documentation for static encoding pattern (#2433)
* add file-level comment

* Finished the write-up

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 158ddeb8ce]
2025-07-04 02:26:18 -07:00
Vidyasagar Ananthan
01975e737e Removing reference to undefined parameter for ignore statement. (#2447)
[ROCm/composable_kernel commit: 2e971eff90]
2025-07-03 20:10:29 -07:00
Vidyasagar Ananthan
bd341803f2 Remove ftime tracing to avoid printing json files (#2452)
* Remove ftime tracing to avoid printing json files

* Factoring out build commands

[ROCm/composable_kernel commit: d2536b91bc]
2025-07-03 07:54:12 -07:00
Vidyasagar Ananthan
7f01554c70 Adding ddembeck to codeowners. (#2449)
Co-authored-by: Dave Dembeck <dave.dembeck@amd.com>

[ROCm/composable_kernel commit: 58d24a7172]
2025-07-02 20:47:09 -07:00
damien-lejeune
85054dc45a Fix clang in ck develop branch (#2445)
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>

[ROCm/composable_kernel commit: 1183824573]
2025-07-02 10:07:47 -06:00
chenjun
2a43b81549 fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437)
Co-authored-by: valarLip <340077269@qq.com>

[ROCm/composable_kernel commit: 74a34e0f50]
2025-07-02 19:12:07 +08:00
Gino Lu
5ba3c3edd7 Fix return value bug that drops minus sign in some cases. (#2415)
* fix return value bug.

* refine change according to comment.

[ROCm/composable_kernel commit: 60eb70f543]
2025-07-02 14:53:00 +08:00
Aviral Goel
dfb7e0d358 [ckProfiler] Add infrastructure and instances to profile gemm_universal with B preshuffle (#2427)
* works on mi300

* fix(profiler): add error message for unsupported type/layout

* refactor(preshuffle.inc): add type aliases for code readability

[ROCm/composable_kernel commit: 36df1cbd0a]
2025-07-01 18:34:52 -07:00
Thrupti Raj Lakshmana Gowda
6a953648d1 Updating Runtime log for CK Tile Engine (#2431)
* Updating runtime log message for CK TILE ENGINE

* Fixing Clang Format

* Update tile_engine/ops/gemm/README.md

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

---------

Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: a03682cb80]
2025-07-01 10:59:49 -07:00
Aviral Goel
d5748a5c16 Enhancements in precommit_install.sh for Python and CK Tile code (#2400)
* fix(precommit_install): script now installs packages in virtual env

* fix(precommit_install): installs packages in virtual env

* feat(precommit): added ruff for python linting and formatting

* feat(precommit): added ruff for python linting and formatting

* feat(precommit): run ruff when py files are commited

* feat(precommit): remod.py is run when ck_tile modified

* add empty line at the end

* style(precommit.yaml): remove empty line

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: e9036a8fc2]
2025-07-01 01:11:10 -07:00
Vidyasagar Ananthan
7125833c40 Fix an earlier static check error due to assignment of variable in Jenkinsfile (#2420)
* Testing assignment of param fix

* Removing redundant changes

* Adding back unit test runs

* Ensuring Jenkins changes work on develop - to be reverted

* Revert "Ensuring Jenkins changes work on develop - to be reverted"

This reverts commit cf1cab4a43.

[ROCm/composable_kernel commit: 2fa9270a25]
2025-06-28 07:07:14 -07:00
Thomas Ning
189056103f Revert "Enable builds on gfx942 by default and run all tests on develop branc…" (#2418)
This reverts commit e4f117a18e6856d19730c3c8be6cffcb9a3dc12d.

[ROCm/composable_kernel commit: 28a63d7dcb]
2025-06-27 16:40:10 -07:00
huaiguxu
b12ae84a40 Huaiguxu/moe fp8 pertoken scale fix (#2391)
* fix pertoken_scale a_scale dimension

* clang-format

* Fix moe_gemm2_fp8 perTokenScale reference and example.

[ROCm/composable_kernel commit: e1c5172fdb]
2025-06-27 10:24:34 +08:00
linqunAMD
1541713b60 [CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375)
1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it.
2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance.
3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.


[ROCm/composable_kernel commit: 1749c0409e]
2025-06-26 08:32:39 +08:00
Khushbu Agarwal
d33891768a Enabling diff datatypes for tile_engine and build with more granularity (#2392)
* merging recent changes to universal gemm to tile_engine

* Reducing Linking time by generating less intermediate files

* make small libs to build faster

* Reducing the instances

* reducing instances

* Restoring default config

* Restoring default config

* warp_n reverted in default config

* Adding diff json files for fp8 and fp16, cmake changes for fp8

* Restructure the CMake File

* Added more granularity for build and some debugging code

* removed some of debugging statements

* added fp8 instances

* tahe datatype from command line to enable both type of json files

* updated README file

* code cleanup

* code cleanup

* updated jenkinsfile

* enable tile_engine daily builds

* updating cmake file

* updated CMakeLists.txt

* Updating CMake code fixing gfx12 build

* Updating CMake code fixing gfx12 build

* Fix CMake file null checks

* fixed traces of rebase

* Update tile_engine/ops/gemm/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update tile_engine/ops/gemm/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update tile_engine/ops/gemm/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* fixing rebase issue

---------

Co-authored-by: khushbu <khuagarw@gmail.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: a14753b86f]
2025-06-25 15:18:24 -07:00
Thomas Ning
90add28587 [CK Tile] Int8 Support on CK Tile GEMM (#2267)
* updates to support int8 in 03_gemm example

* added comments, using aliases, helper functions

* test(gemm_universal): add test cases for int8 gemm pipeline

* fix(test_gemm): fix for failing test unit test for int8

* test(ck_tile): add int8 unit test for gemm universal

* refactor(gemm_universal): GPU reference verification for GEMM code improved

* style(gemm_universal): removed extra comments and did clang format

* merging recent changes to universal gemm to tile_engine

* ck tile engine integration work

* feat(tile_engine): add int8 support to tile engine ops/gemm

* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8

* style: Format code with clang-format-12

* refactor(tile_engine): address review comments

* style: removed unhelpful comments & unused variables.

* build: tile engine uses default config

* feat: add int8 support for CK_TILE GEMM

* style: added trailing commas to codegen_utils.py

* refactor: tile engine

* refactor: formatting and code review

* refactor: code formatting for python files

* fix: suppress build warning

* add support for gfx950

* refactor:KWarpTile size in gemms util

* Fix the branch and wrap up the k warp tile

* Add bf8 integration

* refactor: clang format and rebase

---------

Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>

[ROCm/composable_kernel commit: e03293ebce]
2025-06-25 08:20:35 -07:00
Illia Silin
44656b6230 Enable builds on gfx942 by default and run all tests on develop branch. (#2408)
* add switches for architectures and force develop to run all tests

* move the test condition inside the function

* enable build on gfx942 by default

[ROCm/composable_kernel commit: 6d6f4c76c1]
2025-06-25 08:01:50 -07:00
Rostyslav Geyyer
e1b1dd7476 Enable fp4 tests (#2329)
[ROCm/composable_kernel commit: daf71fb8e4]
2025-06-25 07:38:54 -05:00
linqunAMD
d2ec53a74e [CK_TILE] Refine fp8 support in flatmm (#2239)
* [CK_TILE] Refine fp8 in flatmm

1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct

* fix clang format

* fix clang format

* keep default bhavior unchanged (warp tile = 16x16)

* fix tile engine build error

* fix a typo in codegen_utils.py

* address review comments

* address review comments

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 37e1a27537]
2025-06-25 01:07:45 -07:00
Po Yen Chen
b62e551ccb [CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403)
* Rename batch_prerfill interface

* Add min_seqlen_q parameter in MakeKargs()

[ROCm/composable_kernel commit: 50fad03524]
2025-06-25 15:19:21 +08:00
Xiao Li
66d5fb7017 Fix amd_ck_fp8.hpp macro definitions (#2325)
* Fix amd_ck_fp8.hpp macro definitions

1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before.
2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace
   fp8_impl to avoid redefined error when building with rocm 6.4+
   (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h)


Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>

[ROCm/composable_kernel commit: bac51b6ec0]
2025-06-24 22:46:15 -06:00
Yi DING
c4ba466332 Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393)
[ROCm/composable_kernel commit: c5d9181e1b]
2025-06-24 16:35:54 -07:00