96 Commits

Author SHA1 Message Date
Hosang
859acb5ae7 [rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6)
[CK_TILE] Add LLC-aware FMHA head grouping and head-major
 scheduling on RDNA (#5018)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.

This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.

In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.

## Technical Details
This PR adds two complementary strategies:

- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
  - Estimate LLC size (env override, KFD sysfs, or arch default).
  - Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).

- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
  - Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
     improving temporal locality of per-head K/V reuse.

Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
    deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.

## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`

## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x

- TFLOPs by sequence length target: gfx1201 layout: bshd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 21:19:23 +00:00
Yi DING
574c1c121a [rocm-libraries] ROCm/rocm-libraries#5174 (commit a358a21)
[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic
 Mode (#5174)

## Motivation
This PR enables a persistent-kernel execution path for FMHA backward
(dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is
split, stored, and converted back to final gradients.

## Technical Details
- Introduces a persistent-kernel grid mapping in deterministic mode and
updates split-count calculation accordingly.
- Extends kernel kargs to carry batch-related info needed for persistent
scheduling and dQ conversion.
- Refactors dQ store conditions and adds mask-type traits/utilities and
runner logging updates.

## Test Plan
- Jenkins
[base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline)
- Jenkins
[AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline)
- Jenkins
[FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline)
- local FA tests

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 06:14:31 +00:00
Anton Gorenko
2312eef6c3 [rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on
 gfx950 (#4368)

## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 10:00:52 +00:00
Jeff Huang
6e558658ea [rocm-libraries] ROCm/rocm-libraries#4999 (commit 45f6624)
[CK] Fix 32-bit overflow in batch prefill kernel for >4GB KV
 cache (#4999)

Use SRD rebasing for page_block_size >= kN0: move SRD base pointer to
page start via 48-bit arithmetic, encode only within-page offset in
voffset. Original code path preserved for ps1/ps16 via constexpr-if.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-05 01:09:12 +00:00
Yi DING
b09112bbad [rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c)
[CK_TILE] FMHA BWD Launcher Interface

## Motivation
Reduce memory usage; Be prepared to implement optimizations of reducing
nsplits in deterministic cases.

## Technical Details
This PR introduces a new launcher interface for the FMHA backward
operation, replacing direct function calls with a more structured
approach. The launcher encapsulates kernel dispatch logic and provides
access to computed metadata like the number of dQ acc splits.

**Changes:**
- Added `fmha_bwd_launcher` class that wraps kernel execution and
exposes `dq_acc_splits`
- Moved `fmha_bwd_traits` construction earlier in the execution flow to
support launcher initialization
- Refactored code generation to produce both legacy API and new launcher
constructor

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 01:21:07 +00:00
Anton Gorenko
0d92fffedb [rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11

## Motivation

Add support of gfx11 architectures (RDNA3) to FMHA.

## Technical Details

Distributions (matrix elements to lane registers mapping) of gfx11 WMMA
are completely different from distributions of gfx9 MFMA and gfx12 WMMA.
There are two cases in FMHA where this difference matters:
* usage of results (matrix C) of one GEMM as input (matrix A) of another
GEMM.
* random number generation for dropout (implementation for gfx9 MFMA,
gfx12 WMMA and host validation produce the same results).

Both cases are solved by a special remapping implemented using
`__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`.

Additional changes:
* FMHA tests are now build and run only for those types for which
instances exist (gfx11 supports only fp16 and bf16).
* Two fixes for uninitialized values (`mask.sink` and
`do_fp8_static_quant`): they may contain garbage resulting in incorrect
dispatching logic, sometimes tests report that there are no instance
available for current parameters.
* Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when
they are not requested (i.e. every time), likely has no effect on
performance but makes disassembly a bit clearer.

## Test Plan

```
ninja test_ck_tile_fmha

bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16
bin/test_ck_tile_fmha_bwd_fp16
bin/test_ck_tile_fmha_bwd_bf16
```

## Test Result

All tests must pass (some tests may be skipped).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-21 01:15:57 +00:00
Jeff Huang
7b18f5fed2 [rocm-libraries] ROCm/rocm-libraries#4263 (commit f34aec2)
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implement per-page K/V quantization for paged attention:
  - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
  - Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations

## Proposed changes

Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-04 23:26:20 +00:00
ltqin
67f0b74ec6 Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit de5a1d730d.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-23 09:03:22 -08:00
Po Yen Chen
de5a1d730d Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit dd0b4294af.
2026-01-22 21:21:19 -08:00
ltqin
dd0b4294af Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel

* add block scale to kernel

* add smoke test

* format

* Revert "format"

This reverts commit 356c3c9706.

* only format my code

* format py

* fix auto not allowd in function prototype

* change instance tttt to ttff

* fix structured binding issue

* change s_acc elementwise op

* async pipeline add block scale

* add quantation P using shift exp2

* precompute (m - shift) once per row

* change blk scale seqstrt ptr name

* fix some name

* fix for  deduction guide

* fix some comments

* add P scale to qr_ksvs_pipeline

* add comment to idx_identity

* change the method of calculating descale block index

* unify naming style: use block_scale_ as name prefix

* unify naming style

* update the CHANGELOG.md

* Add FP8 block scale quantization support for FMHA forward kernel

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-21 20:58:26 -08:00
Yi DING
fcc9372c00 [CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615) 2026-01-21 09:54:46 +08:00
Linjun-AMD
717ed0b59f [CK_TILE][FMHA] Enable gpt-oss sink (#3490)
* Enable gptoss sink

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add gptoss sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update CHANGELOG.md

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix test args error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update test_fmha_fwd.cpp

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Revert "update sink test"

This reverts commit 970b4f1686.

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update valid sink_v in splitkv pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix clangformat error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix aiter scale error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qr_ks_vs.hpp

* div scale_s for sink_value

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* update sink_value with bias

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Fix typo in dropout parameter in fmha_batch_prefill_kernel

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* optimized some code

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix splitkv error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update sink reference

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* Update smoke_test_fwd_sink.sh

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-14 21:32:06 +08:00
Thomas Ning
00c46785a8 Shuffle fix for gfx950 (#3491)
* solve compiler issue

* solve the gfx950 mfma shuffle regression

* refactor jenkinsfile to handle arch name better

* [CK TILE] set divisor to count of thread along k dimension

* fix the compiler error

* solve degradation

* Finish the multiplies fix

* fix the scales

* solve compilation error

* solve the composes

* solve the error of tile sweeper

* fix the test and example

* fix for gfx950

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2026-01-13 09:21:29 -08:00
Jeff Huang
cc75a1dc5f [FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter

* add is_sglang_layout to  parameters

* add kv_offset_array_transform to batch async for page size 16

* add kv_last_page_lens to kernel

* change kv layout to [num_total_pages, page_block_size, hdim]

* format

* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
  batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)

* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024

* fix python format

* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]

* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
   traits
3. Add some static check for Page size, vector size, hdim, ..., etc.

* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration

Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.

**Changes:**
*   Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
*   Updated Kernel, Pipeline, and Traits to template on these Enums.
*   Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
*   Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
*   Updated CodeGen scripts to support new parameters.

This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.

* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.

* fix vgpr spill count

* remove unnecessary t2s functions

* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py

* support linear kv cache layout

* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.

* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.

* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.

* update changelog

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2026-01-05 18:41:47 +08:00
Po Yen Chen
1c3151963b [CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425)
* Add fp8bf16 support for batch_prefill

* Fix wrong scale_s re-compute logic in batch_prefill

* Fix wrong scale_s re-compute logic in fmha fwd

* Fix batch_prefill codegen error

* Remove no-longer used GetName() function

* Add fp8 logits=True instances

* Update CHANGELOG.md
2025-12-24 10:34:06 +08:00
Po Yen Chen
bfac64953f [CK_TILE][FMHA] Add logits soft-capping support for FAv3 (WIP) (#3355)
* Let fmha_fwd_v3() compatible with fmha_fwd()

* Decouple get_fwd_blobs() and FmhaFwdKernel

* Decouple compatibility checks from get_fwd_blobs()

* Extract product feature checks out from get_fwd_blobs()

* Remove duplicated code in factories and redundant checks

* Remove FmhaFwdKernel<>::GetName()

* Let FmhaFwdApiPool support pipelines with different mask_impl

* Add tile setting for fmha fwd v3 pipeline

* Add fwd v3 instances to tile_example_fmha_fwd manually

* Remove unused function import

* Undo irrelevant changes

* Remove fwd v3 instances from tile_example_fmha_fwd

* Finish fmha fwd v3 kernel instance codegen

* Fix formatting

* Remove unused F_idx attribute

* Add is_generic_attention_mask<> traits

* Add constraints to the fmha fwd v3 pipeline

* Unify traits & problem used for fmha fwd v3

* Unify kernel launch code for fmha fwd v2 & v3

* Unify kernel template selection logic

* Use same kernel codegen template for both v2 & v3

* Rename api() property as render() method

* Allow specifying filter for fmha fwd api pool

* Allow specifying function name when rendering api pool items

* Separate fmha fwd v3 kernel dispatching logic from v2

* Remove lambda assignment

* Add simple v2/v3 dispatch logic

* Stop generating empty if-clauses

Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.

* Use "".join() to concatenate fmha fwd api string content

* Add more feature checks for fmha fwd v3 pipeline

* Check features before dispatch to fmha_fwd_v3()

* Add more feature checks for fmha_fwd_v3()

* Add missing filter call

* Use Tuple to reserve the dtype orders

* Fix wrong pipeline matching logic

* Add fmha fwd v3 group mode instances

* Add functor_transform<>

* Add type constraints to make_tile_window()

* Remove fmha fwd v3 example

* Fix wrong product(aiter mha_fwd()) config

* Fix wrong fmha fwd v2/v3 selection logic

* Fix formatting

* Add comment to warning v3 kernel users

* Fix wrong codegen logics

* Remove unnecessary param

* Fix format

* Add logits soft-capping support for fmha fwd v3 pipeline (WIP)

* Add missing Kargs base type

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-18 16:08:45 +08:00
Linjun-AMD
f5573f56d9 Add attention sink support for FMHA FWD (#3368)
* Revert "Revert "Add attn sink (#2892)" (#3250)"

This reverts commit 5adaa201ed.

* fix conflict

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Add F_sink parameter to FmhaFwdPipeline

* Update tile_fmha_traits.hpp

* Refactor pipeline creation in fmha_fwd.py

Updated the pipeline creation logic to include 'sink' parameter in product combinations and adjusted the FmhaFwdPipeline calls accordingly.

* Update fmha_fwd.py

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/script/correct_test_fwd_sink.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update CHANGELOG.md

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update CHANGELOG with new features and support

* Update fmha_fwd.hpp

* Update CHANGELOG.md

* Update smoke_test_fwd_sink.sh

* Update correct_test_fwd_sink.sh

* Update smoke_test_fwd_sink.sh

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-15 12:21:59 +08:00
Po Yen Chen
05292b3604 [CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153)
* Let fmha_fwd_v3() compatible with fmha_fwd()

* Decouple get_fwd_blobs() and FmhaFwdKernel

* Decouple compatibility checks from get_fwd_blobs()

* Extract product feature checks out from get_fwd_blobs()

* Remove duplicated code in factories and redundant checks

* Remove FmhaFwdKernel<>::GetName()

* Let FmhaFwdApiPool support pipelines with different mask_impl

* Add tile setting for fmha fwd v3 pipeline

* Add fwd v3 instances to tile_example_fmha_fwd manually

* Remove unused function import

* Undo irrelevant changes

* Remove fwd v3 instances from tile_example_fmha_fwd

* Finish fmha fwd v3 kernel instance codegen

* Fix formatting

* Remove unused F_idx attribute

* Add is_generic_attention_mask<> traits

* Add constraints to the fmha fwd v3 pipeline

* Unify traits & problem used for fmha fwd v3

* Unify kernel launch code for fmha fwd v2 & v3

* Unify kernel template selection logic

* Use same kernel codegen template for both v2 & v3

* Rename api() property as render() method

* Allow specifying filter for fmha fwd api pool

* Allow specifying function name when rendering api pool items

* Separate fmha fwd v3 kernel dispatching logic from v2

* Remove lambda assignment

* Add simple v2/v3 dispatch logic

* Stop generating empty if-clauses

Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.

* Use "".join() to concatenate fmha fwd api string content

* Add more feature checks for fmha fwd v3 pipeline

* Check features before dispatch to fmha_fwd_v3()

* Add more feature checks for fmha_fwd_v3()

* Add missing filter call

* Use Tuple to reserve the dtype orders

* Fix wrong pipeline matching logic

* Add fmha fwd v3 group mode instances

* Add functor_transform<>

* Add type constraints to make_tile_window()

* Remove fmha fwd v3 example

* Fix wrong product(aiter mha_fwd()) config

* Fix wrong fmha fwd v2/v3 selection logic

* Fix formatting

* Add comment to warning v3 kernel users

* Fix wrong codegen logics

* Remove unnecessary param

* Fix format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-05 10:31:12 +08:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
rocking
229d43ea0c Fix batch prefill compile fail in aiter (#3279)
* Fix batch prefill aiter compile fail

* Fix compile error
2025-11-25 09:46:32 +08:00
Qianfeng
81042ea574 Fix a bug for qr_ks_vs_async_trload pipeline (#3271) 2025-11-24 21:31:48 +08:00
rocking
5948dbffe4 Support fp8 dynamic quantization for fmha (#3206)
* Support qscale for dynamic quant, remove static quant

* Support hdim=256

* Remove bias test case for fp8

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-11-24 16:28:25 +08:00
asleepzzz
5adaa201ed Revert "Add attn sink (#2892)" (#3250)
This reverts commit 9fa4e8d5ab.
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab Add attn sink (#2892)
* enable attn sink

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update attn_sink script

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix some error

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* clang-format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd_kernel'mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix ci error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix format error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_bwd_pipeline_default_policy.hpp

* Update fmha_fwd_runner.hpp

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* update splitkv_pipline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update splitkv&pagedkv pipeline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* add sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update attn_sink result log

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update smoke_test_fwd_sink.sh

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test file

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test script

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

* use constexpr kHasSink for sink in fmha pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update by pre-commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove causal mask setting logic from mask.hpp

Removed the mask setting logic for causal masks.

* fix ci error that some usage of lamada not support in c++17

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update remod.py

* add smoke sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update fmha_pagedkv_prefill.py

* Update FmhaFwdPipeline parameters in fmha_fwd.py

* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix c++17 unsupprot error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

* Fix formatting of sink_seq_end assignment

* Fix indentation for sink_seq_end assignment

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-20 19:24:05 +08:00
Anton Gorenko
d7b3197869 [CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) (#3221)
* Use vectorized stores for dropout randvals

With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.

* Calculate dropout randvals for storing and applying only once

Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.
2025-11-19 10:40:12 +05:00
Anton Gorenko
1e77695fe8 [CK_TILE] Support WMMA (gfx12) in FMHA (#2528)
* Pass hdim to tile_example_fmha_fwd in fp8 tests

* Add WMMA support to fwd FMHA pipelines

* Tune tile sizes a bit for less spilling

fp16 256 is still quite slow

* Fix Q grad tile distribution for warp size = 32 and hdim >= 256

With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.

* Use code based on BlockDropout in BlockDropoutBwd

* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal

* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
  because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
  it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
    * any number of warps (power-of-2), not only 4;
    * kN1 = 16, not only 32;
    * any number of splits;

* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now

* Replace hard-coded kN1 in dispatch code with the requested tile size

* Add gfx12-specific tile sizes for split KV

* Pass GPU architecture to kernel generation scripts

This is still a temporary solution.

* Build and run FMHA CI tests for gfx12

* Fix issue after merging

* Fix bwd tile sizes

The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).

* Use hardware f32->f8 on gfx12, remove v_perm

__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
 32-bit dword) is used to store results (two f8 values).

* Update changelog

* Add WMMA support to pagedkv

* Fix scripts after rebasing

* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Fix names after cherry-picking

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Do not use filters related to qr_async_trload

They disable tiles/pipelines which are valid for gfx12.

* Use different dstr encoding when C is transposed

* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code

Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().

* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution

Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.

* Generalize usages of WarpGemmDispatcher for MFMA and WMMA

WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.

* Mark has_load_tr as maybe_unused

There are no transpose loading for RDNA.

* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code

* Detect BlockSize on host based on warp size of the current device

If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.

* Fix calculation of grid size for combine kernel with warp size = 32

* Add missing includes and header

* Support multiple archs in one binary for fwd

* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill

* Support multiple archs in one binary for bwd

* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
  they can be used as fallbacks (similarly to fwd);

* Extract common code from register_traits

* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"

To simplify merging , the proper fix is in develop already.

* Support new numerical d paddings in trait ordering checks

* Build fp32 tests only on gfx9

* Do not use hardcoded M0 = 64 for dot bwd kernel

* Use textwrap.indent from standard library

* Make fp8 pipelines on gfx12 consistent with gfx9

* Update tests for current pipelines

* Make ninja check more responsive in CI

ninja buffers output so this job looks hanging.

* Support fp8fp32 by limiting O vector size

The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).

* Remove unused cmake options

* Unify including  amd_buffer_addressing.hpp/_builtins.hpp

* Temporarily use amd_buffer_addressing.hpp on >=gfx10

amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
 * 1 scalar for exec masks instead of 2,
 * gfx12 uses different instruction names etc.

* Update asm in bf16 conversions to work with warp 32

* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd

* Add arch tags to kernels/host funcs, compile for each arch separately

* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename

* Add workaround for miscompilation of bwd with padded hdim

SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.

* Fix add_gtest_executable for absolute paths

Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.

* Run only hdim 128 smoke tests for fp8fp32

There are no instances for hdim 64 and 256.

* Format py with ruff to simplify merging develop

* Fix incorrect var name

* Codegen for gfx9,gfx950 when --targets is not specified

Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).

* Combine arch-related values into ArchTrait

This more centralized approach removes duplication of various formatting templates.

* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"

Some code is extracted into a function.
2025-10-29 13:31:08 -07:00
Jeff Huang
7c6430eca0 [CK_TILE] fmha: Add query padding support to backward pass (#3097)
* [CK_TILE] fmha: Add query padding support to backward pass

Introduces support for query sequence padding (q_padding) in the FMHA backward pass kernels.
- Passing `seqlen_q_ptr` to the backward kernels to distinguish logical from physical sequence lengths.
- Updating `OGradDotO`, `ConvertQGrad`, and `DQDKDV` kernels to respect logical lengths and handle zero-length sequences.
- Aligning LSE indexing in the forward kernel with the padded layout for consistency.
- Adding a new GTest suite (`test_fmha_bwd_kernel_padding.cpp`) with comprehensive tests for various padding scenarios, including zero-length
  sequences and deterministic mode.

* fix clang format

* Adapt fmha_bwd_runner.cpp to new q, kv sequence padding
Add backward q/kv sequence padding unit tests.

* [CK_TILE] fmha: Unify sequence length and padding handling

Refactor the handling of sequence lengths and padding in the
FMHA forward and backward kernels to provide a more unified and flexible
interface.

- Replaced `seqstart_padded_*_ptr` with a more robust system that uses
  `seqstart_*_ptr` for physical sequence lengths and introduces
  `seqlen_*_ptr` and `cu_seqlen_*_ptr` for logical (unpadded) lengths.
- Established a clear order of precedence for determining sequence
  length: cumulative lengths (`cu_seqlen_*_ptr`) take priority,
  followed by per-sequence lengths (`seqlen_*_ptr`), and finally
  physical lengths derived from `seqstart_*_ptr`.
- Clarified the distinction between "group mode" and "batch mode" and
  how sequence lengths are handled in each case.
- Renamed `cu_seqlen_kv_ptr` to `cu_seqlen_k_ptr` for consistency.
- Updated comments and documentation to reflect the new argument
  structure and usage.

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-10-29 13:56:11 +08:00
Anton Gorenko
1edd250115 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd
2025-09-27 18:03:48 +05:00
Yi DING
32773fe5cb [CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918) 2025-09-26 16:42:59 +08:00
Jeff Huang
518d24e662 Add sequence padding and variable length support in fmha (#2932)
* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
   `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
   `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
   specifically validate/benchmark the new padding and variable-length
   functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

* [CK_TILE] add mqa, gqa to sequence padding unit tests

* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI

* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel
2025-09-26 12:36:27 +08:00
Khushbu Agarwal
b56e5d1d79 Fix for Add the API to load SGPR (#2913)
* Revert "Revert "[CK-Tile] Add the API to load SGPR  (#2878)" (#2904)"

This reverts commit f161b5b738.

* Fix: sgpr minor issue

* cyclic dependency resolved

* clang formatted

* removing unused variable

* clang formatted

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-25 10:32:42 -07:00
ltqin
ab22f91a7c fix fmha fwd kernel name (#2880)
* fix fmha fwd kernel name

* if the input and output types are the same, keep the original code
2025-09-24 20:00:10 -07:00
asleepzzz
f161b5b738 Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)
This reverts commit 2cbbf5dcb3.
2025-09-23 14:33:51 -07:00
Haocong WANG
7b16782d7c [CK_TILE] Fix fmha bwd (#2865)
* Fix fmha bwd filter

* remove unnecessary change

* enable test cases

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-23 19:59:27 +08:00
Thomas Ning
2cbbf5dcb3 [CK-Tile] Add the API to load SGPR (#2878)
* Have a workable version for SGPR

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* substitute with the new sgpr read api

* update the CHANGELOG

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* change to static for logic

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
2025-09-23 01:23:56 -07:00
Haocong WANG
b6e8994386 [CK_TILE] FMHA FWD bug fix (#2888)
* tempsave debug

* fix the bug in fmha fwd_kernel

* Remove unnecessary changes

* Fix the buggy part

* remove fmha fwd known failure cases
2025-09-23 15:00:46 +08:00
Illia Silin
b765fe78f3 Revert "[CK_TILE] Add sequence padding and variable length support in fmha (a…" (#2883)
This reverts commit 86dd59cd01.
2025-09-19 08:15:02 -07:00
Jeff Huang
86dd59cd01 [CK_TILE] Add sequence padding and variable length support in fmha (a… (#2851)
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
   and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
   These test cases and benchmarks that specifically validate/benchmark the
   new padding and variable-length functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-19 17:36:49 +08:00
ltqin
dd249f1cd6 Add input fp8 and output bf16 attention (#2726)
* change host using fp16 to check

* fp8 to fp8 compare

* rewrite input parameters

* add not squant

* remove some output code

* for scale = 1

* format

* saturates only for fp8

* add fp8bf16 data type

* add fp8bf16 data type

* fix test fp8 code

* add run_fp8bf16_tests

* change fmha fwd example parameter(adding fp8bf16)

* Support fp8bf16 for Aiter

* Support aiter fp8bf16 in c++

* fix comment about fp8 in readme.md

* add fp8fp32

* add fp8fp32 test

* remove range_q etc.

* format

* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type

* add fp8bf16 to data_type function

* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8

* format

* reset atol for fp8

* fix bug for atol

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-09-19 14:26:43 +08:00
Haocong WANG
59cb906482 [CK_TILE] fix bug when iperm =0 in fmha fwd (#2820)
* fix bug when iperm =0 in fmha fwd

* Disable f8 fmha smoke test until fix pr merged

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-09-16 15:07:10 +08:00
Po Yen Chen
7fbc9d6c97 [CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833)
* Re-mapping thread block indices for causal=True kernels

* Use more intuitive remap_opt value

* Fallback to origin remapping if seqlen_q >= 64K

* Use GenericAttentionMask to reduce mask computation

* Avoid unnecessary boundary check for IsMasking=false case

* Fix wrong kernel entry specifier

* Add s_nop to prevent delay wave0-3

* Refine scheduling

* Remove unnecessary sched_group_barrier()

* Move sched_group_barrier() call to scheduler

* Replace inline asm s_setprio with intrinsics

* Rephrase comments

* Expend some o_acc rescaling insts to avoid SIMD idle

* Fix block idx special mapping logic

* Tune block index mapping for causal=False cases

* Tune block index mapping for causal=True cases

* Fix wrong vmcnt()

* Remove parameter name

* Use boolean option for turn on/off causal mask

* Update benchmark_fwd_v3.sh option usages

* Add option if compiler support it
2025-09-16 11:32:38 +08:00
Yi DING
91178b4011 [CK_TILE] Fix kname & typo in FMHA BWD (#2809) 2025-09-09 15:08:00 -07:00
Po Yen Chen
9f35cde374 [CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765)
* Fix Default2DEpilogue usage

* Fix Default2DEpilogue usage for batch_prefill
2025-09-02 09:51:56 -07:00
Haocong WANG
33418b201f Fix naming issue (#2762) 2025-09-02 11:18:53 +08:00
Po Yen Chen
d876e87fe4 [CK_TILE] Add FAv3 fwd pipeline (#2731)
* Add FAv3 fwd pipeline

* Unpack v_pk_mul to hide v_mov

* Avoid compiler moving l compute across phase

* Sync sched_group_barrier() setting for masking cases
2025-09-01 09:16:45 +08:00
Mateusz Ozga
0758883fa4 [CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752)
* Init commit

* Quick fix, CI fails

* Remove CDElementWise

* Add CDEELementWise

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-28 12:45:50 -07:00
Yi DING
de61e55493 [CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715)
* FMHA avoid unnecessary vmcnt0

Squashed commit of the following:

commit 7bdf6a7eef
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:51 2025 +0000

    merge develop and solve conflicts

commit f21e916a8c
Merge: a7dd2a7d1 0db21053e
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:21 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue

commit a7dd2a7d13
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 02:17:43 2025 +0000

    update bwd

commit 380aa8f311
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Mon Aug 18 19:36:38 2025 +0000

    add restrict to applicable functions

commit b85daba2a3
Author: Ding, Yi <yi.ding@amd.com>
Date:   Mon Aug 18 02:07:03 2025 +0000

    bwd filter

commit 75c4b9372f
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Sat Aug 16 08:15:23 2025 +0000

    remove noinline attr as it causes a lot more s_waitcnt's

commit 598e3fec41
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 12:11:17 2025 +0000

    remove innerloop, move restrict parameters to mainloop and add noinline attribute.

commit 3340408537
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 07:06:51 2025 +0000

    Create inner lambda with restrict parameters, add restrict to some parameters

commit 3bc45ecbc7
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Aug 14 03:43:54 2025 +0000

    save for debug

commit de4db6c4c5
Merge: 108abf00e 68694cb78
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:15:22 2025 +0000

    Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 108abf00e0
Merge: 0810799e2 0f42a92fc
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:14:26 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 68694cb781
Merge: 0810799e2 20288caa2
Author: asleepzzz <hanwen.chang@amd.com>
Date:   Wed Aug 13 00:34:11 2025 +0800

    Merge branch 'develop' into wip-async-tr-fa

commit 0810799e25
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 14:25:50 2025 +0000

    refactor blockgemm change, isolate to v2;

commit fd1eb323af
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:26:13 2025 +0000

    clang format

commit 75f6f6bac4
Merge: bcc05eee6 8e1eb0c1e
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:04:41 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit bcc05eee62
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 08:46:06 2025 +0000

    Fix the bug

commit 96d24497f5
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 04:02:41 2025 +0000

    fix conflict. disable all v-col instance for fmha fwd

commit 1716171be4
Merge: 1c9800790 4fde1646e
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 03:52:34 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 1c98007901
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:53:31 2025 +0000

    clang format

commit f43e903b1d
Merge: 3868ddd70 a7badc6ec
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:52:52 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 3868ddd708
Merge: 498d234ab 191c62967
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:59:40 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 498d234ab8
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:37:37 2025 +0000

    change the warp setting for hdim32 fmha fwd

commit b86f7786e2
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 14:21:09 2025 +0000

    tempsave, update the blocksync functions

commit 7b8052d7ca
Author: aska-0096 <haocwang@amd.com>
Date:   Sun Aug 10 06:00:51 2025 +0000

    fix bug in pki4

commit 76cbbb84a2
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Aug 9 03:25:12 2025 +0000

    fix bugs in gemm

commit 8c101ccb88
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 18:35:53 2025 +0000

    fix bug on non-gfx950

commit efb8549279
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 17:53:19 2025 +0000

    fix bug

commit 729e8785fb
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 15:42:15 2025 +0000

    fix bugs

commit 250dc13c75
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:31:01 2025 +0000

    fix clangformat with 18.1.3

commit 106edeecd9
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:07:40 2025 +0000

    remove non-necessary change

commit 78edd7303b
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:04:02 2025 +0000

    bug fix, clang format;

commit 3b9fb6af38
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 08:08:03 2025 +0000

    Remove unnecessary changes

commit 6bb57c2c57
Merge: 1ecee378d ab2602683
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 07:50:12 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 1ecee378d5
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 06:19:31 2025 +0000

    remove unnecessary files; rename some files

commit b4640a9de6
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 05:46:18 2025 +0000

    merge fa_decode pipeline into fmha_fwd api

commit fe63a646a4
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 6 05:58:43 2025 +0000

    add __restrict__ to tr load

commit 414cad667b
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 5 07:23:51 2025 +0000

    Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

commit 0d12fc944f
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:27:42 2025 +0000

    Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

commit 4f31847de1
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:02:17 2025 +0000

    add vmcnt guard before load ktile

commit 746f4ccb99
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 06:49:01 2025 +0000

    Load Q through lds, implement xor;

commit 2d4e73d2b4
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 1 10:44:54 2025 +0000

    small refactor

commit a28b6e67fe
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 10:25:37 2025 +0000

    upgrade prefill pipeline; simple iglp; consistent data produce and consume order

commit 75cba48682
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 05:13:27 2025 +0000

    enable larger tile size; upgrade xor pattern

commit 69890afc98
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 12:25:33 2025 +0000

    remove all lds bankconflict with xor layouts

commit 8dacc35c4c
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 03:51:06 2025 +0000

    enable prefill overload operator().

commit 13bcc913de
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 25 07:10:01 2025 +0000

    fix the lds alignment caused performance regression

commit af28123cec
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 23 09:05:57 2025 +0000

    remove unnecessary features

commit 14e0ab70c6
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 22 08:04:05 2025 +0000

    tempsave. asynccopy+trload sanity checked

commit 1b468bac0b
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Jul 21 05:55:55 2025 +0000

    tempsave, trload+asyncload done

commit afd96d8180
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 10:04:34 2025 +0000

    compile pass

commit 5616551115
Merge: ae39c84f5 095393276
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:17:27 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit ae39c84f55
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:16:39 2025 +0000

    tempsave

commit 94b6430489
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 10:06:09 2025 +0000

    temp save

commit 7e330553dc
Merge: 18669925c 804f77dce
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 07:24:32 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline

commit 804f77dce5
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 03:10:46 2025 +0000

    move test_copy into test

commit 21627d7ca7
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:41:31 2025 +0000

    remove unnecessary output

commit 287792c44a
Merge: a4221db30 21fd7e953
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:13 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit a4221db304
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:10 2025 +0000

    add input validation and bug fix

commit 21fd7e9538
Merge: d6df7bf85 6e76b8205
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Date:   Wed Jul 16 11:23:57 2025 -0700

    Merge branch 'develop' into test_copy_fix

commit d6df7bf851
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:55:50 2025 +0000

    fix vmcnt shift

commit 40e039e4e4
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:37:07 2025 +0000

    Improve s_waitcnt_imm calculation

commit c30f8b709b
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 05:39:50 2025 +0000

    fix the s_waitcnt_imm calculation

commit ec0a45b29f
Merge: e5cc4af80 6b09f0823
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:57:57 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit e5cc4af808
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:54:33 2025 +0000

    Add block_sync_lds_direct_load utility

commit eea58629cf
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 15 09:39:03 2025 +0000

    fix async copytest bug

commit 18669925cc
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 10 04:29:33 2025 +0000

    temp save, change all instance to 1wave

commit 18686cfe5b
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 8 08:37:20 2025 +0000

    tempsave, fmha_decode

commit 47565f21a5
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Jun 21 15:02:57 2025 +0000

    temp save, waiting for debug

commit e0a634ef97
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jun 19 05:11:52 2025 +0000

    save an example for __bf16 type

commit 4bd5fd4a3c
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 07:27:24 2025 +0000

    fix bwd code

commit 69809d9513
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 06:37:16 2025 +0000

    Fix for fwd/bwd kernel build filter

commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 08:13:18 2025 +0000

    Add restrict to avoid unnecessary vmcnt

---------

Co-authored-by: aska-0096 <haocwang@amd.com>

* Add comments for c-stype cast

* Better comments

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
2025-08-25 20:55:12 +08:00
John Shumway
c71d7ddd74 Remove unsupported use of c++20 concept. (#2719)
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.
2025-08-24 21:29:23 -07:00
Yi DING
4cfa2c7158 [CK_TILE] FMHA BWD Fix Compilation with Bias (#2682)
* [CK_TILE] FMHA BWD Fix Compilation with Bias

* Fix appendkv kApplyRoPE
2025-08-22 10:01:10 +08:00
linqunAMD
9fcc1ee9fd Support Wave32 in CK_TILE - Part 1 (#2594)
* Support wave32/wave64 in CK_TILE - Part 1

* remove blocksize in kernel launch

* fix build error

* fix clang format

* fix clang format 2

* fix clang format 3

* fix fmha build error

* fix fmha build 2

* fix fmha build 3

* fix build error 4

* address review comment

* update change log

* replace KernelBlockSize with kBlockSize

* fix CI fail

* fix clang format

* address review comment and rebase code.

* fix universal test fail

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-18 10:08:31 -07:00