Commit Graph

73 Commits

Author SHA1 Message Date
assistant-librarian[bot]
45c177c384 [CK_TILE] Extend support of mix precision microscaling BQuant (#4267)
## Proposed changes

Supported types combinations using BQuant=e8m0:
 - A=bf16
 - B=bf16,bf8,fp4

Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
 - add support for scaling instructions in `DequantPack8`
 - mx pipeline:
   - extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
 - block gemm:
   - mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
 - warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered



---
🔁 Imported from
[ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689)
🧑‍💻 Originally authored by @EnricoDeg

---------

Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-24 09:55:50 -08:00
ZheWang
c006b10452 Mx fp6 flatmm (#3601)
* add fp6 data-type and support sync/async dwordx3 load/store

* clang-format

* pre-commit

* 1st commit

* default mnk pass ut

* fix a distrubution

* fix

* fix bdram distr

* update

* pass ut

* improve perf

* update

* clean code

* resolve copilot comment

* reslove comment

* clang-format

---------

Co-authored-by: ZheWang <zhewan@amd.com>

[ROCm/composable_kernel commit: e6bcd192d4]
2026-02-02 16:04:40 +08:00
jiangyon.ren
ce51308aaf [CK_TILE][FMHA] Add sparse attention VSA (#3341)
* add sparse attention VSA

* fix the pre-commit

* Add jenga test and pre-commit

* add bf16 for vsa

* add jenga support bf16

* remove lse arg

* split kernel code to block & kernel

* fix the pre-commit

* fix the pre-commit

* fix the copyrights

* fix the copyright

* fix the copyright & rename block to pipeline

* fix the copyright and pipeline

* remove lse & dropout & add fmt

* fix the jenga&VSA code review

* remove the useless code & resolved the comments

* remove useless code

* remove useless code

* Clean up code

* Remove more unused code

* Re-format .hpp

* Refactor codegen scripts

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 4d2f8c111e]
2026-01-31 00:59:47 +08:00
Erwin Terpstra
09d443a7ad [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603)
* chore: split block scale example instances in more separate files to speed up compile times

* wip: fp4 scaffolding for abquant

* feat: add fp4 decoding-while-loading to abquant pipeline

* feat: add support for fp4 CPU verification in abquant

* chore: add time tracking to reference calculation

* feat: add a4w4 test for blockscale gemm

* feat: optimize reference calculation by preconverting values to AccType

* feat: add fp4 to fp8 look-up table

* fix: reference to wrong ComputeDataType field in QuantProblem

* feat: type utilities for determining MFMA compute types

* feat: packed fp4 for abquant weight preshuffle

* feat: add separate tests for a4w4 base case, padding and preshuffleB

* fix: fp4 conversion on gfx950 attempting to use non-supported method

* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size

* chore: add fp4 preshuffleb mode to block scale example

* chore: sanity check for packed types being 1 byte

* chore: clarify tensor dimension indices with constants

* chore: replace traits check with specialized check for packed types

* style: some minor refactoring and cleanup

* fix: correct conversion table for FNUZ fp8

* chore: add fp4 instances to main abquant instances again

* chore: use same initialization branch for int4 and fp4

* chore: add missing initialization for fp4 in block scale gemm example

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 6a6177a246]
2026-01-30 04:40:50 -07:00
ltqin
90b3476006 Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit 723b7ce0be2884da131036301892bf9157f51876.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 67f0b74ec6]
2026-01-23 09:03:22 -08:00
Po Yen Chen
4ded7e5984 Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit ceccf15275645cc64db0a4ae53f5a215c93a7969.

[ROCm/composable_kernel commit: de5a1d730d]
2026-01-22 21:21:19 -08:00
ltqin
14254656f0 Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel

* add block scale to kernel

* add smoke test

* format

* Revert "format"

This reverts commit 356c3c9706.

* only format my code

* format py

* fix auto not allowd in function prototype

* change instance tttt to ttff

* fix structured binding issue

* change s_acc elementwise op

* async pipeline add block scale

* add quantation P using shift exp2

* precompute (m - shift) once per row

* change blk scale seqstrt ptr name

* fix some name

* fix for  deduction guide

* fix some comments

* add P scale to qr_ksvs_pipeline

* add comment to idx_identity

* change the method of calculating descale block index

* unify naming style: use block_scale_ as name prefix

* unify naming style

* update the CHANGELOG.md

* Add FP8 block scale quantization support for FMHA forward kernel

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: dd0b4294af]
2026-01-21 20:58:26 -08:00
damien-lejeune
693548d8b2 Dlejeune/ck tile 2d multiple reductions (#3147)
* WIP

* Add Unit tests for the Multi Reduction Kernel

* clang format

* Rename multiblock to threadwise

* Multiblock WIP

* Fix multi reduce multi block unit tests

* Multi Reduce Tile Engine: WIP

* refactoring + try addressing precision error

* Fix multiops examples

* Cleanup

* Clean up tile engine's reduce op

* Update changelog

* Fix remod/clang

* Fix dates

* Fix documentation & missing file

* Fix comments

* Use the update_tile api in the multi-block kernel

* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests

* Add TileParitioner

* Cleanup

* Add warning when no data to process, in the example

* Refactoring Reduce kernel Tile Partioner + cleanup

* Move the tile partioner to its own file

* Add missing includes

* Fix copyright header with update_amd_copyright_headers.py

* Fix change of interface in Reduce2dProblem

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 4216d43da8]
2026-01-09 11:16:37 +01:00
John Afaganis
3830186287 Update unsigned long literals and format specifiers to work correctly in Windows (#3483)
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.



[ROCm/composable_kernel commit: ec23be0b9d]
2026-01-02 22:16:41 -07:00
yadaish
a57f8d8b67 [CK_TILE] support split-k a16w4 gemm1 (#3389)
* initial version to support moe gemm1 split-k

* add missing args

* fix build warning

* update reference

* for split-k disable bias and weight

* remove debug log

* fix format

* fix div by zero errors

* fix cmake config

* update

* resolve conflicts

* remove useless changes

* reformat

* fix

* remove useless changes

* fix ci

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>

[ROCm/composable_kernel commit: dae85ead64]
2025-12-29 23:05:35 +08:00
kensclin
0eb5d4a93f Enable padding blockscale for abquant (#3453)
* Enable padding blockscale for abquant

* run clang-format

* Reduce unnecessary testing

* remove cout

[ROCm/composable_kernel commit: 7f68f3c4fa]
2025-12-24 09:12:40 -08:00
kensclin
9b63a65886 Support A/B Quantization in Blockscale GEMM (#3343)
* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Implement review suggested changes

* Implement review suggested changes

* Sync with develop

* fix pre-commit error

* Add unit tests for blockscale AB-Quantization

* fix pre-commit error

* fix pre-commit error

* fix compile error

* fix compile error

* fix clang-format

* fix clang-format

* fix enumeration values not handled in switch

* rebase file

* Add missing enums to data_type_sizeof (#3430)

Fixes broken build on gfx942. This was some test code that got merged at the same time.

* [CK_BUILDER] CK Tile header installation for builder, algorithm concept improvements (#3419)

* Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen
* Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts

---------

Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com>

* Add build trace diagnostics to CI. (#3432)

* generate and visualize build traces for all archs

* generate build traces in all cases

* fix jenkins logic

* fix typo

* use more threads for parsing dependency map

* add script to parse ninja traces and issue warnings

* fix python script syntax and header

* fix python syntax one more time

* fix python syntax

* Support A/B Quantization in Blockscale GEMM

* Implement review suggested changes

* Sync with develop

* Add unit tests for blockscale AB-Quantization

* fix enumeration values not handled in switch

* rebase file

* rebase file

---------

Co-authored-by: John Shumway <jshumway@amd.com>
Co-authored-by: DarylHawkinsAMD <Daryl.Hawkins@amd.com>
Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 0500fcc017]
2025-12-17 07:13:47 -08:00
Linjun-AMD
51886bf22b Add attention sink support for FMHA FWD (#3368)
* Revert "Revert "Add attn sink (#2892)" (#3250)"

This reverts commit e3be392d13e6ee107d823af32aca2d3ff03ca69d.

* fix conflict

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Add F_sink parameter to FmhaFwdPipeline

* Update tile_fmha_traits.hpp

* Refactor pipeline creation in fmha_fwd.py

Updated the pipeline creation logic to include 'sink' parameter in product combinations and adjusted the FmhaFwdPipeline calls accordingly.

* Update fmha_fwd.py

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/script/correct_test_fwd_sink.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update CHANGELOG.md

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update CHANGELOG with new features and support

* Update fmha_fwd.hpp

* Update CHANGELOG.md

* Update smoke_test_fwd_sink.sh

* Update correct_test_fwd_sink.sh

* Update smoke_test_fwd_sink.sh

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: f5573f56d9]
2025-12-15 12:21:59 +08:00
Enrico Degregori
5c81464568 CK Tile: Enable padding blockscale example (#3417)
* Fix host code padding

* restructure the ref code

* clean up

* Fix compilation error

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 21f06aa47d]
2025-12-14 10:25:47 -08:00
eliotwang
d5645ff481 Bf16*fp4 gemm (#2801)
* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* rebase to new develop

* fix clang format

* update code according to reviewer's comment

* Update README.md

* update code according to reviewer's comment

* update code according to reviewer's comment

* Update CMakeLists.txt

* Update README.md

* Update CMakeLists.txt

* Delete files

* Delete files

* Add unit tests

* Update test_gemm_quant_base.hpp

* merge bf16*fp4 example to develop branch

* fix clang format

* fix clang format

* Update CMakeLists.txt

* fix ci test

* fix clang format

* resolve conflicts

---------

Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 715671e419]
2025-12-11 07:20:29 -08:00
msaffari-amd
e33a88a624 [CK Tile] batched contraction kernel generalizing (#3126)
* Add help for example

* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings

* Add stride-aware reference for batched contraction with independent D tensor layouts

* Add -num_d argument for runtime D tensor count selection in batched contraction

* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs

* Add descriptor-based architecture for batched contraction multi-dimensional stride support

* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0

* Add complete multi-dimensional stride support via descriptors

* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm

* Clean up batched contraction: remove old UniversalGemmKernel path

* Clean up batched contraction: remove legacy paths and finalize docs

* Optimize batched contraction example: pass dimension sizes not vectors

* correct the reference calculation, unsigned int to int

* Fix batched_contraction C++17 build errors for gfx90a CI

[ROCm/composable_kernel commit: 2d3020e5b0]
2025-12-02 13:30:27 +01:00
Aviral Goel
216c23b945 chore(copyright): update copyright header for include directory (#3293)
[ROCm/composable_kernel commit: de6466481f]
2025-11-26 11:00:05 -07:00
asleepzzz
06d2e609cd Revert "Add attn sink (#2892)" (#3250)
This reverts commit bbe1d3a917ee92655224c0f1528ace3a7b0e82a8.

[ROCm/composable_kernel commit: 5adaa201ed]
2025-11-20 07:55:15 -08:00
Linjun-AMD
f4ba63deb7 Add attn sink (#2892)
* enable attn sink

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update attn_sink script

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix some error

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* clang-format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd_kernel'mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix ci error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix format error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_bwd_pipeline_default_policy.hpp

* Update fmha_fwd_runner.hpp

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* update splitkv_pipline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update splitkv&pagedkv pipeline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* add sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update attn_sink result log

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update smoke_test_fwd_sink.sh

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test file

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test script

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

* use constexpr kHasSink for sink in fmha pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update by pre-commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove causal mask setting logic from mask.hpp

Removed the mask setting logic for causal masks.

* fix ci error that some usage of lamada not support in c++17

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update remod.py

* add smoke sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update fmha_pagedkv_prefill.py

* Update FmhaFwdPipeline parameters in fmha_fwd.py

* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix c++17 unsupprot error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

* Fix formatting of sink_seq_end assignment

* Fix indentation for sink_seq_end assignment

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 9fa4e8d5ab]
2025-11-20 19:24:05 +08:00
Yi DING
0d9f230577 [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~

[ROCm/composable_kernel commit: 47e2ed838e]
2025-11-20 10:35:15 +08:00
Sami Remes
9f069d6e35 [CK_TILE] B matrix 2D block scale gemm (#3074)
* Refactor quant group size to be configurable for M/N/K, not just K

* add some asserts for configurations not implemented

* start setting of group size for N dimension

* enable 2d for reference quant gemm

* WIP: trying to figure out tile dstr and/or indexing for scale matrix

* WIP

* Fix handling of n dim blocks in tile windows etc

* remove commented code and enable all tests again

* fix formatting

* Add more specialized tile distributions

* Enable NWarps replication for bquant tile dstr

* fix formatting

* fix format

* Fix some issues from the merge

* fix formatting

* one more fix to tile dstr, and revert debug initialization

* Remove commented code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* simplify conditions that are needed for tile distributions

* only enable the working group sizes in tests

* fix formatting

* Update tile distribution for 2D bquant

* add some documentation and 2d block scale example

* fix formatting

* Add in Changlog and restructure the quant 2d example

* fix CMake

* support the change for blockscale 2d

* fix the test file

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 16e85cf179]
2025-11-02 16:49:20 -08:00
Yi DING
c28e65c6bf [CK_TILE] Add mxfp4 flatmm (#3080)
* Squashed commit of the following:

commit 3e1a851dad834776efbe4fe365ac82c4ed312010
Author: Ding, Yi <yi.ding@amd.com>
Date:   Thu Oct 23 06:10:54 2025 +0000

    Fix & clean after rebase

commit 1edf485092f44411da9a1796a4a6b72d5cdb67c6
Author: Ding, Yi <yi.ding@amd.com>
Date:   Wed Oct 22 10:46:13 2025 +0000

    Squashed commit of the following:

    commit 5276b28a51dac7b5d2106fbae8e78de190ee0de1
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 02:04:27 2025 -0500

        fix bandwidth calculation

    commit d645bb20c6d879154c30ecd82bbff4d2a9206750
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 00:58:59 2025 -0500

        updates

    commit 0fa7e6b88aaf81a36034aa7607746de295de4263
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Sep 19 00:39:46 2025 -0500

        fix a bug, set the A DS_read preload size to 4 for MXFP4

    commit 50cafa824e2267f2b2f0dfeeb93e69a673630c61
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Thu Sep 18 01:19:03 2025 -0500

        fix a_wrap preload issue for large MPerBlock.

    commit e6333bbbc6ef540e24f92095040085f1ed59041e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 21:34:03 2025 -0500

        optimized the VGPR repack issue for MXFP4

    commit e99e4932c401b9f6d1893dd5044c2827d6b3f145
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Wed Sep 17 04:19:44 2025 -0500

        fix time error

    commit 4586ce6da7fba0514f2e01a8124c76b7d494e124
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 03:58:00 2025 -0500

        updated, function passed.

    commit c4f25e7579573db5681b9160f6bdb1349f3566f1
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 22:21:39 2025 -0500

        fix, function partially passed

    commit a51b56eb6b00b99a4e8d2802dbf5b5b5277b54d8
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 03:01:12 2025 -0500

        fix, reference function passed, next check kernel function

    commit 5b02643ebab18960e8f9ba66c6bd2f91774f9cae
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Sep 16 02:29:01 2025 -0500

        let pack/unpack return pk_fp4_t

    commit 76d37c5d4b17530e95c6fced31bff66a35d54b8f
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 20:50:26 2025 -0500

        fix

    commit e5be3e162b9a20e5355bd556d2b27afb6d8bf085
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 15 05:51:06 2025 -0500

        fix bug

    commit 39a024efe4aa773df589712b1290803bb5ab5d1d
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 04:02:05 2025 -0500

        fix core dump issue, function is not correct.

    commit 16c49d268cfe065b5112b960b2d852b26552686a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 03:03:02 2025 -0500

        updates, build pass

    commit fe7a961852dee6eff3be3cf1e0d0fabec5cd42ee
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 00:05:18 2025 -0500

        updates

    commit aaf9fe8022a72df59e04e4d5886dca3ba9c23400
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Sun Sep 14 23:40:28 2025 -0500

        fix bug

    commit a3da89290e1553b85fbf1171c07e93ac0f5584db
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 03:28:50 2025 -0500

        fix interface

    commit c5ff747e72d877461ba61dc19a0fe15527d3161e
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 02:53:50 2025 -0500

        add interface in warp_gemm_impl

    commit 0a48d369e601cc798589fc59e0784bdbfc0a22f9
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 10 05:03:08 2025 -0500

        updates some fixes.

    commit aaa2beca30ff5546d171a2028d1894fd4e131d4e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 9 04:37:42 2025 -0500

        fix after merge ginolu/add_wgmfma_dispatcher

    commit bf87449b09cba690922b2f3f78ba39bf1b1e472e
    Merge: 05ab58e3d 991d7fdbb
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 22:09:15 2025 -0500

        Merge remote-tracking branch 'origin/ginolu/add_wgmfma_dispatcher' into mtgu/cktile_mxfp4_flatmm_dev

    commit 05ab58e3de2b708aceda63d704089c0fa89437ae
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 21:42:47 2025 -0500

        update mx flatmm tail pipeline

    commit 991d7fdbb726d65091a91b5cc2800f798a6661fc
    Merge: ad046084a 41ee8fe31
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:10:23 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit ad046084a2f6e4ebf0cd8b47d0d72b74815061fa
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:09:55 2025 -0500

        fix type error

    commit 42e16b43a035364a42789d7ce45a1e6a7d1d2609
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 04:01:40 2025 -0500

        update hotloop pipeline

    commit c2f69745346545087c8ce24acaba2961bb93ef0b
    Merge: adbeeb90b 91db4cec3
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:22:26 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit adbeeb90be1533f8aeb8c1d5aea6470d45a455a0
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:21:26 2025 -0500

        fix clang format

    commit e2378ac393bb79ac80a8eef84677bffce86d9e0a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 3 10:00:54 2025 -0500

        some updates

    commit bdc18a2269db49ff88e1ef1db30f83ea430d7544
    Merge: 6c5cea2b7 b3886a6d8
    Author: asleepzzz <hanwen.chang@amd.com>
    Date:   Wed Sep 3 13:22:03 2025 +0800

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit 6c5cea2b7a306f5d0ad346cb9baf6370ea2a73fe
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 02:11:02 2025 -0500

        fix vec size error

    commit 76d1dfa352087dfd5867c8909b73726d3a1e853e
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 01:23:39 2025 -0500

        fix format error

    commit a9061aaa1b4bfaa9db102c75b9d74863f39708a9
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Sat Aug 30 03:19:07 2025 -0500

        update codes

    commit 0caa184a271a8824ef40f87de456d0fa2500c8ad
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Aug 29 11:27:33 2025 -0500

        init ck_tile mxfp4 flatmm

    commit 5d46a6635f04bd69b76f7eda1438862e271b987a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 28 08:02:50 2025 +0000

        Add bias for f16xf4 moe_flatmm

    commit dd112dc302d17f541737671a3ac557d7c09ff969
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 27 13:39:47 2025 +0000

        update case construction

    commit b1aca68a073d82c7b3c7bb53286e5f415999edc1
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 26 12:32:29 2025 +0000

        support swiglu activaion and use rcpf to accelerate silu

    commit 49235bd42349a84fc2ebd7ad0b100cc2545bb80a
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Aug 26 02:33:55 2025 -0500

        first commit

    commit c169e39d6381b932cf7098cc118db29df91da1cb
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 04:01:59 2025 -0500

        add line to last

    commit 318f9bf317306454941bbf394c1940023edcf0ac
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 03:20:46 2025 -0500

        adjust A_LDS descriptor to avoid bankconflict

    commit 9d066120ed068d6d102da25d619e170a28a04d18
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Thu Aug 21 09:46:52 2025 -0500

        enable hotloop

    commit 61a895e6b821798970afffd0e9432a21e2f04df8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 09:12:21 2025 +0000

        support atomic_pk_add_bf16 on gfx950

    commit 9f14864e45f21d8c1bc70a94988fb86c2c0017d8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 06:58:55 2025 +0000

        use int64_t as expert stride to avoid overflow

    commit e63af46b32e1139a1e59dee6f46b9971047c4026
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 13:53:32 2025 +0000

        use v4i32 as the storage type for B to avoid repack operation

    commit 6cf0224dd8a229bf2be726ca861c736c9b5f5415
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 06:40:03 2025 +0000

        add pk_fp4_t and e8m0_t support for amd_buffer_load_impl

    commit 67a591f2240b0b035029edad904627f98b3839fd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 04:39:14 2025 +0000

        optimize cvt_pkf4_to_f16 implementation

    commit 51c7126e77e9b17af694eaa57040e487f9d443e8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 19 14:56:46 2025 +0000

        optimize A_LDS descriptor to avoid bankconflict

    commit c113160f326353290a2878d7b8febf7daed91d71
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 18:43:37 2025 +0000

        fix gate-up when GU_NRepeat > 1

    commit a45ca0e9934ca4bb9114f65621d5c9582d937a45
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 17:28:11 2025 +0000

        add fp16xf4 moe

    commit dc8c8e484804f7bca10c8f0764540af3b5884e83
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Aug 17 17:51:18 2025 +0000

        rename example

    commit b177c967141cfdc401d3f36bf17830fe99893600
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 15 06:20:46 2025 +0000

        remove additional check when e8m0->float

    commit d467f9688c3d35f391e15089135edb1ad1d38b05
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 14 09:34:12 2025 +0000

        eliminate repeat dequant

    commit 1b20674b26ab3ce6bd2f710dd729fd4cc0f79428
    Merge: faa3c0278 7d02625e7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:51:49 2025 +0000

        Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm

    commit faa3c0278cf11b7105a4302dea3a4416520b2cc7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:16:48 2025 +0000

        update f16xMXF4

    commit a2a2e1dab05501cc2136133236c01c08d51db4ea
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 10:48:53 2025 +0000

        update scale-preshuffle for MXF4

    commit eac9667feb899419dda1628164c092b969852660
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 11:24:34 2025 +0000

        update

    commit 7d02625e7678882af653f52c2a4ddaf64568a41c
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 08:38:23 2025 +0000

        optimize gemm2 atomic_add pattern

    commit d5f3c3e3ec72d0e6739467c4dc0b4e209f6d1192
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:59:47 2025 +0000

        update scale for mxfp4

    commit 15db198084614466bd4cfd4943fcb549cab2069a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:56:14 2025 +0000

        update case construction

    commit 5dff349d82a5f70b6eea821d2622df51f90ef200
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 06:03:06 2025 +0000

        update granularity control

    commit d32cdc52144f65ec473f4ec8e45ea23968811184
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 03:42:46 2025 +0000

        fix TileConfig

    commit 26f38c5716304ee5f84e5c4f6f88144d9f3dddaf
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Thu Aug 7 21:37:28 2025 +0800

        Add e8m0 scaled convert into CK_TILE (#2617)

        * first commit

        * remove redundent code

        * modify according to comments.

        * fix type_convert error with scaled_type_convert

    commit 419041478745f65dfec18859e75a13d975089519
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 8 20:19:16 2025 +0000

        add mixed_prec fp16xfp4

    commit 92e2a8b0308b9b107df9d2fd63a961efce706402
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 7 09:22:04 2025 +0000

        debug mixed_prec flatmm

    commit dea3ce80496ebcb00512979f0c3bb897f25e11a5
    Merge: fde443bc3 b4f45fe14
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Aug 6 16:49:47 2025 +0800

        Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk

        fix split k

    commit d480e8150358cc4ef8b05e25afe299141fad4fde
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 6 08:33:33 2025 +0000

        add moe_flatmm

    commit b4f45fe14d11569f34de40c8a205cd6760b61357
    Author: coderfeli <coderfeli@163.com>
    Date:   Wed Aug 6 02:45:31 2025 +0000

        fix split k

    commit fde443bc38fe60e52195817ecb2c7b20d772eedb
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 4 07:16:36 2025 +0000

        fix flatmm with scaling when WarpTileM == 32

    commit 5a0667afa889a5af8c6b8509232eabd50cf5efef
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 11:01:23 2025 +0000

        optimize scaling epilogue

    commit 5c3502bbf71833c6f6f7d4a1cc4f4fd93811f522
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 07:28:38 2025 +0000

        fix wrong config for fp8 scaling

    commit eb2d0653cdb86603cb11539cbac466b6431b58b7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 06:20:30 2025 +0000

        prune debug message

    commit 0c089cb56343a39e02a1ee38e9cabeb71ba35e92
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 04:52:08 2025 +0000

        fix compile error

    commit 61759ca30ce3787f70e228c3919b3e4d354016dd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Jul 29 15:42:58 2025 +0000

        Add persistent option on flatmm for tuning

    commit b36dc5dd55f15fc1ce8eb21637bdec862e56a883
    Author: AMD-dteng <dteng@amd.com>
    Date:   Tue Jul 29 22:48:00 2025 +0800

        update pipeline v1: add atomic IGLP schedule

    commit f886f26994454fc2b4fc3433c86bf699767a2a7c
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 09:09:27 2025 +0000

        fix error log throwing

    commit 4b4686ab144daa9061fbda17f3df4c17600c8e9a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Jul 28 08:24:51 2025 +0000

        crz idea

    commit 7099af44a81be41431ba70ae60827b60116d02d2
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Jul 27 11:57:38 2025 +0000

        Add permuteN optimzization when NRepeat % 2 == 0 on flatmm

    commit b147524c92e69a267337c8e48b6e64bcb1483551
    Author: sjfeng <j514681085@icloud.com>
    Date:   Sun Jul 27 17:24:08 2025 +0800

        try to remove c_shuffle_lds

    commit 2dd94f59d1a7740a5689e1713ed45588cd0d55dd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Jul 25 07:41:48 2025 +0000

        fix loop-dim mismatch and improve c_shuffle alu parallelism

    commit 4e93f0c5e27806adc070e4caa81661069295751c
    Merge: 3f12ef5aa 0eb7455f1
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 08:46:51 2025 +0000

        merge flatmm -scale

    commit 3f12ef5aa52ced1bff3bfb57b878358330e9e095
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 16:19:58 2025 +0800

        revert delete of inc file

    commit 08c3a0d184d7581dc5be364f5b36f16fb4a8d6fa
    Author: solin <bingzhou@amd.com>
    Date:   Thu Jul 24 04:38:16 2025 +0000

        reorg  flatmm code

    commit 0eb7455f106604d5254ed16b0daeda68e2a148e3
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:12:31 2025 +0000

        fix flatmm syntax error on gfx950

    commit 695ff87e68fdcbe28452c1805cd4dbb643c45495
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:04:22 2025 +0000

        support flatmm scaling

    commit e3c29d9dea8758db96b998982ccc8bd1c4e8298d
    Author: valarLip <340077269@qq.com>
    Date:   Wed Jul 23 08:44:12 2025 +0000

        merge flatmm pipe v0 from dteng_flatmm_opt

    commit 425c366fa4c30426ff36cade89b39fd8cb7b9732
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:38:12 2025 +0800

        build pass

    commit 6b377a9481535696de40f175d7e2159263d21bdc
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 07:20:26 2025 +0000

         fix bug

    commit b6dc58d1ea676fe480c0243ae098c875498f6d6a
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:01:53 2025 +0800

        sync

    commit 904359f401866ee810484e6b8f5b46d79d9e25c8
    Author: valarLip <340077269@qq.com>
    Date:   Tue Jul 22 08:09:35 2025 +0000

        adaptive scheduler instead of Macro definition

    commit f29916c17228c17de9923aab62e7d72d7a30f4e9
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 17 08:40:35 2025 +0000

        fix tail handler bug

    commit e2c60a90929fec955d91db909d50db538d58363b
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 16 10:12:19 2025 +0000

        merge from dteng_flatmm_opt

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

* Fix crash on small M

* Apply suggestion from @Copilot

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: e135dd518d]
2025-10-31 11:29:05 +08:00
Yashvardhan Agarwal
201463f036 [CK_TILE] Add indexing to pooling operator (Lwpck 3892) (#3013)
* Add indexing support to pooling operator

- Add IndexDataType template parameter to pooling problem and kernel
definitions

- Enable pooling kernel to output indices of selected elements during
max/absmax pooling

- Add overloaded operators for Max and AbsMax that track when values
change using bool changed parameter

-  Support optional index buffer allocation and management in device
memory

- Modify BlockReduce2d classes to handle index tensors alongside value
tensors

-  Add separate shared memory allocation for index data in cross-warp
reductions

- Create validate_pool_indices function to verify index correctness

- Modify pool3d.cpp example to demonstrate index output functionality

- Add tests for index output

* fixes

* Refactor BlockReduce2D functions to get rid auxiliary private types.

* comment resolutions and some changes to block_reduce2d

- index reference implementation improved
- reduce_operator.hpp cleanedup
- updated the block_reduce2d.hpp to have index calculation for
BlockReduce2dLinearCrossWarpSync as well

* conditionally used variable declaration improvement

- the conditionally used vairbales are used only when indexing is
enabled. To inform the compiler that they may be unused and declare them
with least size possible. This may allow it to be optimized compared to
the previous declarations

* comment resolutions

* lexical ordering of the indicies

- introduced accumulate methods that handle the intermediate steps if
needed to order the indexes

* add reduce_operator_accumulate.hpp to core.hpp

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: 3052d7c9e6]
2025-10-29 09:58:04 +02:00
Johannes Graner
0690ed26ba [CK_TILE] Add conv fwd + bias + clamp example (#3012)
* Implement argument passing to element-wise functions for fwd convolution

* Add files for fwd + bias + clamp example

* Implement Bias

* Implement Clamp

* Elementwise function composition

* Composition unit test

* Implement fwd + bias + clamp example

* Simplify argument passing and composition

* elfunc -> bias_and_clamp

* Rename function to specify example

* Move element-wise function instantiation to kernel

* Make bias a runtime tensor

* No ugly namespace aliasing

* Initialize element-wise function on host

* Remove function initialization helper, simplify Compose initialization

* Remove unintended LSP compatibility patch

* Clean up includes and unused code

* Switch names in cshuffle epilogue

* Move CDElementwise to conv traits

* Re-add required include

* Initialize bias in same way as other tensors

* Better type specification for ds pointer

* Disable 1D convolution

* Add warning for non-group-constant bias

[ROCm/composable_kernel commit: 5c1974065e]
2025-10-27 18:43:09 +01:00
lalala-sh
0329d71fb9 [CK_TILE] Update flatmm related kernels (#3022)
---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: 211d64e18a]
2025-10-22 22:36:11 +08:00
Johannes Graner
580a54b400 Update pre-commit to fixed versions, run remod for ck_tile (#2895)
* Fix ruff linter errors

* Fix remod dos2unix command

* Clang format

* Ignore utility in remod

* Run remod

* Specify clang-format version in pre-commit

* Specify ruff version

* Include PoolKernelArgs in reference_pool

* Add calculate_total_elements to reference batched contraction

* Fix calculate_total_elements declaration

* Refactor remod pre-commit hook

* Fix Aquant tests

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: d40b50b9d5]
2025-10-16 15:29:17 -07:00
ClementLinCF
7907a466de [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)
* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: e1b0bdfbfa]
2025-10-13 11:52:37 -07:00
msaffari-amd
bcc9d9e514 [CK Tile] contraction multi d - kernel & example (#2901)
* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: e9f0cc83a8]
2025-10-13 12:30:28 +02:00
Yashvardhan Agarwal
8ae2be1027 [CK_TILE] Pooling FWD (Lwpck 3683) (#2956)
* Pooling 2D/3D with refernce

* Tests & cleanup

- added test for ppoling
- cleanup
- removed 2d example

* Comment resolution

- README added
- example target name rectified
- appropriate arg description and comments added

* clang-format

* appropriate blocksize calc

* modifications for future indexing addition

- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future

* some basic fixes

* comment resolutions

* comment resolutions

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 7b6451b68e]
2025-10-09 16:13:26 +02:00
Anton Gorenko
bc9362af55 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd

[ROCm/composable_kernel commit: 1edd250115]
2025-09-27 18:03:48 +05:00
Sami Remes
b6681b5519 [CK_TILE] Tensor-wise scaled quant gemm kernel (#2846)
* rename gemm_group_quant to gemm_quant

* Add TensorWise quant mode

* Cshuffle epilogue tests with tensor scaling

* Add tensor quant to example

* Don't use readfirstlane for reading scales - doesn't work for some reason

* Add to changelog

* revert include - from a merge problem?

* revert common.hpp include

* revert host.hpp include

* remove unused utility function

* rename quant pipeline problem

* refactor quant tests

* remove aquant utils

* use TEST_F

* fix all tests by changing gemm config

* Use typed tests

* fix copyright

[ROCm/composable_kernel commit: 4363a82bd6]
2025-09-19 16:52:35 -07:00
Mateusz Ozga
2a150508c8 [CK_TILE] Multiple-ABD GEMM example (#2788)
* Multi ABD - initial commit

* Clang-foramt fix

* block gemm, unify the name of CDataType

* Apply chnages to mem-pipeline

* Rollback prefix for DType and Layout

* Gemm Kernel Basic, rename

* WMMA config

* Grouped GEMM

* Clang-format

* Dropout, name

* Review v2

* Move element_wise fn to unnary, remov old ones fn

* clang-format

* Fix issue review

* WP operator adjust to universal gemm

* v2 prepare

* Remove unused comment

* Remove vectorsize

* Rollback

* Adjust pipeline for abd

* Shuffle argument

* CI-fail fix quant

* Fix ag_br pipeline

* Failing tests

* Typo

* Single argument support

[ROCm/composable_kernel commit: 30ab1d6a71]
2025-09-19 01:14:11 +02:00
Anton Gorenko
2ec8ebff31 [CK_TILE] Add gtests for FMHA (#2744)
* Improve random number generation

* use different seed for each input (Q, K, V...);
* use deterministic generation of:
  * seqstart_q/k (for group mode);
  * block_table (for paged-kvcahe);
  * cache_batch_idx (for kvcache);

* Extract arg_parser-related code from run functions to use them as tests

* Split examples into main programs and fmha runners, build instances separately

* Add dummy tests that use instances and runners

* Fix a missed corner case of f32->f8 conversion

When value if < min f8 denormal but > min f8 denormal / 2, it must be
rounded to min f8 denormal (i.e. 0b1), not to 0.

* Fix incorrect fp8 scales for P and O in validation code

DataTypeConfig was incorrectly compared with fp8_t.

* Add host generation of dropout random values and use it for validation

Previously host validation (reference_batched_dropout) used random
numbers generated by BlockDropout of the kernel, meaning that incorrect
generation on device (bad distribution, repeated numbers, too many zeros,
etc.) would not trigger any validation errors.

* Implement tests from smoke_test_bwd.sh

* Return result as enum to distinguish failure and missing instance

* Add tests for bwd features: bias, alibi, dropout

* Implement tests from smoke_test_fwd.sh

* Pass seqlen_q/k as vectors to fwd and bwd runners

* Add tests for fwd features: bias, alibi, dropout

* Add tests for pagedkv and splitkv

* Fix conditions when to use splitkv and pagedkv kernels

splitkv was executed only when use_kvcache which == (need_append_kvcache || use_cache_batch_idx || 0 < page_block_size).
In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1.
In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance.

* Add tests for appendkv

* Use is_v_rowmajor = true because there are no instances with column layout anymore

* Split public and private compile options for instances

Tests and examples need to know only about CK_TILE_FMHA_FWD_*_API.

* Improve parsing validation in bias and mask

* Pass bias as string for consistency with mask

* Catch parsing and other exceptions

* Add bwd test for deterministic flag

* Initialize fp8 tensors (-init=ufq) similarly to uf

* Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null

seqlen_k cannot be used to determine padding when seqlen_k_ptr is
provided. The actual seqlen_k is taken from seqlen_k_ptr[b].
Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr
may contain arbitrary values.
In the example or tests this produces incorrect results with appendkv
(for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8).

* Fix use_pagedkv value when kvcache = true but page_block_size = 0

In this case block_table_ptr is nullptr which is accessed in the kernel.

* Clean up bwd tests

* Unify fwd tests for f16/bf16 and fp8

* Use better explicit instantiation declaration for fmha_bwd<2>

* Use the same seed for all tests, allow to override it with env variable

* Undo clang-format of one irrelevant file

For some reason my local clang-format-18 and the one in CI work differently.

* Do not build instances and tests on unsupported archs

* Build instance libraries as OBJECT library

* CI: Enable sccache for HIP

There are source files with LANGUAGE HIP, they need
-DCMAKE_HIP_COMPILER_LAUNCHER=sccache

* Add tests to REGRESSION_TESTS

* Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0

The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are
smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf.

* Pass CK_TILE_FMHA_FWD_*_API as INTERFACE compile options

The instances don't actually depend on them, only examples and tests do.
Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS
without recompiling instances that are already in ccache.

* Fix formatting and names

[ROCm/composable_kernel commit: ec006bb8e0]
2025-09-10 08:06:14 +05:00
Cong Ma
b42e7a85b2 [CK TILE] Support fp8/fp16 with pk_int4_t as data types for tensors A and B (#2805)
- Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats
- Implement A(bf8) B(i4) support in universal GEMM
- Use new implementation for i4 to fp8 conversion in Block Scale

[ROCm/composable_kernel commit: 82890192dd]
2025-09-09 16:40:52 -07:00
Sami Remes
75bd80b1be [CK_TILE] Row/Col quant gemm (#2729)
* Add cshuffle epilogue test

* add the poc implementation to the epilogue and tests

* refactor cshuffle epilogue

* WIP: adding tensor/tile usage to scale_tile

* fix usage of tile_elementwise_inout

* add gemm_quant_kernel for generalizing gemm quant kernel

* Add problem specific to different quants, add QuantType to Traits

* Add quant_type to quant_kernel template parameters

* Create aq/bq_block_windows and views depending on QuantType

* Use tile windows as inputs in cshuffle epilogue

* Fix some issues in epilogue

* initial new example code for new general gemm quant kernel test

* Fix issues in kernel

* Add verification check for rowcol Quantmode

* use AccDataType instead of AQ in pipeline

* fix aquant preshuffle

* fix formatting

* some cleanup

* remove gemm_aquant_basic.cpp

* remove gemm_aquant_kernel.hpp

* fix tests for the renamed quant kernel

* fix formatting

* clean example files

* fix some merge conflicts

* fix preshufflequant rename issue

* fix some templates after merging with develop

* fix test preshuffle parameter

* fix formatting

* Unify bquant kernel to the common quant kernel

* remove bquant kernel also from common header

* fix formatting

* clean up commented code

* fix formatting config hpp

* fix merge mistake

* Non-const for movable windows

* fix formatting

* Fix grammar in README

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Remove #include<bit> and clean up example

* fix strides

* Add some descriptions for move_windows

---------

Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: c6010f2953]
2025-09-04 16:17:12 -07:00
Bartłomiej Kocot
3e8a6dfb9c [CK Tile] Grouped convolution backward data (#2652)
* base working version for single groupped conv bwd data

* Fix 2d descriptor

* fix groups

* Add 3d support

* fixes

* fixes

* fixes

---------

Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>

[ROCm/composable_kernel commit: 4212bbc170]
2025-08-20 05:29:57 -07:00
Yashvardhan Agarwal
b6f0e98da6 General 2D Reduction Kernel (#2535)
* General 2D Reduction Kernel

* Move the reduction kernel from the example
* Split the code and add the necessary policy, problem, shape files as
per ck_tile convention
* Add/modify the headers
* Modified the example to work with the 'new' kernel
* Added tests for the kernel
* N-D refernce reduce
* Added support for N-D input with transform to 2D
* Added padding to support various input sized tensors
* Bug fix in the thread buffer constructor
* Some comments to explain the reduce2d block kernel

* comments resolution

* clang-format

* comments resolution

* clang-format

* clang-format

* comments resolution

* clang-format

[ROCm/composable_kernel commit: 4750b293fe]
2025-08-06 15:36:59 +02:00
Jinchao Xu
d2525548c6 Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611)
Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances
by splitting debug information into separate .dwo files.

Add explicit cast to avoid signed/unsigned comparison warning.

[ROCm/composable_kernel commit: 15eb493152]
2025-08-04 11:26:08 -07:00
Illia Silin
3345f5f417 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config

[ROCm/composable_kernel commit: 504b101da3]
2025-07-28 11:34:07 -07:00
Yashvardhan Agarwal
094e5bad50 [CK_TILE] Support for elementwise kernel (#2246)
* Elementwise kernel implementation

Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>

* Elementwise with generalized nDims

* Adding the n-ary input tensor feature

* Generalize dimensions on top of inputs

* Add TFLOPS + remove std usage for tuples

* 1D basecase optimization

* Cleanup code + refactoring to a common interface

* Generalize to unary and add an example

* Cleanup, refactoring and commenting

* Suggestions for LWPCK-3170: elementwise kernel improvements

* Clang-format: remod.py

* Replace InputTensorType with XDataType as the type of input_tensors

* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs

* Move examples to folder 19_elementwise

* Add missing copyright headers and fix some existing ones

* Replace an assert with throw std::runtime_error in elementwise example

* Avoid reading the output by using make_static_distributed_tensor for y_tile

* Removed two unused includes

* No need to move windows to the next block when each workgroup processes a single tile

* Only copy input tensors to the device

* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example

* Adding output strides to the kernel, transposition example and update the other examples

* Changes made by remod.py

* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp

* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference

* Fix comments in elementwise_example.cpp

- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors

* Simplify CMakeLists.txt and remove the unused variables this uncovers

* Rename a file and fix some copyright statements

* Changes made by script/clang-format-overwrite.sh

* Add basic unit test for ElementWiseKernel

* Remove left-over uninformative comment in apply unit test

* Changes made by clang-format-overwrite.sh

* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp

* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array

* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally

* Move examples to folder 20_elementwise

* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK

* Fix CLang formating

* Bump up the elementwise example folder number

* Elementwise: add padding + minor cleanup

* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view

* Add isSupportedArg to Elementwise kernel + addapt example and unit tests

* Fix clang-format on the unit test file

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 606b0cc947]
2025-07-24 11:21:45 +02:00
jakpiase
bdb86fee78 [CK_TILE] Grouped Convolution Backward Weight Kernel (#2357)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* resolved conflicts

* rebase fixes

* fixes

* tmp

* add working support for splitk

* minor fix

* fixes

* fixes

* minor fix

* small fix

* Split K and preprocessing fixes

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 6681593864]
2025-07-24 10:41:35 +02:00
Cong Ma
baf244000e ck_tile kernel for gemm with groupwise quantized A tensor (#2473)
* ck_tile kernel for gemm with groupwise quantized A or B tensor.

This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32

Group size can go down to as low as K length of underlying WarpGemm primitive.

For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.

* [Block Scale Gemm] Only run gemm quant examples on __gfx94__

- Only run gemm quant examples on __gfx94__ for usage of
  `v_cvt_pk_fp8_f32`
- Format the code

* [Block Scale Gemm] Remove Bquant Gemm BlockScale

This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.

* [Block Scale Gemm] Format code with clang-format-12

The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.

Format code with clang-format-12 for consistency.

* [Block Scale Gemm] Split the k direction loop

- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
   to make the logic clearer.
- Disable C transposition.

* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm

* [Block Scale Gemm] Update copyright

* test

* Add TailHandler

* Move TileDistributionEncodingPatternAQ

* Refactor

* refactor

* fix bug

* fix bug

* help solve the PR comment

* Format the code

* [Block Scale Gemm] Add unit tests

* [Block Scale Gemm] Add support to 16x16x32 MFMA

- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: e62710e461]
2025-07-23 00:10:16 -07:00
Bartłomiej Kocot
2567f5e538 [CK TILE] Grouped Convolution Forward Kernel (#2188)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* rebase fixes

* fixes

* fixes

[ROCm/composable_kernel commit: cebdee4d9e]
2025-06-20 15:44:36 -07:00
carlushuang
8660f6ef22 [CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335)
* support local_token for hipgraph

* update README

* fix comment

* fix fmoe example

[ROCm/composable_kernel commit: a4e1248dba]
2025-06-18 10:49:43 +08:00
Mateusz Ozga
044a8560f7 [CK_TILE] Multiple-D GEMM example (#2219)
* Multiple d, initial commit

* Check Ds Layout

* Readme and clang format

* Update branch & conflicts

* Multiple D - fix clang-formatter

* Rename elemetwise_op

* Fix CI

* Code review part1

* Remove printf

* Remove unnecessary comment

* Add new tests with Col layout

* Review part 2

* Added support for Multiple D GEMM

* Update comment

* Remove maybe_unused

* Clang-format

* Review part 3

* Add comment to function

* Add comment to function: another

* Take number of params for a refrence function

* Remove additional d param for 0 tensor

* Change name of function

* Fix CI fails

[ROCm/composable_kernel commit: bd96ac9742]
2025-06-13 19:39:11 +02:00
ruanjm
fcbf9630fe [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx

[ROCm/composable_kernel commit: d49abdaa87]
2025-03-25 20:09:45 +08:00
Bartłomiej Kocot
2f56999c18 [CK TILE] GEMM with packed i4 (#1885)
* [CK TILE] GEMM with packed i4

* Fixes

* fixes

* fixes

* fixes

[ROCm/composable_kernel commit: 4d9973ec8e]
2025-02-20 09:59:49 +01:00
carlushuang
8ed234da8c [CK_TILE] moe sorting ex kernel to support expert > 128 (#1840)
* moe sorting ex

* fix bug for race condition

* fix bug and optimze large expert

* fix

* optimize with sub_token_oneshot

* support skip empty tokens for expert sorting

* update moe_sorting

* tidy code

[ROCm/composable_kernel commit: c0adab4850]
2025-02-11 17:49:17 +08:00
kylasa
42f4ef8d1c Support for dtypes (fp8, bf8, bf16 and fp16) for the ck_tile/03_gemm example. (#1845)
* Support bf16/fb8/bf8 datatypes for ck_tile/gemm

* remove commented out code.

* Addressing code review comments and enabling universal_gemm for all the supported data types.

* Merge conflict resolution.

* Solve the memory pipeline compilation error. Merge with the new change of CShuffle

* finish the feature, pass the tests

* Fix the pipeline and add the benchmark script for other data types

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: ab5d027866]
2025-02-07 15:07:06 -07:00
fangche123
e4d8548dc5 add batched_transpose implement (#1660)
* add batched_transpose implement

---------

Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: c5fff071e5]
2025-01-28 16:22:02 -08:00