Commit Graph

405 Commits

Author SHA1 Message Date
Thomas Ning
cdffa8cc83 disable the gfx90a (#3336)
[ROCm/composable_kernel commit: 8459d389ad]
2025-12-02 07:27:37 -08:00
Ville Pietilä
8bf6f9cac8 [CK_TILE] Merge multiple fwd convolution groups into a single GEMM batch. (#3136)
* Merge fwd conv groups in CK Tile.

* Fix building CK fwd convs.

* Add number of merged groups to conv fwd kernel name.

* Get number of merged groups from conv config.

* Rename GemmConfig to ConvConfig.

* Clean-up TODOs.

* Check that number of conv groups must be divisible by the number of merged groups.

* Improve error handling in the conv fwd example.

* Fix clang-format.

* Fix group offsets.

* Fix merge problem.

* Address feedback from code review.

* Fix clang-formatting.

[ROCm/composable_kernel commit: 66832861ad]
2025-12-02 15:23:32 +02:00
msaffari-amd
b06b6e684c [CK Tile] batched contraction kernel generalizing (#3126)
* Add help for example

* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings

* Add stride-aware reference for batched contraction with independent D tensor layouts

* Add -num_d argument for runtime D tensor count selection in batched contraction

* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs

* Add descriptor-based architecture for batched contraction multi-dimensional stride support

* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0

* Add complete multi-dimensional stride support via descriptors

* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm

* Clean up batched contraction: remove old UniversalGemmKernel path

* Clean up batched contraction: remove legacy paths and finalize docs

* Optimize batched contraction example: pass dimension sizes not vectors

* correct the reference calculation, unsigned int to int

* Fix batched_contraction C++17 build errors for gfx90a CI

[ROCm/composable_kernel commit: 2d3020e5b0]
2025-12-02 13:30:27 +01:00
Yi DING
07158d16ad [CK_Tile] Flatmm MX Cleanup & Explicite Offset Calculation (#3286)
[ROCm/composable_kernel commit: f211156ce6]
2025-12-02 14:21:12 +08:00
Cong Ma
a6ec08a1d2 Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325)
* [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples

* [CK TILE GEMM Quant] update tile distribution of aquant

* [CK TILE GEMM Quant] update aquant register offset calculation

* [CK TILE GEMM Quant] Reimplement aquant register offset calculation

* [CK TILE GEMM Quant] Add more unit tests of Aquant

- Test M128xN128xK128

* [CK TILE GEMM Quant] Add more comments to Gemm Aquant

[ROCm/composable_kernel commit: 23fb253c4e]
2025-12-01 15:04:37 -08:00
Aviral Goel
0861395425 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>

[ROCm/composable_kernel commit: 004784ef98]
2025-11-28 13:49:54 -08:00
Sami Remes
1c73a3d480 [CK_TILE] Fix Quant GEMM build (#3320)
* Fix build

* Fix ck_tile example 38 & 40

---------

Co-authored-by: Yi DING <yi.ding@amd.com>

[ROCm/composable_kernel commit: f981554c39]
2025-11-28 20:33:53 +08:00
Cong Ma
f622d546f3 Tile engine for streamk (#3157)
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.

- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
  It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
  the streamk implementation reaches stability.

* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM

* [CK TILE STREAMK] Refactor: Extract common utility functions.

* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation

* Add pre-commit

* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK

* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK

* [CK TILE STREAMK] Update Jenkinsfile

* [CK TILE Engine] Update StreamK tile engine help message

Remove default value messages as they are automatically printed

* [CK TILE Engine] Update StreamK tile engine

- Remove namespace reboot

* [CK TILE Engine] Update StreamK tile engine

- Fix merge error

[ROCm/composable_kernel commit: 30727c48fc]
2025-11-27 15:49:57 -07:00
arai713
6d28d12b62 [CK_TILE] Move DataTypeTraits into a Common File (#3146)
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

[ROCm/composable_kernel commit: 24d88d2472]
2025-11-27 09:09:54 -08:00
Max Podkorytov
0ce4a61da5 [CK Tile] enable building examples by default (#3259)
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch

* fix cpp17 compile error in the ck-tile examples

---------

Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>

[ROCm/composable_kernel commit: 79aae7c7f7]
2025-11-26 16:24:44 -08:00
Aviral Goel
cb0aaf8e90 fix: add dynamic selection of pipelines for aquant mode (#3282)
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly

[ROCm/composable_kernel commit: 35a4b26af0]
2025-11-26 10:58:09 +04:00
Aviral Goel
9f94579e5c chore(copyright): update copyright header for experimental & example directory (#3292)
[ROCm/composable_kernel commit: cd47293869]
2025-11-26 03:09:39 +04:00
Bartłomiej Kocot
91cc903d12 [CK TILE] Grouped Conv Explicit Gemm (#3289)
* [CK TILE] Grouped Conv Explicit Gemm

* fixes

* apply builder fixes

[ROCm/composable_kernel commit: 00dfa2f2ce]
2025-11-25 23:28:35 +01:00
Khushbu Agarwal
a6241c62cc [CK-Tile] fix block scale example for gfx1201 (#3283)
[ROCm/composable_kernel commit: 37ea160088]
2025-11-25 13:10:28 -08:00
Bartłomiej Kocot
083ea723a0 [CK_BUILDER] Add grouped conv bwd ck tile traits (#3281)
* [CK_BUILDER] Add grouped conv bwd ck tile traits

* copilot fixes

[ROCm/composable_kernel commit: 9ac2666d5b]
2025-11-25 14:57:43 +01:00
Aviral Goel
91ffc9dd1e chore(copyright): update copyright header for example directory (#3273)
* chore(copyright): update copyright header for codegen directory

* chore(copyright): update copyright header for example directory

[ROCm/composable_kernel commit: d85f065b15]
2025-11-24 18:02:41 -08:00
rocking
9cb3d700da Fix batch prefill compile fail in aiter (#3279)
* Fix batch prefill aiter compile fail

* Fix compile error

[ROCm/composable_kernel commit: 229d43ea0c]
2025-11-25 09:46:32 +08:00
Thomas Ning
99e6b461db Reorganize of KPack in GEMM (#3247)
* add the reorganize of KPack

* fix the compilation error

* fix the compilation error

[ROCm/composable_kernel commit: de6a9590ab]
2025-11-24 12:38:59 -08:00
Khushbu Agarwal
84b12586c6 [CK_Tile] Support for preshuffle weight(B) quant tensor for block scale gemm (#3165)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* addressing review comments

* fixing CI issue

* addressing reveiw comments

* formatting

* formatting

* fixing aquant operator overlaoding

* formatting

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 8111572785]
2025-11-24 07:48:42 -08:00
rocking
ca1a0da0c3 Support fp8 dynamic quantization for fmha (#3206)
* Support qscale for dynamic quant, remove static quant

* Support hdim=256

* Remove bias test case for fp8

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 5948dbffe4]
2025-11-24 16:28:25 +08:00
Johannes Graner
dd7a2d199f [CK Tile] Fix example for conv fwd + bias + clamp (#3235)
* Fix clamp not being applied correctly

* Apply group offsets to D tensors

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 096f0a3b23]
2025-11-24 07:36:26 +01:00
Emily Martins
2963649b29 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names

[ROCm/composable_kernel commit: 2e4b8a8fc4]
2025-11-20 09:32:32 -07:00
asleepzzz
d115b3be4a Revert "Add attn sink (#2892)" (#3250)
This reverts commit cb7f05a8d3.

[ROCm/composable_kernel commit: 5adaa201ed]
2025-11-20 07:55:15 -08:00
Linjun-AMD
cb7f05a8d3 Add attn sink (#2892)
* enable attn sink

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update attn_sink script

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix some error

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* clang-format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd_kernel'mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix ci error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix format error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_bwd_pipeline_default_policy.hpp

* Update fmha_fwd_runner.hpp

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* update splitkv_pipline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update splitkv&pagedkv pipeline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* add sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update attn_sink result log

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update smoke_test_fwd_sink.sh

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test file

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test script

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

* use constexpr kHasSink for sink in fmha pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update by pre-commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove causal mask setting logic from mask.hpp

Removed the mask setting logic for causal masks.

* fix ci error that some usage of lamada not support in c++17

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update remod.py

* add smoke sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update fmha_pagedkv_prefill.py

* Update FmhaFwdPipeline parameters in fmha_fwd.py

* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix c++17 unsupprot error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

* Fix formatting of sink_seq_end assignment

* Fix indentation for sink_seq_end assignment

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 9fa4e8d5ab]
2025-11-20 19:24:05 +08:00
Yi DING
e27e760d5a [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~

[ROCm/composable_kernel commit: 47e2ed838e]
2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal
5f7f81660d [ck_tile] Pooling example - Improved tile sizes (#3233)
* improved tile sizes

- modified tile sizes for improved example performance

* Update example/ck_tile/36_pooling/pool3d.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 1eb26460aa]
2025-11-19 15:30:18 +01:00
Aviral Goel
adf8515169 feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle… (#3225)
* feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle kernel(s) along with unit test

* docs: Update CHANGELOG.MD

[ROCm/composable_kernel commit: ac70206b2c]
2025-11-18 09:32:27 -05:00
Yi DING
e2060bd1fb [CK_TILE] MX Flatmm Split kernel instances (#3207)
* [CK_TILE] MX Flatmm Split kernel instances

* Fix flatmm example compile

[ROCm/composable_kernel commit: b6720531de]
2025-11-18 13:46:30 +08:00
BingYuan.Zhou
3800080d25 fix build error (#3195)
Co-authored-by: root <root@hjbog-srdc-39.amd.com>

[ROCm/composable_kernel commit: 4d629cd2b0]
2025-11-14 09:46:13 +08:00
Khushbu Agarwal
0366389527 fixing ambiguous shuffle definitions (#3175)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: fb41a7b73b]
2025-11-12 23:44:12 -08:00
Cong Ma
c86dee45e0 [CK TILE GEMM] Refactor block_scale_gemm examples (#3181)
* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Set quant group size to (1, 1, 64) for targets excluding gfx950, where warp tile size (16, 16, 128) is incompatible.

[ROCm/composable_kernel commit: 6fd8ddabe7]
2025-11-12 23:43:40 -08:00
Yashvardhan Agarwal
0ca982f8d5 [CK_Tile] Pooling example readme update (#3174)
* pooling example readme update

- The updated readme explains the transformations of the pooling kernel
using a mermaid diagram

* Update example/ck_tile/36_pooling/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* resolve comments

---------

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: 299c9bca1b]
2025-11-12 07:30:20 -08:00
Aviral Goel
efcd6297d4 Add CK Tile Tutorials Folder with GEMM and COPY Kernel (#3038)
* feat: add tutorial folder with gemm tutorial

* chore: move copy kernel from examples folder to tutorial

* Update tutorial/ck_tile/01_naive_gemm/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tutorial/ck_tile/01_naive_gemm/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* chore: remove handdrawn images

* docs: add write ups to explain the gemm kernel

* docs: add about block level pipeline and static distributed tensors

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: b145a5fe80]
2025-11-11 14:15:49 -06:00
linqunAMD
31400ca622 [CK_TILE] Fix gemm_quant (#3186)
[ROCm/composable_kernel commit: 1b1c46e508]
2025-11-11 08:23:57 -08:00
Thomas Ning
fdccd7a3b4 fix input range (#3188)
[ROCm/composable_kernel commit: 9f33b7cfd3]
2025-11-10 11:08:41 -08:00
Bartłomiej Kocot
5c219f1697 [CK TILE] Convolution remove magic values (#3160)
* [CK TILE] Refactor Conv configs and Conv Elementwise

* fix

* [CK TILE] Convolution remove magix values

* fix partitioner

[ROCm/composable_kernel commit: 2234ff830b]
2025-11-06 11:26:30 +01:00
Cong Ma
53e42f5cce Introduces the new partitioner to implement the reduction StreamK kernel. (#3107)
* Introduces the new partitioner to implement the reduction StreamK kernel

* Add more doc text to functions

* Add persistent-dp option to streamk example

* Update example/ck_tile/40_streamk_gemm/README.md

[ROCm/composable_kernel commit: 5abe4109e0]
2025-11-04 10:32:17 -07:00
Thomas Ning
dceaa603d0 fix the blockscale 2d case (#3148)
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 13ba06f1e7]
2025-11-04 11:55:23 -05:00
Bartłomiej Kocot
7b4d9879da [CK TILE] Refactor Conv configs and Conv Elementwise (#3151)
* [CK TILE] Refactor Conv configs and Conv Elementwise

* fix

[ROCm/composable_kernel commit: 8681ced962]
2025-11-04 15:04:53 +01:00
Bartłomiej Kocot
a8500a082d [CK TILE] Refactor grouped conv fwd large tensor (#3144)
[ROCm/composable_kernel commit: 99f38e4d9b]
2025-11-04 00:34:48 +01:00
Emily Martins
4ec9a97bfe Replace CK_TILE_PIPELINE macros with a common enum
This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3,
CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum
called GemmPipeline to reduce code duplication.


[ROCm/composable_kernel commit: 2ec57a8e70]
2025-11-03 09:35:05 -07:00
Sami Remes
fd7bc5e9ab [CK_TILE] B matrix 2D block scale gemm (#3074)
* Refactor quant group size to be configurable for M/N/K, not just K

* add some asserts for configurations not implemented

* start setting of group size for N dimension

* enable 2d for reference quant gemm

* WIP: trying to figure out tile dstr and/or indexing for scale matrix

* WIP

* Fix handling of n dim blocks in tile windows etc

* remove commented code and enable all tests again

* fix formatting

* Add more specialized tile distributions

* Enable NWarps replication for bquant tile dstr

* fix formatting

* fix format

* Fix some issues from the merge

* fix formatting

* one more fix to tile dstr, and revert debug initialization

* Remove commented code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* simplify conditions that are needed for tile distributions

* only enable the working group sizes in tests

* fix formatting

* Update tile distribution for 2D bquant

* add some documentation and 2d block scale example

* fix formatting

* Add in Changlog and restructure the quant 2d example

* fix CMake

* support the change for blockscale 2d

* fix the test file

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 16e85cf179]
2025-11-02 16:49:20 -08:00
Aviral Goel
8302f9ec4a refactor: remove gemm preshuffle pipeline v1 by removing all references from codebase (#3132)
* test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf

* refactor: deprecate gemm preshuffle pipeline v1 by removing all references from codebase

* Revert "test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf"

This reverts commit 573c08a085.

[ROCm/composable_kernel commit: 73f637894d]
2025-11-02 00:06:28 -04:00
JH-Leon-KIM-AMD
35df1d1b79 [CK TILE] Grouped conv fwd split image (#2970)
* Refactor split-image implementation: simplify code and remove redundant variables

* Add padding debug output to split-image implementation

- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp
- Verified padding works correctly with all tests passing

* Fix sign comparison warning after rebase with origin/develop

- Cast blockIdX from unsigned to signed index_t for comparisons
- Integrated with new GetOutputTileIndex logic from upstream
- Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane

* Fix Split-N with groups bug and clean up unused parameters

- Fixed batch stride calculation to include G dimension for grouped convolutions
- When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups
- Removed unused multi-split parameters (we only support 2-way split)
- All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N

* Implement recursive queue-based split-image detection and calculation

- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp
- Implement recursive binary splitting algorithm (10GB→5GB+5GB→...)
- Correctly handle odd dimensions (61→30+31)
- Calculate proper offsets for each split piece
- Update invoker to use split-image helper

Note: Split detection and calculation work correctly but kernel launching
for individual pieces requires kernel modification to handle different
spatial dimensions (unlike Split-N which uses blockIdx.z).

* WIP: Split-Image investigation - found architecture mismatch

- Split-N modifies N_ directly in transformer constructor
- Split-Image needs different approach due to varying dimensions
- Added split calculation logic for 1D and 2D convolutions
- Still facing memory issues when creating piece transformers

Key finding: Split-N uses blockIdx.z for parallel execution,
while Split-Image needs sequential execution of non-uniform pieces.

* Add 1D split-image implementation for grouped convolution (N=1 working)

Implements split-image for 1D convolution to handle large tensors that
exceed memory thresholds. This is a critical milestone with N=1 fully
working and tested.

Key Changes:
- Invoker: Add split-image logic that splits W dimension in half
- Transformer: Add SplitConvProblem helper for recursive splitting
- Calculate offsets for LEFT and RIGHT pieces
- Launch two kernels sequentially (LEFT then RIGHT)

Implementation Details:
- Binary split: divides W dimension by 2
- LEFT piece: W=0 to W/2, keeps left padding, removes right padding
- RIGHT piece: W/2 to W, removes left padding, keeps right padding
- Offset calculation accounts for stride, dilation, and padding
- Physical memory offset (no padding in memory)

Test Results (N=1):
 94/94 tests passing
- Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups)
- Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries)
- Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops)

Known Limitations:
- Only works with N=1 (single batch)
- N>1 fails when split-image triggers (offset calculation issue with Split-N)
- Root cause: Split-N modifies N in transformer, but offset calculated in invoker
- Solution planned: Move offset calculation to transformer (next phase)

Files Modified:
- grouped_convolution_forward_invoker.hpp: Add split-image logic
- transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper

This commit represents a stable, tested 1D split-image implementation
for N=1 cases. It's an important milestone before extending to N>1
and multi-dimensional splits.

* Add basic split-image implementation for 1D/2D/3D grouped convolution

This is a working baseline implementation that splits large spatial
dimensions to handle memory constraints.

Implementation:
- 1D: W-split for NWGC layout (36/36 tests passing)
- 2D: H-split for NHWGC layout (20/20 tests passing)
- 3D: D-split for NDHWGC layout (verified working)

Features:
- Binary split of outermost spatial dimension
- Sequential LEFT/RIGHT kernel launches
- Proper padding adjustment at split boundaries
- Offset calculation for pointer arithmetic
- Debug output for verification

Threshold: 100KB (configurable in transformer)

Known limitations:
- No safety checks for edge cases (to be added)
- Offset calculated before Split-N (incompatible with N>1, to be fixed)
- No recursive splitting for very large tensors

Next steps:
- Add safety checks (is_possible_to_split_*)
- Move offset calculation to transformer (after Split-N)
- Test with N>1 + split-image combination

* Refactor split-image to unified structure for 1D/2D/3D

Unified the three separate dimension-specific blocks into a single
common implementation with dimension-specific stride calculations.

Benefits:
- Reduced code from 636 → 348 lines (45% reduction)
- Eliminated code duplication
- Easier to maintain and extend
- Single source of truth for split logic

Implementation:
- Common: Binary split, offset calc, padding adjustment, kernel launch
- Dimension-specific: Stride calculation only
  - 1D: stride = G * C
  - 2D: stride = W_in * G * C
  - 3D: stride = H_in * W_in * G * C

Test results (all passing):
- 1D: 36/36 tests 
- 2D: 20/20 tests 
- 3D: 28/28 tests 
- Total: 84/84 (100%)

All test scenarios verified:
- Varying channels, padding, stride, dilation
- Filter sizes (1x1 pointwise to 7x7)
- Multiple groups (G=1,2,4)
- Odd dimensions
- Complex combinations

* Add safety checks for split-image in all dimensions

Added is_possible_to_split safety checks to prevent crashes when
splitting is not feasible.

Safety checks verify:
1. Output dimension > 1 (can't split single element)
2. RIGHT piece starts after left padding
3. LEFT piece ends within input bounds

If checks fail, falls back to normal kernel launch.

Verified for all dimensions:
- 1D (W-split): Wo=1 case triggers fallback
- 2D (H-split): Ho=1 case triggers fallback
- 3D (D-split): Do=1 case triggers fallback

Original 84 tests still pass - they use normal configurations
that naturally satisfy safety conditions.

Safety checks protect against pathological edge cases with:
- Very small spatial dimensions
- Extreme stride/dilation combinations
- Invalid padding configurations

* Fix Split-N + Split-Image compatibility issue

Fixed critical bug where Split-N and Split-Image working together
caused ~50% incorrect results due to wrong batch stride calculation.

Problem:
- Batch stride was calculated using MODIFIED spatial dimensions
  (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000)
- Spatial offset was applied globally in invoker, not per-batch in kernel
- Each batch (blockIdx.z) got wrong memory offset

Solution:
1. Store spatial offset in kargs (don't apply to pointer in invoker)
2. Copy correct batch_stride from temp_kargs to left/right kargs
3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset

Changes:
- grouped_convolution_forward_kernel.hpp:
  * Added spatial_offset_in/out fields to KernelArgs
  * Apply batch + spatial offset in operator()

- grouped_convolution_forward_invoker.hpp:
  * Keep base pointer, store spatial offset in kargs
  * Copy batch_stride from temp_kargs (has original dimensions)

- transform_conv_fwd_to_gemm.hpp:
  * Add debug output for split-image calculation

Results:
- N=1 tests: 84/84 passing (100%)
- N>1 tests: Now all passing (previously ~50% errors)
- Tested: 1D, 2D, 3D with N=1,2,4,8,16,20

* Implement unified threshold for Split-N and Split-Image

This commit consolidates threshold management for both Split-N and
Split-Image operations into a single source of truth, eliminating
code duplication and fixing offset calculation issues.

Key Changes:
============

1. Transformer (transform_conv_fwd_to_gemm.hpp):
   - Moved TwoGB constant to public section for unified access
   - CalculateSplitImage() now takes no parameters
   - Uses internal threshold: TwoGB / sizeof(CDataType)
   - Calculates offsets using N_ (after Split-N) for correctness

2. Kernel (grouped_convolution_forward_kernel.hpp):
   - GetSplitImageInfo() simplified to take no parameters
   - Forwards to transformer's CalculateSplitImage()
   - Clean interface with unified threshold internally

3. Invoker (grouped_convolution_forward_invoker.hpp):
   - Removed redundant threshold calculation
   - Simplified to call kargs.GetSplitImageInfo() with no params
   - Clean early-return pattern (no unnecessary else blocks)
   - Removed duplicate/dead code paths

Benefits:
=========
- Single source of truth: TwoGB defined once in transformer
- No parameter passing for threshold between components
- Correct offset calculation using N_ (post-Split-N)
- Cleaner code with no duplication
- All tests passing: 1D/2D/3D with various N values

Testing:
========
- Split-Image only (N=1, large spatial): PASS
- Split-N only (N>1, small spatial): PASS
- Both splits active (N>1, large spatial): PASS
- No splits (N=1, small spatial): PASS
- CPU verification correct for all scenarios

* Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded)

The old recursive queue-based implementation has been replaced by the
new CalculateSplitImage() method which is simpler and correctly handles
Split-N + Split-Image interaction.

Changes:
- Wrapped lines 381-1078 in #if 0...#endif
- Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded()
- Preserved for reference but disabled from compilation
- No functional changes - all tests still pass

The new implementation (CalculateSplitImage at line ~2163) provides:
- Correct offset calculation using N_ (after Split-N)
- Simpler binary split logic
- Better integration with unified threshold approach

* Implement recursive split-image with depth limit (MAX_DEPTH=10)

Changes:
- Add depth tracking to SplitPiece struct
- Implement two stopping conditions:
  1. Piece size below threshold (optimal case)
  2. Depth >= MAX_DEPTH (prevents infinite recursion)
- Remove MAX_PIECES limit in favor of depth-based control
- Support up to 2^10 = 1024 pieces with depth 10

This allows handling extreme tensor sizes while ensuring termination.
Pieces larger than threshold will still launch correctly if depth limit reached.

Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.

* Summary of recursive split-image implementation:
- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces)
- Two stopping conditions: size below threshold OR max depth reached
- Cumulative offset tracking through all recursion levels
- LEFT piece inherits parent offset, RIGHT accumulates (parent + local)
- Per-batch spatial offset application in kernel operator()
- Batch stride uses original dimensions (before split)
- Works with Split-N: split-N first, then recursive split-image
- Handles odd dimensions, padding, stride, dilation correctly
- All 1D/2D/3D tests pass with CPU verification

* Add comment explaining MAX_DEPTH capacity for 2GB threshold

* Refactor: move recursive split-image logic to transformer

- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp
- Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic)
- Encapsulate SplitPiece struct and BFS splitting algorithm in transformer
- Remove unused includes (queue, vector) from invoker
- Add documentation comment for AreDescriptorsSmallerThan2GB()
- Improve code organization and reusability
- No performance overhead (static template function, compiler inlines)
- All tests passing with 2GB production threshold

* Apply clang-format-18 formatting

- Format invoker and transformer files with clang-format-18
- Fix brace placement and alignment
- No functional changes

* Fix clang-format-18 issues in forward kernel

- Remove extra blank lines
- Fix line wrapping for template calls
- Consolidate GetSplitImageInfo() to single line

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Split-Image implementation with temporary fixed divider

- Implemented spatial dimension splitting (Split-Image) for large tensors
- Added piece-based coordinate transformation for 1D/2D/3D convolutions
- Integrated Split-N (batch splitting) with automatic threshold detection
- Fixed M dimension calculation to include batch: M = N × spatial_size
- Added spatial offset support in kernel arguments
- Verified 20/20 test cases passing for Split-Image alone
- Known issue: Split-N + Split-Image combination needs coordinate fix

Implementation Details:
- Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values
- Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts
- Piece descriptors track spatial boundaries and block ranges
- No performance overhead for N=1 cases

* Fix 1D split-image padding issue with per-piece dimensions

- Store actual size per piece to handle non-uniform splits
- Remove dead code from transform utils

* Fix 2D/3D split-image with independent split factors per dimension

Problem: Single split factor caused non-uniform pieces when dimensions
didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed.

Solution: Independent split factor selection for W, H, D dimensions.
Each dimension gets optimal factor based on its own size.

Test Results:
- 1D: 42/42 pass (100%)
- 2D: 25/25 pass (100%)
- Total: 67/67 combinations verified

* Remove unused split-image struct fields

Cleanup of split-image implementation:
- Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct
- These fields were declared but never used in the kernel
- Per-piece dimensions are already stored in pieces[] array
- Reduces struct size and improves code clarity

Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass

* Refactor split-image invoker code for improved readability

- Extract piece calculation logic into calculate_piece lambda helper
- Extract kernel args population into populate_split_image_kargs lambda
- Use aggregate initialization for cleaner struct population
- Reduce nesting depth and improve maintainability
- Fix outdated comment about split-image implementation status

* Refactor split-image code and remove debug prints

- Extract GPU kernel helper lambdas for better readability
- Remove all split-image debug print statements
- Set memory threshold to 2GB for production
- All tests pass with CPU verification

* Add split-image safety constraints and refactor to utils

- Add MAX_TOTAL_PIECES=64 limit to prevent segfault
- Move calculate_spatial_piece to library utils
- Add layout validation (NWGC, NHWGC, NDHWGC only)
- Fix hierarchical splitting to respect piece limits
- Add proper documentation and formatting

* Change split-image from runtime to compile-time branching

Response to @bartekxk review comment:
Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)'

Changes:
- Add EnableSplitImage template parameter to kernel
- Change runtime if to compile-time if constexpr
- Update invoker to instantiate kernel variants with true/false

Benefits:
- Eliminates runtime branching in GPU kernel
- Dead code elimination (each variant is smaller)
- Better compiler optimization

Files modified: 2
Lines changed: 20 total (6 in kernel, 14 in invoker)
Tests: 27/27 passed (100%)
Performance: No regression

* Add split-image example as separate binary

- Create grouped_convolution_forward_split_image example
- Add grouped_convolution_forward_split_image_invoker.hpp
- Update CMakeLists.txt to build split_image binary

* Replace linear search with binary search in find_piece_id

- Change O(n) to O(log n) for finding piece ownership
- Matches reference implementation in large_tensor_cshuffle

* Simplify split-image code and fix integer overflow

- Extract lambda functions to static helper methods
- Pre-calculate constants in invoker
- Fix integer overflow in tensor size calculation for large tensors

* Trigger CI rerun - fix merge conflicts

* Fix merge conflict markers

* Fix clang-format: remove space before {}

* Fix clang-format: comment wrapping and Swish constructor

* Rename split_image to large_tensor for clarity

- Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp
- Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp
- Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor
- Updated comments to refer to 'large tensor' instead of 'split-image'

* Update comments and include in large_tensor example

- Updated header comments to use 'large tensor' terminology
- Fixed include path to use large_tensor_invoker.hpp

* Remove test code, restore 2GB threshold

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix build errors after develop merge and complete rename to large_tensor

This commit addresses compilation errors from the develop merge and
completes the rename from split_image to large_tensor.

Changes:
1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp
2. Fix template parameter order in large_tensor_invoker.hpp
   - TransformConvFwdToGemm signature changed in develop
   - NumGroupsToMerge and SplitN parameters swapped positions
3. Fix missing template parameter in GroupedConvFwdHostArgs
4. Fix EpiloguePipeline scope in kernel (merge conflict)
5. Update binary name references in test scripts

* Restore 2GB threshold for split-image

Changed threshold from 100MB (testing) back to 2GB for production use.

* Fix const-correctness in ds_ptr cast

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply clang-format-18

* update c++ 18 format

* Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 1fbb47ad30]
2025-11-01 14:18:16 +02:00
Aviral Goel
d17d3f0766 test(grouped_gemm): add unit tests for grouped_gemm bquant with preshuffleB true (#3119)
* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* add bquant to grouped_gemm

* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm

* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* feat: add bf8 support

* chore: remove unnecessary decltype usage

* chore: add default quant_mode to function signature as fallback

* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel

Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.

* chore: set default quant mode in function signature

* test: add additional test cases to cover edge case of no hotloop

* change code based on comments

* WIP: bquant preshuffle b compiles but gives numerical error

* feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel

* refactor: refactor code after merge commit

* chore: remove print statements

* test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases

---------

Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 8f1274d9b6]
2025-10-31 12:07:06 -07:00
John Afaganis
a987c5dc2e Add copyright notices to missing files (#3133)
[ROCm/composable_kernel commit: 3f996ee738]
2025-10-31 07:35:11 -07:00
Bartłomiej Kocot
fcabe28158 [CK TILE] Clear output buffers for grouped conv bwd (#3127)
[ROCm/composable_kernel commit: c2d7931446]
2025-10-31 14:11:54 +01:00
Yi DING
acec30dd09 [CK_TILE] Add mxfp4 flatmm (#3080)
* Squashed commit of the following:

commit 3e1a851dad834776efbe4fe365ac82c4ed312010
Author: Ding, Yi <yi.ding@amd.com>
Date:   Thu Oct 23 06:10:54 2025 +0000

    Fix & clean after rebase

commit 1edf485092f44411da9a1796a4a6b72d5cdb67c6
Author: Ding, Yi <yi.ding@amd.com>
Date:   Wed Oct 22 10:46:13 2025 +0000

    Squashed commit of the following:

    commit 5276b28a51dac7b5d2106fbae8e78de190ee0de1
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 02:04:27 2025 -0500

        fix bandwidth calculation

    commit d645bb20c6d879154c30ecd82bbff4d2a9206750
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 00:58:59 2025 -0500

        updates

    commit 0fa7e6b88aaf81a36034aa7607746de295de4263
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Sep 19 00:39:46 2025 -0500

        fix a bug, set the A DS_read preload size to 4 for MXFP4

    commit 50cafa824e2267f2b2f0dfeeb93e69a673630c61
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Thu Sep 18 01:19:03 2025 -0500

        fix a_wrap preload issue for large MPerBlock.

    commit e6333bbbc6ef540e24f92095040085f1ed59041e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 21:34:03 2025 -0500

        optimized the VGPR repack issue for MXFP4

    commit e99e4932c401b9f6d1893dd5044c2827d6b3f145
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Wed Sep 17 04:19:44 2025 -0500

        fix time error

    commit 4586ce6da7fba0514f2e01a8124c76b7d494e124
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 03:58:00 2025 -0500

        updated, function passed.

    commit c4f25e7579573db5681b9160f6bdb1349f3566f1
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 22:21:39 2025 -0500

        fix, function partially passed

    commit a51b56eb6b00b99a4e8d2802dbf5b5b5277b54d8
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 03:01:12 2025 -0500

        fix, reference function passed, next check kernel function

    commit 5b02643ebab18960e8f9ba66c6bd2f91774f9cae
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Sep 16 02:29:01 2025 -0500

        let pack/unpack return pk_fp4_t

    commit 76d37c5d4b17530e95c6fced31bff66a35d54b8f
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 20:50:26 2025 -0500

        fix

    commit e5be3e162b9a20e5355bd556d2b27afb6d8bf085
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 15 05:51:06 2025 -0500

        fix bug

    commit 39a024efe4aa773df589712b1290803bb5ab5d1d
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 04:02:05 2025 -0500

        fix core dump issue, function is not correct.

    commit 16c49d268cfe065b5112b960b2d852b26552686a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 03:03:02 2025 -0500

        updates, build pass

    commit fe7a961852dee6eff3be3cf1e0d0fabec5cd42ee
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 00:05:18 2025 -0500

        updates

    commit aaf9fe8022a72df59e04e4d5886dca3ba9c23400
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Sun Sep 14 23:40:28 2025 -0500

        fix bug

    commit a3da89290e1553b85fbf1171c07e93ac0f5584db
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 03:28:50 2025 -0500

        fix interface

    commit c5ff747e72d877461ba61dc19a0fe15527d3161e
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 02:53:50 2025 -0500

        add interface in warp_gemm_impl

    commit 0a48d369e601cc798589fc59e0784bdbfc0a22f9
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 10 05:03:08 2025 -0500

        updates some fixes.

    commit aaa2beca30ff5546d171a2028d1894fd4e131d4e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 9 04:37:42 2025 -0500

        fix after merge ginolu/add_wgmfma_dispatcher

    commit bf87449b09cba690922b2f3f78ba39bf1b1e472e
    Merge: 05ab58e3d 991d7fdbb
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 22:09:15 2025 -0500

        Merge remote-tracking branch 'origin/ginolu/add_wgmfma_dispatcher' into mtgu/cktile_mxfp4_flatmm_dev

    commit 05ab58e3de2b708aceda63d704089c0fa89437ae
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 21:42:47 2025 -0500

        update mx flatmm tail pipeline

    commit 991d7fdbb726d65091a91b5cc2800f798a6661fc
    Merge: ad046084a b2f280046
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:10:23 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit ad046084a2f6e4ebf0cd8b47d0d72b74815061fa
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:09:55 2025 -0500

        fix type error

    commit 42e16b43a035364a42789d7ce45a1e6a7d1d2609
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 04:01:40 2025 -0500

        update hotloop pipeline

    commit c2f69745346545087c8ce24acaba2961bb93ef0b
    Merge: adbeeb90b 8b4be3a0e
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:22:26 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit adbeeb90be1533f8aeb8c1d5aea6470d45a455a0
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:21:26 2025 -0500

        fix clang format

    commit e2378ac393bb79ac80a8eef84677bffce86d9e0a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 3 10:00:54 2025 -0500

        some updates

    commit bdc18a2269db49ff88e1ef1db30f83ea430d7544
    Merge: 6c5cea2b7 feec59755
    Author: asleepzzz <hanwen.chang@amd.com>
    Date:   Wed Sep 3 13:22:03 2025 +0800

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit 6c5cea2b7a306f5d0ad346cb9baf6370ea2a73fe
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 02:11:02 2025 -0500

        fix vec size error

    commit 76d1dfa352087dfd5867c8909b73726d3a1e853e
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 01:23:39 2025 -0500

        fix format error

    commit a9061aaa1b4bfaa9db102c75b9d74863f39708a9
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Sat Aug 30 03:19:07 2025 -0500

        update codes

    commit 0caa184a271a8824ef40f87de456d0fa2500c8ad
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Aug 29 11:27:33 2025 -0500

        init ck_tile mxfp4 flatmm

    commit 5d46a6635f04bd69b76f7eda1438862e271b987a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 28 08:02:50 2025 +0000

        Add bias for f16xf4 moe_flatmm

    commit dd112dc302d17f541737671a3ac557d7c09ff969
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 27 13:39:47 2025 +0000

        update case construction

    commit b1aca68a073d82c7b3c7bb53286e5f415999edc1
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 26 12:32:29 2025 +0000

        support swiglu activaion and use rcpf to accelerate silu

    commit 49235bd42349a84fc2ebd7ad0b100cc2545bb80a
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Aug 26 02:33:55 2025 -0500

        first commit

    commit c169e39d6381b932cf7098cc118db29df91da1cb
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 04:01:59 2025 -0500

        add line to last

    commit 318f9bf317306454941bbf394c1940023edcf0ac
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 03:20:46 2025 -0500

        adjust A_LDS descriptor to avoid bankconflict

    commit 9d066120ed068d6d102da25d619e170a28a04d18
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Thu Aug 21 09:46:52 2025 -0500

        enable hotloop

    commit 61a895e6b821798970afffd0e9432a21e2f04df8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 09:12:21 2025 +0000

        support atomic_pk_add_bf16 on gfx950

    commit 9f14864e45f21d8c1bc70a94988fb86c2c0017d8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 06:58:55 2025 +0000

        use int64_t as expert stride to avoid overflow

    commit e63af46b32e1139a1e59dee6f46b9971047c4026
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 13:53:32 2025 +0000

        use v4i32 as the storage type for B to avoid repack operation

    commit 6cf0224dd8a229bf2be726ca861c736c9b5f5415
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 06:40:03 2025 +0000

        add pk_fp4_t and e8m0_t support for amd_buffer_load_impl

    commit 67a591f2240b0b035029edad904627f98b3839fd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 04:39:14 2025 +0000

        optimize cvt_pkf4_to_f16 implementation

    commit 51c7126e77e9b17af694eaa57040e487f9d443e8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 19 14:56:46 2025 +0000

        optimize A_LDS descriptor to avoid bankconflict

    commit c113160f326353290a2878d7b8febf7daed91d71
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 18:43:37 2025 +0000

        fix gate-up when GU_NRepeat > 1

    commit a45ca0e9934ca4bb9114f65621d5c9582d937a45
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 17:28:11 2025 +0000

        add fp16xf4 moe

    commit dc8c8e484804f7bca10c8f0764540af3b5884e83
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Aug 17 17:51:18 2025 +0000

        rename example

    commit b177c967141cfdc401d3f36bf17830fe99893600
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 15 06:20:46 2025 +0000

        remove additional check when e8m0->float

    commit d467f9688c3d35f391e15089135edb1ad1d38b05
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 14 09:34:12 2025 +0000

        eliminate repeat dequant

    commit 1b20674b26ab3ce6bd2f710dd729fd4cc0f79428
    Merge: faa3c0278 7d02625e7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:51:49 2025 +0000

        Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm

    commit faa3c0278cf11b7105a4302dea3a4416520b2cc7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:16:48 2025 +0000

        update f16xMXF4

    commit a2a2e1dab05501cc2136133236c01c08d51db4ea
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 10:48:53 2025 +0000

        update scale-preshuffle for MXF4

    commit eac9667feb899419dda1628164c092b969852660
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 11:24:34 2025 +0000

        update

    commit 7d02625e7678882af653f52c2a4ddaf64568a41c
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 08:38:23 2025 +0000

        optimize gemm2 atomic_add pattern

    commit d5f3c3e3ec72d0e6739467c4dc0b4e209f6d1192
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:59:47 2025 +0000

        update scale for mxfp4

    commit 15db198084614466bd4cfd4943fcb549cab2069a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:56:14 2025 +0000

        update case construction

    commit 5dff349d82a5f70b6eea821d2622df51f90ef200
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 06:03:06 2025 +0000

        update granularity control

    commit d32cdc52144f65ec473f4ec8e45ea23968811184
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 03:42:46 2025 +0000

        fix TileConfig

    commit 26f38c5716304ee5f84e5c4f6f88144d9f3dddaf
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Thu Aug 7 21:37:28 2025 +0800

        Add e8m0 scaled convert into CK_TILE (#2617)

        * first commit

        * remove redundent code

        * modify according to comments.

        * fix type_convert error with scaled_type_convert

    commit 419041478745f65dfec18859e75a13d975089519
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 8 20:19:16 2025 +0000

        add mixed_prec fp16xfp4

    commit 92e2a8b0308b9b107df9d2fd63a961efce706402
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 7 09:22:04 2025 +0000

        debug mixed_prec flatmm

    commit dea3ce80496ebcb00512979f0c3bb897f25e11a5
    Merge: fde443bc3 b4f45fe14
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Aug 6 16:49:47 2025 +0800

        Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk

        fix split k

    commit d480e8150358cc4ef8b05e25afe299141fad4fde
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 6 08:33:33 2025 +0000

        add moe_flatmm

    commit b4f45fe14d11569f34de40c8a205cd6760b61357
    Author: coderfeli <coderfeli@163.com>
    Date:   Wed Aug 6 02:45:31 2025 +0000

        fix split k

    commit fde443bc38fe60e52195817ecb2c7b20d772eedb
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 4 07:16:36 2025 +0000

        fix flatmm with scaling when WarpTileM == 32

    commit 5a0667afa889a5af8c6b8509232eabd50cf5efef
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 11:01:23 2025 +0000

        optimize scaling epilogue

    commit 5c3502bbf71833c6f6f7d4a1cc4f4fd93811f522
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 07:28:38 2025 +0000

        fix wrong config for fp8 scaling

    commit eb2d0653cdb86603cb11539cbac466b6431b58b7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 06:20:30 2025 +0000

        prune debug message

    commit 0c089cb56343a39e02a1ee38e9cabeb71ba35e92
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 04:52:08 2025 +0000

        fix compile error

    commit 61759ca30ce3787f70e228c3919b3e4d354016dd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Jul 29 15:42:58 2025 +0000

        Add persistent option on flatmm for tuning

    commit b36dc5dd55f15fc1ce8eb21637bdec862e56a883
    Author: AMD-dteng <dteng@amd.com>
    Date:   Tue Jul 29 22:48:00 2025 +0800

        update pipeline v1: add atomic IGLP schedule

    commit f886f26994454fc2b4fc3433c86bf699767a2a7c
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 09:09:27 2025 +0000

        fix error log throwing

    commit 4b4686ab144daa9061fbda17f3df4c17600c8e9a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Jul 28 08:24:51 2025 +0000

        crz idea

    commit 7099af44a81be41431ba70ae60827b60116d02d2
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Jul 27 11:57:38 2025 +0000

        Add permuteN optimzization when NRepeat % 2 == 0 on flatmm

    commit b147524c92e69a267337c8e48b6e64bcb1483551
    Author: sjfeng <j514681085@icloud.com>
    Date:   Sun Jul 27 17:24:08 2025 +0800

        try to remove c_shuffle_lds

    commit 2dd94f59d1a7740a5689e1713ed45588cd0d55dd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Jul 25 07:41:48 2025 +0000

        fix loop-dim mismatch and improve c_shuffle alu parallelism

    commit 4e93f0c5e27806adc070e4caa81661069295751c
    Merge: 3f12ef5aa 0eb7455f1
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 08:46:51 2025 +0000

        merge flatmm -scale

    commit 3f12ef5aa52ced1bff3bfb57b878358330e9e095
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 16:19:58 2025 +0800

        revert delete of inc file

    commit 08c3a0d184d7581dc5be364f5b36f16fb4a8d6fa
    Author: solin <bingzhou@amd.com>
    Date:   Thu Jul 24 04:38:16 2025 +0000

        reorg  flatmm code

    commit 0eb7455f106604d5254ed16b0daeda68e2a148e3
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:12:31 2025 +0000

        fix flatmm syntax error on gfx950

    commit 695ff87e68fdcbe28452c1805cd4dbb643c45495
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:04:22 2025 +0000

        support flatmm scaling

    commit e3c29d9dea8758db96b998982ccc8bd1c4e8298d
    Author: valarLip <340077269@qq.com>
    Date:   Wed Jul 23 08:44:12 2025 +0000

        merge flatmm pipe v0 from dteng_flatmm_opt

    commit 425c366fa4c30426ff36cade89b39fd8cb7b9732
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:38:12 2025 +0800

        build pass

    commit 6b377a9481535696de40f175d7e2159263d21bdc
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 07:20:26 2025 +0000

         fix bug

    commit b6dc58d1ea676fe480c0243ae098c875498f6d6a
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:01:53 2025 +0800

        sync

    commit 904359f401866ee810484e6b8f5b46d79d9e25c8
    Author: valarLip <340077269@qq.com>
    Date:   Tue Jul 22 08:09:35 2025 +0000

        adaptive scheduler instead of Macro definition

    commit f29916c17228c17de9923aab62e7d72d7a30f4e9
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 17 08:40:35 2025 +0000

        fix tail handler bug

    commit e2c60a90929fec955d91db909d50db538d58363b
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 16 10:12:19 2025 +0000

        merge from dteng_flatmm_opt

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

* Fix crash on small M

* Apply suggestion from @Copilot

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: e135dd518d]
2025-10-31 11:29:05 +08:00
Ville Pietilä
0a49238dd5 Fixed building CK Tile grouped conv fwd bias clamp example. (#3124)
[ROCm/composable_kernel commit: 22d9f99942]
2025-10-30 18:17:48 +02:00
Jimniu
f5200e06c6 Jimniu/ ck tile gemm stride validation (#2710)
* Add stride validation for gemm_basic

* change default stride statement

* Fix build error

* Fix pre-commit failure

* Addressed PR comments

* clear the redundant code

* clang format

---------

Co-authored-by: mkumar16-amd <mkumar16@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 8c4cb4f9f4]
2025-10-29 19:45:09 -07:00