Commit Graph

432 Commits

Author SHA1 Message Date
amd-khushbu
52044c443c resolve merge conflicts with dev 2025-12-03 00:41:54 +00:00
amd-khushbu
b0645c93bf Debugging layout fix 2025-12-03 00:35:53 +00:00
Aviral Goel
6cb0bc2d11 feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193)
* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* feat(gemm_quant): add RRR and CRR layout support for aquant gemm

* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm

* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests

* fix: test cases compilation failure due to PR# 2095

* fix: make condition to filter out tests for gfx950 more explicit

* need to support the gfx950

* fix: add layout suppot for gfx950

* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)

* WIP: add support for pk_int4_t for aquant mode layouts RR and CR

* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant

* fix: compile time error for gfx950

* fix: minor bug where is_a_load_tr_v() was mising

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* docs: update changelog with new layout support info

* Update CHANGELOG.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* refactor: break test instances into multiple cpp files to reduce build time (#3319)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* refactor: break test instances into multiple cpp files to reduce build time

* chore: rename file for better code readability

* fix: merge conflict resolution

* fix: remove memory pipeline because new layout is not compatible

* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()

* refactor: address review comments

* solve the conflict

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-02 14:59:07 -08:00
Thomas Ning
8459d389ad disable the gfx90a (#3336) 2025-12-02 07:27:37 -08:00
Ville Pietilä
66832861ad [CK_TILE] Merge multiple fwd convolution groups into a single GEMM batch. (#3136)
* Merge fwd conv groups in CK Tile.

* Fix building CK fwd convs.

* Add number of merged groups to conv fwd kernel name.

* Get number of merged groups from conv config.

* Rename GemmConfig to ConvConfig.

* Clean-up TODOs.

* Check that number of conv groups must be divisible by the number of merged groups.

* Improve error handling in the conv fwd example.

* Fix clang-format.

* Fix group offsets.

* Fix merge problem.

* Address feedback from code review.

* Fix clang-formatting.
2025-12-02 15:23:32 +02:00
msaffari-amd
2d3020e5b0 [CK Tile] batched contraction kernel generalizing (#3126)
* Add help for example

* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings

* Add stride-aware reference for batched contraction with independent D tensor layouts

* Add -num_d argument for runtime D tensor count selection in batched contraction

* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs

* Add descriptor-based architecture for batched contraction multi-dimensional stride support

* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0

* Add complete multi-dimensional stride support via descriptors

* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm

* Clean up batched contraction: remove old UniversalGemmKernel path

* Clean up batched contraction: remove legacy paths and finalize docs

* Optimize batched contraction example: pass dimension sizes not vectors

* correct the reference calculation, unsigned int to int

* Fix batched_contraction C++17 build errors for gfx90a CI
2025-12-02 13:30:27 +01:00
Yi DING
f211156ce6 [CK_Tile] Flatmm MX Cleanup & Explicite Offset Calculation (#3286) 2025-12-02 14:21:12 +08:00
Cong Ma
23fb253c4e Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325)
* [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples

* [CK TILE GEMM Quant] update tile distribution of aquant

* [CK TILE GEMM Quant] update aquant register offset calculation

* [CK TILE GEMM Quant] Reimplement aquant register offset calculation

* [CK TILE GEMM Quant] Add more unit tests of Aquant

- Test M128xN128xK128

* [CK TILE GEMM Quant] Add more comments to Gemm Aquant
2025-12-01 15:04:37 -08:00
Aviral Goel
004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
Sami Remes
f981554c39 [CK_TILE] Fix Quant GEMM build (#3320)
* Fix build

* Fix ck_tile example 38 & 40

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-11-28 20:33:53 +08:00
Cong Ma
30727c48fc Tile engine for streamk (#3157)
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.

- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
  It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
  the streamk implementation reaches stability.

* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM

* [CK TILE STREAMK] Refactor: Extract common utility functions.

* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation

* Add pre-commit

* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK

* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK

* [CK TILE STREAMK] Update Jenkinsfile

* [CK TILE Engine] Update StreamK tile engine help message

Remove default value messages as they are automatically printed

* [CK TILE Engine] Update StreamK tile engine

- Remove namespace reboot

* [CK TILE Engine] Update StreamK tile engine

- Fix merge error
2025-11-27 15:49:57 -07:00
arai713
24d88d2472 [CK_TILE] Move DataTypeTraits into a Common File (#3146)
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-11-27 09:09:54 -08:00
Max Podkorytov
79aae7c7f7 [CK Tile] enable building examples by default (#3259)
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch

* fix cpp17 compile error in the ck-tile examples

---------

Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2025-11-26 16:24:44 -08:00
Aviral Goel
35a4b26af0 fix: add dynamic selection of pipelines for aquant mode (#3282)
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly
2025-11-26 10:58:09 +04:00
khuagarw
3447196c8a resolving merge conflicts 2025-11-26 01:22:54 +00:00
khuagarw
f290428926 working code for preshuffleb 2025-11-26 01:13:25 +00:00
khuagarw
7788979fbe initial commit 2025-11-25 23:38:41 +00:00
Aviral Goel
cd47293869 chore(copyright): update copyright header for experimental & example directory (#3292) 2025-11-26 03:09:39 +04:00
Bartłomiej Kocot
00dfa2f2ce [CK TILE] Grouped Conv Explicit Gemm (#3289)
* [CK TILE] Grouped Conv Explicit Gemm

* fixes

* apply builder fixes
2025-11-25 23:28:35 +01:00
Khushbu Agarwal
37ea160088 [CK-Tile] fix block scale example for gfx1201 (#3283) 2025-11-25 13:10:28 -08:00
Bartłomiej Kocot
9ac2666d5b [CK_BUILDER] Add grouped conv bwd ck tile traits (#3281)
* [CK_BUILDER] Add grouped conv bwd ck tile traits

* copilot fixes
2025-11-25 14:57:43 +01:00
Aviral Goel
d85f065b15 chore(copyright): update copyright header for example directory (#3273)
* chore(copyright): update copyright header for codegen directory

* chore(copyright): update copyright header for example directory
2025-11-24 18:02:41 -08:00
rocking
229d43ea0c Fix batch prefill compile fail in aiter (#3279)
* Fix batch prefill aiter compile fail

* Fix compile error
2025-11-25 09:46:32 +08:00
Thomas Ning
de6a9590ab Reorganize of KPack in GEMM (#3247)
* add the reorganize of KPack

* fix the compilation error

* fix the compilation error
2025-11-24 12:38:59 -08:00
khuagarw
04aaf97192 debugging PermuteN 2025-11-24 18:58:51 +00:00
Khushbu Agarwal
8111572785 [CK_Tile] Support for preshuffle weight(B) quant tensor for block scale gemm (#3165)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* addressing review comments

* fixing CI issue

* addressing reveiw comments

* formatting

* formatting

* fixing aquant operator overlaoding

* formatting

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-11-24 07:48:42 -08:00
rocking
5948dbffe4 Support fp8 dynamic quantization for fmha (#3206)
* Support qscale for dynamic quant, remove static quant

* Support hdim=256

* Remove bias test case for fp8

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-11-24 16:28:25 +08:00
Johannes Graner
096f0a3b23 [CK Tile] Fix example for conv fwd + bias + clamp (#3235)
* Fix clamp not being applied correctly

* Apply group offsets to D tensors

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-11-24 07:36:26 +01:00
khuagarw
cf3f9b57b4 Merge branch 'develop' into lwpck-3985 2025-11-20 21:48:16 +00:00
khuagarw
a974a08c2e debugging 2025-11-20 21:26:21 +00:00
Emily Martins
2e4b8a8fc4 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed Revert "Add attn sink (#2892)" (#3250)
This reverts commit 9fa4e8d5ab.
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab Add attn sink (#2892)
* enable attn sink

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update attn_sink script

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix some error

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* clang-format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd_kernel'mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix ci error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix format error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_bwd_pipeline_default_policy.hpp

* Update fmha_fwd_runner.hpp

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* update splitkv_pipline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update splitkv&pagedkv pipeline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* add sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update attn_sink result log

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update smoke_test_fwd_sink.sh

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test file

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test script

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

* use constexpr kHasSink for sink in fmha pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update by pre-commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove causal mask setting logic from mask.hpp

Removed the mask setting logic for causal masks.

* fix ci error that some usage of lamada not support in c++17

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update remod.py

* add smoke sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update fmha_pagedkv_prefill.py

* Update FmhaFwdPipeline parameters in fmha_fwd.py

* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix c++17 unsupprot error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

* Fix formatting of sink_seq_end assignment

* Fix indentation for sink_seq_end assignment

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-20 19:24:05 +08:00
Yi DING
47e2ed838e [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal
1eb26460aa [ck_tile] Pooling example - Improved tile sizes (#3233)
* improved tile sizes

- modified tile sizes for improved example performance

* Update example/ck_tile/36_pooling/pool3d.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-11-19 15:30:18 +01:00
khuagarw
2275548400 debugging permuteN 2025-11-18 21:59:30 +00:00
Aviral Goel
ac70206b2c feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle… (#3225)
* feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle kernel(s) along with unit test

* docs: Update CHANGELOG.MD
2025-11-18 09:32:27 -05:00
Yi DING
b6720531de [CK_TILE] MX Flatmm Split kernel instances (#3207)
* [CK_TILE] MX Flatmm Split kernel instances

* Fix flatmm example compile
2025-11-18 13:46:30 +08:00
khuagarw
f5856af85e Merge branch 'develop' into lwpck-3984 2025-11-14 21:36:36 +00:00
BingYuan.Zhou
4d629cd2b0 fix build error (#3195)
Co-authored-by: root <root@hjbog-srdc-39.amd.com>
2025-11-14 09:46:13 +08:00
khuagarw
07700cc4fa fixing CI failures for grouped quant gemm 2025-11-14 01:09:09 +00:00
khuagarw
36f2f873de updating cmake file 2025-11-13 22:18:20 +00:00
khuagarw
48e75593f2 updating readme and correcponding gemmconfigs 2025-11-13 22:16:58 +00:00
khuagarw
903800f9af rebase with develop 2025-11-13 21:39:42 +00:00
Khushbu Agarwal
fb41a7b73b fixing ambiguous shuffle definitions (#3175)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-11-12 23:44:12 -08:00
Cong Ma
6fd8ddabe7 [CK TILE GEMM] Refactor block_scale_gemm examples (#3181)
* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Set quant group size to (1, 1, 64) for targets excluding gfx950, where warp tile size (16, 16, 128) is incompatible.
2025-11-12 23:43:40 -08:00
khuagarw
b8b5709ccf enable preshuffle quant with permuteN 2025-11-12 21:32:02 +00:00
Yashvardhan Agarwal
299c9bca1b [CK_Tile] Pooling example readme update (#3174)
* pooling example readme update

- The updated readme explains the transformations of the pooling kernel
using a mermaid diagram

* Update example/ck_tile/36_pooling/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* resolve comments

---------

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
2025-11-12 07:30:20 -08:00
khuagarw
075c36b5f9 remove debugging statements 2025-11-11 22:33:06 +00:00
khuagarw
869bc5b77b adding preshuffle quant as new parameter and its associated new files 2025-11-11 22:08:53 +00:00