Commit Graph

987 Commits

Author SHA1 Message Date
Illia Silin
9f6fc9fe09 fix clang format (#2926) 2025-09-25 09:35:35 -07:00
linqunAMD
f076f207ce [CK] Fix misc issues in CK examples (#2890)
* [CK] Fix misc CK issues

* revert fp8 change, it causes CI fail.

* resubmit fp8 change
2025-09-24 11:28:20 -07:00
Yi DING
fe0a47a011 [CK_TILE] FMHA BWD Add D96 Instances (#2916) 2025-09-24 17:04:23 +08:00
Johannes Graner
15fff74503 [CK Tile] Implement Invoker pattern for remaining grouped convolution examples (#2894)
* Invoker for grouped_conv_fwd

* Invoker for grouped_conv_bwd_data

* Fix incorrect out layout identifier
2025-09-24 10:22:38 +02:00
Jingwei Liao
6805684788 add fmha dtype fp32 (#2914) 2025-09-24 15:28:39 +08:00
Thomas Ning
b159841a06 Fix the gfx950 numerical errors (#2911)
* Update grouped_gemm example and pipeline

* find the root cause error in did not enable the transpose in gfx950 correctly

* Fix v3 pipeline, row and col major

* Disable f8 datatype tests, it fails on gfx950

* fix the abd test by clear the runtime argument unsupported

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
2025-09-23 22:54:52 -07:00
Haocong WANG
7b16782d7c [CK_TILE] Fix fmha bwd (#2865)
* Fix fmha bwd filter

* remove unnecessary change

* enable test cases

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-23 19:59:27 +08:00
Haocong WANG
b6e8994386 [CK_TILE] FMHA FWD bug fix (#2888)
* tempsave debug

* fix the bug in fmha fwd_kernel

* Remove unnecessary changes

* Fix the buggy part

* remove fmha fwd known failure cases
2025-09-23 15:00:46 +08:00
Yi DING
ad259eeae2 FMHA BWD Avoid SetZero (#2799) 2025-09-23 14:37:48 +08:00
Enrico Degregori
3d29bff2f0 Wmma support for multiple ABD GEMM (#2803)
* multi_abd wmma support:

 - Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)

* Fix bug in device print function

* Fix unused template parameter

* Fix batched gemm for multiABD gridwise implementation

* Fix gemm_universal_reduce with multiABDs gridwise implementation

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-22 18:49:06 -07:00
jakpiase
624c46866e [CK_TILE] Add conv bwd weight two stage support (#2855)
* resolved conflicts

* add conv bwd weight twostage

* fix one file

* fixes after review

* fixes

* fixes

* Fix

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-09-22 15:31:25 +02:00
Sami Remes
4363a82bd6 [CK_TILE] Tensor-wise scaled quant gemm kernel (#2846)
* rename gemm_group_quant to gemm_quant

* Add TensorWise quant mode

* Cshuffle epilogue tests with tensor scaling

* Add tensor quant to example

* Don't use readfirstlane for reading scales - doesn't work for some reason

* Add to changelog

* revert include - from a merge problem?

* revert common.hpp include

* revert host.hpp include

* remove unused utility function

* rename quant pipeline problem

* refactor quant tests

* remove aquant utils

* use TEST_F

* fix all tests by changing gemm config

* Use typed tests

* fix copyright
2025-09-19 16:52:35 -07:00
Illia Silin
b765fe78f3 Revert "[CK_TILE] Add sequence padding and variable length support in fmha (a…" (#2883)
This reverts commit 86dd59cd01.
2025-09-19 08:15:02 -07:00
Jeff Huang
86dd59cd01 [CK_TILE] Add sequence padding and variable length support in fmha (a… (#2851)
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)

 - Group Mode Padding: Introduces the `-s_qpad` argument to support
   physically padded layouts. Kernels now use padded start pointers
   (`seqstart_padded_*_ptr`) for memory addressing.

 - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
   arguments for efficient processing of variable-length sequences by
   passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.

 - FMHA examples: Support padding and variable length both in
   group and batch mode. Dispatcher is updated as well (dispatch to
   kPadSeqLenK enabled pipeline).

 - New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
   and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
   These test cases and benchmarks that specifically validate/benchmark the
   new padding and variable-length functionalities in both group and batch modes.

* [CK_TILE] Fix build error in fmha unit tests

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-19 17:36:49 +08:00
ltqin
dd249f1cd6 Add input fp8 and output bf16 attention (#2726)
* change host using fp16 to check

* fp8 to fp8 compare

* rewrite input parameters

* add not squant

* remove some output code

* for scale = 1

* format

* saturates only for fp8

* add fp8bf16 data type

* add fp8bf16 data type

* fix test fp8 code

* add run_fp8bf16_tests

* change fmha fwd example parameter(adding fp8bf16)

* Support fp8bf16 for Aiter

* Support aiter fp8bf16 in c++

* fix comment about fp8 in readme.md

* add fp8fp32

* add fp8fp32 test

* remove range_q etc.

* format

* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type

* add fp8bf16 to data_type function

* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8

* format

* reset atol for fp8

* fix bug for atol

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-09-19 14:26:43 +08:00
Mateusz Ozga
30ab1d6a71 [CK_TILE] Multiple-ABD GEMM example (#2788)
* Multi ABD - initial commit

* Clang-foramt fix

* block gemm, unify the name of CDataType

* Apply chnages to mem-pipeline

* Rollback prefix for DType and Layout

* Gemm Kernel Basic, rename

* WMMA config

* Grouped GEMM

* Clang-format

* Dropout, name

* Review v2

* Move element_wise fn to unnary, remov old ones fn

* clang-format

* Fix issue review

* WP operator adjust to universal gemm

* v2 prepare

* Remove unused comment

* Remove vectorsize

* Rollback

* Adjust pipeline for abd

* Shuffle argument

* CI-fail fix quant

* Fix ag_br pipeline

* Failing tests

* Typo

* Single argument support
2025-09-19 01:14:11 +02:00
Yi DING
7ee7915e94 [CK_TILE] FMHA Test Ignore Known Errors (#2872) 2025-09-18 16:51:21 +08:00
yinglu
dd7af118d7 TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852)
* Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)"

This reverts commit 03b59f8c76.

* fix compile error on gf12x

* only run tf32 example on gfx942

* only build tf32 instance on gfx942

* ckProfiler:only support tf32 in gfx942

* delete unuseful messages
2025-09-17 14:50:15 -07:00
Aviral Goel
7c934b72ab build(grouped_gemm): added appropriate compiler flag to resolve numerical error for fp8 on gfx950 (#2868) 2025-09-17 11:04:21 -07:00
Michał Kulikowski
5c4f52a02a [CK][Examples] - fixing grouped_conv_bwd_weight command parser. (#2840)
-added parameter to change group count for grouped_gemm examples.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-09-17 10:39:48 -07:00
Aviral Goel
db79fad16f fix(grouped_gemm): pipeline selection when tail_num varies per group and leads to numerical error (#2863)
* fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num

* WIP: add temp config to stress test numerical error correction

* refactor: remove comments
2025-09-16 18:43:19 -07:00
Wojciech Laskowski
f97b2a3f5d Added wmma support for gemm quantization: (#2841)
- profiler for gemm quantization for DL/XDL
- tests for gemm quantization for DL/XDL
- implementation for gemm quantization for WMMA
- profiler/tests for gemm qunatization for WMMA

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-16 16:23:29 -07:00
Aviral Goel
48e08c6429 test(grouped_gemm): add gtests for the example/grouped_gemm_preshuffle to ensure its integrity (#2811)
* test(grouped_gemm): add gtests for the example to maintain its integrity

* test(grouped_gemm_preshuffle): add prefill variant to testbed to cover wider range

* fix: removed residue code to make b_shuffle() work again

* test(grouped_gemm_preshuffle): limit the test suite to gfx942 arch as it fails on gfx90a

* build: add gfx950 as build target for gtests

* test(grouped_gemm_preshuffle): temporarily disable fp8 prec tests due to numerical errors

* fix(grouped_gemm_preshuffle): resolved fp8 tests failure on gfx950 by adding correct compiler flag
2025-09-16 15:43:30 -07:00
Haocong WANG
59cb906482 [CK_TILE] fix bug when iperm =0 in fmha fwd (#2820)
* fix bug when iperm =0 in fmha fwd

* Disable f8 fmha smoke test until fix pr merged

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-09-16 15:07:10 +08:00
Po Yen Chen
7fbc9d6c97 [CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833)
* Re-mapping thread block indices for causal=True kernels

* Use more intuitive remap_opt value

* Fallback to origin remapping if seqlen_q >= 64K

* Use GenericAttentionMask to reduce mask computation

* Avoid unnecessary boundary check for IsMasking=false case

* Fix wrong kernel entry specifier

* Add s_nop to prevent delay wave0-3

* Refine scheduling

* Remove unnecessary sched_group_barrier()

* Move sched_group_barrier() call to scheduler

* Replace inline asm s_setprio with intrinsics

* Rephrase comments

* Expend some o_acc rescaling insts to avoid SIMD idle

* Fix block idx special mapping logic

* Tune block index mapping for causal=False cases

* Tune block index mapping for causal=True cases

* Fix wrong vmcnt()

* Remove parameter name

* Use boolean option for turn on/off causal mask

* Update benchmark_fwd_v3.sh option usages

* Add option if compiler support it
2025-09-16 11:32:38 +08:00
linqunAMD
f22740df82 Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725)
* Enable xdl in gfx11 & gfx12

* update cmake file

* fix all instance build (cmake)

* fix batched_gemm_gemm(cmake)

* rebase cmake files

* fix cmake build error

* remve CK_ENABLE_DYNAMIC_WARP_SIZE

* update cmake build error2

* fix gfx11 build

CK_USE_XDL is enabled on gfx11 and gfx12

* fix gfx10 build

* fix gfx11 error

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-09-15 10:59:25 -07:00
Illia Silin
03b59f8c76 Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)
This reverts commit c51102144f.
2025-09-15 08:27:04 -07:00
lym
c51102144f feature:tf32:add initial conv3d fwd kernel support (#2763) 2025-09-15 21:03:00 +08:00
Cong Ma
e5d73da2da [CK TILE GEMM] set correct value to TiledMMAPermuteN_ (#2839)
- TiledMMAPermuteN_ should be set to true when config if GemmConfigPreshufflePrefill
2025-09-13 20:54:08 -07:00
linqunAMD
b0ee317d83 [CK_TILE] Enable ck_tile tests on gfx11 and gfx12 (#2821)
* [CK_TILE] Enable ck_tile test on gfx11 & gfx12

* revert an unnecessary change

* enable pk_int4 on gfx11 & gfx12

* revert .pre-commit-config.yaml
2025-09-12 12:45:14 -07:00
Anton Gorenko
847834a408 [CK_TILE] FMHA Reduce build time by disabling instances that are not tested (#2834)
* Use lse = false for PagedKV tests

There are no instances with lse = true so splitkv is actually launched
as a fallback.

* Reduce build time by disabling instances that are not tested
2025-09-12 12:44:25 -07:00
Wojciech Laskowski
b25d4d684a WMMA support for GEMM reduce (#2823)
Added gemm + reduce instance library for RDNA4. This includes:

- New device implementation running GEMM and reduction kernel
- instances for wmma (xdl parity)
- examples for wmma (xdl parity)
- tests for existing xdl and wmma
2025-09-12 21:36:43 +02:00
linqunAMD
321627aec5 Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724)
* Fix example

* fix build error

* update pk_i4 & moe test case

* fix all instance build (examples)

* fix batched_gemm_gemm (example)

* disable example_gemm_bias_softmax_gemm_permute on gfx11

* remove unnecessary disable gfx11

* update tests

* update tests2
2025-09-12 08:17:07 -07:00
Michał Kulikowski
ffe9775e70 [CK][EXAMPLES] (#2826)
-Added parameter to enable/disable verification and timing of kernel in various examples that missed it.
-Added parameter to change number of groups to execute in grouped_gemm_examples.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-09-11 12:33:00 -07:00
Aviral Goel
f3239395dc fix(copyright header): add header to missing files (#2807) 2025-09-11 12:27:08 -07:00
Cong Ma
2ed39f8d91 [CK TILE GEMM] Fixed the regression issue with transpose C in Quant Gemm (#2819)
The numerical error was introduced after merging row/col quant. And it is fixed.
2025-09-11 11:38:16 -07:00
linqunAMD
60d3e8f504 [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 (#2808)
* [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12

* fix gemm_splitk_two_stage

* revert .pre-commit-config.yaml
2025-09-11 07:27:33 -07:00
linqunAMD
0b9a638f26 [CK_TILE] fix example reduces, permute and elementwise on gfx11 & gfx12 (#2810)
1. Refine Reduce2dShape to support both wave32 and wave64
2. Fix example reduce, permute and elementwise on gfx11 and gfx12

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-11 12:41:20 +08:00
Khushbu Agarwal
80a61afb9b [CK-Tile] Fix quant example code (#2813)
* initial commit

* remove extra files

* fixing errors

* updated ReadMe file for mapping of diff quants with diff configs

* addressing review comments

* addressing review comments

* Resolved merge conflicts

* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled

The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.

---------

Co-authored-by: Cong Ma <congma13@amd.com>
2025-09-10 17:15:39 -07:00
Anton Gorenko
ec006bb8e0 [CK_TILE] Add gtests for FMHA (#2744)
* Improve random number generation

* use different seed for each input (Q, K, V...);
* use deterministic generation of:
  * seqstart_q/k (for group mode);
  * block_table (for paged-kvcahe);
  * cache_batch_idx (for kvcache);

* Extract arg_parser-related code from run functions to use them as tests

* Split examples into main programs and fmha runners, build instances separately

* Add dummy tests that use instances and runners

* Fix a missed corner case of f32->f8 conversion

When value if < min f8 denormal but > min f8 denormal / 2, it must be
rounded to min f8 denormal (i.e. 0b1), not to 0.

* Fix incorrect fp8 scales for P and O in validation code

DataTypeConfig was incorrectly compared with fp8_t.

* Add host generation of dropout random values and use it for validation

Previously host validation (reference_batched_dropout) used random
numbers generated by BlockDropout of the kernel, meaning that incorrect
generation on device (bad distribution, repeated numbers, too many zeros,
etc.) would not trigger any validation errors.

* Implement tests from smoke_test_bwd.sh

* Return result as enum to distinguish failure and missing instance

* Add tests for bwd features: bias, alibi, dropout

* Implement tests from smoke_test_fwd.sh

* Pass seqlen_q/k as vectors to fwd and bwd runners

* Add tests for fwd features: bias, alibi, dropout

* Add tests for pagedkv and splitkv

* Fix conditions when to use splitkv and pagedkv kernels

splitkv was executed only when use_kvcache which == (need_append_kvcache || use_cache_batch_idx || 0 < page_block_size).
In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1.
In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance.

* Add tests for appendkv

* Use is_v_rowmajor = true because there are no instances with column layout anymore

* Split public and private compile options for instances

Tests and examples need to know only about CK_TILE_FMHA_FWD_*_API.

* Improve parsing validation in bias and mask

* Pass bias as string for consistency with mask

* Catch parsing and other exceptions

* Add bwd test for deterministic flag

* Initialize fp8 tensors (-init=ufq) similarly to uf

* Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null

seqlen_k cannot be used to determine padding when seqlen_k_ptr is
provided. The actual seqlen_k is taken from seqlen_k_ptr[b].
Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr
may contain arbitrary values.
In the example or tests this produces incorrect results with appendkv
(for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8).

* Fix use_pagedkv value when kvcache = true but page_block_size = 0

In this case block_table_ptr is nullptr which is accessed in the kernel.

* Clean up bwd tests

* Unify fwd tests for f16/bf16 and fp8

* Use better explicit instantiation declaration for fmha_bwd<2>

* Use the same seed for all tests, allow to override it with env variable

* Undo clang-format of one irrelevant file

For some reason my local clang-format-18 and the one in CI work differently.

* Do not build instances and tests on unsupported archs

* Build instance libraries as OBJECT library

* CI: Enable sccache for HIP

There are source files with LANGUAGE HIP, they need
-DCMAKE_HIP_COMPILER_LAUNCHER=sccache

* Add tests to REGRESSION_TESTS

* Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0

The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are
smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf.

* Pass CK_TILE_FMHA_FWD_*_API as INTERFACE compile options

The instances don't actually depend on them, only examples and tests do.
Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS
without recompiling instances that are already in ccache.

* Fix formatting and names
2025-09-10 08:06:14 +05:00
linqunAMD
c254f3d7b4 [CK_TILE] Refine Generic2dBlockShape to fix ck_tile example 2,10,11,14 on rdna3 and 4 (#2795)
BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side.

Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape
2025-09-10 08:29:20 +08:00
linqunAMD
df4ee556d6 [CK_TILE] Fix flatmm on gfx11 and gfx12 (#2790)
1. Correct shuffle_b and MakeBFlatDramTileDistribution according to WMMA warp layout
2. Add FlatmmConfig16_Wmma for gfx11 and gfx12
2025-09-10 08:28:00 +08:00
Cong Ma
82890192dd [CK TILE] Support fp8/fp16 with pk_int4_t as data types for tensors A and B (#2805)
- Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats
- Implement A(bf8) B(i4) support in universal GEMM
- Use new implementation for i4 to fp8 conversion in Block Scale
2025-09-09 16:40:52 -07:00
Yi DING
91178b4011 [CK_TILE] Fix kname & typo in FMHA BWD (#2809) 2025-09-09 15:08:00 -07:00
aledudek
e82ccbdaf7 [CK_TILE] Fix Batched GEMM Example GPU verification (#2800)
Added more batched GEMM test cases
2025-09-09 09:30:57 +02:00
lalala-sh
75570d0fa8 [CK_TILE] Add permuteN optimization to remove lds operation in c_shuffle (#2764)
* permuteN optimization to remove lds operation in c_shuffle

* add the change log

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-09-08 22:02:48 -07:00
lalala-sh
e4a7728903 [CK_TILE] add atomic IGLP scheduler for wp gemm (#2739)
* add atomic IGLP scheduler

* clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-09-08 14:57:14 -07:00
kyle-256
4eb415829e [CK_TILE] Implement Row/Col quant grouped gemm (#2786)
* Add cshuffle epilogue test

* add the poc implementation to the epilogue and tests

* refactor cshuffle epilogue

* WIP: adding tensor/tile usage to scale_tile

* fix usage of tile_elementwise_inout

* add gemm_quant_kernel for generalizing gemm quant kernel

* Add problem specific to different quants, add QuantType to Traits

* Add quant_type to quant_kernel template parameters

* Create aq/bq_block_windows and views depending on QuantType

* Use tile windows as inputs in cshuffle epilogue

* Fix some issues in epilogue

* initial new example code for new general gemm quant kernel test

* Fix issues in kernel

* Add verification check for rowcol Quantmode

* use AccDataType instead of AQ in pipeline

* fix aquant preshuffle

* fix formatting

* some cleanup

* remove gemm_aquant_basic.cpp

* remove gemm_aquant_kernel.hpp

* fix tests for the renamed quant kernel

* fix formatting

* clean example files

* fix some merge conflicts

* fix preshufflequant rename issue

* updating

* fix some templates after merging with develop

* fix test preshuffle parameter

* fix formatting

* updating kernels

* change update user

* test username

* update quant_grouped_gemm example

* update example

* Unify bquant kernel to the common quant kernel

* remove bquant kernel also from common header

* fix formatting

* clean up commented code

* update grouped_gemm_quant example

* fix formatting config hpp

* fix merge mistake

* Non-const for movable windows

* fix formatting

* update tileloop pipleline

* Fix grammar in README

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Remove #include<bit> and clean up example

* fix strides

* Add some descriptions for move_windows

* fix tensor print bug

* update quant_grouped_gemm example

* remove useless code

* cleanup code

* clean up code & format code

* fix compile & running bug in grouped_gemm example

---------

Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: liyingli <liyingli@amd.com>
Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
2025-09-08 10:25:57 -07:00
Yi DING
5ff205ca79 Reapply "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757) (#2761)
This reverts commit 038ea82315.
2025-09-08 09:21:14 -07:00
Aviral Goel
e279e9420e feat(grouped_gemm): add preshuffle v2 support to grouped gemm example (#2721)
* docs(README): update readme with new build instructions

* feat(grouped_gemm): add support back for non persistent kernel

* refactor(grouped_gemm): simplify tensor creation

* refactor(grouped_gemm): Persistance is now GemmConfig value for easier management

* chore(grouped_gemm): add print statements to ease debugging

* WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration

* fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity

* WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines

* chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle

* style: clang format work so far

* BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline

* fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors

* refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm

* chore(grouped_gemm): add/remove debug comments and debug print statements

* feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes

* chore(gemm_profile): add new argument combinations

* fix: branch cleanup, formatting, refactoring

* fix: branch cleanup, formatting, refactoring

* chore(changelog):  update changelog to reflect new featuer

* address review comments & nit
2025-09-07 14:18:35 -07:00