Commit Graph

937 Commits

Author SHA1 Message Date
Enrico Degregori
f16d9100e4 Multi AB support for wave transfer (#3578)
* Add multi AB support to wave transfer

* Improviments to multi ABD examples

* Add instances and use intrawave v1 instead of interwave

* Apply changes to other transfers

* Wave transfer: add support for multiple internal vgpr buffers

* Fix compilation error gfx11
2026-01-29 10:29:40 -08:00
Khushbu Agarwal
9b168082b7 [CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629)
* initial commit

* preshuffleQuant support for ABQuant

* fix mxfp4 to use correct QuantGroupSize

* addressing review comments and seperated Preshufflequant for A and B

* updated grouped gemm example for updated traits definition

* fix for CI failure

* updated grouped_gemm_abquant test for updated traits definition

* updated grouped_gemm_abquant test for updated traits definition
2026-01-28 19:45:09 -08:00
ltqin
654bec3362 Fix block scale init value (#3666)
* Make blockscale descale range adaptive to data type max value

* format
2026-01-28 12:37:15 -08:00
Yi DING
8e3d84aba3 [CK_TILE] ABQuant New Preshuffle (#3638)
* Refactor

* Gemm quant improvement

* Change preshuffle

* Fix

* Fix grouped gemm ut

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-27 23:46:49 -08:00
Enrico Degregori
2e49b6b2f7 Padding support for wave transfer (#3537)
* Add padding support with transpose

Also move check before writing storing is_src_valid during reading

* Add/modify instances to use wave transfer for gemm universal

Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer

* Fix clang format

* Modify example

* Fix bwd data

* Add restriction for wave transfer with padding and transpose

Add test case which shows this limitation

* Fix validity checks 8 bit types

* Add validity check gemm_bias_add_reduce

* Add validity check grouped gemm tile loop

* Fix validity checks new flavours

* Minor fixes

* Fix clang format
2026-01-26 12:57:09 -08:00
Aviral Goel
b8751e505d feat: Add Interwave scheduler for aquant memory pipeline (#3540)
* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* test: add unit test for aquant memory pipeline

* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* fix: compilation error on gfx950

* chore: remove debug statements from the code

* test: resolve merge conflict

* test: remove non rcr unit tests from test suite
2026-01-26 11:27:42 -08:00
ltqin
67f0b74ec6 Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit de5a1d730d.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-23 09:03:22 -08:00
Po Yen Chen
de5a1d730d Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit dd0b4294af.
2026-01-22 21:21:19 -08:00
kensclin
31a35ecab4 GEMM Blockscale ABQuant Optimization (#3620)
* GEMM Blockscale ABQuant Optimization

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix precommit error

* clean

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2026-01-22 09:39:38 -08:00
Bartłomiej Kocot
44f481a45c [CK TILE] Fix basic gemm pipelines (#3611)
* [CK TILE] Fix basic pipelines

* fixes
2026-01-22 08:11:18 -06:00
Linjun-AMD
0b13697a88 [CK_TILE][FMHA]Add new tile size for async (#3623)
* Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)"

This reverts commit 8f75869408.

* Add new tile_size for async pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-22 16:07:14 +08:00
ltqin
dd0b4294af Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel

* add block scale to kernel

* add smoke test

* format

* Revert "format"

This reverts commit 356c3c9706.

* only format my code

* format py

* fix auto not allowd in function prototype

* change instance tttt to ttff

* fix structured binding issue

* change s_acc elementwise op

* async pipeline add block scale

* add quantation P using shift exp2

* precompute (m - shift) once per row

* change blk scale seqstrt ptr name

* fix some name

* fix for  deduction guide

* fix some comments

* add P scale to qr_ksvs_pipeline

* add comment to idx_identity

* change the method of calculating descale block index

* unify naming style: use block_scale_ as name prefix

* unify naming style

* update the CHANGELOG.md

* Add FP8 block scale quantization support for FMHA forward kernel

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-21 20:58:26 -08:00
Michał Kulikowski
4c2c18ef48 [CK][Examples] Extending support for rdna3/4 part 4: (#3264)
* [CK][Examples] Extending support for rdna3/4 part 4:
-example_gemm_xdl_streamk
-example_gemm_xdl_fp16_fp8_v3
-example_gemm_xdl_fp16_v3

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

* [CK][Examples] Revert example\01_gemm\gemm_xdl_streamk parameters change.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

---------

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-21 18:10:16 -08:00
Yi DING
fcc9372c00 [CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615) 2026-01-21 09:54:46 +08:00
Erwin Terpstra
d5ae81b292 Implement batched gemm add relu gemm add for rdna4 (#3391)
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation

* wip: many fixes in implementation of batched gemm gemm multiple d

* wip: batched gemm gemm multiple d gridwise op compiling, not working yet

* fix: incorrect d0 grid indexing in batched gemm gemm multipled

* feat: add instances for batched gemm add relu gemm add

* chore: configure instance with low vector transfer size for odd sizes

* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense

* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes

* fix: disable odd size tests on XDL archs

* chore: removed temporary logging

* chore: update some references to C tensor to E tensor

* Tentative fix for example template params

* Tentative fix for non-multi-D batched gemm gemm device impl.

* Tentative fix for xdl example template params

* Tentative fix for profiler build on gfx90a

* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions

* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply

* fix: make the gemm1 data types match what happens in the device op

* feat: add d0s/d1s datatypes and layouts to the device op type string

* chore: change element-wise op so addition happens in fp32

* chore: add static asserts for gemm0/gemm1 calculated wave sizes

* chore: also updated other element-wise ops to use fp32 calculations

* chore: log number of supported instances

* chore: update instance comment

* chore: disable kernel timing in example by default

* fix: gemm1 wave size calculation

* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions

* chore: remove increased tolerance in batched gemm gemm multiple d example

* chore: add comment explaining that verification fails for certain input values

* chore: clarify instance comment

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2026-01-20 13:06:59 -08:00
Max Podkorytov
91b4102a59 Add persistent async input scheduler for GEMM kernels (#3520)
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
  chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks

Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
  chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
2026-01-20 10:37:09 -08:00
Linjun-AMD
8f75869408 Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)
This reverts commit f3aafb9555.
2026-01-20 09:40:54 -08:00
Estevan Vedovelli
7d8bca7ddc Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598)
* Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions

Enables hipTensor to access the WMMA HW functionalities
for these combinations of datatype on gfx11 and gfx12.

* Fix change to contraction scale tests

* Fix clang-format
2026-01-20 09:39:57 -08:00
Linjun-AMD
f3aafb9555 [CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-19 15:22:33 -08:00
Erwin Terpstra
fe40a5d139 Implement batched gemm bias permute for RDNA4 (#3534)
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)

* wip: device struct for WMMA batched contraction multiple d based on new gridwise op

* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

* fix: failure to resolve template parameters when calling new function overload

* fix: passing reference type as parameter instead of underlying types

* fix: merge error caused duplicate definitions

* fix: make sure constness of template and parameters types match

* fix: don't compile batched contraction test on unsupported architectures

* feat: add example for new wmma implementation, and consolidate example code between platforms

* style: return inline instead of with branch

* chore: add extra assert on vector memory access sizes

* chore: clean up some unused variables

* fix: correct tail number calculation, added small cases and extra instances to the test

* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method
2026-01-17 08:30:27 +01:00
Michal Kulikowski
e1f2a44096 [CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation
-Fixing args num in ck examples 68/69

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2026-01-15 16:43:02 +01:00
Yung-sheng Tu
6df2d70143 Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429)
* add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp

* add examples

* add instances to test

* remove duplicate code between examples
2026-01-15 07:19:31 -08:00
Jeff Huang
993d3e2f0e [FMHA] Enable page size 16 for batch prefill kernel (#3568)
* [FMHA] Enable page size 16 for batch prefill kernel

* Refactor batch prefill KV offset logic to simplify template arguments
- Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args.
- Add static assert to forbid `page_size=1` with vectorized layout.
2026-01-15 22:11:44 +08:00
Khushbu Agarwal
118afa455c [CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* G128 working for both prefill and decode shapes for preshufflequant

* clean up after merging with develop

* fixing group 64 for decode shapes

* non preshufflequant working for group size 128

* enable preshuffleb and preshufflequant with variour group sizes

* reduce build time by splitting example into diff datatype files

* Adding tests for preshuffleQuant

* address review comment

* fix for gfx1201

* compile time fix for gfx1201

* clang formatted

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-14 10:00:19 -08:00
Ville Pietilä
1fc5a3f3ac Build CK on Windows (#3458)
* CMakeLists.txt hack for Windows.

* Add Windows build instructions.

* Fix  type issue with variadic min function.

* Use std::common_type to fix the variadic min/max functions.

* Enable CPU guard compilation on Windows.

* Suppress warnings related to std::getenv on Windows platform.

* Git ignore the output directory on Windows platform.

* Powershell script for running tests and generating reports.

* Improve test logging.

* Disable non-conv tests.

* Fix Debug build on Windows.

* More debug build changes.

* Update Windows build instructions.

* Enable all tests.

* Test fixes.

* Suppress not found linker options warning.

* Update unsigned long literals and format specifiers to work correctly in Windows

* Fix conv 3D bwd weight bilinear tests on Windows.

* Revert changes on .gitignore.

* Clean-up CMake project file for Windows builds.

* clang-format

* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.

* Fix building examples on Windows.

* Update Readme.

* Remove the suppression of the deprecated warnings.

* Remove Windows specific min/max implementations from CK Tile math core.

* Remove unnecessary no-op on Windows.

---------

Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>
2026-01-14 07:31:45 -08:00
Linjun-AMD
717ed0b59f [CK_TILE][FMHA] Enable gpt-oss sink (#3490)
* Enable gptoss sink

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add gptoss sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update CHANGELOG.md

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix test args error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update test_fmha_fwd.cpp

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Revert "update sink test"

This reverts commit 970b4f1686.

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update valid sink_v in splitkv pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix clangformat error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix aiter scale error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qr_ks_vs.hpp

* div scale_s for sink_value

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* update sink_value with bias

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Fix typo in dropout parameter in fmha_batch_prefill_kernel

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* optimized some code

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix splitkv error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update sink reference

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* Update smoke_test_fwd_sink.sh

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-14 21:32:06 +08:00
Thomas Ning
00c46785a8 Shuffle fix for gfx950 (#3491)
* solve compiler issue

* solve the gfx950 mfma shuffle regression

* refactor jenkinsfile to handle arch name better

* [CK TILE] set divisor to count of thread along k dimension

* fix the compiler error

* solve degradation

* Finish the multiplies fix

* fix the scales

* solve compilation error

* solve the composes

* solve the error of tile sweeper

* fix the test and example

* fix for gfx950

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2026-01-13 09:21:29 -08:00
Erwin Terpstra
eb041079a3 Implement grouped gemm tile loop for RDNA4 (#3304)
* feat: grouped gemm tile loop support for RDNA4

* fix: removed extra parameter from grouped gemm example instance

* fix: FP8 check incorrectly enabling FP8 on RDNA3
2026-01-13 07:14:23 +01:00
Jeff Huang
c9f112b026 [FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545)
- Enable page_size=1 support in batch prefill codegen (linear layout only).
- Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly.
- Relax `kPageBlockSize` alignment assertion for the page_size=1 case.
2026-01-13 12:04:43 +08:00
yadaish
32408c8bc0 moe fp8 blockscale use nt (#3524)
* nt on fp8 blockscale

* some improve and tests needs to be fixed

* update

* fix format

* revert useless change

* revert any change in amd_buffer_coherence
2026-01-12 10:48:10 +08:00
damien-lejeune
4216d43da8 Dlejeune/ck tile 2d multiple reductions (#3147)
* WIP

* Add Unit tests for the Multi Reduction Kernel

* clang format

* Rename multiblock to threadwise

* Multiblock WIP

* Fix multi reduce multi block unit tests

* Multi Reduce Tile Engine: WIP

* refactoring + try addressing precision error

* Fix multiops examples

* Cleanup

* Clean up tile engine's reduce op

* Update changelog

* Fix remod/clang

* Fix dates

* Fix documentation & missing file

* Fix comments

* Use the update_tile api in the multi-block kernel

* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests

* Add TileParitioner

* Cleanup

* Add warning when no data to process, in the example

* Refactoring Reduce kernel Tile Partioner + cleanup

* Move the tile partioner to its own file

* Add missing includes

* Fix copyright header with update_amd_copyright_headers.py

* Fix change of interface in Reduce2dProblem

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-01-09 11:16:37 +01:00
Johannes Graner
0a474aa62f [CI, CK examples] Disable time_kernel for CI tests and examples (#3464)
* Disable kernel timing in tests

* default time_kernel = false in old CK examples
2026-01-07 16:30:57 +01:00
Khushbu Agarwal
aaa35f0bbf [CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* clean up after merging with develop

* added comments for the tile window and tile distribution encoding

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-06 12:46:59 -08:00
kyle-256
76696ace44 [CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452)
* update grouped_gemm blockwise kernel

* update config

* update kernel

* update examples

* remove test code for now

* sync test files with origin/develop

* update example

* fix code lint

* fix code-lint

* update test code

* run clang format

* run pre-commit

* update api
2026-01-06 12:36:04 -08:00
kensclin
2309c86054 [CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495)
* [CK_TILE] add preshuffleB mode for ABQuant GEMM

* fix precommit error

* use template method call for cvt_scale_to_fp32

* fix precommit error

* add test code

* fix precommit error

* switch abquant  gemmconfig to default

* Add changelog.md

* fix precommit error

* fix conflict
2026-01-06 12:35:01 -08:00
Jeff Huang
cc75a1dc5f [FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter

* add is_sglang_layout to  parameters

* add kv_offset_array_transform to batch async for page size 16

* add kv_last_page_lens to kernel

* change kv layout to [num_total_pages, page_block_size, hdim]

* format

* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
  batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)

* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024

* fix python format

* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]

* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
   traits
3. Add some static check for Page size, vector size, hdim, ..., etc.

* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration

Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.

**Changes:**
*   Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
*   Updated Kernel, Pipeline, and Traits to template on these Enums.
*   Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
*   Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
*   Updated CodeGen scripts to support new parameters.

This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.

* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.

* fix vgpr spill count

* remove unnecessary t2s functions

* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py

* support linear kv cache layout

* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.

* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.

* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.

* update changelog

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2026-01-05 18:41:47 +08:00
Max Podkorytov
e339101e9c [CK-Tile] move out memory operation from cshuffle epilogue class (#3359)
* initial poc

* factor out common parts in operator()

* cv4

* rest of the universal gemm pipelines

* fix test

* remove boilerplate from tile engine

* fix example

* fix example

* format

* fix tests build for gemm

* remove base pipeline codegen from gemm instance builder

* unify v3 logic with the rest of universal gemm pipelines

* fix build for multi abd test

* fix test gemm multi d

* fix build for weight preshuffle

* fix grouped gemm test

* fix grouped gemm multi d test

* fix grouped gemm preshuffle

* fix grouped gemm example except for quant

* fix gemm preshuffle

* fix splitk 2 stage example

* fix batched gemm example

* fix multid example

* fix multiabd example

* fix batched gemm test

* fixup

* fix examples build

* fix grouped gemm test build

* fix smoke builder

* hacky poc

* fix tile engine

* kill the lambda

* maybe fix test build

* more fixes

* clang-format

* save temp

* clang-format

* mostly fix examples

* clang-format

* remove dead code

* more cleanup

* fix fmha bwd build (default epilogue set/add appears to be broken)

* fix default epilogue tests but not correctness

* clang-format

* fix bquant

* clang-format

* cleanup dead code

* rearrange make windows for readability

* restore changes to IsSupportedArgument

* fix smoke-builder

* clang-format

* fixup rename class

* build fixes

* clang-format

* fix builder

* fixup

* remove set from builder tests

* fix test

* clang-format

* re-refactor the kernels

* clang-format

* fix header license

* remove memory operation from conv bwd test

* clang-format

* clang-format example,include

* clang-format test

* build fixes

* clang-format

* solve compilation error

* fix the CI

* solve compilation error

* clang format

* solve merge conflict

* solve merge conflict

* solve the gfx11 error

* solve test error

* moar build fixes

* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-04 03:28:14 -08:00
John Afaganis
ec23be0b9d Update unsigned long literals and format specifiers to work correctly in Windows (#3483)
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.
2026-01-02 22:16:41 -07:00
ApoorvaKalyani
53a1e4f551 Grouped convolution backward data WMMA v3 implementation (#3460)
* Added device level implementation for bwd_data_wmma_v3.

* Added first instance of bwd_data_wmma_v3(f16).

* Add support for bwd data in gridwise implementation

Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight

* Initial device implementation of bwd data

* Remove unused template parameters in device impl

* Add one instance for different layout

initial check of device implementation

* Add tests for splitk and for different layouts

* Appended more instances to wmma_v3_f16.

* Added conv_2d bf16 wmma_v3 instances.

* Added conv_3d_bf16 wmma_v3_instances.

* Added conv_3d_f16_wmma_v3_instances.

* Added SplitN test cases for wmma.

* Conv3d_bwd_data_scale_wmma_v3 instances.

* Conv3d_bwd_data_bilinear_wmma_v3_instances

* Renaming the device level instances file to common name , since it is defined for different DataTypes.

* Renaming the instances and fixing typo

* Added the test cases to regression test list

* NCHW support for wmma_v3

* Examples for bf16 and f16 bwd_data_wmma_v3

* Added transpose conditons for device impl

* fixing bugs

* Added the gemm_args array implmentation

* WIP debug conv bwd

* fix splitk

* Grouped gemm fix

* Update CmakeLists with EOF

* Added more instances for tests

* Fixed the run time error in examples and removed 3d conv examples.

* Fixed a typo.

* Updated CmakeLists to removed the 3d convultion deleted files

* Added print error statements for unsupoorted argument

* Added the merge conflict related changes

* Fixed compilation error

* Fixed the InstanceFactory duplication error.

* Removed the print statements and added logs to Arg function

* All the merge conflict related errors resolved

* Added d_tensor tests.

* Added the missing example types of wmm_v3

* Merge error fix

* Corrected the instance name

* Reverted the bias relu change

* Revereted the transpose load local change

* Updated the regression test list with bwd_data_scale

* Revert "Revereted the transpose load local change"

This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.

* Revert "Merge error fix"

This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.

* Reverting the local change

* Added merge error fix

* Build error fix due to merge conflicts

* Added bias_relu example for wmma_v3

* Modified the main method in dtensor tests

* Updated the dtensor tests to pick all the shapes

* Updated the dtensor test shapes.

* Updated the mem operations in tests.

* Added reference func

* Fixed typos in device impl

* Added new header file and modified the include file for 3d tests

* Renamed the test file and added reference func call.

* clang format fix

* Added ignore params

* Modified device impl and tests

* Removed debug print statements and updated dtensor test shapes

* Fixing merge conflicts

* Fixing more merge conflicts

* Fixed copyrights

* Updated the tuned instances to bilinear and scale.

* Adding tuned instances to vanilla wmma_v3

* Removed all unused instances and modified test layouts.

* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.

* Fix clang format

* Updated tuned f16/-genric instances

* Formatting the instances file

* Fixed copyrights and clang issues

* Nonsense commit to force git to force

* Removed the transpose instances

* Added verified genric instances

* Fixing namespace errors

* Added todo for failing shapes

* Formatting instance file

* Fix instance list formatting

* Removing unnecessary formats

* Renamed the common file

* Unification of xdl and wmma bwd_data tests

* Updated Cmake

* Added all layout types and deleted code.

* Updated Cmake to add the condition to all tests.

---------

Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2025-12-30 16:25:08 +01:00
yadaish
dae85ead64 [CK_TILE] support split-k a16w4 gemm1 (#3389)
* initial version to support moe gemm1 split-k

* add missing args

* fix build warning

* update reference

* for split-k disable bias and weight

* remove debug log

* fix format

* fix div by zero errors

* fix cmake config

* update

* resolve conflicts

* remove useless changes

* reformat

* fix

* remove useless changes

* fix ci

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>
2025-12-29 23:05:35 +08:00
Yi DING
7ce532eac7 [CK_TILE] Align FMHA BWD Reference with Kernel Implementation (#3486) 2025-12-25 16:12:36 +08:00
Erwin Terpstra
e08efa551f [CK_TILE] Grouped gemm quant tensor layouts (#3414)
* feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time

* chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times

* chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems

* fix: missing check for transposed load in bquant pipeline

* chore: lower unit test tensors dimensions a bit for faster tests

* chore: set grouped gemm example M/N warp tile to 16

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-24 23:01:23 -08:00
kensclin
7f68f3c4fa Enable padding blockscale for abquant (#3453)
* Enable padding blockscale for abquant

* run clang-format

* Reduce unnecessary testing

* remove cout
2025-12-24 09:12:40 -08:00
Po Yen Chen
1c3151963b [CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425)
* Add fp8bf16 support for batch_prefill

* Fix wrong scale_s re-compute logic in batch_prefill

* Fix wrong scale_s re-compute logic in fmha fwd

* Fix batch_prefill codegen error

* Remove no-longer used GetName() function

* Add fp8 logits=True instances

* Update CHANGELOG.md
2025-12-24 10:34:06 +08:00
Yi DING
6864a618f4 [CK_TILE] FMHA Ignore BWD Failed Cases in Smoke Test (#3480) 2025-12-23 13:28:15 +08:00
Yi DING
2220cbaba7 [CK_TILE] MX Flatmm Use Byte Pointer Arithmetic for A Tensor (#3446)
* A as bytes

* Reformat with static_for_product
2025-12-19 10:28:13 +08:00
yadaish
c0ee71d735 Dev/a8w4 and a8w8splitk (#3447)
* Ck moe bs splitk pr (#3440)

* splitk kick-off. Compilation fail

* splitk hack pass

* fix scale offset calc.

* clang-format for a8w8_moe_blk_gemm1 splitk change

* fix testcase error

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>

* Zan/moe a8w4 (#3441)

* update

* update

* update ck moe a8w4

* update

* update

* update

* compile pass

* update

* update

* python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready

* support new a8w4 kernel

* update

* update ck_tile

* re format

* update

* update

* fix conflict

* fix build

* update ck_tile moe

* fix clang format

* fix the problem

* fix accruacy issue

* fix

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>
Co-authored-by: Zzz9990 <zanzhang@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-12-19 09:26:52 +08:00
Kiefer van Teutem
2ea710e88b Grouped convolution forward device implementation and base flavors for RDNA3/4 (#2964)
* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* removed unnecessary ck parts from compilation

* initial gemm_add_multiply instance implementations

* fixed profiler help message for gemm_add_multiply

* improved multiply_add profiler layout help

* fixed template arguments for test instances

* added test for gemm_add_multiply

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* switched to splitK interface

* log print added to splitk benchmarks

* revert main cmake comments

* newline change reverted

* added add_fastgelu instances

* revert unintended change in xdl add_fastgelu

* created gemm_add_add_fastgelu instances

* created fastegelu instances

* added tests for all splitk fastgelus

* Added tests.

* multiply_add instances created

* updates to add_multiply splitk instances

* splitk xdl test fixes

* added wmma multiply_multiply instances

* fixed ONLY_XDL_AND_WMMA_KERNELS tag

* Added gemm_add examples for wmma v1 and v3

* fixed / workarounded i8 instances

* Modified the v3 code to added one fp16 bxdl instance.

* added bf16 xdl instance.

* adding gemm_add wmma_cshuffle and other support


(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* add instances into camkelists


(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* This is work in progress, edited the template parameters in order to build

(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype


(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* added datatype and use clang-format-12


(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* Fixing build errors

* Added instances for v3

* Adding instances and executables

* Code update of template parameters modified.

* Renamed file.

* Added tests.

* resolved error tests.

* Fixing build errors

* Updated comments

* removed the changes as per the MR review comment.

* Updated tests.

* fp8 instances - not tested

* Restored the Cmake file that was reverted by mistake during rebase.

* fixed wmma_op test

* Updated comments.

* Updated the template parameter description

* fixed rdna4 instances

* fixed back compatibility on gfx11

* cleanups

* fix ckProfiler

* one more cmake fix

* added fp8 instances

* Updated tests to ad BF16 instances as per review comment

* Added include file and cleaned up(as per review comment)

* Updated and optimized the example code for all types.

* Fixed clang format

* Resolve "Implement `device_gemm_bilinear` for RDNA4"

* test generalization to handle FP16 shuffle better

* added missing changes

* Added bf16 wmma instance for add_relu

* Added f16 wmma instance and corrected bf16 instance errors.

* Added instances to Cmake

* Modified the template parameters to make the instances work.

* Fixed typo in profiler

* Added v3 instances for gemm_add_relu

* addressed core review comments

* Added test for gemm_add_relu wmma instance

* Cleaned up the code.

* Added examples for gemm_add_relu

* Fixing typo to resolve build errors.

* Fixes applied to fix  the precision loss.

* fix billinear test after merge

* Removed the old wmma instances.

* Added wrapper and renamed the wmma_v3 instances

* Updated copyrights and added wrappers.

* Fixes applied according to review comments

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Robin Voetter <robin@streamhpc.com>

* Removed the old wmma instances.

* Updated wrapper for the v3 instances

* removed the old wmma examples

* Renamed the v3 instances

* Deleted the  gtest file added by mistake.

* Updated thge profiler with wrapper

* Fixed test errors.

* Fixed the review comments

* Fixed the if condition MACROS.

* REVERTED THE PROFILER CHANGES

* Revert "REVERTED THE PROFILER CHANGES"

This reverts commit 21cb98546c.

* Revert "Fixed test errors."

This reverts commit 13efcc6fe1.

* Revert "Updated thge profiler with wrapper"

This reverts commit 536f86661d.

* Added missing wrapper instances

* Updated copyrights.

* Fixed typo.

* Fixed copyrights.

* Updated copyrights.

* updated copyrights.

* comments on the atomics workaround

* fixed cmake comment

* Fix bug from merge

* clang-format-18

* Fix compilation error

* multi_abd wmma support:

 - Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)

* Fix bug in device print function

* Fix unused template parameter

* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.

* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.

* Make relevant profilers print the number of valid instances to aid testing.

* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now.  Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.

* Reset output buffer after each run in profile_grouped_conv_fwd_impl().

* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.

* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.

* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.

* Add int8 instances for 2D vanilla grouped conv fwd all layouts.

* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd

* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.

* Implement multi-AB support for grouped conv fwd and add example.

* Add 1D instances

* Add D layout tests to IsSupportedArgument()

* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.

* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.

* Add some more tests for vanilla grouped conv fwd

* Add 2D bias clamp instances and tests

* Add 3D bias clamp instances and tests

* Add 2D and 3D clamp instances and tests

* Unify problem sizes across vanilla and clamp flavor tests

* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.

* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.

* Remove wmma fp8 and bf8 instances when not targetting gfx12

* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build

* Remove old years for newly created files.

* No need to time kernels for now.

* Fixup comments

* Pass struct args to Gridwise Run() function by reference.

* Don't use workspace memory in the case where A needs explicit transposition but B does not.

* Move calculation of rotating memory buffer sizes to Argument member functions.

* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.

* Unify xdl and wmma example code for grouped conv fwd scaleadd ab

* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.

* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.

* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.

* Small post-merge fixup, everything seems to work.

* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.

* Extend scaleadd_ab instance lists

* Extend merged groups instance lists, including adaptations of xdl "2x" instances.

* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.

* Extend "mem" instance lists.

* Extend regular instance lists.

* Fixup comments and ignored kernel arg name

* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.

* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.

* Re-enable sharding for wmma cshufflev3 instances

* Post merge fix to vanilla test

* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.

* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.

* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.

* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.

* Re-enable old wmma instances

* Re-enable Linqun's Xdl Wmma instances

* Small post-merge fixes

* Fix copyright headers

* Remove commented code snippet in gridwise

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.

* Adding tuned instace list for groupoed conv fwd (#3288)

Following flavors are updated with tuned instance list:
  - grouped_conv2d_fwd
  - grouped_conv2d_fwd_bias_clamp
  - grouped_conv2d_fwd_clamp
  - grouped_conv3d_fwd
  - grouped_conv3d_fwd_bias_clamp
  - grouped_conv3d_fwd_clamp
  - grouped_conv3d_fwd_scaleadd_ab

Re-factored instance selection:
  - removed all the unnecessary instance tuples (comp/mem/16x16/generic)
  - removed all unnecessary layouts and data types

* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.

* Splitting grouped conv fwd instances (#3449)

* Disable unnecessary and failing tests related to experimental CK builder

* Disable unnecessary ck builder experimental tests fully

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Wojciech Laskowski <77888887+wj-laskowski@users.noreply.github.com>
2025-12-18 13:12:15 -07:00
Bartłomiej Kocot
700b2ec9c0 Update AMD buffer coherency (#3403)
* Update AMD buffer coherency [AICK-421]

* fixes

* fix

* fixes

* fixes

* Add backward compatilibity

* fix

* fixes

* fix

* fix

* fix

* Update grouped_convolution_backward_weight_kernel.hpp
2025-12-18 10:16:22 +01:00
Yashvardhan Agarwal
15e81397a4 [CK_TILE] Epilogue chaining (Lwpck 3373) (#2773)
* Epilogue chainer

* epilogue chainer with context to share state in between epilogues
* chain-able epilogues for cshuffle

* clang-format

* rebase related changes

- Added separate chainer test
-  clang format

* comment resolutions

* clang-format

* Policy based chaining

- basic Policy structure to control blanket looping and barrier
placement.

- to be extended for fine grianed control

- to  be modified to move possible auto-compute values and SFC  access
count to policy

* Refactoring as per spec

- Introduced epilogue schedule, graph
- modified chainer to function with graph and schedule

* minor_changes

- made functions to overload in the epilogue_graph file

* clang-format

* Documentation and Comments

- Added comments to files
- Noted changes in changelog
- Added README to explain the chainer and current status, exact use
steps to be added

* Comment resolutions

- README modified with the suggested changes
- Comment fixed accordingly

* major refactoring

- modified the chainer files to match the new design
- updated comments
- updated readme
- multi-d example shocases use of the chainer

* minor cleanup

* tensor and rowcol quant chainer epilogue

- added scalarepilogue for tensor quant
- added schedule for tensorquant
- modified quant example to use chainer and appropriate schedules

* Refactor epilogue chainer: generalize ops and standardize context interface

Address review comments.

Changes:
- Rename CastToLdsOp to CastAndStoreToLdsOp for clarity
- Standardize context member names (working_tile, out_tile, aux_windows)
- Update README documentation with correct operation names
- Clean up parameter naming in epilogue_chainer.hpp (OutWindow, AccTile,
AuxWindows)
- common_epilogue_ops.hpp: General-purpose ops (ScaleScalarOp,
CastAndStoreToLdsOp,
  LoadFromLdsOp, ElementwiseOp, StoreOp, MoveWindowsOp)
- cshuffle_epilogue_chainer_ops.hpp: CShuffle-specific context and slice
operations
- epilogue_chainer.hpp: Cleaned up parameter naming for generality
- Removed test files that are no longer needed. These were added for
intermediate use

* update cshuffle chainer ops file w.r.t cshuffle_epilogue.hpp updates & add chainer to quant gemm example

* fix compile errors

- CI uses c++17 while the code had c++20 features

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-12-18 10:02:02 +01:00