Commit Graph

688 Commits

Author SHA1 Message Date
Sami Remes
c7298e57c0 remove some old files 2026-02-06 18:37:34 +00:00
Sami Remes
dc4366a876 add main include file 2026-02-06 18:12:54 +00:00
Sami Remes
06a8998254 clean up kernel and pipeline code 2026-02-06 18:11:17 +00:00
Sami Remes
c588a1fd42 use unpacked scales 2026-02-06 17:26:03 +00:00
Sami Remes
061c9f9374 save packing approach 2026-02-06 15:54:57 +00:00
Sami Remes
a8d48f9224 now offsetting with M/MPerXdl to get scales 2026-02-05 17:31:32 +00:00
Sami Remes
350022827f init=1 init=2 working, some scales are still wrong as init=0 failing 2026-02-05 10:28:49 +00:00
Sami Remes
6c61804665 try to enable scale loading in kernel and pipeline 2026-02-05 09:25:21 +00:00
Sami Remes
16fa73db63 use proper rtol/atol 2026-02-03 09:57:20 +00:00
Sami Remes
6b50755cd2 fix alignment calculation of lds tensor views 2026-02-03 08:24:03 +00:00
Sami Remes
b47853d3fe enable fp4 for universal gemm - without any scaling 2026-02-03 03:10:35 -05:00
Sami Remes
4d241289c9 use default scale (no scale) for 16x16x128 mfma scale 2026-01-30 12:55:46 -05:00
Sami Remes
407df88c02 enable 32 element for fp4 2026-01-30 12:47:45 -05:00
Sami Remes
771c46aa8b add initial version for scale block_gemm, not used yet 2026-01-30 12:42:45 -05:00
Sami Remes
b124a72ff5 revert mostly back to original comp_async 2026-01-30 12:40:48 -05:00
Sami Remes
2cc0e3d019 override base policys vector size with static_assert 4/12/16 bytes 2026-01-30 03:55:56 -05:00
Sami Remes
409a7d8edb Merge remote-tracking branch 'origin/develop' into samremes/ck_tile_mx_gemm 2026-01-30 03:30:11 -05:00
MHYangAMD
6ff0737843 Fix redundant cast in model sensitive rmsnorm (#3681)
* Fix redundant cast

* Fix linting
2026-01-30 10:52:19 +08:00
Khushbu Agarwal
9b168082b7 [CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629)
* initial commit

* preshuffleQuant support for ABQuant

* fix mxfp4 to use correct QuantGroupSize

* addressing review comments and seperated Preshufflequant for A and B

* updated grouped gemm example for updated traits definition

* fix for CI failure

* updated grouped_gemm_abquant test for updated traits definition

* updated grouped_gemm_abquant test for updated traits definition
2026-01-28 19:45:09 -08:00
Jeff Huang
e3556fed04 Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657)
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
  GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-29 07:18:41 +08:00
Sami Remes
0033748c62 revert custom ldstile, should be able to use the regular ones 2026-01-28 10:37:13 -05:00
Yi DING
8e3d84aba3 [CK_TILE] ABQuant New Preshuffle (#3638)
* Refactor

* Gemm quant improvement

* Change preshuffle

* Fix

* Fix grouped gemm ut

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-27 23:46:49 -08:00
damien-lejeune
91e32f305f [CK Tile] multi reduce improvements (#3607)
* WIP: refactoring

* Swap operation/data nested loops order

* Improve memory coalescing

* Add comments

* Enforce same identity element for the reduce operations

* Re-add compile time constant

* Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-01-27 12:56:09 -08:00
Sami Remes
30d4c25d5a use PackedSize in slicing 2026-01-27 13:01:54 -05:00
Illia Silin
b26cb596b0 fix some syntax errors (#3658) 2026-01-27 09:59:39 -08:00
Sami Remes
f62cc5415f current state of pipeline 2026-01-27 12:56:24 -05:00
Bartłomiej Kocot
3d67e6c492 [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624)
* [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err

* Update test_grouped_convnd_fwd_tile.cpp

* Update test_grouped_convnd_fwd_tile.cpp

* Update conv_tuning_params.hpp

* clang format fix

* Update CMakeLists.txt
2026-01-27 11:04:11 +02:00
Aviral Goel
b8751e505d feat: Add Interwave scheduler for aquant memory pipeline (#3540)
* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* test: add unit test for aquant memory pipeline

* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* fix: compilation error on gfx950

* chore: remove debug statements from the code

* test: resolve merge conflict

* test: remove non rcr unit tests from test suite
2026-01-26 11:27:42 -08:00
Thomas Ning
3900e1e7ce Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634)
* Finished the work

* Fix the pipeline
2026-01-26 10:29:28 -08:00
Sami Remes
70c7fcda43 WIP: debugging... 2026-01-26 11:33:45 -05:00
Emily Martins
f5c2f09036 [CK_TILE] Fix alignment in Stream-K workspace buffer (#3625)
* Fix alignment issue in Stream-K workspace buffer

In CK Tile Stream-K, the workspace buffer is used to hold flags and
partials, where the first i bytes holds the flags and the remaining
bytes hold partials. This change adds padding to the flags prefix of the
workspace buffer to ensure the number of bytes is 128B-aligned. Without
this alignment, since workgroups do not skip cache when reading from
partials, they may read stale partials data in cache, leading to
incorrect results. The added padding avoids the stale data reading.

This change also re-enables the test_ck_tile_streamk_reduction tests.

* Compute reference GEMM on GPU for test verification to decrease testing time
2026-01-23 16:14:22 -07:00
ltqin
67f0b74ec6 Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit de5a1d730d.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-23 09:03:22 -08:00
Sami Remes
d2a7c2f041 compiles again using get_y_sliced_thread_data in warpgemm loop 2026-01-23 11:01:43 -05:00
Po Yen Chen
de5a1d730d Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit dd0b4294af.
2026-01-22 21:21:19 -08:00
kensclin
31a35ecab4 GEMM Blockscale ABQuant Optimization (#3620)
* GEMM Blockscale ABQuant Optimization

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix precommit error

* clean

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2026-01-22 09:39:38 -08:00
Bartłomiej Kocot
44f481a45c [CK TILE] Fix basic gemm pipelines (#3611)
* [CK TILE] Fix basic pipelines

* fixes
2026-01-22 08:11:18 -06:00
Linjun-AMD
0b13697a88 [CK_TILE][FMHA]Add new tile size for async (#3623)
* Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)"

This reverts commit 8f75869408.

* Add new tile_size for async pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-22 16:07:14 +08:00
ltqin
dd0b4294af Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel

* add block scale to kernel

* add smoke test

* format

* Revert "format"

This reverts commit 356c3c9706.

* only format my code

* format py

* fix auto not allowd in function prototype

* change instance tttt to ttff

* fix structured binding issue

* change s_acc elementwise op

* async pipeline add block scale

* add quantation P using shift exp2

* precompute (m - shift) once per row

* change blk scale seqstrt ptr name

* fix some name

* fix for  deduction guide

* fix some comments

* add P scale to qr_ksvs_pipeline

* add comment to idx_identity

* change the method of calculating descale block index

* unify naming style: use block_scale_ as name prefix

* unify naming style

* update the CHANGELOG.md

* Add FP8 block scale quantization support for FMHA forward kernel

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-21 20:58:26 -08:00
Yi DING
fcc9372c00 [CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615) 2026-01-21 09:54:46 +08:00
Max Podkorytov
91b4102a59 Add persistent async input scheduler for GEMM kernels (#3520)
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
  chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks

Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
  chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
2026-01-20 10:37:09 -08:00
Linjun-AMD
8f75869408 Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)
This reverts commit f3aafb9555.
2026-01-20 09:40:54 -08:00
Bartłomiej Kocot
0727e85e52 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes
2026-01-19 22:29:01 -07:00
Cong Ma
0517d43d31 [CK TILE] remove dependency on std chrono (#3599)
* [CK TILE] remove dependency on std chrono

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-19 15:31:02 -08:00
Linjun-AMD
f3aafb9555 [CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-19 15:22:33 -08:00
Adam Osewski
1a6d1b59ef [CK_BUILDER] Convolution forward transfer concepts. (#3535)
* Rename member variable to better reflect its actuall meaning.

* Add transfer checks for conv fwd xdl.

* Validate tensor layouts & vector size conv fwd v3.

* Add combined transfer concepts.

* Add transfer concepts for conv fwd factories.

* Fix clang format

* Add helper instruction to get max mem vector instruction width.

* Apply review comments.

* Rename thread cluster access(->arrange) order concept

* FIx merge artifacts.

* Add generic access order limits into block transfer concept.
2026-01-19 10:54:10 +01:00
Cong Ma
f9104ef9b3 [CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584) 2026-01-16 15:27:39 -08:00
Sami Remes
f09e10936d fixed vector load siz for fp4 2026-01-16 12:04:34 -05:00
Sami Remes
16ca5cb532 WIP 2026-01-16 08:22:11 -05:00
Estevan Vedovelli
e30207985a Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-15 09:35:24 -05:00
Jeff Huang
993d3e2f0e [FMHA] Enable page size 16 for batch prefill kernel (#3568)
* [FMHA] Enable page size 16 for batch prefill kernel

* Refactor batch prefill KV offset logic to simplify template arguments
- Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args.
- Add static assert to forbid `page_size=1` with vectorized layout.
2026-01-15 22:11:44 +08:00