Commit Graph

504 Commits

Author SHA1 Message Date
Sami Remes
6f90564708 fix formatting 2025-10-31 20:16:52 +00:00
Sami Remes
fe92102baf add some documentation and 2d block scale example 2025-10-31 20:13:43 +00:00
Cong Ma
bcccafee40 Update tile distribution for 2D bquant 2025-10-30 22:55:43 -04:00
Sami Remes
68e41da5f2 fix formatting 2025-10-30 08:48:39 +00:00
Sami Remes
1290b1b28a simplify conditions that are needed for tile distributions 2025-10-29 17:33:44 +00:00
Sami Remes
5e0a356e19 Remove commented code
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-29 11:19:02 +02:00
Sami Remes
e1475d4a52 one more fix to tile dstr, and revert debug initialization 2025-10-28 19:13:16 +00:00
Sami Remes
7c93551878 fix formatting 2025-10-28 18:55:51 +00:00
Sami Remes
e12ab566e8 Fix some issues from the merge 2025-10-28 18:55:17 +00:00
Sami Remes
a449728fdd Merge remote-tracking branch 'origin/develop' into samremes/bmatrix_2d_blockscale 2025-10-28 17:49:53 +00:00
Mateusz Ozga
da4247a6df [CK_TILE] Fixed multi-abd GEMM test, NaN problem (#2979)
* Multi-ABD NaN problem

* Rollback tests

---------

Co-authored-by: root <root@splinter-126-008d.aus.dcgpu>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-28 15:53:36 +01:00
Aviral Goel
4368fd9f57 [CK_TILE] Add Bquant to Grouped Gemm (#3063)
* update test cases

* format codes

* use GTEST_FAIL

* add bquant to grouped_gemm

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* fix a bug in test_grouped_gemm_util

* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm

* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* feat: add bf8 support

* chore: remove unnecessary decltype usage

* chore: add default quant_mode to function signature as fallback

* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel

Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.

* chore: set default quant mode in function signature

* test: add additional test cases to cover edge case of no hotloop

* chore: clang formatting

---------

Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-28 10:20:24 -04:00
Khushbu Agarwal
b11f53a484 Fix quant scale matrix layout for block scale gemm (#3079)
* Adding support for TiledPermuteN

* Adding test

* moving shuffle functions to common place

* resolving commit hook

* fix formatting
2025-10-27 13:56:07 -07:00
Johannes Graner
5c1974065e [CK_TILE] Add conv fwd + bias + clamp example (#3012)
* Implement argument passing to element-wise functions for fwd convolution

* Add files for fwd + bias + clamp example

* Implement Bias

* Implement Clamp

* Elementwise function composition

* Composition unit test

* Implement fwd + bias + clamp example

* Simplify argument passing and composition

* elfunc -> bias_and_clamp

* Rename function to specify example

* Move element-wise function instantiation to kernel

* Make bias a runtime tensor

* No ugly namespace aliasing

* Initialize element-wise function on host

* Remove function initialization helper, simplify Compose initialization

* Remove unintended LSP compatibility patch

* Clean up includes and unused code

* Switch names in cshuffle epilogue

* Move CDElementwise to conv traits

* Re-add required include

* Initialize bias in same way as other tensors

* Better type specification for ds pointer

* Disable 1D convolution

* Add warning for non-group-constant bias
2025-10-27 18:43:09 +01:00
arai713
054fdb765c [CK_TILE] Stream-K operator() Reboot (#3064)
* Persistent Stream-K Kernel Implementation

This change implements an operator() function in the
reboot::StreamKKernel class that is enabled when the Persistent flag is
set to true. In this case, the data-parallel portion and the Stream-K
portion of the kernel are fully persistent.

The changes were made in the reboot namespace. A future PR will remove
the old Stream-K kernel class and remove the reboot namespace.

* Unit Tests for Persistent Stream-K Kernel

This change contains the inital test suite for the Persitent Stream-K
Kernel. The files contain "reboot" in the name; a future PR will remove
tests for the old Stream-K Kernel and remove the "reboot" naming.

A future commit will add tests for the non-persistent kernel.

Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase
class. This allows us to estimate the number of accumulations done per
macro tile in C to use during validation when computing relative and
absolute tolerance.

* Adding implementation for the Non-Persistent Stream-K kernel

This code is adding the operator() function for the Non-Persistent Stream-K
kernel. Persistency of the kernel is determined through a template argument.
The Non-Persistent kernel will allocate additional workgroups for the data
parallel section, leading to a different structure for processing the data
parallel and Stream-K sections.

There has been an addition to the TilePartitioner to get access to the whether
Persistent has been set to true or false in the StreamKKernel.

* Adding in the tests for the Non-Persistent Stream-K kernel

* Refactor Stream-K Reboot Unit Tests

This commit makes the following changes:
- Update test cases to determine M, N, and K based on the number of CUs.
  This ensures that each test case is one of Edge Case, SK Only, DP
Only, or DP + 2 Tile SK regardless of the architecture.
- Since the DP + 2 Tile SK test case takes long to run, this change
  moves this case into a separate .inc file and labels it as an extended
test.
- Since the extended test takes > 30 seconds to run, this test is added
  to the list of regression tests.

* Fix spelling errors in comments for test cases

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Changes based on review

Removed const volatile for typenames
Set up alias for is_tuple_t
Naming changes for clarity: GemmCommon -> BaseGemm
Moved std::enable_if_t out of template parameters and changed to a return type for operator()
Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-27 09:14:17 -07:00
Sami Remes
470d6e4df4 Merge remote-tracking branch 'origin/develop' into samremes/bmatrix_2d_blockscale 2025-10-27 15:17:14 +00:00
Sami Remes
2d86cd0081 fix formatting 2025-10-27 14:28:06 +00:00
Sami Remes
98deefac3e Enable NWarps replication for bquant tile dstr 2025-10-27 14:09:07 +00:00
Sami Remes
37738e4cb8 Add more specialized tile distributions 2025-10-27 13:43:02 +00:00
Adam Osewski
f53d857b25 [CK_Builder] Add name member to unary elementwise ops & update builder traits. (#3093)
* Add name member to unary elementwise ops.

* Update elementwise_op_name to check for name attribute.

* Require that the layout is derived from BaseTensorLayout struct.
2025-10-25 07:27:03 -07:00
Max Podkorytov
86d542f663 [CK-Tile][Async gemm] add missing sync and f8 inputs test cases (#3000)
* add missing sync and f8 test cases

* reformat test cases

* comment failing cases

* bump

* reintroduce compv4 shapes
2025-10-24 12:16:01 -07:00
Khushbu Agarwal
0584399571 [CK_TILE] Adding support for TiledPermuteN on preshuffle Block Scale Gemm (#3019)
* Adding support for TiledPermuteN

* Adding test

* resolving remod.py

---------

Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>
2025-10-24 11:06:51 -07:00
Max Podkorytov
fdcc1f75c3 limit the rotating count to prevent oom (#3087) 2025-10-24 08:55:34 -07:00
kyle-256
3c12a02827 [CK_TILE] add tensorwise quant in grouped gemm (#3007)
* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* change code based on comments

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-24 07:41:54 -07:00
Gino Lu
bedade2572 [CK_TILE] Add fp4 warp gemm 16x16x128 (#2738)
* first commit

* fix format error

* fix vec size error

* fix clang format

* fix type error

* add interface in warp_gemm_impl

* fix interface

* fix bug

* fix bug

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-23 10:55:51 -07:00
Qianfeng
fbd101b1ac [CK_TILE] Fix in set_slice_tile (#2232)
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-23 10:34:02 -07:00
Haocong WANG
0d3860dfdb [CKTILE] FMHA fwd trload lse fix (#3046)
* enable storelse for fmha_fwd_trload kernel

* fix lse in trload

* fix the mask related bug
2025-10-23 09:33:33 +08:00
lalala-sh
211d64e18a [CK_TILE] Update flatmm related kernels (#3022)
---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-10-22 22:36:11 +08:00
Johannes Graner
cbd1279ae6 [CK_TILE] Conv bwd splitN support (#3047)
* Conv bwd splitN support

* Adjust splitting calculations to lengths format

* Prepare indexing for future splitK support
2025-10-22 13:34:06 +02:00
Sami Remes
d100ab690a fix formatting 2025-10-22 11:12:39 +00:00
Sami Remes
f179a8a97b remove commented code and enable all tests again 2025-10-22 11:09:21 +00:00
MHYangAMD
5a27a97391 Introduce tree reduction for BlockReduce2dCrossWarpSync (#2588)
* Introduce tree reduction for BlockReduce2dCrossWarpSync

* Rename original impl to BlockReduce2dLinearCrossWarpSync

* Replace warp_size with get_warp_size()

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-22 14:41:35 +08:00
Sami Remes
bb52cd9889 Fix handling of n dim blocks in tile windows etc 2025-10-21 15:51:23 +00:00
Yashvardhan Agarwal
35754d2ec8 fix identity value of AbsMax (#3058)
* fix identity value of AbsMax

- Identity value of AbsMax should be 0 not numeric<T>::lowest()

* Update include/ck_tile/core/utility/reduce_operator.hpp

resolved comment

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-10-21 14:42:08 +02:00
Johannes Graner
4043401db1 Fix race conditions in ck_tile remod (#3061) 2025-10-21 09:35:04 +02:00
Max Podkorytov
2570462ecf [CK_TILE] Fix transpose_vectors for 2x2 8-bit tiles (#3042)
fix transpose_vectors logic for 2x2 8-bit tiles

    add a test which goes through this code path.

    factor out constexpr'd cases into smaller functions.

    add inline docs about the data movement

    impact: gemms with 8-bit non-rcr inputs on gfx942
2025-10-20 13:40:44 -07:00
Sami Remes
36b88c665c WIP 2025-10-20 15:42:39 +00:00
Gino Lu
fb1d090f3c [CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044)
* Patch for pk_fp4_raw_t buffer load and ref check
2025-10-20 14:47:04 +08:00
AviralGoelAMD
b03764ca5a docs: add inline comments about flush_cache and rotating buffer 2025-10-17 12:56:47 -04:00
Yashvardhan Agarwal
889ffc0b1d fix identity values in Max and AbsMax (#3048)
- The identity value method returned the minimum positive number while
we need the lowest number for Max and AbsMax operations
2025-10-17 09:49:21 -07:00
Emily Martins
352dee5225 Fix CK Tile Stream-K BF16 Validation Errors (#3039)
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.

This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.
2025-10-17 09:33:38 -07:00
Johannes Graner
8a4cd32d86 Pre-commit in CI (#3029)
* Pre-commit in CI

* Specify python version, and install dos2unix for remod

* Refactor remod hook to correctly install dependencies

* Run pre-commit
2025-10-17 09:28:38 -07:00
Johannes Graner
d40b50b9d5 Update pre-commit to fixed versions, run remod for ck_tile (#2895)
* Fix ruff linter errors

* Fix remod dos2unix command

* Clang format

* Ignore utility in remod

* Run remod

* Specify clang-format version in pre-commit

* Specify ruff version

* Include PoolKernelArgs in reference_pool

* Add calculate_total_elements to reference batched contraction

* Fix calculate_total_elements declaration

* Refactor remod pre-commit hook

* Fix Aquant tests

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-16 15:29:17 -07:00
Sami Remes
9988a46af2 WIP: trying to figure out tile dstr and/or indexing for scale matrix 2025-10-16 15:53:32 +00:00
Emily Martins
cb83d52301 Style updates and cleanup
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes
2025-10-16 08:47:06 -06:00
Astha
8f75d7cea6 Addition of the derived structs for the new Stream-K TilePartitioner
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.

The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.

Unit tests for the derived structs are included.
2025-10-16 08:47:06 -06:00
Emily Martins
f87f768d16 Stream-K Tile Partitioner Base Class with Tests
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.

This change also includes unit tests for the base tile partitioner.
2025-10-16 08:47:06 -06:00
Illia Silin
3348f01e6f re-enable clang-format by default (#3030)
* re-enable clang-format by default

* fix clang format
2025-10-15 07:43:11 -07:00
felix
4c826abfff Felix/opt sorting (#2902)
* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-10-15 09:24:03 +08:00
joyeamd
b9d74e7746 update s_barrier's logic in gfx12 architecture (#3003)
change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment
2025-10-14 08:49:34 -07:00