Commit Graph

5 Commits

Author SHA1 Message Date
Emily Martins
22b945e06e [CK_TILE] Stream-K Tree Reduction and Cache Skipping Integration (#3371)
* CK Tile Stream-K Tree Reduction

This change adds the first implementation of the Stream-K tree reduction
strategy into CK Tile. The tree reduction reduces the the number of
steps for accumulating results for a tile from O(N) to O(logN) where N
is the number of workgroups contributing to a C tile.

Additionally, in the original non-atomic reduction strategy, atomics
were used to set the flags buffer and to read from the flags buffer.
Howeover, through investigation with the tree reduciton, atomics with
default (relaxed) semantics were not enough to guarantee workgroups
would not read stale data, leading to incorrect results. Stronger
acquire/release memory orderings are too expensive. So, this change
also eliminates the use of atomics for setting the flags. Instead, we
leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby
avoiding the use of atomics.

Prelimiary tests were also added for the normal reduction and tree
reduction. More will be added in a future PR via tile engine.

* Move Stream-K kernel files to a subdirectory

* Cleanup Code Style & Handle Unsupported Reductions

This change makes the following small changes:
- Add an explicit else block for unimplemented reduction strategies
- Clarify type of sk_flags_ptr via auto*
- Add description for extra_iters_before_me variable

* Run new copyright script on new files
2025-12-14 14:49:49 -07:00
Emily Martins
2e4b8a8fc4 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
arai713
715395bc86 [CK_TILE] Stream-K Gemm Example for fp8 and bf8 (#3041)
* Addition of streamk fp8 example for CK Tile

* Adding in bf8 streamk example in CK Tile

* Refactoring fp8/bf8 unit tests

Refactored the unit tests for fp8/bf8 to utilize the test harness.
Implemented smoke tests with layouts: CCR, CRR, RCR, RRR for fp8/bf8.
The tests are using 128x128x32 for the tile configuration, as other
configurations revealed implementation gaps that are currently being
documented.
2025-10-27 19:29:03 -07:00
arai713
054fdb765c [CK_TILE] Stream-K operator() Reboot (#3064)
* Persistent Stream-K Kernel Implementation

This change implements an operator() function in the
reboot::StreamKKernel class that is enabled when the Persistent flag is
set to true. In this case, the data-parallel portion and the Stream-K
portion of the kernel are fully persistent.

The changes were made in the reboot namespace. A future PR will remove
the old Stream-K kernel class and remove the reboot namespace.

* Unit Tests for Persistent Stream-K Kernel

This change contains the inital test suite for the Persitent Stream-K
Kernel. The files contain "reboot" in the name; a future PR will remove
tests for the old Stream-K Kernel and remove the "reboot" naming.

A future commit will add tests for the non-persistent kernel.

Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase
class. This allows us to estimate the number of accumulations done per
macro tile in C to use during validation when computing relative and
absolute tolerance.

* Adding implementation for the Non-Persistent Stream-K kernel

This code is adding the operator() function for the Non-Persistent Stream-K
kernel. Persistency of the kernel is determined through a template argument.
The Non-Persistent kernel will allocate additional workgroups for the data
parallel section, leading to a different structure for processing the data
parallel and Stream-K sections.

There has been an addition to the TilePartitioner to get access to the whether
Persistent has been set to true or false in the StreamKKernel.

* Adding in the tests for the Non-Persistent Stream-K kernel

* Refactor Stream-K Reboot Unit Tests

This commit makes the following changes:
- Update test cases to determine M, N, and K based on the number of CUs.
  This ensures that each test case is one of Edge Case, SK Only, DP
Only, or DP + 2 Tile SK regardless of the architecture.
- Since the DP + 2 Tile SK test case takes long to run, this change
  moves this case into a separate .inc file and labels it as an extended
test.
- Since the extended test takes > 30 seconds to run, this test is added
  to the list of regression tests.

* Fix spelling errors in comments for test cases

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Changes based on review

Removed const volatile for typenames
Set up alias for is_tuple_t
Naming changes for clarity: GemmCommon -> BaseGemm
Moved std::enable_if_t out of template parameters and changed to a return type for operator()
Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-27 09:14:17 -07:00
Christopher Millette
f5708882a3 Streamk functional tests (#2974)
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite

* Account for stride when computing K offsets for A and B tensor

This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.

* Move helper code to test_gemm_streamk_util.hpp

* Separate tests into smoke/regression/extended. Add bf16 datatype

* Run clang-format

* Refactor combinatorial macro expansion and naming

* Adjust the initialization values to account for better tolerance on bf16

* Correct BF16 datatypes in comments

* Move the extended tests under the REGRESSION_TESTS label

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-11 07:53:40 -05:00