[CK][CK Tile] Fix kbatch check in grouped conv and gemm
kernels (#5555)
## Motivation
Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.
## Technical Details
Round up K / Kperxdl and divide it by Kbatch to allow tail for K.
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Ck/joye/revert oob check
## Motivation
fix ck_tile's oob check.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Rename Stream-K grid function
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().
## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.
## Test Plan
Validated using the test suite that is already present.
## Test Result
All tests passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Skip work if any of Grouped GEMM groups M/N/K are
zero. (#5050)
## Motivation
It's common in MoE workloads that some experts receive zero tokens,
which would result in some of the dimensions equal to zero. Currently we
handle such case only for non-persistent kernels where we have all GEMMs
information beforehand on host - we validate this during creation of
kernel arguments. However for the "dynamic" input path (persistent
kernel) this information is not available before kernel launch. Thus we
have to validate this during kernel execution. The goal is to add this
validation.
## Technical Details
Skip work if any of Grouped GEMM groups M/N/K are zero for persistent
kernel path.
## Test Plan
Add unit-tests which cover "dynamic" inputs with zero dims for
persistent kernel execution path.
## Test Result
All tests pass.
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Reduce Register Spills in Stream-K Reductions
(#4984)
## Motivation
In CK Tile Stream-K, kernels using one of two non-atomic reduction
strategies (i.e., linear, tree) have high register spill count, with the
tree reduction generally being worse. These changes act a first step to
help decrease the register spill count.
## Technical Details
### Problem 1: Unvectorized access to partials
In both the linear and tree reductions, workgroups write partials
results to a global buffer; another workgroup will later read this data.
When the initial logic to support reading and writing to the partials
buffer was added (see
https://github.com/ROCm/composable_kernel/pull/3107), the tile
distribution encoding used to read from and write to partials matches
the register layout for the accumulator of the mfma instruction used for
the kernel. Since we do not currently use the transposed register layout
for the accumulator, we end with an encoding that is not optimized for
writing to HBM.
For example: Consider the register layout of the
`v_mfma_f32_16x16x32_fp8_fp8` instruction.
```bash
./matrix_calculator.py --architecture gfx942 --instruction v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix
```
<img width="1113" height="537" alt="image"
src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02"
/>
The above shows that threads are responsible for consecutive elements
down a column of the C tile. If we use this distribution to read and
write to partials with C in row major, then threads are unable to
perform vectorized reads and writes. Note: thread 0 is shown in red and
thread 1 is shown in green.
Since the C-shuffle Epilogue only supports C in row major, reading and
writing to partials is highly unoptimized.
### Problem 2: Missed opportunity for SPGR use in tree reduction loop
Since the reduction occurs between workgroups, all threads in the
workgroup follow the same execution paths in the tree reduction logic,
hence various variables should be using SGPRs, but they are not.
### Implemented Solutions
1. Add a new tile distribution encoding that is optimized for accessing
partials in HBM. This encoding does not change the data assignment to
threads, it merely changes the addresses to which they write/read in the
partials buffer. For example, continuing with the
`v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result
in threads writing in the following layout:
<img width="517" height="342" alt="image"
src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202"
/>
This layout ensures that each thread writes along a row, enabling
`buffer_{store|load}_dwordx4` instructions (i.e., vectorized accesses).
This helps reduce register usage due to requiring fewer offset
calculations.
2. To force SGPR usage in the tree reduction loop, I make use of CK
Tile's `amd_wave_read_first_lane` which is a wrapper around
`__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the
tree reduction.
_These changes do not fully eliminate register spills. Future work will
aim to further reduce spills. But these changes make good progress._
## Test Plan
Added tests for different warp tile sizes to validate that the new
encoding works with different `WarpGemm` variants.
## Test Result
All tests pass locally on all gfx9 architectures.
Some results for decreases in register spills on gfx942: (BL = baseline)
| Kernel | SGPR Spill (BL) | SGPR Spill (new) | SGPR Delta | SGPR % |
VGPR Spill (BL) | VGPR Spill (new) | VGPR Delta | VGPR % |
|--------|------------------:|------------------:|-----------:|-------:|-------------------:|------------------:|-----------:|-------:|
| fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 | 223 | 0 | -223 |
-100.0% | 21 | 20 | -1 | -4.8% |
| fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 | 233 | 11 | -222 | -95.3%
| 443 | 23 | -420 | -94.8% |
| fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 | 221 | 3 | -218 | -98.6%
| 12 | 6 | -6 | -50.0% |
| fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 | 230 | 14 | -216 | -93.9%
| 396 | 12 | -384 | -97.0% |
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Fix batched gemm kernel 2 lds
## Motivation
Fix 2 lds batched gemm universal gemm call. Disable split k for not
valid atomic add instruction size.
## Technical Details
Fix 2 lds batched gemm universal gemm call. Disable split k for not
valid atomic add instruction size.
## Test Plan
CI overall
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to
allow caller to pass desciptors directly (#4295)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create
tensor views from strides and sizes. This refactors the descriptor
creation out and add overloaded definitions, allowing descriptors to be
created separately by the caller instead of passing explicit strides,
with no functional changes.
This will enable further refactoring of `RunGemm` to do likewise,
enabling derived kernels like BatchedContractionKernel to avoid creating
separate versions (PR
[#3457](https://github.com/ROCm/composable_kernel/pull/3457)).
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
Since the logic within the MakeXBlockviews chains together operations on
tuples, and thus the descriptors are also passed as such, adding a
template parameter for the type of the input tuple was the simplest
option to enable the overload without too much verbiage. However, for
`MakeCBlockView` this adds a complications as the templated definitions
are prone to overlap. This for now is avoided by just moving the
arguments around for the descriptor version, which avoids the collision.
It's not a great solution, so feel free to suggest a better one.
[CK_TILE] Update Stream-K Reduction Strategy Enum
## Motivation
Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.
## Technical Details
Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Reduction = 1u,
TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Linear = 1u,
Tree = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan
No new functionality was added, so no new tests were added; I just
validated existing tests and examples.
## Test Result
All tests passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* Fix alignment issue in Stream-K workspace buffer
In CK Tile Stream-K, the workspace buffer is used to hold flags and
partials, where the first i bytes holds the flags and the remaining
bytes hold partials. This change adds padding to the flags prefix of the
workspace buffer to ensure the number of bytes is 128B-aligned. Without
this alignment, since workgroups do not skip cache when reading from
partials, they may read stale partials data in cache, leading to
incorrect results. The added padding avoids the stale data reading.
This change also re-enables the test_ck_tile_streamk_reduction tests.
* Compute reference GEMM on GPU for test verification to decrease testing time
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks
Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
* refactor: remove Default scheduler implementation as it not used anymore
* refactor: remove dead code from gemm universal kernel
* chore: add descriptive comments about amd intrinsic hardware sync instructions
* fix: label existing memory pipeline for aquant as intrawave
* [CK_TILE] unify double and single lds implementation (#108)
Unify LDS buffer management API for single and double buffering modes
This change consolidates the Local Data Store (LDS) buffer management by:
Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets
---------
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* update wp_pipeline
* fix a c++17 issue
* update for ci errors
* fix ci issues
* include a header to fix ci errors
* fix some rebase issues
* update with rebase
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* hacky poc
* fix tile engine
* kill the lambda
* maybe fix test build
* more fixes
* clang-format
* save temp
* clang-format
* mostly fix examples
* clang-format
* remove dead code
* more cleanup
* fix fmha bwd build (default epilogue set/add appears to be broken)
* fix default epilogue tests but not correctness
* clang-format
* fix bquant
* clang-format
* cleanup dead code
* rearrange make windows for readability
* restore changes to IsSupportedArgument
* fix smoke-builder
* clang-format
* fixup rename class
* build fixes
* clang-format
* fix builder
* fixup
* remove set from builder tests
* fix test
* clang-format
* re-refactor the kernels
* clang-format
* fix header license
* remove memory operation from conv bwd test
* clang-format
* clang-format example,include
* clang-format test
* build fixes
* clang-format
* solve compilation error
* fix the CI
* solve compilation error
* clang format
* solve merge conflict
* solve merge conflict
* solve the gfx11 error
* solve test error
* moar build fixes
* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* fix for splitk if splitk < grid
* add different splitk implementation
* minor bugfix for streamk gemm
* Add test
---------
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* CK Tile Stream-K Tree Reduction
This change adds the first implementation of the Stream-K tree reduction
strategy into CK Tile. The tree reduction reduces the the number of
steps for accumulating results for a tile from O(N) to O(logN) where N
is the number of workgroups contributing to a C tile.
Additionally, in the original non-atomic reduction strategy, atomics
were used to set the flags buffer and to read from the flags buffer.
Howeover, through investigation with the tree reduciton, atomics with
default (relaxed) semantics were not enough to guarantee workgroups
would not read stale data, leading to incorrect results. Stronger
acquire/release memory orderings are too expensive. So, this change
also eliminates the use of atomics for setting the flags. Instead, we
leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby
avoiding the use of atomics.
Prelimiary tests were also added for the normal reduction and tree
reduction. More will be added in a future PR via tile engine.
* Move Stream-K kernel files to a subdirectory
* Cleanup Code Style & Handle Unsupported Reductions
This change makes the following small changes:
- Add an explicit else block for unimplemented reduction strategies
- Clarify type of sk_flags_ptr via auto*
- Add description for extra_iters_before_me variable
* Run new copyright script on new files
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.
- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
the streamk implementation reaches stability.
* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM
* [CK TILE STREAMK] Refactor: Extract common utility functions.
* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation
* Add pre-commit
* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK
* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK
* [CK TILE STREAMK] Update Jenkinsfile
* [CK TILE Engine] Update StreamK tile engine help message
Remove default value messages as they are automatically printed
* [CK TILE Engine] Update StreamK tile engine
- Remove namespace reboot
* [CK TILE Engine] Update StreamK tile engine
- Fix merge error
When there are multiple workgroups contributing to a tile, when using
atomics, there may be round off error in cases where the accumulator
type is not the same as the C type. To compute an error tolerance for
test validation, the Stream-K Tile Partitioner has a function called
estimate_num_wgs_per_tile to estimate the number of workgroups per tile.
That said, this function only provides an estimate. In some cases for
DP+2TSK, the function returns 1 rather than the more accurate value of
2.
Thus, this change updates the estimate_num_wgs_per_tile function to
explicitely return the value of 2 in cases for DP+2TSK to ensure that we
have a better error tolerance to avoid test failures due to round-off
error.
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
1. Enable grouped_gemm_quant and gemm_streamk on gfx12
- test_ck_tile_streamk_smoke is kept on gfx9, since it looks someone is still working on it.
2. Update warp tile size in grouped_gemm_quant and gemm_streamk unit test
3. Reduce gemm tile size to pass the build on gfx12 in test_gemm_streamk_reboot_types.hpp
* Introduces the new partitioner to implement the reduction StreamK kernel
* Add more doc text to functions
* Add persistent-dp option to streamk example
* Update example/ck_tile/40_streamk_gemm/README.md
* Persistent Stream-K Kernel Implementation
This change implements an operator() function in the
reboot::StreamKKernel class that is enabled when the Persistent flag is
set to true. In this case, the data-parallel portion and the Stream-K
portion of the kernel are fully persistent.
The changes were made in the reboot namespace. A future PR will remove
the old Stream-K kernel class and remove the reboot namespace.
* Unit Tests for Persistent Stream-K Kernel
This change contains the inital test suite for the Persitent Stream-K
Kernel. The files contain "reboot" in the name; a future PR will remove
tests for the old Stream-K Kernel and remove the "reboot" naming.
A future commit will add tests for the non-persistent kernel.
Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase
class. This allows us to estimate the number of accumulations done per
macro tile in C to use during validation when computing relative and
absolute tolerance.
* Adding implementation for the Non-Persistent Stream-K kernel
This code is adding the operator() function for the Non-Persistent Stream-K
kernel. Persistency of the kernel is determined through a template argument.
The Non-Persistent kernel will allocate additional workgroups for the data
parallel section, leading to a different structure for processing the data
parallel and Stream-K sections.
There has been an addition to the TilePartitioner to get access to the whether
Persistent has been set to true or false in the StreamKKernel.
* Adding in the tests for the Non-Persistent Stream-K kernel
* Refactor Stream-K Reboot Unit Tests
This commit makes the following changes:
- Update test cases to determine M, N, and K based on the number of CUs.
This ensures that each test case is one of Edge Case, SK Only, DP
Only, or DP + 2 Tile SK regardless of the architecture.
- Since the DP + 2 Tile SK test case takes long to run, this change
moves this case into a separate .inc file and labels it as an extended
test.
- Since the extended test takes > 30 seconds to run, this test is added
to the list of regression tests.
* Fix spelling errors in comments for test cases
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Changes based on review
Removed const volatile for typenames
Set up alias for is_tuple_t
Naming changes for clarity: GemmCommon -> BaseGemm
Moved std::enable_if_t out of template parameters and changed to a return type for operator()
Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance
---------
Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.
The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.
Unit tests for the derived structs are included.
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.
This change also includes unit tests for the base tile partitioner.
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite
* Account for stride when computing K offsets for A and B tensor
This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.
* Move helper code to test_gemm_streamk_util.hpp
* Separate tests into smoke/regression/extended. Add bf16 datatype
* Run clang-format
* Refactor combinatorial macro expansion and naming
* Adjust the initialization values to account for better tolerance on bf16
* Correct BF16 datatypes in comments
* Move the extended tests under the REGRESSION_TESTS label
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults
* feat(grouped_gemm_multi_d): add functionality to run persistant kernel
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite
* docs: improved README text based on review comments
* fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint
* feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature
* feat: generalized grouped_gemm_kernel.hpp
* feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0
* refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel
* tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments
* fix: segfault fix by passing correct parameters for d tensors
* docs: add multi d info and trim down outdated content
* tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility
* style: clang format
* fix: incorrect validation method and Dtensor layout in test suite
* Update grouped_gemm example and pipeline
* find the root cause error in did not enable the transpose in gfx950 correctly
* Fix v3 pipeline, row and col major
* Disable f8 datatype tests, it fails on gfx950
* fix the abd test by clear the runtime argument unsupported
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
* Have a workable version for SGPR
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* substitute with the new sgpr read api
* update the CHANGELOG
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* change to static for logic
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num
* WIP: add temp config to stress test numerical error correction
* refactor: remove comments
* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function
Prior to this change, the splitk_batch_offset parameter of
MakeGemmTensorViews had type SplitKBatchOffset. But, the only member
variable of the SplitKBatchOffset class used in the MakeGemmTensorViews
function was splitted_k (an int32_t). The splitted_k value was used as
part of defining the dimensions of the tensor view. That said, for
Stream K, we do not need to use the SplitKBatchOffset class since we are
not using Split K. Thus, this commit changes the splitk_batch_offset
parameter to a int32_t called k_size. This will avoid the constraint of
requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset
class while still providing the same functionality. Calls to
UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.
* StreamK Kernel RunGemm Implementation
Stream K cannot simply use UniversalGemmKernel's RunGemm for the
following reasons:
1. The UniversalGemmKernel::RunGemm function computes num_loop based on
a static function of the TilePartitioner. That said, for Stream K,
num_loop must be computed using a member function (namely
GetCurrentIterLength from PR #2708).
2. The UniversalGemmKernel::RunGemm function requires the use of a
SplitKBatchOffset object which is not used for Stream K since we are
not using Split K.
Thus, this change adds a RunGemm function in the StreamKKernel class.
* initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm
* Fix indexing and offset issues for StreamK
These changes do the following:
- Ensure offsets along the M and N dimensions are multiplied by
MPerblock or NPerBlock, respectively. This ensures tile window origins
are at the correct locations.
- Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply
divmod to the given references to ensure correct values are available
to the caller.
- Added documentation in the Stream-K operator()
* Initial gtests for Stream-K
These changes add an initial gtest suite for the CK Tile Stream-K
kernel. Currently, due to bugs in the StreamKTilePartitioner (which will
be handled in a future PR), there are validation issues for certain
cases which may differ on different architectures. Thus, we opted to run
cases that are only fully data-parallel (skipping others). A guard was
added to Stream-K's IsSupportedArgument method to ensure that callers
are aware of this constraint. Additionally, to ensure testing
reproducibility, options for setting the number of CUs and occupancy
were added to MakeKernelArgs.
* Use GemmPipeline operator() variant that takes hot loop and tail num
In Stream-K, the num_loop value varies per WG and per iteration of a
Stream-K loop. So instead, we use the version of the GemmPipeline's
operator() function that takes in has_hot_loop and tail_num. This is
similar to what is done in Grouped GEMM.
* changes from review: comments, move readfirstlane, remove ifndef
* Switch direction of C tensor traversal & add padding guard
Prior to this change, WGs travelled backwards through their assigned
macro tiles in the C tensor. For instance, if WG0 is responsible for C
tiles 0 and 1, it would first visit tile 1 then tile 0. This means that
the iter_end decrements in each iteration of the stream-K while loop.
Since we are working with unsigned integers, the subtraction operation
may not be safe. Thus, this change makes is such that WGs travel forward
so that their iter_start is incremented and their iter_end remains
fixed.
Additionally, we added a guard against WGs that are neither sk_blocks
nor dp_blocks to ensure such WGs do not participate in the GEMM.
Together, these changes make is such that the algorithm is correct when
sk_blocks is greater than zero.
* Disable StreamK_M256_N256_K256_SKBlocks12 test case
This instance involves >=3 WGs contributing to each macro tile in C. Due
to the use of atomics, this is resulting in precision errors. These
errors will not persist once the reduction strategy is implemented. We
will re-enable this test then.
---------
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
* docs(README): update readme with new build instructions
* feat(grouped_gemm): add support back for non persistent kernel
* refactor(grouped_gemm): simplify tensor creation
* refactor(grouped_gemm): Persistance is now GemmConfig value for easier management
* chore(grouped_gemm): add print statements to ease debugging
* WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration
* fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity
* WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines
* chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle
* style: clang format work so far
* BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline
* fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors
* refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm
* chore(grouped_gemm): add/remove debug comments and debug print statements
* feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes
* chore(gemm_profile): add new argument combinations
* fix: branch cleanup, formatting, refactoring
* fix: branch cleanup, formatting, refactoring
* chore(changelog): update changelog to reflect new featuer
* address review comments & nit
* initial commit for skeleton code
* replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code
* fixed up code to match CK Tile convention: data type changes, naming changes, etc.
* change for num_sk_blocks data type
* formatting fix
* minor fixes
* moved reduction argument to template
* resolved comments from PR review: standardizing naming, pruning unneeded code
* resolve errors from merge of device op PR: moved enum to common file
* switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems
* unsigned type fix
* add const qualifier
* added documentation for template parameters
* documentation edit