Commit Graph

485 Commits

Author SHA1 Message Date
Ville Pietilä
abccb649d1 [CK_Tile] Merge multiple convolution groups into a single GEMM batch (#2986)
* Fix compilation of the grouped conv examples.

* Fix grouped conv bwd weight example output in CK Tile.

* Add number of groups to merge to ck tile grouped gemm example.

* Initial set of tests for TransformConvBwdWeightToGemm.

* Added unit tests for TransformConvBwdWeightToGemm conv groups are merged.

* WIP: Tensor transformations.

* Add unit tests for coordinate transforms.

* Fully working conv group merging for TransformConvBwdWeightToGemm.

* WIP: Merged conv groups offset calculation.

* Adde unit tests for tensor view.

* WIP: Merged conv groups epilogue.

* Enable running multiple conv groups per batch.

* Add tests for tile_distribution_encoding.

* Change example to match optimally depthwise convolution with merged groups.

* Add more tests for tensor view.

* Integration test for reading diagonal blocks from grouped distributed tensor.

* Improved integration test.

* Improve test for accessing diagonal blocks.

* Added integration test for cshuffle epilogue LDS tile distribution.

* Add more logging.

* Increase the max number of reported errors.

* WIP: merged conv groups GEMM epilogue changes.

* LDS to global memory copy.

* Fix tile window size for c block.

* Integration test for CShuffle epilogue.

* Improved CShuffle test.

* WIP: Separate epilogue for merged conv groups.

* Tile example parameters changes to match depthwise conv.

* Offset fixes.

* Epilogue fixes.

* Working baseline for depthwise covolution with merged conv groups.

* Fix build.

* Initial unit tests for tensor descriptor.

* Add one more unit test for tensor view.

* WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding.

* Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding.

* Add more comments, disable debug code.

* Remove debug and other dead code.

* Code clean-up for bwd tensor transformations.

* Enable running multiple GEMM batches of merged conv groups.

* Add compile check for assumed row-mjor layout.

* Fix strides in 1D conv to gemm transformation.

* WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases.

* Fix case k > 1 and c=1.

* Remove debug code.

* Make MPerGroup and NPerGroup template parameters.

* Add additional check for non-supported c > 1 case.

* WIP: Put back the generic tensor descriptors for convolutions.

* Fix tensor descriptors.

* Remove the obsolete template parameters.

* Add more instances.

* Fix bugs in merged conv groups tensor descriptors.

* Fix tensor descriptors for merged conv groups when K > 1.

* Remove debug output.

* Remove dead code.

* Fix merge conflicts.

* Code clean-up.

* Remove unused code.

* Run clang-formatting.

* Remove debug prints and obsolete tests.

* Check that number of convolution groups is multiple of merged groups.

* Fix build after removing obsolete functionality.

* Remove obsolete enumeration.

* Fix new unit projects.

* Remove unnecessary includes.

* Fix passing the number of merged groups.

* Remove unrelated tests.

* Fix IsSupportedArgument for bwd weight conv kernel.

* Fix clang formatting.

* Fix the bwd weight conv to gemm mapping for num merged groups > 1.

* GEMM config for conv group merging.

* Fix clang-formatting.

* Remove obsolete comment.

* Fix typos in comment strings.

* Increase the max number of reported errors when testing against reference implementation.

* Rename gemm_config to conv_config.

* Rename GemmConfig to ConvConfig and move NumGroupsToMerge into ConvConfig.

* Change num_groups_to_merge to a boolean flag in the ck tile grouped conv example.

* Run clang-format.

* Add number of merged groups into kernel name string.

* Remove group merging flag from CK Tile grouped conv example.

[ROCm/composable_kernel commit: 121bf0e1f3]
2025-10-29 16:49:28 +02:00
Yashvardhan Agarwal
edea16ce14 [CK_TILE] Add indexing to pooling operator (Lwpck 3892) (#3013)
* Add indexing support to pooling operator

- Add IndexDataType template parameter to pooling problem and kernel
definitions

- Enable pooling kernel to output indices of selected elements during
max/absmax pooling

- Add overloaded operators for Max and AbsMax that track when values
change using bool changed parameter

-  Support optional index buffer allocation and management in device
memory

- Modify BlockReduce2d classes to handle index tensors alongside value
tensors

-  Add separate shared memory allocation for index data in cross-warp
reductions

- Create validate_pool_indices function to verify index correctness

- Modify pool3d.cpp example to demonstrate index output functionality

- Add tests for index output

* fixes

* Refactor BlockReduce2D functions to get rid auxiliary private types.

* comment resolutions and some changes to block_reduce2d

- index reference implementation improved
- reduce_operator.hpp cleanedup
- updated the block_reduce2d.hpp to have index calculation for
BlockReduce2dLinearCrossWarpSync as well

* conditionally used variable declaration improvement

- the conditionally used vairbales are used only when indexing is
enabled. To inform the compiler that they may be unused and declare them
with least size possible. This may allow it to be optimized compared to
the previous declarations

* comment resolutions

* lexical ordering of the indicies

- introduced accumulate methods that handle the intermediate steps if
needed to order the indexes

* add reduce_operator_accumulate.hpp to core.hpp

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: 3052d7c9e6]
2025-10-29 09:58:04 +02:00
Jeff Huang
9ad15a658c [CK_TILE] fmha: Add query padding support to backward pass (#3097)
* [CK_TILE] fmha: Add query padding support to backward pass

Introduces support for query sequence padding (q_padding) in the FMHA backward pass kernels.
- Passing `seqlen_q_ptr` to the backward kernels to distinguish logical from physical sequence lengths.
- Updating `OGradDotO`, `ConvertQGrad`, and `DQDKDV` kernels to respect logical lengths and handle zero-length sequences.
- Aligning LSE indexing in the forward kernel with the padded layout for consistency.
- Adding a new GTest suite (`test_fmha_bwd_kernel_padding.cpp`) with comprehensive tests for various padding scenarios, including zero-length
  sequences and deterministic mode.

* fix clang format

* Adapt fmha_bwd_runner.cpp to new q, kv sequence padding
Add backward q/kv sequence padding unit tests.

* [CK_TILE] fmha: Unify sequence length and padding handling

Refactor the handling of sequence lengths and padding in the
FMHA forward and backward kernels to provide a more unified and flexible
interface.

- Replaced `seqstart_padded_*_ptr` with a more robust system that uses
  `seqstart_*_ptr` for physical sequence lengths and introduces
  `seqlen_*_ptr` and `cu_seqlen_*_ptr` for logical (unpadded) lengths.
- Established a clear order of precedence for determining sequence
  length: cumulative lengths (`cu_seqlen_*_ptr`) take priority,
  followed by per-sequence lengths (`seqlen_*_ptr`), and finally
  physical lengths derived from `seqstart_*_ptr`.
- Clarified the distinction between "group mode" and "batch mode" and
  how sequence lengths are handled in each case.
- Renamed `cu_seqlen_kv_ptr` to `cu_seqlen_k_ptr` for consistency.
- Updated comments and documentation to reflect the new argument
  structure and usage.

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 7c6430eca0]
2025-10-29 13:56:11 +08:00
Sami Remes
39e77ae650 [CK_TILE] Top-K with Sigmoid kernel (#3062)
* Add sigmoid option to topk_softmax

* fix formatting

* add to changelog

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Use else if

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 515e283091]
2025-10-28 10:54:06 -07:00
Mateusz Ozga
8d51d0ef4d [CK_TILE] Fixed multi-abd GEMM test, NaN problem (#2979)
* Multi-ABD NaN problem

* Rollback tests

---------

Co-authored-by: root <root@splinter-126-008d.aus.dcgpu>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: da4247a6df]
2025-10-28 15:53:36 +01:00
Aviral Goel
dfbc489a6b [CK_TILE] Add Bquant to Grouped Gemm (#3063)
* update test cases

* format codes

* use GTEST_FAIL

* add bquant to grouped_gemm

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* fix a bug in test_grouped_gemm_util

* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm

* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* feat: add bf8 support

* chore: remove unnecessary decltype usage

* chore: add default quant_mode to function signature as fallback

* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel

Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.

* chore: set default quant mode in function signature

* test: add additional test cases to cover edge case of no hotloop

* chore: clang formatting

---------

Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 4368fd9f57]
2025-10-28 10:20:24 -04:00
Khushbu Agarwal
e10a11323a Fix quant scale matrix layout for block scale gemm (#3079)
* Adding support for TiledPermuteN

* Adding test

* moving shuffle functions to common place

* resolving commit hook

* fix formatting

[ROCm/composable_kernel commit: b11f53a484]
2025-10-27 13:56:07 -07:00
Johannes Graner
3b8e9864c6 [CK_TILE] Add conv fwd + bias + clamp example (#3012)
* Implement argument passing to element-wise functions for fwd convolution

* Add files for fwd + bias + clamp example

* Implement Bias

* Implement Clamp

* Elementwise function composition

* Composition unit test

* Implement fwd + bias + clamp example

* Simplify argument passing and composition

* elfunc -> bias_and_clamp

* Rename function to specify example

* Move element-wise function instantiation to kernel

* Make bias a runtime tensor

* No ugly namespace aliasing

* Initialize element-wise function on host

* Remove function initialization helper, simplify Compose initialization

* Remove unintended LSP compatibility patch

* Clean up includes and unused code

* Switch names in cshuffle epilogue

* Move CDElementwise to conv traits

* Re-add required include

* Initialize bias in same way as other tensors

* Better type specification for ds pointer

* Disable 1D convolution

* Add warning for non-group-constant bias

[ROCm/composable_kernel commit: 5c1974065e]
2025-10-27 18:43:09 +01:00
arai713
cbf24c87c6 [CK_TILE] Stream-K operator() Reboot (#3064)
* Persistent Stream-K Kernel Implementation

This change implements an operator() function in the
reboot::StreamKKernel class that is enabled when the Persistent flag is
set to true. In this case, the data-parallel portion and the Stream-K
portion of the kernel are fully persistent.

The changes were made in the reboot namespace. A future PR will remove
the old Stream-K kernel class and remove the reboot namespace.

* Unit Tests for Persistent Stream-K Kernel

This change contains the inital test suite for the Persitent Stream-K
Kernel. The files contain "reboot" in the name; a future PR will remove
tests for the old Stream-K Kernel and remove the "reboot" naming.

A future commit will add tests for the non-persistent kernel.

Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase
class. This allows us to estimate the number of accumulations done per
macro tile in C to use during validation when computing relative and
absolute tolerance.

* Adding implementation for the Non-Persistent Stream-K kernel

This code is adding the operator() function for the Non-Persistent Stream-K
kernel. Persistency of the kernel is determined through a template argument.
The Non-Persistent kernel will allocate additional workgroups for the data
parallel section, leading to a different structure for processing the data
parallel and Stream-K sections.

There has been an addition to the TilePartitioner to get access to the whether
Persistent has been set to true or false in the StreamKKernel.

* Adding in the tests for the Non-Persistent Stream-K kernel

* Refactor Stream-K Reboot Unit Tests

This commit makes the following changes:
- Update test cases to determine M, N, and K based on the number of CUs.
  This ensures that each test case is one of Edge Case, SK Only, DP
Only, or DP + 2 Tile SK regardless of the architecture.
- Since the DP + 2 Tile SK test case takes long to run, this change
  moves this case into a separate .inc file and labels it as an extended
test.
- Since the extended test takes > 30 seconds to run, this test is added
  to the list of regression tests.

* Fix spelling errors in comments for test cases

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Changes based on review

Removed const volatile for typenames
Set up alias for is_tuple_t
Naming changes for clarity: GemmCommon -> BaseGemm
Moved std::enable_if_t out of template parameters and changed to a return type for operator()
Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 054fdb765c]
2025-10-27 09:14:17 -07:00
Adam Osewski
75a0f41bb0 [CK_Builder] Add name member to unary elementwise ops & update builder traits. (#3093)
* Add name member to unary elementwise ops.

* Update elementwise_op_name to check for name attribute.

* Require that the layout is derived from BaseTensorLayout struct.

[ROCm/composable_kernel commit: f53d857b25]
2025-10-25 07:27:03 -07:00
Max Podkorytov
26c4304c84 [CK-Tile][Async gemm] add missing sync and f8 inputs test cases (#3000)
* add missing sync and f8 test cases

* reformat test cases

* comment failing cases

* bump

* reintroduce compv4 shapes

[ROCm/composable_kernel commit: 86d542f663]
2025-10-24 12:16:01 -07:00
Khushbu Agarwal
eef9513fd3 [CK_TILE] Adding support for TiledPermuteN on preshuffle Block Scale Gemm (#3019)
* Adding support for TiledPermuteN

* Adding test

* resolving remod.py

---------

Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>

[ROCm/composable_kernel commit: 0584399571]
2025-10-24 11:06:51 -07:00
Max Podkorytov
77fc1e4c3f limit the rotating count to prevent oom (#3087)
[ROCm/composable_kernel commit: fdcc1f75c3]
2025-10-24 08:55:34 -07:00
kyle-256
b49f5d9de5 [CK_TILE] add tensorwise quant in grouped gemm (#3007)
* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* change code based on comments

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 3c12a02827]
2025-10-24 07:41:54 -07:00
Gino Lu
7e4c021e26 [CK_TILE] Add fp4 warp gemm 16x16x128 (#2738)
* first commit

* fix format error

* fix vec size error

* fix clang format

* fix type error

* add interface in warp_gemm_impl

* fix interface

* fix bug

* fix bug

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: bedade2572]
2025-10-23 10:55:51 -07:00
Qianfeng
cf31de9211 [CK_TILE] Fix in set_slice_tile (#2232)
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: fbd101b1ac]
2025-10-23 10:34:02 -07:00
Haocong WANG
2a01918313 [CKTILE] FMHA fwd trload lse fix (#3046)
* enable storelse for fmha_fwd_trload kernel

* fix lse in trload

* fix the mask related bug

[ROCm/composable_kernel commit: 0d3860dfdb]
2025-10-23 09:33:33 +08:00
lalala-sh
63e0a73bd3 [CK_TILE] Update flatmm related kernels (#3022)
---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: 211d64e18a]
2025-10-22 22:36:11 +08:00
Johannes Graner
b8882aae95 [CK_TILE] Conv bwd splitN support (#3047)
* Conv bwd splitN support

* Adjust splitting calculations to lengths format

* Prepare indexing for future splitK support

[ROCm/composable_kernel commit: cbd1279ae6]
2025-10-22 13:34:06 +02:00
MHYangAMD
6d802e7ba4 Introduce tree reduction for BlockReduce2dCrossWarpSync (#2588)
* Introduce tree reduction for BlockReduce2dCrossWarpSync

* Rename original impl to BlockReduce2dLinearCrossWarpSync

* Replace warp_size with get_warp_size()

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 5a27a97391]
2025-10-22 14:41:35 +08:00
Yashvardhan Agarwal
9072046e55 fix identity value of AbsMax (#3058)
* fix identity value of AbsMax

- Identity value of AbsMax should be 0 not numeric<T>::lowest()

* Update include/ck_tile/core/utility/reduce_operator.hpp

resolved comment

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

[ROCm/composable_kernel commit: 35754d2ec8]
2025-10-21 14:42:08 +02:00
Johannes Graner
671f2686c0 Fix race conditions in ck_tile remod (#3061)
[ROCm/composable_kernel commit: 4043401db1]
2025-10-21 09:35:04 +02:00
Max Podkorytov
1d7e4157c5 [CK_TILE] Fix transpose_vectors for 2x2 8-bit tiles (#3042)
fix transpose_vectors logic for 2x2 8-bit tiles

    add a test which goes through this code path.

    factor out constexpr'd cases into smaller functions.

    add inline docs about the data movement

    impact: gemms with 8-bit non-rcr inputs on gfx942


[ROCm/composable_kernel commit: 2570462ecf]
2025-10-20 13:40:44 -07:00
Gino Lu
b7e5da5e83 [CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044)
* Patch for pk_fp4_raw_t buffer load and ref check

[ROCm/composable_kernel commit: fb1d090f3c]
2025-10-20 14:47:04 +08:00
AviralGoelAMD
48b0e60e14 docs: add inline comments about flush_cache and rotating buffer
[ROCm/composable_kernel commit: b03764ca5a]
2025-10-17 12:56:47 -04:00
Yashvardhan Agarwal
c5eda13381 fix identity values in Max and AbsMax (#3048)
- The identity value method returned the minimum positive number while
we need the lowest number for Max and AbsMax operations

[ROCm/composable_kernel commit: 889ffc0b1d]
2025-10-17 09:49:21 -07:00
Emily Martins
6157673c39 Fix CK Tile Stream-K BF16 Validation Errors (#3039)
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.

This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.

[ROCm/composable_kernel commit: 352dee5225]
2025-10-17 09:33:38 -07:00
Johannes Graner
f7ffb12123 Pre-commit in CI (#3029)
* Pre-commit in CI

* Specify python version, and install dos2unix for remod

* Refactor remod hook to correctly install dependencies

* Run pre-commit

[ROCm/composable_kernel commit: 8a4cd32d86]
2025-10-17 09:28:38 -07:00
Johannes Graner
8af66c65d0 Update pre-commit to fixed versions, run remod for ck_tile (#2895)
* Fix ruff linter errors

* Fix remod dos2unix command

* Clang format

* Ignore utility in remod

* Run remod

* Specify clang-format version in pre-commit

* Specify ruff version

* Include PoolKernelArgs in reference_pool

* Add calculate_total_elements to reference batched contraction

* Fix calculate_total_elements declaration

* Refactor remod pre-commit hook

* Fix Aquant tests

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: d40b50b9d5]
2025-10-16 15:29:17 -07:00
Emily Martins
95805e674a Style updates and cleanup
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes


[ROCm/composable_kernel commit: cb83d52301]
2025-10-16 08:47:06 -06:00
Astha
47e002fc27 Addition of the derived structs for the new Stream-K TilePartitioner
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.

The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.

Unit tests for the derived structs are included.


[ROCm/composable_kernel commit: 8f75d7cea6]
2025-10-16 08:47:06 -06:00
Emily Martins
48ebcbe898 Stream-K Tile Partitioner Base Class with Tests
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.

This change also includes unit tests for the base tile partitioner.


[ROCm/composable_kernel commit: f87f768d16]
2025-10-16 08:47:06 -06:00
Illia Silin
0b8e15dc2e re-enable clang-format by default (#3030)
* re-enable clang-format by default

* fix clang format

[ROCm/composable_kernel commit: 3348f01e6f]
2025-10-15 07:43:11 -07:00
felix
7b584fd2d2 Felix/opt sorting (#2902)
* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 4c826abfff]
2025-10-15 09:24:03 +08:00
joyeamd
2592957760 update s_barrier's logic in gfx12 architecture (#3003)
change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment

[ROCm/composable_kernel commit: b9d74e7746]
2025-10-14 08:49:34 -07:00
Illia Silin
fc0f6be56d Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)
This reverts commit 985b9f98a2.

[ROCm/composable_kernel commit: e4298e55c7]
2025-10-14 08:43:14 -07:00
jakpiase
4643cdd962 [CK_TILE] Switch into universal gemms for conv bwds (#2981)
* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments

[ROCm/composable_kernel commit: 6deaaa92cc]
2025-10-14 16:09:16 +02:00
ClementLinCF
6a423df526 [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)
* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: e1b0bdfbfa]
2025-10-13 11:52:37 -07:00
Sami Remes
985b9f98a2 [CK_TILE] Non-K Major from old CK to CK-Tile (#2442)
* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: d2bbca3eca]
2025-10-13 14:27:02 +02:00
aledudek
ab7f67488c [CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)
* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: 634634f5c0]
2025-10-13 13:57:37 +02:00
aledudek
e42f27e42e [CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)
* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages

[ROCm/composable_kernel commit: 3021604213]
2025-10-13 13:55:23 +02:00
damien-lejeune
b904c41e44 Update include path to break the remod's cyclic dep issue (#2978)
* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 46c10c316d]
2025-10-13 13:24:47 +02:00
msaffari-amd
b9f7381f95 [CK Tile] contraction multi d - kernel & example (#2901)
* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: e9f0cc83a8]
2025-10-13 12:30:28 +02:00
Christopher Millette
7df03bedec Streamk functional tests (#2974)
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite

* Account for stride when computing K offsets for A and B tensor

This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.

* Move helper code to test_gemm_streamk_util.hpp

* Separate tests into smoke/regression/extended. Add bf16 datatype

* Run clang-format

* Refactor combinatorial macro expansion and naming

* Adjust the initialization values to account for better tolerance on bf16

* Correct BF16 datatypes in comments

* Move the extended tests under the REGRESSION_TESTS label

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: f5708882a3]
2025-10-11 07:53:40 -05:00
Khushbu Agarwal
fdb397b2c9 supporting prefill shapes for preshuffle block scale gemm (#2975)
* debugging

* debugging for prefill shapes

* comment unused code

* fix for prefill shapes

* clearing up the code

* add int4 to universal gemm example

* clang formatted

* adding test for prefill shapes in block scale gemm

* lil improv on the block pipeline

* Address Review Comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 3c39d279ab]
2025-10-10 15:36:24 -07:00
Max Podkorytov
03d3c63948 [CK-Tile] functional support for transposed inputs in compute-bound double-lds-buffer pipeline with async loads from global memory to LDS (#2984)
* reuse local prefetch logic from compute v4 pipeline

add single-tile test

explicit lambda capture

reuse lds block descriptors from base policy for the transposed case

match the test case kernel configuration with compute v4

* add comments

[ROCm/composable_kernel commit: 9d060d3e3c]
2025-10-10 12:57:50 -07:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
8134f1d6f4 [CK_TILE] fix pk_fp4 compilation for non-gfx950 GPUs (#2983)
See build error log from
https://github.com/ROCm/composable_kernel/issues/2271#issuecomment-3150218542

This PR make vector element access constexpr-safe by avoiding operator[] on
ext_vector_type(2) and replace those sites in the pk_fp4 conversions so they
can be used in constant expressions, as The operator[] on ext_vector_type(2)
isn't allowed in constant expressions, which caused "constexpr function never
produces a constant expression" with a note at x[0]. Using `bit_cast` to a
trivial array representation keeps it constexpr-compatible.

Signed-off-by: Hollow Man <hollowman@opensuse.org>

[ROCm/composable_kernel commit: fb66b4f5e4]
2025-10-09 07:43:41 -07:00
Yashvardhan Agarwal
bc9e9df38f [CK_TILE] Pooling FWD (Lwpck 3683) (#2956)
* Pooling 2D/3D with refernce

* Tests & cleanup

- added test for ppoling
- cleanup
- removed 2d example

* Comment resolution

- README added
- example target name rectified
- appropriate arg description and comments added

* clang-format

* appropriate blocksize calc

* modifications for future indexing addition

- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future

* some basic fixes

* comment resolutions

* comment resolutions

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 7b6451b68e]
2025-10-09 16:13:26 +02:00
Aviral Goel
7146674c11 Add Memory pipeline for AQuant Block Scale GEMM (#2987)
* WIP: add memory pipeline boiler plate code that compiles and works for one block

* WIP: tail handling works for memory pipeline

* WIP: numerical errors appears to have gone by adding block_sync_lds()

* fix: numerical error with memory pipeline by adding block_sync_lds() and new tail handler

* refactror: remove debug print statements and lints

* fix: remove redundant sync barriars

* chore: remove lint

* fix: remove unused code from tile handler and remove redundant block_sync_lds()

* fix: correct parent struct name for memory pipeline

* fix: remove static assert check from parent struct and add it to child struct because not all child structs needs to static assert

* fix: defer block sync lds to just before prefill

[ROCm/composable_kernel commit: e99356dabc]
2025-10-08 17:22:30 -07:00
Cong Ma
6650feee3a [CK TILE GEMM] Refactor the code of transposeC and quantpreshuffle of AQuant Gemm (#2965)
Refactor the code of transposeC and quantpreshuffle of AQuant Gemm to make it easier to maintain.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 1d4db30af9]
2025-10-08 00:05:38 -07:00