Commit Graph

3949 Commits

Author SHA1 Message Date
ApoorvaKalyani
bac1ccbf8b Grouped convolution backward data WMMA v3 implementation (#3460)
* Added device level implementation for bwd_data_wmma_v3.

* Added first instance of bwd_data_wmma_v3(f16).

* Add support for bwd data in gridwise implementation

Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight

* Initial device implementation of bwd data

* Remove unused template parameters in device impl

* Add one instance for different layout

initial check of device implementation

* Add tests for splitk and for different layouts

* Appended more instances to wmma_v3_f16.

* Added conv_2d bf16 wmma_v3 instances.

* Added conv_3d_bf16 wmma_v3_instances.

* Added conv_3d_f16_wmma_v3_instances.

* Added SplitN test cases for wmma.

* Conv3d_bwd_data_scale_wmma_v3 instances.

* Conv3d_bwd_data_bilinear_wmma_v3_instances

* Renaming the device level instances file to common name , since it is defined for different DataTypes.

* Renaming the instances and fixing typo

* Added the test cases to regression test list

* NCHW support for wmma_v3

* Examples for bf16 and f16 bwd_data_wmma_v3

* Added transpose conditons for device impl

* fixing bugs

* Added the gemm_args array implmentation

* WIP debug conv bwd

* fix splitk

* Grouped gemm fix

* Update CmakeLists with EOF

* Added more instances for tests

* Fixed the run time error in examples and removed 3d conv examples.

* Fixed a typo.

* Updated CmakeLists to removed the 3d convultion deleted files

* Added print error statements for unsupoorted argument

* Added the merge conflict related changes

* Fixed compilation error

* Fixed the InstanceFactory duplication error.

* Removed the print statements and added logs to Arg function

* All the merge conflict related errors resolved

* Added d_tensor tests.

* Added the missing example types of wmm_v3

* Merge error fix

* Corrected the instance name

* Reverted the bias relu change

* Revereted the transpose load local change

* Updated the regression test list with bwd_data_scale

* Revert "Revereted the transpose load local change"

This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.

* Revert "Merge error fix"

This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.

* Reverting the local change

* Added merge error fix

* Build error fix due to merge conflicts

* Added bias_relu example for wmma_v3

* Modified the main method in dtensor tests

* Updated the dtensor tests to pick all the shapes

* Updated the dtensor test shapes.

* Updated the mem operations in tests.

* Added reference func

* Fixed typos in device impl

* Added new header file and modified the include file for 3d tests

* Renamed the test file and added reference func call.

* clang format fix

* Added ignore params

* Modified device impl and tests

* Removed debug print statements and updated dtensor test shapes

* Fixing merge conflicts

* Fixing more merge conflicts

* Fixed copyrights

* Updated the tuned instances to bilinear and scale.

* Adding tuned instances to vanilla wmma_v3

* Removed all unused instances and modified test layouts.

* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.

* Fix clang format

* Updated tuned f16/-genric instances

* Formatting the instances file

* Fixed copyrights and clang issues

* Nonsense commit to force git to force

* Removed the transpose instances

* Added verified genric instances

* Fixing namespace errors

* Added todo for failing shapes

* Formatting instance file

* Fix instance list formatting

* Removing unnecessary formats

* Renamed the common file

* Unification of xdl and wmma bwd_data tests

* Updated Cmake

* Added all layout types and deleted code.

* Updated Cmake to add the condition to all tests.

---------

Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: 53a1e4f551]
2025-12-30 16:25:08 +01:00
assistant-librarian[bot]
6850e5e7bb Merge commit 'dae85ead64c16b34eaa643d09fb0d6da008ca814' into develop 2025-12-29 15:14:37 +00:00
yadaish
fc3ffa0d75 [CK_TILE] support split-k a16w4 gemm1 (#3389)
* initial version to support moe gemm1 split-k

* add missing args

* fix build warning

* update reference

* for split-k disable bias and weight

* remove debug log

* fix format

* fix div by zero errors

* fix cmake config

* update

* resolve conflicts

* remove useless changes

* reformat

* fix

* remove useless changes

* fix ci

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>

[ROCm/composable_kernel commit: dae85ead64]
2025-12-29 23:05:35 +08:00
assistant-librarian[bot]
a90c72d560 Merge commit 'a0acc83a72c84a8cdbbdef6f397e617ac040aa72' into develop 2025-12-29 14:13:26 +00:00
JH-Leon-KIM-AMD
3772cf9dd4 [CK_BUILDER] Add GPU Reference Algorithm to CK Builder (#3381)
* [CK_BUILDER] Integrate GPU reference as ConvAlgorithm

Add GPU reference as a ConvAlgorithm specialization, enabling:
- Unified Builder API for reference and optimized kernels
- Future ckProfiler integration for validation
- First step toward numerical validation in Builder tests

Changes:
- Add ConvAlgorithmSpecialization::REFERENCE enum
- Add ConvAlgorithm_Reference struct
- Add IsReferenceAlgorithm concept
- Create 3 reference factories (Forward, BwdData, BwdWeight)
- Wire into conv_dispatcher
- Add proof-of-concept test (passing)

Test result: Can instantiate reference through Builder API

* Add GPU reference execution tests

- Reference kernel executes through Builder (459ms)
- Both reference and optimized can instantiate
- Tests passing

Next: Implement utilities for comparison

* Optimized Builder kernel execution works

- MakeArgument pattern implemented
- Builder-generated kernel executes successfully
- Tests passing (451ms execution)

Next: Add comparison

* VALIDATION COMPLETE: Builder == Reference

Builder-generated kernel output matches GPU reference!

Test: Validate_Optimized_vs_Reference_Forward_2D_FP16
Result: PASS ✓

This proves CK Builder generates correct code!

* Update to new Builder API

All tests passing

* Rename test file for clarity

test_builder_kernel_execution -> test_builder_kernel_validation

* Add all 3 directions support

- Forward, Backward Data, Backward Weight
- All reference factories working
- Dispatcher wired for all directions
- 9 tests passing

Tests:
- test_reference_execution: 3 tests (all directions)
- test_optimized_execution: 3 tests (all directions)
- test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders)

* Add backward direction support

- Backward data and weight dispatcher wiring
- Fix factories for new API
- All 3 directions tested
- 9 tests passing

* Refactor: Change IsReferenceAlgorithm from concept to consteval function

Address review feedback: Use consteval function in dispatcher instead of
concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL).

- Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp
- Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp
- Update dispatcher to use function call: IsReferenceAlgorithm<T>()
- Remove redundant algorithm checks from reference factory requires clauses

All tests passing (9/9).

* Move Tile algorithm check outside direction block to support all directions

* Implement MakeInvokerPointer interface and add random input validation

- Implement full Argument/Invoker structs for old CK interface (not just nullptr)
- Refactor with reference_common.hpp to reduce code duplication
- Add random input validation tests: Builder vs direct GPU reference (all directions)
- Fix layout: GNHWC -> NHWGC to match reference kernel expectations
- All 12 tests pass with IDENTICAL results on random input

* Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp

Keep types.hpp for data types only (enums), move algorithm descriptors
to conv_algorithm_types.hpp as suggested by review.

* Add static_assert to ensure reference factories only accept PassThrough operations

Reference implementation doesn't support fused elementwise operations.
Add compile-time validation to fail early with clear error message if
non-PassThrough operations are specified on input, weight, or output.

* Add InstanceTraits support for reference kernels

- Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection
- Create shared ReferenceCommonTraits base for common properties
- Add 3 direction-specific InstanceTraits specializations in one file
- Include data type and layouts in instance_string output

* Remove optimized kernel validation tests from reference-only branch

* Use existing layout helper and organize reference tests

Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits
test to validation folder.

* Merge develop branch

Fix DataType switch for new mixed precision types.

* Fix comment spacing for CI

* Convert IsReferenceAlgorithm from function to concept

* Add reference tests to CI smoke tests

* Consolidate 3 reference factories into single unified factory

---------

Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>

[ROCm/composable_kernel commit: a0acc83a72]
2025-12-29 16:11:08 +02:00
assistant-librarian[bot]
0b6dde06c3 Merge commit '88ae4455806efe2019bb0403606f7c4a1e3d9c3a' into develop 2025-12-29 12:22:38 +00:00
Kiefer van Teutem
04d4dd1ada Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470)
* Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests.

* Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.

[ROCm/composable_kernel commit: 88ae445580]
2025-12-29 12:58:29 +01:00
assistant-librarian[bot]
c3c8c20144 Merge commit 'b0ea67e37725c26860a3520dc31c1f7a01164db9' into develop 2025-12-29 01:43:07 +00:00
Yi DING
9045cafc8c [CK_TILE] MX FLATMM Fix M Padding (#3489)
* Fix M Padding

* Fix tensor desc ele space size

[ROCm/composable_kernel commit: b0ea67e377]
2025-12-29 09:09:12 +08:00
assistant-librarian[bot]
dd314aaa48 Merge commit 'a3916a8d16d6e8d676b890ea3f242a180aeef61b' into develop 2025-12-27 09:13:12 +00:00
joyeamd
38a547df56 enable f8 tests (#3488)
[ROCm/composable_kernel commit: a3916a8d16]
2025-12-27 00:21:56 -08:00
assistant-librarian[bot]
c9bf3fde79 Merge commit '7ce532eac7faab5041d472b7dabebf57e09fbaf6' into develop 2025-12-25 08:16:26 +00:00
Yi DING
d80a3f9c70 [CK_TILE] Align FMHA BWD Reference with Kernel Implementation (#3486)
[ROCm/composable_kernel commit: 7ce532eac7]
2025-12-25 16:12:36 +08:00
assistant-librarian[bot]
4f1df06484 Merge commit 'e08efa551ff260f0e55c839cfc0e2b64c929eb57' into develop 2025-12-25 07:15:36 +00:00
Erwin Terpstra
bd73699148 [CK_TILE] Grouped gemm quant tensor layouts (#3414)
* feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time

* chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times

* chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems

* fix: missing check for transposed load in bquant pipeline

* chore: lower unit test tensors dimensions a bit for faster tests

* chore: set grouped gemm example M/N warp tile to 16

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: e08efa551f]
2025-12-24 23:01:23 -08:00
assistant-librarian[bot]
199991cf05 Merge commit '14668a56e376550cd68d116aa64302a1df05b56f' into develop 2025-12-25 01:42:14 +00:00
Illia Silin
21d679acab remove the LLVM_MAIN_REVISION usage (#3487)
[ROCm/composable_kernel commit: 14668a56e3]
2025-12-24 16:49:35 -08:00
assistant-librarian[bot]
446db13a0f Merge commit '62a8ec155facd901232977b688d5225d72969709' into develop 2025-12-24 19:11:47 +00:00
Thrupti Raj Lakshmana Gowda
b17fa5656f [CK TILE ENGINE] CI configuration with basic cases (#3475)
* [CK TILE ENGINE] Adding GEMM BASIC TEST in Kenkins

* fix RUN_TILE_ENGINE_BASIC_TESTS name typo

* [CK Tile Engine] Updating basic CI

* Resolving merging issues

* Resolving merging issues

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 62a8ec155f]
2025-12-24 10:45:56 -08:00
assistant-librarian[bot]
a169d59e06 Merge commit '7f68f3c4fa5bf478313c2147610317b199f9e65b' into develop 2025-12-24 17:14:26 +00:00
kensclin
b29e16aa67 Enable padding blockscale for abquant (#3453)
* Enable padding blockscale for abquant

* run clang-format

* Reduce unnecessary testing

* remove cout

[ROCm/composable_kernel commit: 7f68f3c4fa]
2025-12-24 09:12:40 -08:00
assistant-librarian[bot]
e1039a7eeb Merge commit '1c3151963bd5abd30a5ced62f6859994a45f710e' into develop 2025-12-24 02:47:07 +00:00
Po Yen Chen
51b7e7d2d6 [CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425)
* Add fp8bf16 support for batch_prefill

* Fix wrong scale_s re-compute logic in batch_prefill

* Fix wrong scale_s re-compute logic in fmha fwd

* Fix batch_prefill codegen error

* Remove no-longer used GetName() function

* Add fp8 logits=True instances

* Update CHANGELOG.md

[ROCm/composable_kernel commit: 1c3151963b]
2025-12-24 10:34:06 +08:00
assistant-librarian[bot]
27c1ae2774 Merge commit 'c0797c167143aa750936c108caa0945640eeefd1' into develop 2025-12-23 23:13:10 +00:00
jakpiase
0d94859dca [CK_TILE] Minor splitk bugfix for gemms and conv (#3387)
* fix for splitk if splitk < grid

* add different splitk implementation

* minor bugfix for streamk gemm

* Add test

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: c0797c1671]
2025-12-24 00:10:13 +01:00
assistant-librarian[bot]
166fe9db60 Merge commit 'e1381d6a712ce5703cd9bc9e3ec351fa91b1d87d' into develop 2025-12-23 11:12:47 +00:00
Johannes Graner
8f9d91fe6c [CK grouped gemm] Fix grouped gemm two stage HasMainK0BlockLoop (#3466)
* Re-enable two stage kernel

* Only disable on HasMainKBlockLoop mismatch

* Address PR comments

[ROCm/composable_kernel commit: e1381d6a71]
2025-12-23 11:33:09 +01:00
assistant-librarian[bot]
3e31171d74 Merge commit '4ce7d4c511c7e98a9ac01580ed1e9112e59061a0' into develop 2025-12-23 10:13:44 +00:00
kabrahamAMD
34edd1d99d [ck_builder] add utility functions to convolution (#3459)
* reinstate conv_signature_utils.hpp

* added tests for elementwise operation getters

* add tests for getDataType functions

* added test for no data type specified

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>

[ROCm/composable_kernel commit: 4ce7d4c511]
2025-12-23 10:39:49 +01:00
assistant-librarian[bot]
b8269a8c17 Merge commit 'ead81d1b0bba57b86ac28f3e2994dc97279f8eb3' into develop 2025-12-23 09:20:57 +00:00
jakpiase
b4626d7093 [CK_TILE] Add splitk support to ck tile conv bwd data (#3353)
* add splitk support to ck tile conv bwd data

* add reviewers suggestions

* minor fix

* removed splitkbatchoffset struct

[ROCm/composable_kernel commit: ead81d1b0b]
2025-12-23 10:03:42 +01:00
assistant-librarian[bot]
e64347d747 Merge commit '8b73633e651822d90b66ffd7d174a21891a99611' into develop 2025-12-23 07:15:46 +00:00
Lyu, Xudong
677e3cd174 fix: handle void return type in TailHandler error path with ROCm6 compiler (clang++) (#3477)
Replace `decltype(TailHandler<>(...)){}` with direct function call
to fix compilation error when return type is void.

Co-authored-by: Yi DING <yi.ding@amd.com>

[ROCm/composable_kernel commit: 8b73633e65]
2025-12-23 15:03:18 +08:00
assistant-librarian[bot]
d0bc7ccc31 Merge commit '6864a618f47e5ba8d28ada30e2a59da7d051085d' into develop 2025-12-23 06:16:51 +00:00
Yi DING
b0959a72b9 [CK_TILE] FMHA Ignore BWD Failed Cases in Smoke Test (#3480)
[ROCm/composable_kernel commit: 6864a618f4]
2025-12-23 13:28:15 +08:00
assistant-librarian[bot]
9569b291ed Merge commit '2955d77f3cfb3515c6d36d54879ed65b854dafa6' into develop 2025-12-22 21:12:09 +00:00
Bartłomiej Kocot
83d15b7bb4 Fix grouped conv fwd wmma porting (#3479)
* Fix grouped conv fwd wmma porting

* add more limitations

[ROCm/composable_kernel commit: 2955d77f3c]
2025-12-22 21:32:48 +01:00
assistant-librarian[bot]
9d93dd9352 Merge commit 'a8aebb7a8efbd9860487a4bc563706cf7a71f988' into develop 2025-12-22 16:14:04 +00:00
Wojciech Laskowski
d8164b2632 Post-merge cleanup for WMMA grouped conv fwd (#3468)
* remove duplicate aliases

* Split scaleadd_ab instances for WMMA grouped conv fwd

* removed big shape from the test

[ROCm/composable_kernel commit: a8aebb7a8e]
2025-12-22 15:57:45 +01:00
assistant-librarian[bot]
7a55f53fcf Merge commit '44f1b5c5de8c85cbae1520fa054405d96df67304' into develop 2025-12-22 01:42:28 +00:00
Bartłomiej Kocot
2228960cc4 Fix jenkinsfile for large tensor conv test (#3478)
[ROCm/composable_kernel commit: 44f1b5c5de]
2025-12-21 17:39:30 -08:00
assistant-librarian[bot]
5be6381bcb Merge commit '9bd67c2cf2fe8e4479a433bcd6d467e2ea9aedb4' into develop 2025-12-20 01:40:48 +00:00
Jan Patrick Lehr
500d143fa8 [CK-TILE] Guard against compiler lexer diagnostic (#3444)
* [CK-TILE] Guard against compiler lexer diagnostic

A recent change to Clang added a lexer-level diagnostic about that C2y
language feature. Since that is lexer level, the `__extension__`
compiler built-in does not work as it is only respected *after* the
lexer when parsing.

This change adds guarding pragmas to disable the diagnostic in the
lexer and not lead to warnings being treated as errors.

* Fixing still existing build issue

Once the one warning was removed, another one poppoed up. Both are
related to the same c2y feature. Thus, ignoring both.

* clang-format handling

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 9bd67c2cf2]
2025-12-19 17:32:20 -08:00
assistant-librarian[bot]
09019c1024 Merge commit 'cbc83359649b1b56cd745c4102e9556112f942c2' into develop 2025-12-19 23:13:41 +00:00
Bartłomiej Kocot
38ff45abf7 Improve XDL to WMMA porting for grouped conv fwd (#3456)
Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles. 
The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim.

[ROCm/composable_kernel commit: cbc8335964]
2025-12-19 15:58:51 -07:00
Illia Silin
34d26c63a0 get LLVM_MAIN_REVISION macro from compiler header (#3469)
[ROCm/composable_kernel commit: 2d9c962e2c]
2025-12-19 14:57:12 -08:00
Geo Min
f9b62a0e99 Revert "details from org var (#3431)" (#3473)
This reverts commit e43a252d19.

[ROCm/composable_kernel commit: f67a20b0be]
2025-12-19 14:10:58 -08:00
assistant-librarian[bot]
ac5610980f Merge commit 'e22622f0ec185bf9e717523c8734acfb13dad0a5' into develop 2025-12-19 16:14:44 +00:00
Thrupti Raj Lakshmana Gowda
2dacac9561 [TILE ENGINE] Restructure to Base class of GEMM (#3434)
[ROCm/composable_kernel commit: e22622f0ec]
2025-12-19 23:53:56 +08:00
assistant-librarian[bot]
44900da55a Merge commit '0fd2b2f0459b10570788b74bf1a794095a18fc96' into develop 2025-12-19 15:13:27 +00:00