Commit Graph

1988 Commits

Author SHA1 Message Date
apoorva
bdfdb0c11e Fixes applied according to review comments 2025-07-08 12:17:20 +00:00
apoorva
669befb25a Updated copyrights and added wrappers. 2025-07-08 12:06:38 +00:00
apoorva
9b64da2298 Added wrapper and renamed the wmma_v3 instances 2025-07-08 11:26:01 +00:00
apoorva
86ca6b827d Removed the old wmma instances. 2025-07-08 11:23:33 +00:00
apoorva
5c491e7a4b Fixing typo to resolve build errors. 2025-07-02 18:04:10 +00:00
apoorva
ba9c637c0b Added examples for gemm_add_relu 2025-07-02 14:12:52 +00:00
apoorva
6ec0ad2758 Added test for gemm_add_relu wmma instance 2025-07-01 13:44:18 +00:00
apoorva
f5843dd22b Added v3 instances for gemm_add_relu 2025-07-01 12:37:46 +00:00
apoorva
bb7f6650f7 Fixed typo in profiler 2025-07-01 12:02:28 +00:00
apoorva
6a116fa958 Modified the template parameters to make the instances work. 2025-07-01 11:23:44 +00:00
apoorva
cdaff7f210 Added instances to Cmake 2025-07-01 11:23:43 +00:00
apoorva
6f89183179 Added f16 wmma instance and corrected bf16 instance errors. 2025-07-01 11:23:17 +00:00
apoorva
35aab35d96 Added bf16 wmma instance for add_relu 2025-07-01 11:23:17 +00:00
Zoltán Lakatos
4f1910117b Merge branch '63-implement-device_gemm_bilinear-for-rdna4' into 'feature/multiple-d-gemms'
Resolve "Implement `device_gemm_bilinear` for RDNA4"

See merge request amd/ai/composable_kernel!26
2025-06-26 06:48:38 +00:00
Zoltán Lakatos
686df332e2 Resolve "Implement device_gemm_bilinear for RDNA4" 2025-06-26 06:48:38 +00:00
Zoltán Lakatos
055bc0227c Merge branch '10-implement-device_gemm_add_fastgelu-for-rdna4' into 'feature/multiple-d-gemms'
Resolve "Implement device_gemm_add_fastgelu for RDNA4"

See merge request amd/ai/composable_kernel!23
2025-06-18 14:12:31 +00:00
Zoltan Lakatos
7424b4a0f8 fixed ONLY_XDL_AND_WMMA_KERNELS tag 2025-06-17 19:50:07 +00:00
Zoltan Lakatos
61b6e9a606 splitk xdl test fixes 2025-06-13 09:37:37 +00:00
Zoltan Lakatos
b4d3e4112c added tests for all splitk fastgelus 2025-06-12 11:43:26 +00:00
Zoltan Lakatos
264e1b238a created fastegelu instances 2025-06-11 14:14:43 +00:00
Zoltan Lakatos
4c8ea9517d created gemm_add_add_fastgelu instances 2025-06-11 12:46:42 +00:00
Zoltan Lakatos
aeca8efdea revert unintended change in xdl add_fastgelu 2025-06-11 07:50:53 +00:00
Zoltan Lakatos
ea9805b580 added add_fastgelu instances 2025-06-10 17:23:44 +00:00
Anton Gorenko
881bc3fbb0 Merge branch '52-implement-multipled-in-gemm-universal' into 'feature/multiple-d-gemms'
DeviceGemmMultipleD_Wmma_CShuffleV3

See merge request amd/ai/composable_kernel!21
2025-06-04 16:18:14 +05:00
Anton Gorenko
9912e5f09c Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
2025-06-04 12:34:26 +05:00
Anton Gorenko
381c02d06a Add gemm_add_fastgelu instances and test 2025-06-04 12:34:26 +05:00
Anton Gorenko
bcf93e292c Prepare gemma_add tests for adding wmma 2025-06-04 12:29:34 +05:00
Anton Gorenko
e36a176e38 Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 2025-06-04 12:29:34 +05:00
Anton Gorenko
137efa743d Implement DeviceGemmMultipleD_Wmma_CShuffleV3 2025-06-04 12:29:34 +05:00
Anton Gorenko
89ac60de6d Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma 2025-06-04 12:29:34 +05:00
Anton Gorenko
7dff5fe4ff Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support 2025-06-04 12:29:34 +05:00
Anton Gorenko
deebe1ea13 Use ThreadGroupTensorSliceTransfer_v7r3 2025-06-04 12:29:34 +05:00
Anton Gorenko
ed047d08b4 Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
2025-06-04 12:29:33 +05:00
Anton Gorenko
52b4860a30 WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230)
* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Updates to support mixed precision

* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip

* Added support for F8xF16xF16 to gemm_wmma_universal

* Added support for F16xF8xF16 to gemm_wmma_universal

* Added support for BF16xI4xBF16 to gemm_wmma_universal

* Added support for F16xI4xF16 to gemm_wmma_universal

* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType

* Added missing test class for FP16_KM_NK

* Pre-commit hooks fixes

* Added padding instances for f16xf16xf16

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Ammending changes for adding support for padding instances for f16xf16xf16

* Fixes for padding instances for f16xf16xf16

* Added padding instances for bf16xbf16, f8xf8

* Added packed instances for bf16xi4xbf16

* Added padding instances for f8xf16xf16

* Added padding instances for f16xf8xf16, f16xi4xf16

* Fixed typos for bf16xbf16xbf16 padding instances

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Fix int8 DTYPES check for gemm_bilinear

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
2025-06-04 12:22:33 +06:00
Khushbu Agarwal
59a85cb4bc [CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262)
* debugging issue

* debugging issue

* debugging

* debugging

* reverting debugging code

* clang formatted

* updating default_config.json

* fix ci failure

* clang formatted
2025-06-03 20:16:10 -07:00
Khushbu Agarwal
1037b21cfe Update changelog for Rotating buffer (#2283)
* Updating changelog for Rotating buffer

* address review comment.
2025-06-03 20:14:22 -07:00
Bartłomiej Kocot
6e5acee0f9 Add Clamp/Relu bf16/fp16 cast fixes (#2279)
* Add Clamp/Relu bf16/fp16 fixes

* fix
2025-06-03 18:31:46 +02:00
Xiaodong Wang
7f9eef40b0 Move pragma ahead (#2231) 2025-06-03 07:27:51 -07:00
Aviral Goel
11f6c14e03 Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268)
* add 0 as valid default arguement for strides

* add 0 as valid default arguement for strides

# Conflicts:
#	example/01_gemm/common.hpp
2025-06-03 07:26:58 -07:00
Illia Silin
b76fdbe47f Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274)
* upgrade to rocm6.4.1 and use gfx1x-generic targets

* add rocm version parsing

* fix the gfx10-3-generic syntax in cmake
2025-06-03 07:17:35 -07:00
Khushbu Agarwal
2e38eb4f1c Rotating buffer PR CI fix (#2257)
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)"

This reverts commit bbdaf79a52.

* fix regression
2025-06-02 10:25:01 -07:00
dependabot[bot]
cffe8fa2a4 Bump rocm-docs-core[api_reference] from 1.19.1 to 1.20.0 in /docs/sphinx (#2272)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.19.1 to 1.20.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.19.1...v1.20.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.20.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-02 06:44:10 -07:00
valarLip
0fdbf6bcd1 extend buffer load for fp16/bf16x16 (#2270)
* extend buffer load for fp16/bf16x16

* format
2025-06-02 10:29:54 +08:00
Kiefer van Teutem
2215a9edf0 Explicitly set the LINKER_LANGUAGE for the gemm_template_instances target to avoid Ninja build config failure. (#2265)
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2025-05-30 13:32:28 -07:00
Illia Silin
654956bb02 Add a daily CI build on GFX950. (#2261)
* add CI build for gfx950

* make sure gfx950 CI always uses special docker and compiler

* enable codegen tests by default
2025-05-30 12:50:08 -07:00
Mirza Halilčević
fbce6c7bb6 Define CHAR_BIT during hipRTC (#2264)
* Fix failing codegen tests.

* fix clang format

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-05-30 08:23:44 -07:00
dependabot[bot]
61e6c382c6 Bump rocm-docs-core[api_reference] from 1.19.0 to 1.19.1 in /docs/sphinx (#2263)
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.19.0 to 1.19.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.19.0...v1.19.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
  dependency-version: 1.19.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-30 05:56:59 -07:00
slippedJim
57f497452a remove restriction of group mode hd192 no lse (#2252)
Co-authored-by: Jim <jimguo12@amd.com>
2025-05-30 10:14:21 +08:00
Illia Silin
4e561af18c Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185)" (#2260)
This reverts commit fd6a859b44.
2025-05-29 16:22:16 -07:00
Paul Fultz II
306f4c537e Export codegen targets (#2259) 2025-05-29 11:03:51 -07:00