Commit Graph

616 Commits

Author SHA1 Message Date
John Afaganis
42ea095d98 Remove usage of 'warpSize' variable as it has been deprecated (#2295)
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround

* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo

---------

Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>

[ROCm/composable_kernel commit: 6635d1bb88]
2025-06-10 07:34:54 -07:00
Bartłomiej Kocot
dbec4063f1 Grouped conv bwd wei explicit GEMM for odd C/K (#2306)
[ROCm/composable_kernel commit: 7a83f1d510]
2025-06-10 11:17:12 +02:00
Bartłomiej Kocot
d9dd0aa254 Grouped Convolution Backward Weight Explicit GEMM (#2282)
* Grouped conv bwd weight explicit gemm

* 3d

* cmake fixes

* fix test

* fix

[ROCm/composable_kernel commit: 050cad09b5]
2025-06-06 10:30:08 +02:00
Andriy Roshchenko
ab0540c5db Optimized GEMMs for MX FP4/8 (#2294)
Adds V3 GEMM pipeline for MX FP4 and MX FP8 
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275)
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler




Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>

[ROCm/composable_kernel commit: 00247e3c29]
2025-06-05 13:54:15 -06:00
Anton Gorenko
780cb29a42 WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230)
* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Updates to support mixed precision

* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip

* Added support for F8xF16xF16 to gemm_wmma_universal

* Added support for F16xF8xF16 to gemm_wmma_universal

* Added support for BF16xI4xBF16 to gemm_wmma_universal

* Added support for F16xI4xF16 to gemm_wmma_universal

* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType

* Added missing test class for FP16_KM_NK

* Pre-commit hooks fixes

* Added padding instances for f16xf16xf16

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Ammending changes for adding support for padding instances for f16xf16xf16

* Fixes for padding instances for f16xf16xf16

* Added padding instances for bf16xbf16, f8xf8

* Added packed instances for bf16xi4xbf16

* Added padding instances for f8xf16xf16

* Added padding instances for f16xf8xf16, f16xi4xf16

* Fixed typos for bf16xbf16xbf16 padding instances

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Fix int8 DTYPES check for gemm_bilinear

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>

[ROCm/composable_kernel commit: 52b4860a30]
2025-06-04 12:22:33 +06:00
Bartłomiej Kocot
dbdf79d541 Add Clamp/Relu bf16/fp16 cast fixes (#2279)
* Add Clamp/Relu bf16/fp16 fixes

* fix

[ROCm/composable_kernel commit: 6e5acee0f9]
2025-06-03 18:31:46 +02:00
Xiaodong Wang
095f9dbb63 Move pragma ahead (#2231)
[ROCm/composable_kernel commit: 7f9eef40b0]
2025-06-03 07:27:51 -07:00
Mirza Halilčević
a16458864a Define CHAR_BIT during hipRTC (#2264)
* Fix failing codegen tests.

* fix clang format

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: fbce6c7bb6]
2025-05-30 08:23:44 -07:00
Bartłomiej Kocot
f1ec9e0be5 Change relu to clamp for grouped conv fwd instances (#2249)
[ROCm/composable_kernel commit: e7906dd644]
2025-05-29 00:51:25 +02:00
Bartłomiej Kocot
d14948041b Fix grid size calc for bwd wei (#2226)
[ROCm/composable_kernel commit: 037764bbc6]
2025-05-26 16:51:09 +02:00
Illia Silin
9305126446 fix the buffer intrinsic names for clang >=20 (#2228)
[ROCm/composable_kernel commit: 8146e471f1]
2025-05-23 14:58:25 -07:00
Illia Silin
b3be024c8c Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 9553c67ab25cb25bf4b6e4d359937413e1f7fd6a.

[ROCm/composable_kernel commit: 1b846143c6]
2025-05-22 15:41:17 -07:00
Aviral Goel
45972932a7 Add Doxygen Documentation for HostTesnor, HostTensorDescriptor, DeviceMem, FillUniformDistribution (#2160)
* added documentation for HostTensorDescriptor

* added documentation for DeviceMem and FillUniformDistribution

* fixed merging error

* fixed host_tensor_descriptor error

* clang format

[ROCm/composable_kernel commit: fa39c4e798]
2025-05-21 10:34:30 -07:00
Thomas Ning
f969f4d798 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format

[ROCm/composable_kernel commit: 1386924749]
2025-05-20 15:05:08 -07:00
jefyang1
1d9c3ecf4f Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

[ROCm/composable_kernel commit: f18170064d]
2025-05-19 17:29:51 -07:00
Andriy Roshchenko
9128d5e5cc MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199)
* Unify test interface for different layouts.

* WIP: Introducing FP4/FP6/FP8 abstractions

* WIP: Introducing packed storage abstraction

* WIP: Introducing packed storage abstraction

* WIP: Improved support for FP6 data type

* Refactor packed storage for f6_t

* WIP: FP6 MFMA test

* Test if we correctly represent all FP6/FP4 numbers

* Additional output for failed FP4 test.

* More failing conversion tests

* Even more failing conversion tests

* Working FP6 MFMA tests

* Expand MX MFMA testing to BF8/6

* Update and verify MX MFMA test for packed types

* Fix fp4 and fp6 conversions on host

* Working MX MFMA tests for FP8/6/4

* Cleanup

* Add missing type

* Cleanup

* Final cleanup

* Restrict FP6/4 values output to CK_LOGGING=1

* Use CHAR_BIT instead of number 8

* Fix typo

* Remove FP6 and FP4 from the list of native types

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>

[ROCm/composable_kernel commit: 57e0f5df29]
2025-05-19 16:52:51 -05:00
arai713
4418b3acce Narrowing error fix for codegen compilation (#2194)
* removed comment with special characters

* fix for arg/template change after merge from develop

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 5b3430b868]
2025-05-16 11:11:54 -07:00
Mateusz Ozga
adc48512fc Disable conv for Filter1x1Stride1Pad0 when K or C is even (#2186)
[ROCm/composable_kernel commit: fa3c6811d8]
2025-05-16 10:18:47 +02:00
Illia Silin
11b71b421e Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format

[ROCm/composable_kernel commit: 58f9e9ffbc]
2025-05-13 10:18:14 -07:00
Bartłomiej Kocot
7d74de4d7a Switch to v2 pipeline for grouped conv bwd data (#2181)
* Change to old pipeline for grouped conv bwd data

* fix

* fix

* fix

* fix

* fix

* fix

* Fix

[ROCm/composable_kernel commit: c53b7bd22e]
2025-05-13 10:14:30 +02:00
Thomas Ning
ab918009ed Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166)
* make the work compiled

* Solved the example code, but still have the profiler error

* Finished the feature

* Clang format and update the CHANGELOG

* solve the preshuffle v1 & v2 problem

* Comment Addressed

* Comment Addressed

[ROCm/composable_kernel commit: b49f7de81f]
2025-05-12 09:52:58 -07:00
Bartłomiej Kocot
fa024cca43 Add grouped conv fwd bias relu instances (#2179)
* Add grouped conv fwd bias relu instances

* fixes

* fix

[ROCm/composable_kernel commit: 6fddb5708c]
2025-05-09 22:52:34 +02:00
jefyang1
1089f78857 Fix grouped conv bwd data tests on gfx950 (#2173)
[ROCm/composable_kernel commit: 6b1a339b6f]
2025-05-09 09:01:06 -07:00
Andriy Roshchenko
abd554cfe6 Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171)
* Re-enable MX GEMM instances

* Fix compilation error when building MX GEMM for multiple architectures

[ROCm/composable_kernel commit: cb27e7c77f]
2025-05-08 13:26:03 -06:00
Rostyslav Geyyer
789ed57662 Add FP4 MX MFMA tests (#2151)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add ops

* Add tests

* Add missing utils

* Update reference mx gemm

* Add f4x2 init mode

* Update host tensor utils

* Update chunk size for f4x2

* Add non scaled ops

* Add a type utility

* Update non scaled reference kernel

* Add non scaled tests

* Debug mfma arguments

* Add more debug info

* Update chunk size

* Update data layout

* Add more debugging

* Fix B stride

* Fix reference gemm

* Fix build

* One more reference fix

* Add more debug info

* Disable some tests

* Enable tests

* Add fp4 dimensions

* Update reference kernels

* Temp edits

* Remove leftovers

* Fix conflicts

* Clean up

* More clean up

* Revert "More clean up"

This reverts commit d8d35a0846.

* Add layouts to tests

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

[ROCm/composable_kernel commit: 8a0d659f92]
2025-05-06 09:24:00 -05:00
Andriy Roshchenko
e70035cf5a MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152)
* Add gemm_mx_fp8_bf8 example with row-major B

* Add more overloads of MX MFMA instructions

* Add MK_KN (RRR) tests

* Add KM_NK (CCR) tests

* Add more problem sizes to Large tests

* Add test_gemm_mx to the list of regression tests

[ROCm/composable_kernel commit: 79b0bfeb41]
2025-05-01 11:55:48 -06:00
Bartłomiej Kocot
7942bb905b Integrate universal gemm with conv bwd data and add SplitK (#1315)
* Integrate universal gemm with conv bwd data

* Fix multi d kernel

* Add splitK support

* instances refactor

* instances refactor

* refactor

* fixeS

* fixes

* 16x16 instnaces

* Fixes

* Fix

* Fix

* Fix

* Fix

* Fix

* Fixes

* fix

* fix

[ROCm/composable_kernel commit: 4094ad158a]
2025-04-28 23:54:49 +02:00
Anton Gorenko
420de0f22d DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096)
* Prepare files for DeviceGemm_Wmma_CShuffleV3

* Implement main part of CShuffleV3 with block pipeline v3 for WMMA

* Remove unused functions and template params for A/B descriptors

* Support both gfx11 and gfx12

* Enable SplitK for gfx12 and disable for gfx11

* Added RowColRow layout for DeviceGemmV2 fp16

* Added more instances for Row, Col, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout

* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout

* Fix formatting

* Add documentation

Based on cc666c6a19dabc2cce8141e7ae23bd460ceef331

* Enable gemm_universal profiling for gfx11/12

* Add WMMA intrinsics for F8/BF8

* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances

* Add BF16 instances and tests

* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

[ROCm/composable_kernel commit: edd92fc546]
2025-04-28 10:14:21 +05:00
Illia Silin
08f859b460 make code compliant with std=c++20 (#2123)
[ROCm/composable_kernel commit: 01cb8379cd]
2025-04-24 10:14:52 -07:00
lalala-sh
8f426b1216 Moe gemm activation (#2026)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* impl int64 but result not correct

* int64 index ok now

* input output all ok

* fix uint32

* revert v1 test

* use uint32

* mork to support 13w tokens

* moe sorting fix moebuf

* fix merge

* update moe api fix aiter build

* fix buid

* fuse silu

* silu ok

* acale ok

* add silu

* change code

* gemm2 ok

* gufusion compatible ok, fix warnings

* gu fusion for m32 m64 ok

* support bf16 cshuffle

* i4 gemm2 ok

* i4 gemm2 ok and i4 gemm1 build

* 16x16 run ok

* change flops; change cshuffle dtype

* fuse gelu silu act in moe gemm1

* fp8 with act ready

* int4 act ready

* remove useless changes

* remove useless code change

* fix clang format

* add the arch limit of int4 moe gemm

* fuse moe activation

* fix fp8 16x16

* fix no quant case

* fix bugs

* fix fp8 gufusion bug

* remove useless comments

* refine activation code & complete moe example

* fix int8 bugs

* merge tkw1

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: root <root@hjbog-srdc-51.amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 39ba03f25d]
2025-04-23 10:35:34 +08:00
Rostyslav Geyyer
dbcaae42bf Temporarily disable MX FP4 device tests (#2112)
[ROCm/composable_kernel commit: 416e851584]
2025-04-22 16:08:48 -05:00
Thomas Ning
005e61ce63 GEMM Multiply Multiply Fix (#2102)
* fix the type convert and increase the BF16 conversion + the profile comment

* fix the CI

[ROCm/composable_kernel commit: 0cca8fa28f]
2025-04-22 01:13:22 -07:00
Muhammed Emin Ozturk
eb8306b5f0 MI308 fix for streamk 1-Tile floating point exception (#2101)
[ROCm/composable_kernel commit: b092c18da7]
2025-04-21 11:44:07 -07:00
lalala-sh
dc58110b06 enable do top k weights in moe stage1 gemm (#2094)
* add switch for mul topk weights

* fix bf16/f16 bugs

* complete

[ROCm/composable_kernel commit: bcf5bb41be]
2025-04-18 10:45:49 +08:00
Andriy Roshchenko
7972a39081 MX GEMM - Parameterized Test Template (#2088)
* Tests for MX FP8 GEMM

* Improve documentation


[ROCm/composable_kernel commit: 213b203a3c]
2025-04-16 19:56:00 -06:00
Andriy Roshchenko
a96e96ab4e MX GEMM - Add MX BF8 example (#2071)
* Add MX GEMM example for MX BF8

* Verified MX FP8 with 16x16x128 scale builtin

* Verify MX BF8 GEMM with BF16 output


[ROCm/composable_kernel commit: da54464cce]
2025-04-16 15:25:02 -06:00
Andriy Roshchenko
5e2bd20672 MX GEMM - New GEMM pipeline for MX data types (#2059)
* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods

[ROCm/composable_kernel commit: 7106976a72]
2025-04-15 17:17:07 -06:00
Mingtao Gu
3cbda9a11c CK pk_i4_t test failures fix (SWDEV-518629) (#2075)
* fix pk_i4_v3 tests failures in Unbuntu env.

* fix pk_i4_t tests failure on Unbuntu issues.

* some fixed.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 56378f810f]
2025-04-14 16:58:57 +08:00
Illia Silin
c91d046350 Fix build issues for multiple targets. (#2077)
* build for multiple targets on gfx942

* add missing ignore statements

[ROCm/composable_kernel commit: 0d4f145078]
2025-04-11 12:12:53 -07:00
Illia Silin
9cc561987f enable gfx115x support (#2065)
[ROCm/composable_kernel commit: 3e6d21adeb]
2025-04-09 10:06:42 -07:00
valarLip
aa6eeb7f66 add passthrough for int32->float32 (#2062)
[ROCm/composable_kernel commit: 2c563fecf7]
2025-04-08 15:16:30 -07:00
Max Podkorytov
1eec5cc073 simplify generate_tuple (#2043)
[ROCm/composable_kernel commit: 6ce0797dad]
2025-04-08 09:00:51 -07:00
Illia Silin
ad65c947c9 fix codegen issues (#2052)
[ROCm/composable_kernel commit: 1793228422]
2025-04-07 07:08:39 -07:00
Illia Silin
6d90b2eb50 Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic

[ROCm/composable_kernel commit: 572cd820ce]
2025-04-03 15:30:21 -07:00
Rostyslav Geyyer
1716380358 Add FP16/BF16<->FP8/BF8 conversions (#2035)
* Move conversion functions and add missing conversions

* Add tests

* Add missing conversions

* Add missing conversions

* Add bf8 tests

* Update clipping for vectors

* Add missing conversions

* Add bf16 fp8 tests

* Add bf16 bf8 tests

* Fix device conversion

* Fix conversions

* Fix vector use

* Minor fix

* Add a workaround flag

* Add a workaround flag for bf16 conversion

* Add another workaround

* Add a workaround for fp16 to bf8 conversion

* Update type alias

* Add docstrings and missing wrappers

* Fix if defined macros

* Fix more if defined macros

* Add comments

* Remove __host__ specifier

* Add a gfx950 guard

* Update function naming

[ROCm/composable_kernel commit: 265af71a71]
2025-04-03 12:42:03 -05:00
aledudek
7a78bc823a Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments

[ROCm/composable_kernel commit: 9329432f6c]
2025-04-03 13:35:43 +02:00
Bartłomiej Kocot
49565538fe Add support for GKCYX grouped conv weight (#2023)
* Grouped conv bwd weight GKCYX support

* fix and changelog

* fix

* fix

* fixes

* comments

* fix

[ROCm/composable_kernel commit: 2ccf914888]
2025-04-02 23:59:49 +02:00
Adam Osewski
0a607256cf Basic docs for universal gemm & ck-tile gemm. (#2014)
* Basic docs for universal gemm & ck-tile gemm.

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Reviewers suggestions.

* Align tparam names in doc with class tparams.

* More reviewers fine tuning ;)

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: e5ad48a784]
2025-04-02 11:03:40 +02:00
Bartłomiej Kocot
50fb390d6f Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances

[ROCm/composable_kernel commit: 8c0ab61ece]
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
b417137276 Grouped conv fwd v3 fix for SplitN an G > 1 (#2038)
* Grouped conv fwd v3 fix for SplitN an G > 1

* Remove int8 large test

* Retore int8 test

[ROCm/composable_kernel commit: ec742908bd]
2025-04-01 13:19:35 -07:00