Commit Graph

699 Commits

Author SHA1 Message Date
kiefer
b26f2c60c6 Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same. 2025-09-18 14:34:26 +00:00
kiefer
a6dbb39e90 After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts. 2025-09-18 08:06:09 +00:00
kiefer
58e7321098 Move calculation of rotating memory buffer sizes to Argument member functions. 2025-09-16 15:37:03 +00:00
kiefer
a8a5504f31 Don't use workspace memory in the case where A needs explicit transposition but B does not. 2025-09-16 15:34:35 +00:00
kiefer
a28d10253c Pass struct args to Gridwise Run() function by reference. 2025-09-16 09:08:49 +00:00
kiefer
e3fccf0115 Fixup comments 2025-09-16 09:07:38 +00:00
kiefer
b9986de7ae Remove old years for newly created files. 2025-09-15 08:48:35 +00:00
kiefer
b8d4b01267 Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations. 2025-09-04 14:22:26 +00:00
kiefer
52c42d5b47 Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check. 2025-09-02 09:26:24 +00:00
kiefer
382d6fea6b Add D layout tests to IsSupportedArgument() 2025-08-29 08:22:42 +00:00
kiefer
68f9e73b5e Implement multi-AB support for grouped conv fwd and add example. 2025-08-28 13:12:40 +00:00
kiefer
73521fe091 Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd 2025-08-27 08:02:51 +00:00
kiefer
ca7b3121cd Add int8 instances for 2D vanilla grouped conv fwd all layouts. 2025-08-26 12:16:41 +00:00
kiefer
e325dab094 Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types. 2025-08-26 09:20:38 +00:00
kiefer
6ad73cd0cd Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation. 2025-08-24 12:44:01 +00:00
kiefer
43f99d85bd Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future. 2025-08-20 10:43:49 +00:00
kiefer
16920dee0f Add support for fwd conv in gridwise implementation. Identical to run function for bwd data. 2025-08-20 08:56:53 +00:00
Enrico Degregori
9b7cf3ef81 Fix bug in device print function 2025-08-06 14:03:55 +00:00
Enrico Degregori
9ee5699e50 multi_abd wmma support:
- Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)
2025-08-06 13:54:56 +00:00
kiefer
c434378570 clang-format-18 2025-08-06 11:53:43 +00:00
Kiefer van Teutem
ec382804a9 Merge remote-tracking branch 'origin/develop' into 90-prepare-an-upstream-pr-for-multipled-based-gemms 2025-08-06 07:47:43 +00:00
Enrico Degregori
2203b0ddfe Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)
* Add padding 1x1Stride1Pad0 conv specialization

* Add gridwise checks for conv cshufflev3

* Merge padding with previous transforms

* Apply transform changes for padding to default specialization as well

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-08-05 15:23:19 +02:00
Enrico Degregori
02cb1f2b6a Fix bug from merge 2025-08-04 08:37:40 +00:00
Illia Silin
788e8a878e update the switch condition for buffer built-ins (#2602) 2025-08-01 14:30:07 -07:00
lalala-sh
bb5c478295 fix weight index out of range (#2414) 2025-08-01 17:50:02 +08:00
Ville Pietilä
e962a41638 Automatic deduction of split-K value for grouped convolution (#2491)
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Use simple best occupancy model to calculate the split-K.

* Handle split-K autodeduction in explicit gemm conv.

* Add unit tests for split-K autodeduction.

* Remove oversubscription.

* Small fixes.

* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.

* Run clang formatting.

* Fix error handling in the conv profiler.

* Add missing documentation for the autodeducted split-K values.

* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.

* Fix clang formatting and split-K profiler documentation.

* Rename max_occupancy value variable.

* Calculate grid size for split-K autodeduction directly from input array shapes and template params.

---------

Co-authored-by: Ville Pietilä <>
2025-07-31 12:08:45 +02:00
Bartłomiej Kocot
5b244105d9 Enable multiple D for grouped conv fwd large tensors (#2572) 2025-07-28 22:39:07 +02:00
linqunAMD
0782ee8eb3 Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564)
* Remove HIP_COMPILE_DEVICE

* add missing files

* fix clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-28 13:01:07 -07:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
Enrico Degregori
5dc21c5521 Merge branch 'develop' into feature/multiple-d-gemms 2025-07-28 17:18:18 +00:00
Bartłomiej Kocot
685771b875 Enable bf16 RNE on gfx950 (#2542)
* Enable bf16 RNE for gfx950

* test bhalf

* fix

* fix

* Comments fixes

* fixes

* clean

* fix
2025-07-28 00:47:17 +02:00
Adam Osewski
c8eb2f995c Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-07-25 10:34:31 +02:00
Enrico Degregori
b01a27ff22 Support b_scale: (#2350)
- extend pipeline v1 and v3
 - add instances
 - add tests
 - add example

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-07-24 18:49:58 -07:00
Illia Silin
9c04a55626 remove repetitive code (#2562) 2025-07-24 14:52:46 -07:00
Andriy Roshchenko
3421272f90 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8
2025-07-24 14:36:53 -04:00
Rostyslav Geyyer
c9886109b4 Update packed fp4 layout (#2523) 2025-07-21 16:58:59 -05:00
Mingtao Gu
0198257d79 [CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512)
* added MPerBlock=32 for MXFP4 GEMM decode

* added two instance for M>128 scenario.

* added 1 instance

* format

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-07-18 14:35:54 +08:00
linqunAMD
fbd9f32abe [CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459)
1. Port NCHW support from ConvFwd (#2375) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw

Co-authored-by: azhuang <anzhong.huang@amd.com>
2025-07-17 08:19:57 +08:00
linqunAMD
6e76b82059 Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-16 07:58:23 -07:00
Illia Silin
a4bf78ac0e replace obsolete warpSize system variable with the new one (#2496) 2025-07-16 07:39:15 -07:00
huaiguxu
c1badfd30c Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438)
Co-authored-by: felix <felix.li@amd.com>
2025-07-16 15:44:34 +08:00
Apoorva Kalyani
036799d631 Merge branch '61-add-examples-for-bf16-and-fp16-instances-of-gemm_add-for-v1-and-v3-for-rdna4' into 'feature/multiple-d-gemms'
Resolve "Add Examples for bf16 and fp16 instances of gemm_add xdl and  wmma for v1 and v3 for RDNA4"

See merge request amd/ai/composable_kernel!24
2025-07-14 11:46:08 +00:00
Andriy Roshchenko
518dc21ae8 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8
2025-07-11 13:07:05 -06:00
Illia Silin
1b66f3f4a3 Add declarations for atomic add for fp16 and unsigned short. (#2483)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

* revrt back to atomic add using casting
2025-07-10 07:18:56 -07:00
Zoltan Lakatos
a7993abd4b comments on the atomics workaround 2025-07-10 08:43:32 +00:00
Illia Silin
93420ecf89 Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474)
This reverts commit 112b47e885.
2025-07-08 19:01:26 -07:00
Illia Silin
112b47e885 Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short
2025-07-08 18:09:30 -04:00
Andriy Roshchenko
054f85ab7c MX GEMM - FP6 Example (#2419)
Adds support for MX FP6 data type in MX GEMM block pipeline version v1.
Provides an example of MX FP6 GEMM algorithm.

---------

Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
2025-07-07 10:33:26 -06:00
apoorva
8a5bb256f7 Fixes applied to fix the precision loss. 2025-07-07 09:41:00 +00:00
Mingtao Gu
7998ae8969 [CK] Mxfp4 moe blockscale buf2lds version support (#2455)
* change cshuffle size

* added mxfp4 moe async buffer loading without B preshuffle

* added mx moe B shuffling + scale shuffling (async loads)

* minor fix

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-07-06 15:42:00 +08:00