composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-03 05:37:34 +00:00

Author	SHA1	Message	Date
kiefer	b26f2c60c6	Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.	2025-09-18 14:34:26 +00:00
kiefer	a6dbb39e90	After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.	2025-09-18 08:06:09 +00:00
kiefer	58e7321098	Move calculation of rotating memory buffer sizes to Argument member functions.	2025-09-16 15:37:03 +00:00
kiefer	a8a5504f31	Don't use workspace memory in the case where A needs explicit transposition but B does not.	2025-09-16 15:34:35 +00:00
kiefer	a28d10253c	Pass struct args to Gridwise Run() function by reference.	2025-09-16 09:08:49 +00:00
kiefer	e3fccf0115	Fixup comments	2025-09-16 09:07:38 +00:00
kiefer	b9986de7ae	Remove old years for newly created files.	2025-09-15 08:48:35 +00:00
kiefer	b8d4b01267	Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.	2025-09-04 14:22:26 +00:00
kiefer	52c42d5b47	Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.	2025-09-02 09:26:24 +00:00
kiefer	382d6fea6b	Add D layout tests to IsSupportedArgument()	2025-08-29 08:22:42 +00:00
kiefer	68f9e73b5e	Implement multi-AB support for grouped conv fwd and add example.	2025-08-28 13:12:40 +00:00
kiefer	73521fe091	Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd	2025-08-27 08:02:51 +00:00
kiefer	ca7b3121cd	Add int8 instances for 2D vanilla grouped conv fwd all layouts.	2025-08-26 12:16:41 +00:00
kiefer	e325dab094	Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.	2025-08-26 09:20:38 +00:00
kiefer	6ad73cd0cd	Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.	2025-08-24 12:44:01 +00:00
kiefer	43f99d85bd	Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.	2025-08-20 10:43:49 +00:00
kiefer	16920dee0f	Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.	2025-08-20 08:56:53 +00:00
Enrico Degregori	9b7cf3ef81	Fix bug in device print function	2025-08-06 14:03:55 +00:00
Enrico Degregori	9ee5699e50	multi_abd wmma support: - Add multiple A and B support to multiple D implementation (gridwise level) - Add multi_abd GEMM (device level) - Add instances (xdl parity) - Add tests (both xdl and wmma) - Add examples - Add ckProfiler support (both xdl and wmma)	2025-08-06 13:54:56 +00:00
kiefer	c434378570	clang-format-18	2025-08-06 11:53:43 +00:00
Kiefer van Teutem	ec382804a9	Merge remote-tracking branch 'origin/develop' into 90-prepare-an-upstream-pr-for-multipled-based-gemms	2025-08-06 07:47:43 +00:00
Enrico Degregori	2203b0ddfe	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 ) * Add padding 1x1Stride1Pad0 conv specialization * Add gridwise checks for conv cshufflev3 * Merge padding with previous transforms * Apply transform changes for padding to default specialization as well --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-05 15:23:19 +02:00
Enrico Degregori	02cb1f2b6a	Fix bug from merge	2025-08-04 08:37:40 +00:00
Illia Silin	788e8a878e	update the switch condition for buffer built-ins (#2602 )	2025-08-01 14:30:07 -07:00
lalala-sh	bb5c478295	fix weight index out of range (#2414 )	2025-08-01 17:50:02 +08:00
Ville Pietilä	e962a41638	Automatic deduction of split-K value for grouped convolution (#2491 ) * Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Use simple best occupancy model to calculate the split-K. * Handle split-K autodeduction in explicit gemm conv. * Add unit tests for split-K autodeduction. * Remove oversubscription. * Small fixes. * Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle. * Run clang formatting. * Fix error handling in the conv profiler. * Add missing documentation for the autodeducted split-K values. * Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver. * Fix clang formatting and split-K profiler documentation. * Rename max_occupancy value variable. * Calculate grid size for split-K autodeduction directly from input array shapes and template params. --------- Co-authored-by: Ville Pietilä <>	2025-07-31 12:08:45 +02:00
Bartłomiej Kocot	5b244105d9	Enable multiple D for grouped conv fwd large tensors (#2572 )	2025-07-28 22:39:07 +02:00
linqunAMD	0782ee8eb3	Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 ) * Remove HIP_COMPILE_DEVICE * add missing files * fix clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-28 13:01:07 -07:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
Enrico Degregori	5dc21c5521	Merge branch 'develop' into feature/multiple-d-gemms	2025-07-28 17:18:18 +00:00
Bartłomiej Kocot	685771b875	Enable bf16 RNE on gfx950 (#2542 ) * Enable bf16 RNE for gfx950 * test bhalf * fix * fix * Comments fixes * fixes * clean * fix	2025-07-28 00:47:17 +02:00
Adam Osewski	c8eb2f995c	Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463 ) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-07-25 10:34:31 +02:00
Enrico Degregori	b01a27ff22	Support b_scale: (#2350 ) - extend pipeline v1 and v3 - add instances - add tests - add example Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-07-24 18:49:58 -07:00
Illia Silin	9c04a55626	remove repetitive code (#2562 )	2025-07-24 14:52:46 -07:00
Andriy Roshchenko	3421272f90	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8	2025-07-24 14:36:53 -04:00
Rostyslav Geyyer	c9886109b4	Update packed fp4 layout (#2523 )	2025-07-21 16:58:59 -05:00
Mingtao Gu	0198257d79	[CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512 ) * added MPerBlock=32 for MXFP4 GEMM decode * added two instance for M>128 scenario. * added 1 instance * format --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-07-18 14:35:54 +08:00
linqunAMD	fbd9f32abe	[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459 ) 1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com>	2025-07-17 08:19:57 +08:00
linqunAMD	6e76b82059	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-16 07:58:23 -07:00
Illia Silin	a4bf78ac0e	replace obsolete warpSize system variable with the new one (#2496 )	2025-07-16 07:39:15 -07:00
huaiguxu	c1badfd30c	Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-16 15:44:34 +08:00
Apoorva Kalyani	036799d631	Merge branch '61-add-examples-for-bf16-and-fp16-instances-of-gemm_add-for-v1-and-v3-for-rdna4' into 'feature/multiple-d-gemms' Resolve "Add Examples for bf16 and fp16 instances of gemm_add xdl and wmma for v1 and v3 for RDNA4" See merge request amd/ai/composable_kernel!24	2025-07-14 11:46:08 +00:00
Andriy Roshchenko	518dc21ae8	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8	2025-07-11 13:07:05 -06:00
Illia Silin	1b66f3f4a3	Add declarations for atomic add for fp16 and unsigned short. (#2483 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short * revrt back to atomic add using casting	2025-07-10 07:18:56 -07:00
Zoltan Lakatos	a7993abd4b	comments on the atomics workaround	2025-07-10 08:43:32 +00:00
Illia Silin	93420ecf89	Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474 ) This reverts commit `112b47e885`.	2025-07-08 19:01:26 -07:00
Illia Silin	112b47e885	Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short	2025-07-08 18:09:30 -04:00
Andriy Roshchenko	054f85ab7c	MX GEMM - FP6 Example (#2419 ) Adds support for MX FP6 data type in MX GEMM block pipeline version v1. Provides an example of MX FP6 GEMM algorithm. --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com>	2025-07-07 10:33:26 -06:00
apoorva	8a5bb256f7	Fixes applied to fix the precision loss.	2025-07-07 09:41:00 +00:00
Mingtao Gu	7998ae8969	[CK] Mxfp4 moe blockscale buf2lds version support (#2455 ) * change cshuffle size * added mxfp4 moe async buffer loading without B preshuffle * added mx moe B shuffling + scale shuffling (async loads) * minor fix --------- Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-07-06 15:42:00 +08:00

1 2 3 4 5 ...

699 Commits