composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 02:27:57 +00:00

Author	SHA1	Message	Date
jefyang1	8b37a2968f	Fix pk i4 v3 example test regression on gfx942 (#2706 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `6ba9289b26`]	2025-08-19 09:58:28 -07:00
mirchen-amd	020c9139ec	Mirchen/gemm blockscale wp segfault fix (#2638 ) * Add stride validation to prevent segfault in blockscale GEMM * run clang-format * Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * added stride length checking to more gemm examples in ckprofiler * ran clang format * added validation header and implement in core gemm operations * remove ck_tile transpose and gemm stages from CI (#2646) * update CK build instruction step 4 (#2563) Co-authored-by: Aviral Goel <aviral.goel@amd.com> * Fixes to "General 2D Reduction Kernel" (#2535) (#2656) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format * enable aiter test_mha in daily CI (#2659) * feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * [CK_TILE] FMHA BWD Optimization For GFX950 (#2628) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp * Fix num_byte calculations to use nhead_k for K & V size (#2653) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ``` * [CK_TILE] FMHA BWD Decode Pipeline (#2643) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 * fix (#2668) * Optimize fmha fwd decode & prefill for gfx950 (#2641) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> * Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670) This reverts commit `747d127983`. * added batch stride checking to batched gemm ops in profiler * removed batch stride validation * removed batched stride validation again * Update include/ck/library/utility/profiler_validation_common.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * refactor function names * added gemm stride checking to more profiler gemm operations * run clang format * add stride checkign to 01 gemm example * rename from profiler to validation common, used for examples and profiler * build of ckProfiler success * update file headers --------- Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> Co-authored-by: Cameron Shinn <camerontshinn@gmail.com> Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com> Co-authored-by: Haocong WANG <haocwang@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `60320e90c1`]	2025-08-19 01:19:17 -07:00
Sami Remes	13bfcba04c	Add other layouts for FP8 block scaled gemm (#2665 ) * Start adding other layouts for gemm_ab_scale * Add some instances * Create tensor descriptors for A/B scales depending on A/B layout * Fix formatting * Revert some comments * Revert commented instances in CMakeLists.txt * Add some more instances for col-row gemm * enable more row,row instances * Use occupancy=1 for col,row layout to avoid spills [ROCm/composable_kernel commit: `26d3300930`]	2025-08-18 01:46:10 -07:00
jefyang1	1a412578d8	Add gemm universal f8 f8 bf16 instances on gfx950 (#2662 ) [ROCm/composable_kernel commit: `d7c95dd491`]	2025-08-14 13:25:24 -07:00
Enrico Degregori	ec1c249b63	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2675 ) Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `a6f4029276`]	2025-08-14 00:21:09 +02:00
Enrico Degregori	bdafbd7ca1	Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 )" (#2637 ) This reverts commit `863c87b1e2`. Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `21e9983913`]	2025-08-07 12:30:08 +02:00
Bartłomiej Kocot	14eef94b28	Grouped Convolution Forward Infer Bias Bnorm Activ (#2621 ) * Grouped Convolution Forward Infer Bias Bnorm Activ * 3d [ROCm/composable_kernel commit: `5328b232b2`]	2025-08-07 08:36:47 +02:00
Enrico Degregori	863c87b1e2	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 ) * Add padding 1x1Stride1Pad0 conv specialization * Add gridwise checks for conv cshufflev3 * Merge padding with previous transforms * Apply transform changes for padding to default specialization as well --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `2203b0ddfe`]	2025-08-05 15:23:19 +02:00
Illia Silin	ee2bb3f20b	update the switch condition for buffer built-ins (#2602 ) [ROCm/composable_kernel commit: `788e8a878e`]	2025-08-01 14:30:07 -07:00
lalala-sh	3318b0e25d	fix weight index out of range (#2414 ) [ROCm/composable_kernel commit: `bb5c478295`]	2025-08-01 17:50:02 +08:00
Ville Pietilä	f6f07467f4	Automatic deduction of split-K value for grouped convolution (#2491 ) * Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Use simple best occupancy model to calculate the split-K. * Handle split-K autodeduction in explicit gemm conv. * Add unit tests for split-K autodeduction. * Remove oversubscription. * Small fixes. * Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle. * Run clang formatting. * Fix error handling in the conv profiler. * Add missing documentation for the autodeducted split-K values. * Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver. * Fix clang formatting and split-K profiler documentation. * Rename max_occupancy value variable. * Calculate grid size for split-K autodeduction directly from input array shapes and template params. --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `e962a41638`]	2025-07-31 12:08:45 +02:00
Bartłomiej Kocot	ccc74013b1	Enable multiple D for grouped conv fwd large tensors (#2572 ) [ROCm/composable_kernel commit: `5b244105d9`]	2025-07-28 22:39:07 +02:00
linqunAMD	943db65a21	Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 ) * Remove HIP_COMPILE_DEVICE * add missing files * fix clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `0782ee8eb3`]	2025-07-28 13:01:07 -07:00
Illia Silin	24f228df3b	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config [ROCm/composable_kernel commit: `504b101da3`]	2025-07-28 11:34:07 -07:00
Bartłomiej Kocot	85c03cf088	Enable bf16 RNE on gfx950 (#2542 ) * Enable bf16 RNE for gfx950 * test bhalf * fix * fix * Comments fixes * fixes * clean * fix [ROCm/composable_kernel commit: `685771b875`]	2025-07-28 00:47:17 +02:00
Adam Osewski	5ccd4c679f	Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463 ) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `c8eb2f995c`]	2025-07-25 10:34:31 +02:00
Enrico Degregori	cde0c4d224	Support b_scale: (#2350 ) - extend pipeline v1 and v3 - add instances - add tests - add example Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b01a27ff22`]	2025-07-24 18:49:58 -07:00
Illia Silin	6d77f529c7	remove repetitive code (#2562 ) [ROCm/composable_kernel commit: `9c04a55626`]	2025-07-24 14:52:46 -07:00
Andriy Roshchenko	77b38a8e9d	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8 [ROCm/composable_kernel commit: `3421272f90`]	2025-07-24 14:36:53 -04:00
Rostyslav Geyyer	c58afb0565	Update packed fp4 layout (#2523 ) [ROCm/composable_kernel commit: `c9886109b4`]	2025-07-21 16:58:59 -05:00
Mingtao Gu	38b61f2f24	[CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512 ) * added MPerBlock=32 for MXFP4 GEMM decode * added two instance for M>128 scenario. * added 1 instance * format --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `0198257d79`]	2025-07-18 14:35:54 +08:00
linqunAMD	8319789fe1	[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459 ) 1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com> [ROCm/composable_kernel commit: `fbd9f32abe`]	2025-07-17 08:19:57 +08:00
linqunAMD	f77d704980	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `6e76b82059`]	2025-07-16 07:58:23 -07:00
Illia Silin	3bf86334c9	replace obsolete warpSize system variable with the new one (#2496 ) [ROCm/composable_kernel commit: `a4bf78ac0e`]	2025-07-16 07:39:15 -07:00
huaiguxu	73da271e03	Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438 ) Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `c1badfd30c`]	2025-07-16 15:44:34 +08:00
Andriy Roshchenko	259055e53f	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8 [ROCm/composable_kernel commit: `518dc21ae8`]	2025-07-11 13:07:05 -06:00
Illia Silin	9e654191ad	Add declarations for atomic add for fp16 and unsigned short. (#2483 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short * revrt back to atomic add using casting [ROCm/composable_kernel commit: `1b66f3f4a3`]	2025-07-10 07:18:56 -07:00
Illia Silin	0db55aad94	Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474 ) This reverts commit `b00bbe4b35`. [ROCm/composable_kernel commit: `93420ecf89`]	2025-07-08 19:01:26 -07:00
Illia Silin	b00bbe4b35	Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short [ROCm/composable_kernel commit: `112b47e885`]	2025-07-08 18:09:30 -04:00
Andriy Roshchenko	67545a9d22	MX GEMM - FP6 Example (#2419 ) Adds support for MX FP6 data type in MX GEMM block pipeline version v1. Provides an example of MX FP6 GEMM algorithm. --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com> [ROCm/composable_kernel commit: `054f85ab7c`]	2025-07-07 10:33:26 -06:00
Mingtao Gu	b0bf370fbd	[CK] Mxfp4 moe blockscale buf2lds version support (#2455 ) * change cshuffle size * added mxfp4 moe async buffer loading without B preshuffle * added mx moe B shuffling + scale shuffling (async loads) * minor fix --------- Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `7998ae8969`]	2025-07-06 15:42:00 +08:00
Adam Osewski	6575a782c6	Always force output clearing for grouped conv bwd data (#2446 ) * Always force output clearing * dont run set zero for residual --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `3d70c638d1`]	2025-07-04 07:49:52 -06:00
Vidyasagar Ananthan	fa57cec140	Removing reference to undefined parameter for ignore statement. (#2447 ) [ROCm/composable_kernel commit: `2e971eff90`]	2025-07-03 20:10:29 -07:00
damien-lejeune	2dbb2bcb33	Fix clang in ck develop branch (#2445 ) Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> [ROCm/composable_kernel commit: `1183824573`]	2025-07-02 10:07:47 -06:00
chenjun	577a80122f	fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437 ) Co-authored-by: valarLip <340077269@qq.com> [ROCm/composable_kernel commit: `74a34e0f50`]	2025-07-02 19:12:07 +08:00
Gino Lu	c9d043a1d1	Fix return value bug that drops minus sign in some cases. (#2415 ) * fix return value bug. * refine change according to comment. [ROCm/composable_kernel commit: `60eb70f543`]	2025-07-02 14:53:00 +08:00
huaiguxu	0ac91713ae	Huaiguxu/moe fp8 pertoken scale fix (#2391 ) * fix pertoken_scale a_scale dimension * clang-format * Fix moe_gemm2_fp8 perTokenScale reference and example. [ROCm/composable_kernel commit: `e1c5172fdb`]	2025-06-27 10:24:34 +08:00
linqunAMD	c7c24bb10d	[CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375 ) 1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it. 2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance. 3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw. [ROCm/composable_kernel commit: `1749c0409e`]	2025-06-26 08:32:39 +08:00
Rostyslav Geyyer	9e0bfd3dbb	Enable fp4 tests (#2329 ) [ROCm/composable_kernel commit: `daf71fb8e4`]	2025-06-25 07:38:54 -05:00
Xiao Li	b3b4aa8d57	Fix amd_ck_fp8.hpp macro definitions (#2325 ) * Fix amd_ck_fp8.hpp macro definitions 1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before. 2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace fp8_impl to avoid redefined error when building with rocm 6.4+ (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h) Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `bac51b6ec0`]	2025-06-24 22:46:15 -06:00
Anton Gorenko	e156b5aebb	Improve fmha_bwd tests performance (#2376 ) * Avoid passing indices (std::vector) by value to host tensor's operator() Each access requires 2 allocations and copies of the vector. * Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification * Compute ds_hp_host_ref in parallel This sequntial ForEach is the slowest part of validation and it benefits from parallel computation. * Do not use ForEach for simple copy and conversion of large tensors These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices. [ROCm/composable_kernel commit: `77123600ee`]	2025-06-24 07:45:24 -07:00
Kiefer van Teutem	eb4b7c65ff	Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319 ) * Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty. * Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good. * Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet. * Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3. In principle the whole thing works now, just need to add other data types and perhaps do some cleanup. * Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16). * Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts. * Fixup comments and TODOs * Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small. * Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests. * Take batching into account when using rotating memory or clearing the C tensor. * Implement small refactors / comments etc. from review. * Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists. * Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems. * Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups. * Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `9e74ae7c89`]	2025-06-24 07:28:13 -07:00
lalala-sh	b6c780fc7f	fix moe i4 bug from aiter (#2339 ) [ROCm/composable_kernel commit: `bb571a0330`]	2025-06-24 14:51:29 +08:00
Rostyslav Geyyer	de3cfbab9a	Add accelerated stochastic rounding on gfx950 (#2355 ) * Add native prand generation support for gfx950 * Update seed calculation [ROCm/composable_kernel commit: `dbfe70e72a`]	2025-06-23 09:31:46 -05:00
John Shumway	7c57c4f045	Shard several of the most costly targets. (#2373 ) * Shard several of the most costly targets. Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance * fix clang format * Fix build errors in instantiation code. I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR. * Fix link errors from mismatched declarations. Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Shard the longest 2D convolution builds Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes. * Use PROJECT_SOURCE_DIR for submodule compatibility I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Remove accidental copy of a file * Remove accidental copies of template files. --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `47ae4b0955`]	2025-06-23 07:24:36 -07:00
Illia Silin	073bb8d588	Revert "Shard several of the most costly targets. (#2266 )" (#2361 ) This reverts commit `c1285aaada`. [ROCm/composable_kernel commit: `cdfd7722bf`]	2025-06-17 13:56:30 -07:00
Bartłomiej Kocot	d9316dfbeb	Fix Add in dynamic buffer for fp32/i8 (#2351 ) * Fix Add in dynamic buffer for fp32/i8 * fixes * Fix [ROCm/composable_kernel commit: `cc98a41f46`]	2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi	bde406245a	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
Bartłomiej Kocot	4ae33b454f	Grouped convolution forward with clamp (#2334 ) * Grouped convolution forward with clamp * Optimize clamp * unary fixes * test gk bias * Revert "test gk bias" This reverts commit `8e42e29d7b`. * Revert "Revert "test gk bias"" This reverts commit `e73c0550ce`. * workaround comment [ROCm/composable_kernel commit: `f6c2ff9dce`]	2025-06-16 15:36:53 +02:00
Illia Silin	7eaa398458	Fix direct lds load for gfx950 and clang20 (#2346 ) * fix direct lds load for gfx950 and clang20 * Update include/ck/utility/amd_buffer_addressing_builtins.hpp * Fix format --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `2d8a804152`]	2025-06-15 15:22:34 -07:00

1 2 3 4 5 ...

671 Commits