composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Author	SHA1	Message	Date
Sami Remes	64edaacebe	Another alternative, this is numerically correct, but lets compiler do more work to enable the double buffering	2025-09-11 13:38:29 +00:00
Sami Remes	180a436cca	WIP: Some numerical issues still, maybe from tail? It should allow decoupling the MFMA and the FMA-scaling with two c_thread_buf_per_scale buffers, and look ahead fetching of a/b thread bufs. The performance is still quite similar as without double buffering.	2025-09-11 12:22:51 +00:00
Sami Remes	5f5561a499	WIP	2025-09-11 07:51:40 +00:00
Sami Remes	691f2c99b7	use double buffer only for interleaving MFMA and scaling	2025-09-04 16:27:18 +00:00
Sami Remes	ab0f764a1c	WIP	2025-09-03 07:18:19 +00:00
Sami Remes	98d7fea950	Merge remote-tracking branch 'origin/develop' into samremes/double_buffer_fp8_ab_scale	2025-09-02 12:13:43 +00:00
Sami Remes	4419fc34a2	Fix formatting problem (#2768 )	2025-09-02 14:14:10 +03:00
Michael Mcminn	022f369deb	Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751 ) * Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16 * Adding support for offload target gfx9-4-generic * This duplication here isn't ideal	2025-09-02 10:35:07 +02:00
Bartłomiej Kocot	cfe5e448db	Fix splitk autodeduce for grouped conv bwd weight (#2742 )	2025-08-27 12:35:42 +02:00
linqunAMD	95e4a4efcb	Fix merge mfma_wmma (part 1) regression (#2749 ) root cause: a typo in GetGfx11InputBlkIdx, const ia added by mistake.	2025-08-26 22:49:34 -07:00
linqunAMD	d6e49c5fde	Extend XDL kernel to Support RDNA3/4 - Part 1 (#2606 )	2025-08-22 17:46:30 -04:00
Sami Remes	c82db3d178	WIP: breaks v1 pipeline	2025-08-21 11:35:00 +00:00
Sami Remes	3dc6b7c71a	WIP	2025-08-21 11:33:53 +00:00
jefyang1	6ba9289b26	Fix pk i4 v3 example test regression on gfx942 (#2706 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-08-19 09:58:28 -07:00
mirchen-amd	60320e90c1	Mirchen/gemm blockscale wp segfault fix (#2638 ) * Add stride validation to prevent segfault in blockscale GEMM * run clang-format * Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * added stride length checking to more gemm examples in ckprofiler * ran clang format * added validation header and implement in core gemm operations * remove ck_tile transpose and gemm stages from CI (#2646) * update CK build instruction step 4 (#2563) Co-authored-by: Aviral Goel <aviral.goel@amd.com> * Fixes to "General 2D Reduction Kernel" (#2535) (#2656) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format * enable aiter test_mha in daily CI (#2659) * feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * [CK_TILE] FMHA BWD Optimization For GFX950 (#2628) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp * Fix num_byte calculations to use nhead_k for K & V size (#2653) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ``` * [CK_TILE] FMHA BWD Decode Pipeline (#2643) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 * fix (#2668) * Optimize fmha fwd decode & prefill for gfx950 (#2641) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> * Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670) This reverts commit `b7322a521a`. * added batch stride checking to batched gemm ops in profiler * removed batch stride validation * removed batched stride validation again * Update include/ck/library/utility/profiler_validation_common.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * refactor function names * added gemm stride checking to more profiler gemm operations * run clang format * add stride checkign to 01 gemm example * rename from profiler to validation common, used for examples and profiler * build of ckProfiler success * update file headers --------- Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> Co-authored-by: Cameron Shinn <camerontshinn@gmail.com> Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com> Co-authored-by: Haocong WANG <haocwang@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-08-19 01:19:17 -07:00
Sami Remes	abcf2f3c97	WIP	2025-08-19 08:01:33 +00:00
Sami Remes	26d3300930	Add other layouts for FP8 block scaled gemm (#2665 ) * Start adding other layouts for gemm_ab_scale * Add some instances * Create tensor descriptors for A/B scales depending on A/B layout * Fix formatting * Revert some comments * Revert commented instances in CMakeLists.txt * Add some more instances for col-row gemm * enable more row,row instances * Use occupancy=1 for col,row layout to avoid spills	2025-08-18 01:46:10 -07:00
jefyang1	d7c95dd491	Add gemm universal f8 f8 bf16 instances on gfx950 (#2662 )	2025-08-14 13:25:24 -07:00
Enrico Degregori	a6f4029276	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2675 ) Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-14 00:21:09 +02:00
Enrico Degregori	21e9983913	Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 )" (#2637 ) This reverts commit `2203b0ddfe`. Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-07 12:30:08 +02:00
Bartłomiej Kocot	5328b232b2	Grouped Convolution Forward Infer Bias Bnorm Activ (#2621 ) * Grouped Convolution Forward Infer Bias Bnorm Activ * 3d	2025-08-07 08:36:47 +02:00
Enrico Degregori	2203b0ddfe	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 ) * Add padding 1x1Stride1Pad0 conv specialization * Add gridwise checks for conv cshufflev3 * Merge padding with previous transforms * Apply transform changes for padding to default specialization as well --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-05 15:23:19 +02:00
Illia Silin	788e8a878e	update the switch condition for buffer built-ins (#2602 )	2025-08-01 14:30:07 -07:00
lalala-sh	bb5c478295	fix weight index out of range (#2414 )	2025-08-01 17:50:02 +08:00
Ville Pietilä	e962a41638	Automatic deduction of split-K value for grouped convolution (#2491 ) * Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Use simple best occupancy model to calculate the split-K. * Handle split-K autodeduction in explicit gemm conv. * Add unit tests for split-K autodeduction. * Remove oversubscription. * Small fixes. * Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle. * Run clang formatting. * Fix error handling in the conv profiler. * Add missing documentation for the autodeducted split-K values. * Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver. * Fix clang formatting and split-K profiler documentation. * Rename max_occupancy value variable. * Calculate grid size for split-K autodeduction directly from input array shapes and template params. --------- Co-authored-by: Ville Pietilä <>	2025-07-31 12:08:45 +02:00
Bartłomiej Kocot	5b244105d9	Enable multiple D for grouped conv fwd large tensors (#2572 )	2025-07-28 22:39:07 +02:00
linqunAMD	0782ee8eb3	Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 ) * Remove HIP_COMPILE_DEVICE * add missing files * fix clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-28 13:01:07 -07:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
Bartłomiej Kocot	685771b875	Enable bf16 RNE on gfx950 (#2542 ) * Enable bf16 RNE for gfx950 * test bhalf * fix * fix * Comments fixes * fixes * clean * fix	2025-07-28 00:47:17 +02:00
Adam Osewski	c8eb2f995c	Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463 ) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-07-25 10:34:31 +02:00
Enrico Degregori	b01a27ff22	Support b_scale: (#2350 ) - extend pipeline v1 and v3 - add instances - add tests - add example Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-07-24 18:49:58 -07:00
Illia Silin	9c04a55626	remove repetitive code (#2562 )	2025-07-24 14:52:46 -07:00
Andriy Roshchenko	3421272f90	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8	2025-07-24 14:36:53 -04:00
Rostyslav Geyyer	c9886109b4	Update packed fp4 layout (#2523 )	2025-07-21 16:58:59 -05:00
Mingtao Gu	0198257d79	[CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512 ) * added MPerBlock=32 for MXFP4 GEMM decode * added two instance for M>128 scenario. * added 1 instance * format --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-07-18 14:35:54 +08:00
linqunAMD	fbd9f32abe	[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459 ) 1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com>	2025-07-17 08:19:57 +08:00
linqunAMD	6e76b82059	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-16 07:58:23 -07:00
Illia Silin	a4bf78ac0e	replace obsolete warpSize system variable with the new one (#2496 )	2025-07-16 07:39:15 -07:00
huaiguxu	c1badfd30c	Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-16 15:44:34 +08:00
Andriy Roshchenko	518dc21ae8	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8	2025-07-11 13:07:05 -06:00
Illia Silin	1b66f3f4a3	Add declarations for atomic add for fp16 and unsigned short. (#2483 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short * revrt back to atomic add using casting	2025-07-10 07:18:56 -07:00
Illia Silin	93420ecf89	Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474 ) This reverts commit `112b47e885`.	2025-07-08 19:01:26 -07:00
Illia Silin	112b47e885	Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short	2025-07-08 18:09:30 -04:00
Andriy Roshchenko	054f85ab7c	MX GEMM - FP6 Example (#2419 ) Adds support for MX FP6 data type in MX GEMM block pipeline version v1. Provides an example of MX FP6 GEMM algorithm. --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com>	2025-07-07 10:33:26 -06:00
Mingtao Gu	7998ae8969	[CK] Mxfp4 moe blockscale buf2lds version support (#2455 ) * change cshuffle size * added mxfp4 moe async buffer loading without B preshuffle * added mx moe B shuffling + scale shuffling (async loads) * minor fix --------- Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-07-06 15:42:00 +08:00
Adam Osewski	3d70c638d1	Always force output clearing for grouped conv bwd data (#2446 ) * Always force output clearing * dont run set zero for residual --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-07-04 07:49:52 -06:00
Vidyasagar Ananthan	2e971eff90	Removing reference to undefined parameter for ignore statement. (#2447 )	2025-07-03 20:10:29 -07:00
damien-lejeune	1183824573	Fix clang in ck develop branch (#2445 ) Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2025-07-02 10:07:47 -06:00
chenjun	74a34e0f50	fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437 ) Co-authored-by: valarLip <340077269@qq.com>	2025-07-02 19:12:07 +08:00
Gino Lu	60eb70f543	Fix return value bug that drops minus sign in some cases. (#2415 ) * fix return value bug. * refine change according to comment.	2025-07-02 14:53:00 +08:00

1 2 3 4 5 ...

685 Commits