Sami Remes
64edaacebe
Another alternative, this is numerically correct, but lets compiler do more work to enable the double buffering
2025-09-11 13:38:29 +00:00
Sami Remes
180a436cca
WIP: Some numerical issues still, maybe from tail?
...
It should allow decoupling the MFMA and the FMA-scaling with two
c_thread_buf_per_scale buffers, and look ahead fetching of
a/b thread bufs.
The performance is still quite similar as without double buffering.
2025-09-11 12:22:51 +00:00
Sami Remes
5f5561a499
WIP
2025-09-11 07:51:40 +00:00
Sami Remes
691f2c99b7
use double buffer only for interleaving MFMA and scaling
2025-09-04 16:27:18 +00:00
Sami Remes
ab0f764a1c
WIP
2025-09-03 07:18:19 +00:00
Sami Remes
98d7fea950
Merge remote-tracking branch 'origin/develop' into samremes/double_buffer_fp8_ab_scale
2025-09-02 12:13:43 +00:00
Sami Remes
4419fc34a2
Fix formatting problem ( #2768 )
2025-09-02 14:14:10 +03:00
Michael Mcminn
022f369deb
Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… ( #2751 )
...
* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16
* Adding support for offload target gfx9-4-generic
* This duplication here isn't ideal
2025-09-02 10:35:07 +02:00
Bartłomiej Kocot
cfe5e448db
Fix splitk autodeduce for grouped conv bwd weight ( #2742 )
2025-08-27 12:35:42 +02:00
linqunAMD
95e4a4efcb
Fix merge mfma_wmma (part 1) regression ( #2749 )
...
root cause: a typo in GetGfx11InputBlkIdx, const ia added by mistake.
2025-08-26 22:49:34 -07:00
linqunAMD
d6e49c5fde
Extend XDL kernel to Support RDNA3/4 - Part 1 ( #2606 )
2025-08-22 17:46:30 -04:00
Sami Remes
c82db3d178
WIP: breaks v1 pipeline
2025-08-21 11:35:00 +00:00
Sami Remes
3dc6b7c71a
WIP
2025-08-21 11:33:53 +00:00
jefyang1
6ba9289b26
Fix pk i4 v3 example test regression on gfx942 ( #2706 )
...
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-08-19 09:58:28 -07:00
mirchen-amd
60320e90c1
Mirchen/gemm blockscale wp segfault fix ( #2638 )
...
* Add stride validation to prevent segfault in blockscale GEMM
* run clang-format
* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* added stride length checking to more gemm examples in ckprofiler
* ran clang format
* added validation header and implement in core gemm operations
* remove ck_tile transpose and gemm stages from CI (#2646 )
* update CK build instruction step 4 (#2563 )
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
* Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 )
* fix reduce2d
- revret the combine_partial_results() chnages
- remove auto from function def
* clang-format
* enable aiter test_mha in daily CI (#2659 )
* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 )
* feat(copy_kernel): add basic copy kernel example with documentation
* docs(CHANGELOG): Updated changelog
* chore: performed clang format
* Update example/ck_tile/39_copy/copy_basic.cpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* fix(terminology): follow amd terms
* extract elementwise copy to a new kernel
* fix(copy_kernel): bug in verification
* add comments about vgpr usage
* lint and nits
* add notes and comments
* print hostTensor via stream
* print hostTensor via stream
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628 )
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
* Fix num_byte calculations to use nhead_k for K & V size (#2653 )
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
* [CK_TILE] FMHA BWD Decode Pipeline (#2643 )
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
* fix (#2668 )
* Optimize fmha fwd decode & prefill for gfx950 (#2641 )
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 )
This reverts commit b7322a521a .
* added batch stride checking to batched gemm ops in profiler
* removed batch stride validation
* removed batched stride validation again
* Update include/ck/library/utility/profiler_validation_common.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* refactor function names
* added gemm stride checking to more profiler gemm operations
* run clang format
* add stride checkign to 01 gemm example
* rename from profiler to validation common, used for examples and profiler
* build of ckProfiler success
* update file headers
---------
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com >
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
Co-authored-by: Yi DING <yi.ding@amd.com >
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com >
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com >
Co-authored-by: Haocong WANG <haocwang@amd.com >
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-08-19 01:19:17 -07:00
Sami Remes
abcf2f3c97
WIP
2025-08-19 08:01:33 +00:00
Sami Remes
26d3300930
Add other layouts for FP8 block scaled gemm ( #2665 )
...
* Start adding other layouts for gemm_ab_scale
* Add some instances
* Create tensor descriptors for A/B scales depending on A/B layout
* Fix formatting
* Revert some comments
* Revert commented instances in CMakeLists.txt
* Add some more instances for col-row gemm
* enable more row,row instances
* Use occupancy=1 for col,row layout to avoid spills
2025-08-18 01:46:10 -07:00
jefyang1
d7c95dd491
Add gemm universal f8 f8 bf16 instances on gfx950 ( #2662 )
2025-08-14 13:25:24 -07:00
Enrico Degregori
a6f4029276
Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) ( #2675 )
...
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2025-08-14 00:21:09 +02:00
Enrico Degregori
21e9983913
Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) ( #2610 )" ( #2637 )
...
This reverts commit 2203b0ddfe .
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2025-08-07 12:30:08 +02:00
Bartłomiej Kocot
5328b232b2
Grouped Convolution Forward Infer Bias Bnorm Activ ( #2621 )
...
* Grouped Convolution Forward Infer Bias Bnorm Activ
* 3d
2025-08-07 08:36:47 +02:00
Enrico Degregori
2203b0ddfe
Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) ( #2610 )
...
* Add padding 1x1Stride1Pad0 conv specialization
* Add gridwise checks for conv cshufflev3
* Merge padding with previous transforms
* Apply transform changes for padding to default specialization as well
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2025-08-05 15:23:19 +02:00
Illia Silin
788e8a878e
update the switch condition for buffer built-ins ( #2602 )
2025-08-01 14:30:07 -07:00
lalala-sh
bb5c478295
fix weight index out of range ( #2414 )
2025-08-01 17:50:02 +08:00
Ville Pietilä
e962a41638
Automatic deduction of split-K value for grouped convolution ( #2491 )
...
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.
* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.
* Use simple best occupancy model to calculate the split-K.
* Handle split-K autodeduction in explicit gemm conv.
* Add unit tests for split-K autodeduction.
* Remove oversubscription.
* Small fixes.
* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.
* Run clang formatting.
* Fix error handling in the conv profiler.
* Add missing documentation for the autodeducted split-K values.
* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.
* Fix clang formatting and split-K profiler documentation.
* Rename max_occupancy value variable.
* Calculate grid size for split-K autodeduction directly from input array shapes and template params.
---------
Co-authored-by: Ville Pietilä <>
2025-07-31 12:08:45 +02:00
Bartłomiej Kocot
5b244105d9
Enable multiple D for grouped conv fwd large tensors ( #2572 )
2025-07-28 22:39:07 +02:00
linqunAMD
0782ee8eb3
Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel ( #2564 )
...
* Remove HIP_COMPILE_DEVICE
* add missing files
* fix clang format
---------
Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com >
2025-07-28 13:01:07 -07:00
Illia Silin
504b101da3
upgrade from clang-format-12 to clang-format-18 ( #2568 )
...
* upgrade to clang-format-18
* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
Bartłomiej Kocot
685771b875
Enable bf16 RNE on gfx950 ( #2542 )
...
* Enable bf16 RNE for gfx950
* test bhalf
* fix
* fix
* Comments fixes
* fixes
* clean
* fix
2025-07-28 00:47:17 +02:00
Adam Osewski
c8eb2f995c
Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler ( #2463 )
...
* Add logging to IsSupported.
* Less casting in AddClamp
* Conv+bias+clamp instances & profiler BF16
* Fix 3D instances & run just 1x for verification.
* :Run just once for verification conv fwd.
* ckProfiler conv fwd clampwq
* Remove exec bit & formatting
* Add support for MultiD for grouped conv fwd v3.
* Enable 2Lds.
* clean
* align instances
* align instances
* profiler fixes
* Fixes
* fix
* fix
---------
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu >
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2025-07-25 10:34:31 +02:00
Enrico Degregori
b01a27ff22
Support b_scale: ( #2350 )
...
- extend pipeline v1 and v3
- add instances
- add tests
- add example
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-07-24 18:49:58 -07:00
Illia Silin
9c04a55626
remove repetitive code ( #2562 )
2025-07-24 14:52:46 -07:00
Andriy Roshchenko
3421272f90
MX GEMM - FP6 Support in GEMM MX v3 Pipeline ( #2481 )
...
* Add GEMM MX BF6 example
* Fix BF6 type_convert
* Add type_convert for bf16x6
* Add compare operator to f4x2_pk_t
* Update README for 67_gemm_microscaling
* Fix host tensor initialization with integer values for FP8
2025-07-24 14:36:53 -04:00
Rostyslav Geyyer
c9886109b4
Update packed fp4 layout ( #2523 )
2025-07-21 16:58:59 -05:00
Mingtao Gu
0198257d79
[CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode ( #2512 )
...
* added MPerBlock=32 for MXFP4 GEMM decode
* added two instance for M>128 scenario.
* added 1 instance
* format
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: felix <felix.li@amd.com >
2025-07-18 14:35:54 +08:00
linqunAMD
fbd9f32abe
[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 ( #2459 )
...
1. Port NCHW support from ConvFwd (#2375 ) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw
Co-authored-by: azhuang <anzhong.huang@amd.com >
2025-07-17 08:19:57 +08:00
linqunAMD
6e76b82059
Fix build errors on windows ( #2456 )
...
* Fix build errors on windows
* correct clang format
---------
Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com >
2025-07-16 07:58:23 -07:00
Illia Silin
a4bf78ac0e
replace obsolete warpSize system variable with the new one ( #2496 )
2025-07-16 07:39:15 -07:00
huaiguxu
c1badfd30c
Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check ( #2438 )
...
Co-authored-by: felix <felix.li@amd.com >
2025-07-16 15:44:34 +08:00
Andriy Roshchenko
518dc21ae8
MX GEMM - FP6 Support in GEMM MX v3 Pipeline ( #2481 )
...
* Add GEMM MX BF6 example
* Fix BF6 type_convert
* Add type_convert for bf16x6
* Add compare operator to f4x2_pk_t
* Update README for 67_gemm_microscaling
* Fix host tensor initialization with integer values for FP8
2025-07-11 13:07:05 -06:00
Illia Silin
1b66f3f4a3
Add declarations for atomic add for fp16 and unsigned short. ( #2483 )
...
* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
* revrt back to atomic add using casting
2025-07-10 07:18:56 -07:00
Illia Silin
93420ecf89
Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" ( #2474 )
...
This reverts commit 112b47e885 .
2025-07-08 19:01:26 -07:00
Illia Silin
112b47e885
Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. ( #2471 )
...
* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
2025-07-08 18:09:30 -04:00
Andriy Roshchenko
054f85ab7c
MX GEMM - FP6 Example ( #2419 )
...
Adds support for MX FP6 data type in MX GEMM block pipeline version v1.
Provides an example of MX FP6 GEMM algorithm.
---------
Co-authored-by: OscarXu <huaiguxu@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: valarLip <340077269@qq.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
Co-authored-by: feifei14119 <feiw@amd.com >
Co-authored-by: Lin, Qun <qlin@amd.com >
Co-authored-by: joye <joye@amd.com >
2025-07-07 10:33:26 -06:00
Mingtao Gu
7998ae8969
[CK] Mxfp4 moe blockscale buf2lds version support ( #2455 )
...
* change cshuffle size
* added mxfp4 moe async buffer loading without B preshuffle
* added mx moe B shuffling + scale shuffling (async loads)
* minor fix
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-07-06 15:42:00 +08:00
Adam Osewski
3d70c638d1
Always force output clearing for grouped conv bwd data ( #2446 )
...
* Always force output clearing
* dont run set zero for residual
---------
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com >
2025-07-04 07:49:52 -06:00
Vidyasagar Ananthan
2e971eff90
Removing reference to undefined parameter for ignore statement. ( #2447 )
2025-07-03 20:10:29 -07:00
damien-lejeune
1183824573
Fix clang in ck develop branch ( #2445 )
...
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com >
2025-07-02 10:07:47 -06:00
chenjun
74a34e0f50
fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 ( #2437 )
...
Co-authored-by: valarLip <340077269@qq.com >
2025-07-02 19:12:07 +08:00
Gino Lu
60eb70f543
Fix return value bug that drops minus sign in some cases. ( #2415 )
...
* fix return value bug.
* refine change according to comment.
2025-07-02 14:53:00 +08:00