Commit Graph

671 Commits

Author SHA1 Message Date
jefyang1
8b37a2968f Fix pk i4 v3 example test regression on gfx942 (#2706)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 6ba9289b26]
2025-08-19 09:58:28 -07:00
mirchen-amd
020c9139ec Mirchen/gemm blockscale wp segfault fix (#2638)
* Add stride validation to prevent segfault in blockscale GEMM

* run clang-format

* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* added stride length checking to more gemm examples in ckprofiler

* ran clang format

* added validation header and implement in core gemm operations

* remove ck_tile transpose and gemm stages from CI (#2646)

* update CK build instruction step 4 (#2563)

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

* Fixes to  "General 2D Reduction Kernel" (#2535) (#2656)

* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format

* enable aiter test_mha in daily CI (#2659)

* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582)

* feat(copy_kernel): add basic copy kernel example with documentation

* docs(CHANGELOG): Updated changelog

* chore: performed clang format

* Update example/ck_tile/39_copy/copy_basic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* fix(terminology): follow amd terms

* extract elementwise copy to a new kernel

* fix(copy_kernel): bug in verification

* add comments about vgpr usage

* lint and nits

* add notes and comments

* print hostTensor via stream

* print hostTensor via stream

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)

* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp

* Fix num_byte calculations to use nhead_k for K & V size (#2653)

Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.

Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```

After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```

* [CK_TILE] FMHA BWD Decode Pipeline (#2643)

* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2

* fix (#2668)

* Optimize fmha fwd decode & prefill for gfx950 (#2641)

* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)

This reverts commit 747d127983.

* added batch stride checking to batched gemm ops in profiler

* removed batch stride validation

* removed batched stride validation again

* Update include/ck/library/utility/profiler_validation_common.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* refactor function names

* added gemm stride checking to more profiler gemm operations

* run clang format

* add stride checkign to 01 gemm example

* rename from profiler to validation common, used for examples and profiler

* build of ckProfiler success

* update file headers

---------

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com>
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 60320e90c1]
2025-08-19 01:19:17 -07:00
Sami Remes
13bfcba04c Add other layouts for FP8 block scaled gemm (#2665)
* Start adding other layouts for gemm_ab_scale

* Add some instances

* Create tensor descriptors for A/B scales depending on A/B layout

* Fix formatting

* Revert some comments

* Revert commented instances in CMakeLists.txt

* Add some more instances for col-row gemm

* enable more row,row instances

* Use occupancy=1 for col,row layout to avoid spills

[ROCm/composable_kernel commit: 26d3300930]
2025-08-18 01:46:10 -07:00
jefyang1
1a412578d8 Add gemm universal f8 f8 bf16 instances on gfx950 (#2662)
[ROCm/composable_kernel commit: d7c95dd491]
2025-08-14 13:25:24 -07:00
Enrico Degregori
ec1c249b63 Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2675)
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: a6f4029276]
2025-08-14 00:21:09 +02:00
Enrico Degregori
bdafbd7ca1 Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)" (#2637)
This reverts commit 863c87b1e2.

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 21e9983913]
2025-08-07 12:30:08 +02:00
Bartłomiej Kocot
14eef94b28 Grouped Convolution Forward Infer Bias Bnorm Activ (#2621)
* Grouped Convolution Forward Infer Bias Bnorm Activ

* 3d

[ROCm/composable_kernel commit: 5328b232b2]
2025-08-07 08:36:47 +02:00
Enrico Degregori
863c87b1e2 Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)
* Add padding 1x1Stride1Pad0 conv specialization

* Add gridwise checks for conv cshufflev3

* Merge padding with previous transforms

* Apply transform changes for padding to default specialization as well

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 2203b0ddfe]
2025-08-05 15:23:19 +02:00
Illia Silin
ee2bb3f20b update the switch condition for buffer built-ins (#2602)
[ROCm/composable_kernel commit: 788e8a878e]
2025-08-01 14:30:07 -07:00
lalala-sh
3318b0e25d fix weight index out of range (#2414)
[ROCm/composable_kernel commit: bb5c478295]
2025-08-01 17:50:02 +08:00
Ville Pietilä
f6f07467f4 Automatic deduction of split-K value for grouped convolution (#2491)
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Use simple best occupancy model to calculate the split-K.

* Handle split-K autodeduction in explicit gemm conv.

* Add unit tests for split-K autodeduction.

* Remove oversubscription.

* Small fixes.

* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.

* Run clang formatting.

* Fix error handling in the conv profiler.

* Add missing documentation for the autodeducted split-K values.

* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.

* Fix clang formatting and split-K profiler documentation.

* Rename max_occupancy value variable.

* Calculate grid size for split-K autodeduction directly from input array shapes and template params.

---------

Co-authored-by: Ville Pietilä <>

[ROCm/composable_kernel commit: e962a41638]
2025-07-31 12:08:45 +02:00
Bartłomiej Kocot
ccc74013b1 Enable multiple D for grouped conv fwd large tensors (#2572)
[ROCm/composable_kernel commit: 5b244105d9]
2025-07-28 22:39:07 +02:00
linqunAMD
943db65a21 Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564)
* Remove HIP_COMPILE_DEVICE

* add missing files

* fix clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: 0782ee8eb3]
2025-07-28 13:01:07 -07:00
Illia Silin
24f228df3b upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config

[ROCm/composable_kernel commit: 504b101da3]
2025-07-28 11:34:07 -07:00
Bartłomiej Kocot
85c03cf088 Enable bf16 RNE on gfx950 (#2542)
* Enable bf16 RNE for gfx950

* test bhalf

* fix

* fix

* Comments fixes

* fixes

* clean

* fix

[ROCm/composable_kernel commit: 685771b875]
2025-07-28 00:47:17 +02:00
Adam Osewski
5ccd4c679f Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: c8eb2f995c]
2025-07-25 10:34:31 +02:00
Enrico Degregori
cde0c4d224 Support b_scale: (#2350)
- extend pipeline v1 and v3
 - add instances
 - add tests
 - add example

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: b01a27ff22]
2025-07-24 18:49:58 -07:00
Illia Silin
6d77f529c7 remove repetitive code (#2562)
[ROCm/composable_kernel commit: 9c04a55626]
2025-07-24 14:52:46 -07:00
Andriy Roshchenko
77b38a8e9d MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8



[ROCm/composable_kernel commit: 3421272f90]
2025-07-24 14:36:53 -04:00
Rostyslav Geyyer
c58afb0565 Update packed fp4 layout (#2523)
[ROCm/composable_kernel commit: c9886109b4]
2025-07-21 16:58:59 -05:00
Mingtao Gu
38b61f2f24 [CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512)
* added MPerBlock=32 for MXFP4 GEMM decode

* added two instance for M>128 scenario.

* added 1 instance

* format

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: 0198257d79]
2025-07-18 14:35:54 +08:00
linqunAMD
8319789fe1 [CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459)
1. Port NCHW support from ConvFwd (#2375) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw

Co-authored-by: azhuang <anzhong.huang@amd.com>

[ROCm/composable_kernel commit: fbd9f32abe]
2025-07-17 08:19:57 +08:00
linqunAMD
f77d704980 Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: 6e76b82059]
2025-07-16 07:58:23 -07:00
Illia Silin
3bf86334c9 replace obsolete warpSize system variable with the new one (#2496)
[ROCm/composable_kernel commit: a4bf78ac0e]
2025-07-16 07:39:15 -07:00
huaiguxu
73da271e03 Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438)
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: c1badfd30c]
2025-07-16 15:44:34 +08:00
Andriy Roshchenko
259055e53f MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8



[ROCm/composable_kernel commit: 518dc21ae8]
2025-07-11 13:07:05 -06:00
Illia Silin
9e654191ad Add declarations for atomic add for fp16 and unsigned short. (#2483)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

* revrt back to atomic add using casting

[ROCm/composable_kernel commit: 1b66f3f4a3]
2025-07-10 07:18:56 -07:00
Illia Silin
0db55aad94 Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474)
This reverts commit b00bbe4b35.

[ROCm/composable_kernel commit: 93420ecf89]
2025-07-08 19:01:26 -07:00
Illia Silin
b00bbe4b35 Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

[ROCm/composable_kernel commit: 112b47e885]
2025-07-08 18:09:30 -04:00
Andriy Roshchenko
67545a9d22 MX GEMM - FP6 Example (#2419)
Adds support for MX FP6 data type in MX GEMM block pipeline version v1.
Provides an example of MX FP6 GEMM algorithm.

---------

Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>

[ROCm/composable_kernel commit: 054f85ab7c]
2025-07-07 10:33:26 -06:00
Mingtao Gu
b0bf370fbd [CK] Mxfp4 moe blockscale buf2lds version support (#2455)
* change cshuffle size

* added mxfp4 moe async buffer loading without B preshuffle

* added mx moe B shuffling + scale shuffling (async loads)

* minor fix

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 7998ae8969]
2025-07-06 15:42:00 +08:00
Adam Osewski
6575a782c6 Always force output clearing for grouped conv bwd data (#2446)
* Always force output clearing

* dont run set zero for residual

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 3d70c638d1]
2025-07-04 07:49:52 -06:00
Vidyasagar Ananthan
fa57cec140 Removing reference to undefined parameter for ignore statement. (#2447)
[ROCm/composable_kernel commit: 2e971eff90]
2025-07-03 20:10:29 -07:00
damien-lejeune
2dbb2bcb33 Fix clang in ck develop branch (#2445)
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>

[ROCm/composable_kernel commit: 1183824573]
2025-07-02 10:07:47 -06:00
chenjun
577a80122f fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437)
Co-authored-by: valarLip <340077269@qq.com>

[ROCm/composable_kernel commit: 74a34e0f50]
2025-07-02 19:12:07 +08:00
Gino Lu
c9d043a1d1 Fix return value bug that drops minus sign in some cases. (#2415)
* fix return value bug.

* refine change according to comment.

[ROCm/composable_kernel commit: 60eb70f543]
2025-07-02 14:53:00 +08:00
huaiguxu
0ac91713ae Huaiguxu/moe fp8 pertoken scale fix (#2391)
* fix pertoken_scale a_scale dimension

* clang-format

* Fix moe_gemm2_fp8 perTokenScale reference and example.

[ROCm/composable_kernel commit: e1c5172fdb]
2025-06-27 10:24:34 +08:00
linqunAMD
c7c24bb10d [CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375)
1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it.
2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance.
3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.


[ROCm/composable_kernel commit: 1749c0409e]
2025-06-26 08:32:39 +08:00
Rostyslav Geyyer
9e0bfd3dbb Enable fp4 tests (#2329)
[ROCm/composable_kernel commit: daf71fb8e4]
2025-06-25 07:38:54 -05:00
Xiao Li
b3b4aa8d57 Fix amd_ck_fp8.hpp macro definitions (#2325)
* Fix amd_ck_fp8.hpp macro definitions

1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before.
2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace
   fp8_impl to avoid redefined error when building with rocm 6.4+
   (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h)


Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>

[ROCm/composable_kernel commit: bac51b6ec0]
2025-06-24 22:46:15 -06:00
Anton Gorenko
e156b5aebb Improve fmha_bwd tests performance (#2376)
* Avoid passing indices (std::vector) by value to host tensor's operator()

Each access requires 2 allocations and copies of the vector.

* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification

* Compute ds_hp_host_ref in parallel

This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.

* Do not use ForEach for simple copy and conversion of large tensors

These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.

[ROCm/composable_kernel commit: 77123600ee]
2025-06-24 07:45:24 -07:00
Kiefer van Teutem
eb4b7c65ff Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319)
* Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty.

* Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good.

* Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet.

* Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3.  In principle the whole thing works now, just need to add other data types and perhaps do some cleanup.

* Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16).

* Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts.

* Fixup comments and TODOs

* Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small.

* Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests.

* Take batching into account when using rotating memory or clearing the C tensor.

* Implement small refactors / comments etc. from review.

* Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists.

* Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems.

* Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups.

* Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function.

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: 9e74ae7c89]
2025-06-24 07:28:13 -07:00
lalala-sh
b6c780fc7f fix moe i4 bug from aiter (#2339)
[ROCm/composable_kernel commit: bb571a0330]
2025-06-24 14:51:29 +08:00
Rostyslav Geyyer
de3cfbab9a Add accelerated stochastic rounding on gfx950 (#2355)
* Add native prand generation support for gfx950

* Update seed calculation

[ROCm/composable_kernel commit: dbfe70e72a]
2025-06-23 09:31:46 -05:00
John Shumway
7c57c4f045 Shard several of the most costly targets. (#2373)
* Shard several of the most costly targets.

Introduces a filter_tuple_by_modulo to break up tuples.

Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.

time ninja -j 64 device_grouped_conv3d_fwd_instance

* fix clang format

* Fix build errors in instantiation code.

I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:

ninja -j 128 test_grouped_convnd_fwd

This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.

I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.

* Fix link errors from mismatched declarations.

Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Shard the longest 2D convolution builds

Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.

* Use PROJECT_SOURCE_DIR for submodule compatibility

I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules.  Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Remove accidental copy of a file

* Remove accidental copies of template files.

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 47ae4b0955]
2025-06-23 07:24:36 -07:00
Illia Silin
073bb8d588 Revert "Shard several of the most costly targets. (#2266)" (#2361)
This reverts commit c1285aaada.

[ROCm/composable_kernel commit: cdfd7722bf]
2025-06-17 13:56:30 -07:00
Bartłomiej Kocot
d9316dfbeb Fix Add in dynamic buffer for fp32/i8 (#2351)
* Fix Add in dynamic buffer for fp32/i8

* fixes

* Fix

[ROCm/composable_kernel commit: cc98a41f46]
2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi
bde406245a Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 4c57157d50]
2025-06-17 11:54:30 -07:00
Bartłomiej Kocot
4ae33b454f Grouped convolution forward with clamp (#2334)
* Grouped convolution forward with clamp

* Optimize clamp

* unary fixes

* test gk bias

* Revert "test gk bias"

This reverts commit 8e42e29d7b.

* Revert "Revert "test gk bias""

This reverts commit e73c0550ce.

* workaround comment

[ROCm/composable_kernel commit: f6c2ff9dce]
2025-06-16 15:36:53 +02:00
Illia Silin
7eaa398458 Fix direct lds load for gfx950 and clang20 (#2346)
* fix direct lds load for gfx950 and clang20

* Update include/ck/utility/amd_buffer_addressing_builtins.hpp

* Fix format

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>

[ROCm/composable_kernel commit: 2d8a804152]
2025-06-15 15:22:34 -07:00