Commit Graph

821 Commits

Author SHA1 Message Date
jefyang1
6ba9289b26 Fix pk i4 v3 example test regression on gfx942 (#2706)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-08-19 09:58:28 -07:00
mirchen-amd
60320e90c1 Mirchen/gemm blockscale wp segfault fix (#2638)
* Add stride validation to prevent segfault in blockscale GEMM

* run clang-format

* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* added stride length checking to more gemm examples in ckprofiler

* ran clang format

* added validation header and implement in core gemm operations

* remove ck_tile transpose and gemm stages from CI (#2646)

* update CK build instruction step 4 (#2563)

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

* Fixes to  "General 2D Reduction Kernel" (#2535) (#2656)

* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format

* enable aiter test_mha in daily CI (#2659)

* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582)

* feat(copy_kernel): add basic copy kernel example with documentation

* docs(CHANGELOG): Updated changelog

* chore: performed clang format

* Update example/ck_tile/39_copy/copy_basic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* fix(terminology): follow amd terms

* extract elementwise copy to a new kernel

* fix(copy_kernel): bug in verification

* add comments about vgpr usage

* lint and nits

* add notes and comments

* print hostTensor via stream

* print hostTensor via stream

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)

* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp

* Fix num_byte calculations to use nhead_k for K & V size (#2653)

Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.

Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```

After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```

* [CK_TILE] FMHA BWD Decode Pipeline (#2643)

* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2

* fix (#2668)

* Optimize fmha fwd decode & prefill for gfx950 (#2641)

* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)

This reverts commit b7322a521a.

* added batch stride checking to batched gemm ops in profiler

* removed batch stride validation

* removed batched stride validation again

* Update include/ck/library/utility/profiler_validation_common.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* refactor function names

* added gemm stride checking to more profiler gemm operations

* run clang format

* add stride checkign to 01 gemm example

* rename from profiler to validation common, used for examples and profiler

* build of ckProfiler success

* update file headers

---------

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com>
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-08-19 01:19:17 -07:00
Sami Remes
26d3300930 Add other layouts for FP8 block scaled gemm (#2665)
* Start adding other layouts for gemm_ab_scale

* Add some instances

* Create tensor descriptors for A/B scales depending on A/B layout

* Fix formatting

* Revert some comments

* Revert commented instances in CMakeLists.txt

* Add some more instances for col-row gemm

* enable more row,row instances

* Use occupancy=1 for col,row layout to avoid spills
2025-08-18 01:46:10 -07:00
jefyang1
d7c95dd491 Add gemm universal f8 f8 bf16 instances on gfx950 (#2662) 2025-08-14 13:25:24 -07:00
Enrico Degregori
a6f4029276 Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2675)
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-08-14 00:21:09 +02:00
Enrico Degregori
21e9983913 Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)" (#2637)
This reverts commit 2203b0ddfe.

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-08-07 12:30:08 +02:00
Bartłomiej Kocot
5328b232b2 Grouped Convolution Forward Infer Bias Bnorm Activ (#2621)
* Grouped Convolution Forward Infer Bias Bnorm Activ

* 3d
2025-08-07 08:36:47 +02:00
Enrico Degregori
2203b0ddfe Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)
* Add padding 1x1Stride1Pad0 conv specialization

* Add gridwise checks for conv cshufflev3

* Merge padding with previous transforms

* Apply transform changes for padding to default specialization as well

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-08-05 15:23:19 +02:00
Illia Silin
788e8a878e update the switch condition for buffer built-ins (#2602) 2025-08-01 14:30:07 -07:00
lalala-sh
bb5c478295 fix weight index out of range (#2414) 2025-08-01 17:50:02 +08:00
Ville Pietilä
e962a41638 Automatic deduction of split-K value for grouped convolution (#2491)
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Use simple best occupancy model to calculate the split-K.

* Handle split-K autodeduction in explicit gemm conv.

* Add unit tests for split-K autodeduction.

* Remove oversubscription.

* Small fixes.

* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.

* Run clang formatting.

* Fix error handling in the conv profiler.

* Add missing documentation for the autodeducted split-K values.

* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.

* Fix clang formatting and split-K profiler documentation.

* Rename max_occupancy value variable.

* Calculate grid size for split-K autodeduction directly from input array shapes and template params.

---------

Co-authored-by: Ville Pietilä <>
2025-07-31 12:08:45 +02:00
Bartłomiej Kocot
5b244105d9 Enable multiple D for grouped conv fwd large tensors (#2572) 2025-07-28 22:39:07 +02:00
linqunAMD
0782ee8eb3 Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564)
* Remove HIP_COMPILE_DEVICE

* add missing files

* fix clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-28 13:01:07 -07:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
Bartłomiej Kocot
685771b875 Enable bf16 RNE on gfx950 (#2542)
* Enable bf16 RNE for gfx950

* test bhalf

* fix

* fix

* Comments fixes

* fixes

* clean

* fix
2025-07-28 00:47:17 +02:00
Adam Osewski
c8eb2f995c Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-07-25 10:34:31 +02:00
Enrico Degregori
b01a27ff22 Support b_scale: (#2350)
- extend pipeline v1 and v3
 - add instances
 - add tests
 - add example

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-07-24 18:49:58 -07:00
Illia Silin
9c04a55626 remove repetitive code (#2562) 2025-07-24 14:52:46 -07:00
Andriy Roshchenko
3421272f90 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8
2025-07-24 14:36:53 -04:00
Rostyslav Geyyer
c9886109b4 Update packed fp4 layout (#2523) 2025-07-21 16:58:59 -05:00
Mingtao Gu
0198257d79 [CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512)
* added MPerBlock=32 for MXFP4 GEMM decode

* added two instance for M>128 scenario.

* added 1 instance

* format

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-07-18 14:35:54 +08:00
linqunAMD
fbd9f32abe [CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459)
1. Port NCHW support from ConvFwd (#2375) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw

Co-authored-by: azhuang <anzhong.huang@amd.com>
2025-07-17 08:19:57 +08:00
linqunAMD
6e76b82059 Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-16 07:58:23 -07:00
Illia Silin
a4bf78ac0e replace obsolete warpSize system variable with the new one (#2496) 2025-07-16 07:39:15 -07:00
huaiguxu
c1badfd30c Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438)
Co-authored-by: felix <felix.li@amd.com>
2025-07-16 15:44:34 +08:00
Andriy Roshchenko
518dc21ae8 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8
2025-07-11 13:07:05 -06:00
Illia Silin
1b66f3f4a3 Add declarations for atomic add for fp16 and unsigned short. (#2483)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

* revrt back to atomic add using casting
2025-07-10 07:18:56 -07:00
Illia Silin
93420ecf89 Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474)
This reverts commit 112b47e885.
2025-07-08 19:01:26 -07:00
Illia Silin
112b47e885 Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short
2025-07-08 18:09:30 -04:00
Andriy Roshchenko
054f85ab7c MX GEMM - FP6 Example (#2419)
Adds support for MX FP6 data type in MX GEMM block pipeline version v1.
Provides an example of MX FP6 GEMM algorithm.

---------

Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
2025-07-07 10:33:26 -06:00
Mingtao Gu
7998ae8969 [CK] Mxfp4 moe blockscale buf2lds version support (#2455)
* change cshuffle size

* added mxfp4 moe async buffer loading without B preshuffle

* added mx moe B shuffling + scale shuffling (async loads)

* minor fix

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-07-06 15:42:00 +08:00
Adam Osewski
3d70c638d1 Always force output clearing for grouped conv bwd data (#2446)
* Always force output clearing

* dont run set zero for residual

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-07-04 07:49:52 -06:00
Vidyasagar Ananthan
2e971eff90 Removing reference to undefined parameter for ignore statement. (#2447) 2025-07-03 20:10:29 -07:00
damien-lejeune
1183824573 Fix clang in ck develop branch (#2445)
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2025-07-02 10:07:47 -06:00
chenjun
74a34e0f50 fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437)
Co-authored-by: valarLip <340077269@qq.com>
2025-07-02 19:12:07 +08:00
Gino Lu
60eb70f543 Fix return value bug that drops minus sign in some cases. (#2415)
* fix return value bug.

* refine change according to comment.
2025-07-02 14:53:00 +08:00
huaiguxu
e1c5172fdb Huaiguxu/moe fp8 pertoken scale fix (#2391)
* fix pertoken_scale a_scale dimension

* clang-format

* Fix moe_gemm2_fp8 perTokenScale reference and example.
2025-06-27 10:24:34 +08:00
linqunAMD
1749c0409e [CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375)
1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it.
2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance.
3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.
2025-06-26 08:32:39 +08:00
Rostyslav Geyyer
daf71fb8e4 Enable fp4 tests (#2329) 2025-06-25 07:38:54 -05:00
Xiao Li
bac51b6ec0 Fix amd_ck_fp8.hpp macro definitions (#2325)
* Fix amd_ck_fp8.hpp macro definitions

1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before.
2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace
   fp8_impl to avoid redefined error when building with rocm 6.4+
   (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h)


Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
2025-06-24 22:46:15 -06:00
Anton Gorenko
77123600ee Improve fmha_bwd tests performance (#2376)
* Avoid passing indices (std::vector) by value to host tensor's operator()

Each access requires 2 allocations and copies of the vector.

* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification

* Compute ds_hp_host_ref in parallel

This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.

* Do not use ForEach for simple copy and conversion of large tensors

These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
2025-06-24 07:45:24 -07:00
Kiefer van Teutem
9e74ae7c89 Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319)
* Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty.

* Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good.

* Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet.

* Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3.  In principle the whole thing works now, just need to add other data types and perhaps do some cleanup.

* Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16).

* Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts.

* Fixup comments and TODOs

* Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small.

* Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests.

* Take batching into account when using rotating memory or clearing the C tensor.

* Implement small refactors / comments etc. from review.

* Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists.

* Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems.

* Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups.

* Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function.

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2025-06-24 07:28:13 -07:00
lalala-sh
bb571a0330 fix moe i4 bug from aiter (#2339) 2025-06-24 14:51:29 +08:00
Rostyslav Geyyer
dbfe70e72a Add accelerated stochastic rounding on gfx950 (#2355)
* Add native prand generation support for gfx950

* Update seed calculation
2025-06-23 09:31:46 -05:00
John Shumway
47ae4b0955 Shard several of the most costly targets. (#2373)
* Shard several of the most costly targets.

Introduces a filter_tuple_by_modulo to break up tuples.

Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.

time ninja -j 64 device_grouped_conv3d_fwd_instance

* fix clang format

* Fix build errors in instantiation code.

I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:

ninja -j 128 test_grouped_convnd_fwd

This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.

I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.

* Fix link errors from mismatched declarations.

Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Shard the longest 2D convolution builds

Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.

* Use PROJECT_SOURCE_DIR for submodule compatibility

I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules.  Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Remove accidental copy of a file

* Remove accidental copies of template files.

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-06-23 07:24:36 -07:00
Illia Silin
cdfd7722bf Revert "Shard several of the most costly targets. (#2266)" (#2361)
This reverts commit 3a0cb27966.
2025-06-17 13:56:30 -07:00
Bartłomiej Kocot
cc98a41f46 Fix Add in dynamic buffer for fp32/i8 (#2351)
* Fix Add in dynamic buffer for fp32/i8

* fixes

* Fix
2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi
4c57157d50 Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-06-17 11:54:30 -07:00
Bartłomiej Kocot
f6c2ff9dce Grouped convolution forward with clamp (#2334)
* Grouped convolution forward with clamp

* Optimize clamp

* unary fixes

* test gk bias

* Revert "test gk bias"

This reverts commit 8e42e29d7b.

* Revert "Revert "test gk bias""

This reverts commit e73c0550ce.

* workaround comment
2025-06-16 15:36:53 +02:00
Illia Silin
2d8a804152 Fix direct lds load for gfx950 and clang20 (#2346)
* fix direct lds load for gfx950 and clang20

* Update include/ck/utility/amd_buffer_addressing_builtins.hpp

* Fix format

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
2025-06-15 15:22:34 -07:00