Commit Graph

2153 Commits

Author SHA1 Message Date
Jinchao Xu
d2525548c6 Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611)
Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances
by splitting debug information into separate .dwo files.

Add explicit cast to avoid signed/unsigned comparison warning.

[ROCm/composable_kernel commit: 15eb493152]
2025-08-04 11:26:08 -07:00
Bartłomiej Kocot
c903850b65 Mark non-grouped convolutions instances as deprecated (#2595)
* Mark non-grouped convolutions instances as deprecated

* Update CHANGELOG.md

Co-authored-by: John Afaganis <john.afaganis@amd.com>

* Update library/src/tensor_operation_instance/gpu/conv1d_bwd_data/device_conv1d_bwd_data_xdl_nwc_kxc_nwk_bf16_instance.cpp

Co-authored-by: John Afaganis <john.afaganis@amd.com>

---------

Co-authored-by: John Afaganis <john.afaganis@amd.com>

[ROCm/composable_kernel commit: 8655ba989c]
2025-08-04 16:49:55 +02:00
Max Podkorytov
b13f01345d remove std::format (#2604)
[ROCm/composable_kernel commit: 0d9439760f]
2025-08-01 19:22:07 -07:00
Illia Silin
8703165b46 remove std=c++17 compiler flag (#2603)
[ROCm/composable_kernel commit: b786d12e56]
2025-08-01 16:18:16 -07:00
Max Podkorytov
dd4a259904 [CK-tile] remove old ck-tile transpose test (#2591)
* remove old ck-tile transpose test

* rename test exe for consistency

* replace batched transpose regression test

[ROCm/composable_kernel commit: f36cb5b2aa]
2025-08-01 14:50:09 -07:00
Thomas Ning
aa2f9b4c73 Reduce build time tile engine (#2579)
* Modify CMakeLists to allow for splitting.

* Modify CMakeLists for data and layout logic.

* Run tests and get build artifact.

* Test new Cmakelists for speedup.

* Further improvements for speedup.

* turn off the FMHA

* turn off the automatic tile engine gemm

* minor fix

* disable the transpose test first

* Address the comment

* Jenkinsfile

* change the make thread to 64

* change the compile thread to 32

* Try to use with less OS memory space

* Have the Unity build batch size to 2

* reduce the chunk size

---------

Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>

[ROCm/composable_kernel commit: e5b79b26fa]
2025-08-01 14:42:33 -07:00
Illia Silin
6fc0a709dc update the switch condition for buffer built-ins (#2602)
[ROCm/composable_kernel commit: 788e8a878e]
2025-08-01 14:30:07 -07:00
Thomas Ning
9325ae50e8 Fix the GFX 950 Universal GEMM (#2597)
* solve the gfx950 error

* clang format

* fix a typo error

---------

Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: 7c44a763fa]
2025-08-01 09:32:24 -07:00
Illia Silin
6c5d1d39b8 Add a daily CI stage to test AITER with latest CK. (#2598)
* add a CI stage for AITER testing

[ROCm/composable_kernel commit: e6104daecc]
2025-08-01 07:55:51 -07:00
lalala-sh
def677fbc7 fix weight index out of range (#2414)
[ROCm/composable_kernel commit: bb5c478295]
2025-08-01 17:50:02 +08:00
Aviral Goel
4f9a78fbf0 Integration of a new pipeline for weight preshuffle into gemm examples (#2516)
* something khushbu can help with

* v1 v2 works with flatmm develop

* v0 v1 v2 numerical error gone

* Fixing numerical error, and interchange preshuffle configs to match with flatmm

* Refactor GEMM pipeline configurations and integrate preshuffle support

- Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3).
- Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations.
- Removed obsolete code and comments

* clang format

* fix vectorloadsize bug

* add the Preshuffle3

* update kwarp calculation in gemm utils

* update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch

* fix: add CK_GFX950_SUPPORT macro for gfx950 detection

* default disable rotating buffer

* docs(CHANGELOG): update changelog for rocm 7.0

* Revert "docs(CHANGELOG): update changelog for rocm 7.0"

This reverts commit 2bc16fff84.

* Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files.

* revert example/ck_tile/flatmm to its original state

* remove comment added by second author

* switch to xor ALDSDescriptor

* modify the MakeALdsDescriptor()

* temporary profiling script

* getting rid of line marker compiler error

* UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy

* add a minor fix for the config

* typo fix

* Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2

* revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp

* revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp

* reenable the GemmSpatiallyLocalTilePartitioner

* make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline

* remove hardcoded true for preshuffle bool template argument

* rename script

* remove gemm_profilie.sh script

* merge conflict resolve

* clang formatted

* typo fix

* Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp

* Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy

* Fix missing newline at end of file in run_gemm_example.inc

* Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1

* addressing review comments

* removing debug code

* addressing review comments

* Revert "addressing review comments"

This reverts commit 29c45192ba.

* updating tile_engine code

* addressing review comments

---------

Co-authored-by: amd-khushbu <khuagarw@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 1441a0a7ee]
2025-08-01 00:04:54 -07:00
Khushbu Agarwal
464c6f459e [CK_Tile] Updating gpu timer when doing flush cache (#2593)
* Missed updating function names in example

* updating timer

* code cleanup

* addressing review comments

* updating tile_engine code

* addressing review comments

[ROCm/composable_kernel commit: 88d72178d6]
2025-07-31 16:43:33 -07:00
Aviral Goel
8bea96ab32 Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt

* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt

* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

---------

Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>

[ROCm/composable_kernel commit: 546ef78d1d]
2025-07-31 12:18:02 -07:00
Ville Pietilä
e91e8e7908 Automatic deduction of split-K value for grouped convolution (#2491)
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Use simple best occupancy model to calculate the split-K.

* Handle split-K autodeduction in explicit gemm conv.

* Add unit tests for split-K autodeduction.

* Remove oversubscription.

* Small fixes.

* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.

* Run clang formatting.

* Fix error handling in the conv profiler.

* Add missing documentation for the autodeducted split-K values.

* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.

* Fix clang formatting and split-K profiler documentation.

* Rename max_occupancy value variable.

* Calculate grid size for split-K autodeduction directly from input array shapes and template params.

---------

Co-authored-by: Ville Pietilä <>

[ROCm/composable_kernel commit: e962a41638]
2025-07-31 12:08:45 +02:00
Anton Gorenko
2ef590ab43 [CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571)
* Add tests for host convesion f32/f16 to f8

* Add tests for host convesion from f8 to f32/f16

* Fix UB and corner cases in f32/f16 to/from f8 conversion

* There are UBs when very small values are converted to f8: bitshifts
  can be larger that type width. Using unsigned long long does not help
  because exponent_diff >= 64 in such cases. This causes that values
  like 2.117582368e-22 are converted to non-zero f8 in host validation
  of FMHA tests, test_f8 crashes with segfault in completely irrelevant
  code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
  conversion.

* Replace some magic numbers with values from numeric_traits

* Build tests only on devices supporting the type

[ROCm/composable_kernel commit: 7b074249f4]
2025-07-31 09:54:17 +05:00
Illia Silin
64f8c28b42 upgrade clang-format version in install_precommit.sh (#2589)
[ROCm/composable_kernel commit: e8709c24f4]
2025-07-30 08:02:25 -07:00
Max Podkorytov
e18a5f6859 [CK-tile] add gtest for ck-tile batched transpose kernels (#2585)
* add a dummy test file

* add kernel launch logic to the test

* transfer all test cases into gtest params

* factor kernel out into test config

* add load transpose pipeline tests

* add padded tests and skip invalid kernels at runtime

* enum class for pipeline type

* add multiwarp test cases

* fix type

* try to solve the problem

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: de0cdb4c31]
2025-07-30 07:31:05 -07:00
Gino Lu
80a9d720ca add constexpr to pk_fp4::pack/unpack() (#2586)
[ROCm/composable_kernel commit: b25d512e8a]
2025-07-30 10:29:04 -04:00
Khushbu Agarwal
7af5e043b1 Update to gpu_timer for rotating_buffer (#2524)
* update gpu_timer for rotating buffer as hipblasLt's implementation

* timing fix

* Updating gpu timer for old ck as well

* Revert "Updating gpu timer for old ck as well"

This reverts commit 958cd1bc99.

* code clean up with runtime argument; function rename

* code cleanup

* general timer fixes

* bug fix

* clang formatted

* addressing reveiew comments

* clang formatted

* Addressing review comments

* CI fix

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 61e21f5567]
2025-07-29 15:21:05 -07:00
Illia Silin
c810dda70e Revert "Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558)" (#2584)
This reverts commit a111f65475de7974fe89070963d21a2741341ef7.

[ROCm/composable_kernel commit: b80099cc5f]
2025-07-29 13:04:51 -07:00
Thomas Ning
69b117ae73 Expand the bandwidth of direct_global_to_lds for gfx950 (#2576)
* Expand the bandwidth of direct_global_to_lds for gfx950

* clang-format

* fix the remod.py and script for clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: 9d4b494f07]
2025-07-28 23:56:53 -07:00
rocking
0fb487ad63 set default optdim (#2580)
[ROCm/composable_kernel commit: 01642ca8b1]
2025-07-29 13:44:10 +08:00
Illia Silin
6704e5303f fix the clang-format (#2578)
[ROCm/composable_kernel commit: 49723e94bb]
2025-07-28 20:49:55 -07:00
Yi DING
a592107cb9 [CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130)
* Fix shuffle_tile

* Add fmha bwd d160

* CHANGELOG

* Use static_cast

* Update

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 1926cd0cb8]
2025-07-29 09:31:14 +08:00
Andres Lugo
53b36035f2 Remove filter for only batch on receipt 4 (#2574)
Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA

[ROCm/composable_kernel commit: 7fe50dc3da]
2025-07-28 14:53:24 -07:00
Bartłomiej Kocot
f25da17c36 Enable multiple D for grouped conv fwd large tensors (#2572)
[ROCm/composable_kernel commit: 5b244105d9]
2025-07-28 22:39:07 +02:00
linqunAMD
d3a1842669 Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564)
* Remove HIP_COMPILE_DEVICE

* add missing files

* fix clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: 0782ee8eb3]
2025-07-28 13:01:07 -07:00
Illia Silin
3345f5f417 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config

[ROCm/composable_kernel commit: 504b101da3]
2025-07-28 11:34:07 -07:00
Illia Silin
61ff984dcd use ninja to build packages (#2575)
[ROCm/composable_kernel commit: 9786087010]
2025-07-28 11:04:12 -07:00
jefyang1
8d7d85be00 Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558)
[ROCm/composable_kernel commit: c64a0c65b9]
2025-07-28 09:03:54 -07:00
rocking
4c9de4fbb4 [CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539)
* uncomment all the headdim, use optdim to control

* change default back to -1

* uncomment splitkv instance

* Fix typo in receipt 4 for appendkv

* support optdim for bwd, splitkv and appendkv

* Fix 192 key error

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>

[ROCm/composable_kernel commit: b36e0b029f]
2025-07-28 17:16:32 +08:00
shay-li77
d143f1466f fix mha bwd dbias random mismatch (#2570)
* fix mha bwd dbias random mismatch

* formatting code

[ROCm/composable_kernel commit: 8ae528a1b4]
2025-07-28 14:39:31 +08:00
Bartłomiej Kocot
23ee2ddb5c Enable bf16 RNE on gfx950 (#2542)
* Enable bf16 RNE for gfx950

* test bhalf

* fix

* fix

* Comments fixes

* fixes

* clean

* fix

[ROCm/composable_kernel commit: 685771b875]
2025-07-28 00:47:17 +02:00
Gheorghe-Teodor Bercea
457e3c8b08 Refactor async loads to work on all GPUs (#2545)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: cbfa62e4b6]
2025-07-26 22:04:59 -07:00
Max Podkorytov
98bad686a9 [CK-Tile] Merge transpose examples (#2450)
* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

[ROCm/composable_kernel commit: 821cd26c13]
2025-07-26 21:51:54 -07:00
liang
d055927b83 reorder grid dim schedule (#2533)
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: d2459878cf]
2025-07-26 02:46:55 +08:00
Bartłomiej Kocot
69c8415c05 Fix clang format (#2567)
* clean

* clang format fix

[ROCm/composable_kernel commit: 5741edf761]
2025-07-25 09:54:34 -07:00
rahjain-amd
1e3243ff59 Fixing 0ms and inf GB/s issue in img2col (#2565)
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

[ROCm/composable_kernel commit: 78082855d8]
2025-07-25 21:15:50 +05:30
Adam Osewski
63d239d406 Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: c8eb2f995c]
2025-07-25 10:34:31 +02:00
Enrico Degregori
2d68b3f9c0 Support b_scale: (#2350)
- extend pipeline v1 and v3
 - add instances
 - add tests
 - add example

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: b01a27ff22]
2025-07-24 18:49:58 -07:00
Cong Ma
c721559117 [CK TILE] Apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests (#2560)
cherry-pick c68687e30 and apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 2addf05b91]
2025-07-24 16:06:32 -07:00
Illia Silin
12f5978e20 remove repetitive code (#2562)
[ROCm/composable_kernel commit: 9c04a55626]
2025-07-24 14:52:46 -07:00
Aviral Goel
9097832e16 Revamp TERMINOLOGY.md (#2522)
* Add comprehensive terminology reference for Composable Kernel, including glossary, hardware and memory hierarchy, execution model, programming model, memory access, tile-based computing, kernel operations, and optimizations.

* Refine terminology in documentation for clarity and consistency.

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Enhance TERMINOLOGY.md by adding definitions for CUDA and HIP

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* address review comments

* refine pipeline and tile partitioner

---------

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: 963dfa680b]
2025-07-24 12:38:24 -07:00
Mateusz Ozga
c3568357ca [CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520)
* Init commit new API

* apply clang-format

* PreShuffle preapring

* Apply Preshuffle condition to universal_gemm

* Fix: convert size_t to index_t

* Review changes

* Mode 100755 -> 100644

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: b507d889c1]
2025-07-24 20:39:56 +02:00
AviralGoelAMD
8d0acd764f docs(CHANGELOG): update changelog for rocm 7.0
[ROCm/composable_kernel commit: 1e84fdaca7]
2025-07-24 14:36:53 -04:00
Andriy Roshchenko
9395318666 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8



[ROCm/composable_kernel commit: 3421272f90]
2025-07-24 14:36:53 -04:00
Cong Ma
87dae7bde1 [CK_TILE] Disable moe_sorting unit test on gfx908 (#2555)
* [CK_TILE] Disable moe_sorting unit test on gfx908

- gfx908 does not support instruction used in moe_sorting

* Update CMakeLists.txt

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: adeaf61ee5]
2025-07-24 09:32:06 -07:00
Yi DING
e397140640 Use filename but not path to filter compilation (#2556)
[ROCm/composable_kernel commit: 4338346b10]
2025-07-24 17:38:14 +08:00
Yashvardhan Agarwal
094e5bad50 [CK_TILE] Support for elementwise kernel (#2246)
* Elementwise kernel implementation

Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>

* Elementwise with generalized nDims

* Adding the n-ary input tensor feature

* Generalize dimensions on top of inputs

* Add TFLOPS + remove std usage for tuples

* 1D basecase optimization

* Cleanup code + refactoring to a common interface

* Generalize to unary and add an example

* Cleanup, refactoring and commenting

* Suggestions for LWPCK-3170: elementwise kernel improvements

* Clang-format: remod.py

* Replace InputTensorType with XDataType as the type of input_tensors

* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs

* Move examples to folder 19_elementwise

* Add missing copyright headers and fix some existing ones

* Replace an assert with throw std::runtime_error in elementwise example

* Avoid reading the output by using make_static_distributed_tensor for y_tile

* Removed two unused includes

* No need to move windows to the next block when each workgroup processes a single tile

* Only copy input tensors to the device

* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example

* Adding output strides to the kernel, transposition example and update the other examples

* Changes made by remod.py

* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp

* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference

* Fix comments in elementwise_example.cpp

- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors

* Simplify CMakeLists.txt and remove the unused variables this uncovers

* Rename a file and fix some copyright statements

* Changes made by script/clang-format-overwrite.sh

* Add basic unit test for ElementWiseKernel

* Remove left-over uninformative comment in apply unit test

* Changes made by clang-format-overwrite.sh

* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp

* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array

* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally

* Move examples to folder 20_elementwise

* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK

* Fix CLang formating

* Bump up the elementwise example folder number

* Elementwise: add padding + minor cleanup

* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view

* Add isSupportedArg to Elementwise kernel + addapt example and unit tests

* Fix clang-format on the unit test file

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 606b0cc947]
2025-07-24 11:21:45 +02:00
jakpiase
bdb86fee78 [CK_TILE] Grouped Convolution Backward Weight Kernel (#2357)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* resolved conflicts

* rebase fixes

* fixes

* tmp

* add working support for splitk

* minor fix

* fixes

* fixes

* minor fix

* small fix

* Split K and preprocessing fixes

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 6681593864]
2025-07-24 10:41:35 +02:00