composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
chenjun	577a80122f	fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437 ) Co-authored-by: valarLip <340077269@qq.com> [ROCm/composable_kernel commit: `74a34e0f50`]	2025-07-02 19:12:07 +08:00
Gino Lu	c9d043a1d1	Fix return value bug that drops minus sign in some cases. (#2415 ) * fix return value bug. * refine change according to comment. [ROCm/composable_kernel commit: `60eb70f543`]	2025-07-02 14:53:00 +08:00
Aviral Goel	9f87368b4a	[ckProfiler] Add infrastructure and instances to profile gemm_universal with B preshuffle (#2427 ) * works on mi300 * fix(profiler): add error message for unsupported type/layout * refactor(preshuffle.inc): add type aliases for code readability [ROCm/composable_kernel commit: `36df1cbd0a`]	2025-07-01 18:34:52 -07:00
Thrupti Raj Lakshmana Gowda	783dc82c5e	Updating Runtime log for CK Tile Engine (#2431 ) * Updating runtime log message for CK TILE ENGINE * Fixing Clang Format * Update tile_engine/ops/gemm/README.md Co-authored-by: Aviral Goel <aviral.goel@amd.com> --------- Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `a03682cb80`]	2025-07-01 10:59:49 -07:00
Aviral Goel	26fd170c5b	Enhancements in precommit_install.sh for Python and CK Tile code (#2400 ) * fix(precommit_install): script now installs packages in virtual env * fix(precommit_install): installs packages in virtual env * feat(precommit): added ruff for python linting and formatting * feat(precommit): added ruff for python linting and formatting * feat(precommit): run ruff when py files are commited * feat(precommit): remod.py is run when ck_tile modified * add empty line at the end * style(precommit.yaml): remove empty line --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `e9036a8fc2`]	2025-07-01 01:11:10 -07:00
Vidyasagar Ananthan	ae7f6accfc	Fix an earlier static check error due to assignment of variable in Jenkinsfile (#2420 ) * Testing assignment of param fix * Removing redundant changes * Adding back unit test runs * Ensuring Jenkins changes work on develop - to be reverted * Revert "Ensuring Jenkins changes work on develop - to be reverted" This reverts commit `cf1cab4a43`. [ROCm/composable_kernel commit: `2fa9270a25`]	2025-06-28 07:07:14 -07:00
Thomas Ning	57fb6b0cc6	Revert "Enable builds on gfx942 by default and run all tests on develop branc…" (#2418 ) This reverts commit `b719fea21b`. [ROCm/composable_kernel commit: `28a63d7dcb`]	2025-06-27 16:40:10 -07:00
huaiguxu	0ac91713ae	Huaiguxu/moe fp8 pertoken scale fix (#2391 ) * fix pertoken_scale a_scale dimension * clang-format * Fix moe_gemm2_fp8 perTokenScale reference and example. [ROCm/composable_kernel commit: `e1c5172fdb`]	2025-06-27 10:24:34 +08:00
linqunAMD	c7c24bb10d	[CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375 ) 1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it. 2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance. 3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw. [ROCm/composable_kernel commit: `1749c0409e`]	2025-06-26 08:32:39 +08:00
Khushbu Agarwal	207baa02bb	Enabling diff datatypes for tile_engine and build with more granularity (#2392 ) * merging recent changes to universal gemm to tile_engine * Reducing Linking time by generating less intermediate files * make small libs to build faster * Reducing the instances * reducing instances * Restoring default config * Restoring default config * warp_n reverted in default config * Adding diff json files for fp8 and fp16, cmake changes for fp8 * Restructure the CMake File * Added more granularity for build and some debugging code * removed some of debugging statements * added fp8 instances * tahe datatype from command line to enable both type of json files * updated README file * code cleanup * code cleanup * updated jenkinsfile * enable tile_engine daily builds * updating cmake file * updated CMakeLists.txt * Updating CMake code fixing gfx12 build * Updating CMake code fixing gfx12 build * Fix CMake file null checks * fixed traces of rebase * Update tile_engine/ops/gemm/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fixing rebase issue --------- Co-authored-by: khushbu <khuagarw@gmail.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `a14753b86f`]	2025-06-25 15:18:24 -07:00
Thomas Ning	753232ea70	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com> [ROCm/composable_kernel commit: `e03293ebce`]	2025-06-25 08:20:35 -07:00
Illia Silin	b719fea21b	Enable builds on gfx942 by default and run all tests on develop branch. (#2408 ) * add switches for architectures and force develop to run all tests * move the test condition inside the function * enable build on gfx942 by default [ROCm/composable_kernel commit: `6d6f4c76c1`]	2025-06-25 08:01:50 -07:00
Rostyslav Geyyer	9e0bfd3dbb	Enable fp4 tests (#2329 ) [ROCm/composable_kernel commit: `daf71fb8e4`]	2025-06-25 07:38:54 -05:00
linqunAMD	511f170dab	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `37e1a27537`]	2025-06-25 01:07:45 -07:00
Po Yen Chen	b86c92c84e	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs() [ROCm/composable_kernel commit: `50fad03524`]	2025-06-25 15:19:21 +08:00
Xiao Li	b3b4aa8d57	Fix amd_ck_fp8.hpp macro definitions (#2325 ) * Fix amd_ck_fp8.hpp macro definitions 1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before. 2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace fp8_impl to avoid redefined error when building with rocm 6.4+ (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h) Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `bac51b6ec0`]	2025-06-24 22:46:15 -06:00
Yi DING	820ba182a0	Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393 ) [ROCm/composable_kernel commit: `c5d9181e1b`]	2025-06-24 16:35:54 -07:00
JiaLuo-CAN	d07dd533b3	add a mx_fp8 client example (#2380 ) * add a mx_fp8 client example * remove verify code and fix date * remove verify code and fix date, type --------- Co-authored-by: root <root@bg-1w300-e1-2a.mkm.dcgpu> Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `778ac24376`]	2025-06-24 12:13:18 -04:00
Anton Gorenko	e156b5aebb	Improve fmha_bwd tests performance (#2376 ) * Avoid passing indices (std::vector) by value to host tensor's operator() Each access requires 2 allocations and copies of the vector. * Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification * Compute ds_hp_host_ref in parallel This sequntial ForEach is the slowest part of validation and it benefits from parallel computation. * Do not use ForEach for simple copy and conversion of large tensors These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices. [ROCm/composable_kernel commit: `77123600ee`]	2025-06-24 07:45:24 -07:00
JonathanLichtnerAMD	1c8b1cee57	Do not build "other" library for MIOpen (#2382 ) MIOpen only needs the static CK library for convolutions. [ROCm/composable_kernel commit: `87fdb368a7`]	2025-06-24 07:32:16 -07:00
JonathanLichtnerAMD	79c30fbb3b	Fix build error when building with MIOPEN_REQ_LIBS_ONLY=ON (#2383 ) Co-authored-by: John Shumway <john.shumwayjr@gmail.com> [ROCm/composable_kernel commit: `42e246e90f`]	2025-06-24 07:30:42 -07:00
Kiefer van Teutem	eb4b7c65ff	Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319 ) * Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty. * Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good. * Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet. * Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3. In principle the whole thing works now, just need to add other data types and perhaps do some cleanup. * Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16). * Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts. * Fixup comments and TODOs * Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small. * Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests. * Take batching into account when using rotating memory or clearing the C tensor. * Implement small refactors / comments etc. from review. * Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists. * Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems. * Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups. * Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `9e74ae7c89`]	2025-06-24 07:28:13 -07:00
lalala-sh	b6c780fc7f	fix moe i4 bug from aiter (#2339 ) [ROCm/composable_kernel commit: `bb571a0330`]	2025-06-24 14:51:29 +08:00
Yi DING	9f0d3497c3	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `b8212864cf`]	2025-06-24 01:33:31 +08:00
Rostyslav Geyyer	de3cfbab9a	Add accelerated stochastic rounding on gfx950 (#2355 ) * Add native prand generation support for gfx950 * Update seed calculation [ROCm/composable_kernel commit: `dbfe70e72a`]	2025-06-23 09:31:46 -05:00
John Shumway	7c57c4f045	Shard several of the most costly targets. (#2373 ) * Shard several of the most costly targets. Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance * fix clang format * Fix build errors in instantiation code. I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR. * Fix link errors from mismatched declarations. Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Shard the longest 2D convolution builds Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes. * Use PROJECT_SOURCE_DIR for submodule compatibility I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Remove accidental copy of a file * Remove accidental copies of template files. --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `47ae4b0955`]	2025-06-23 07:24:36 -07:00
Linjun-AMD	17346f2c91	update the way to compute fmha fwd tflop, include mask type (#2386 ) * update the way to compute fwd tflop, include mask type Signed-off-by: JL-underdog <Jun.Lin@amd.com> * remove unneccessary comment * add necessary comment * remove some comment --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu> [ROCm/composable_kernel commit: `61eb622e85`]	2025-06-23 15:53:58 +08:00
Po Yen Chen	7001322416	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors [ROCm/composable_kernel commit: `7d669440a6`]	2025-06-23 12:29:15 +08:00
Max Podkorytov	0bb4daa71b	Update for xformers (#2372 ) * update api * update kernel api * clang-format [ROCm/composable_kernel commit: `0366fb2abc`]	2025-06-22 00:28:30 -07:00
Bartłomiej Kocot	29cfe38b42	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes [ROCm/composable_kernel commit: `cebdee4d9e`]	2025-06-20 15:44:36 -07:00
Illia Silin	bc61ff620d	update code owners list (#2381 ) [ROCm/composable_kernel commit: `7378a51b4c`]	2025-06-20 14:03:20 -07:00
Thomas Ning	5c2009c852	fix the mi350 error (#2378 ) [ROCm/composable_kernel commit: `df6023e305`]	2025-06-20 12:50:13 -07:00
Illia Silin	3d10c98abe	Introduce dependency-based CI test selection. (#2377 ) * Selective test filter initial commit. * Expanded folder paths for parsing ninja dependencies. * Fixing default branch name in the test evaluation script. * Fixing paths for robustness and adding ctest command to the launch script. * change jenkins file and few tests to upgrade CI * Setting ninja build path. * Fixing typo in Jenkinsfile, and wrong paths. * Fixing typo in launch script. * add few more tests to check CI logic * Fixing header for shell script. * turn off performance test by default, add option to run all unit tests * revert dummy changes in source code to trigger tests * make sure develop branch runs all unit tests --------- Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com> [ROCm/composable_kernel commit: `c3c8c6a10f`]	2025-06-20 12:48:00 -07:00
Thomas Ning	3414888f92	Transpose builtin macro defense (#2374 ) * add the macro defense * add the static assert check [ROCm/composable_kernel commit: `107e3623c7`]	2025-06-20 11:24:54 -07:00
Bartłomiej Kocot	9e27236fb7	Grouped conv bias clamp fp32/fp16 support (#2366 ) [ROCm/composable_kernel commit: `663992e99b`]	2025-06-20 11:41:04 +02:00
Max Podkorytov	7c10189a27	Reland fix default epilogue (#2367 ) * Revert "Revert "Fix default epilogue (#2358)" (#2364)" This reverts commit `f85c70b31e`. * add operator() with old signature [ROCm/composable_kernel commit: `11eb9f1c77`]	2025-06-19 10:39:30 -07:00
dependabot[bot]	83fa5f32c6	Bump sphinxcontrib-bibtex from 2.6.3 to 2.6.4 in /docs/sphinx (#2365 ) Bumps [sphinxcontrib-bibtex](https://github.com/mcmtroffaes/sphinxcontrib-bibtex) from 2.6.3 to 2.6.4. - [Changelog](https://github.com/mcmtroffaes/sphinxcontrib-bibtex/blob/develop/CHANGELOG.rst) - [Commits](https://github.com/mcmtroffaes/sphinxcontrib-bibtex/compare/2.6.3...2.6.4) --- updated-dependencies: - dependency-name: sphinxcontrib-bibtex dependency-version: 2.6.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `c8b247c55c`]	2025-06-18 08:15:59 -07:00
Muhammed Emin Ozturk	9c035fb203	Stream-K CkProfiler Update ( Replace CPU Validation with GPU Validation and Add Dynamic Grid Size Calculation for Stream-K GEMM Profiler) (#2333 ) * Stream-K Ckprofiler Update * new grid list based on sm number * clang * update for review * Update profile_gemm_universal_streamk.cpp --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> [ROCm/composable_kernel commit: `bfb33bc1e9`]	2025-06-18 07:49:22 -07:00
joyeamd	3cb0dd8506	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `a2f01141aa`]	2025-06-18 01:28:34 -07:00
Thomas Ning	f85c70b31e	Revert "Fix default epilogue (#2358 )" (#2364 ) This reverts commit `b29e3830a6`. [ROCm/composable_kernel commit: `64a2fda713`]	2025-06-17 22:43:05 -07:00
linqunAMD	cd0bf60645	[CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd (#2243 ) * [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd * fix error with the latest develop code. [ROCm/composable_kernel commit: `7aeec9a901`]	2025-06-17 21:37:59 -07:00
carlushuang	f540c6ccb4	[CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335 ) * support local_token for hipgraph * update README * fix comment * fix fmoe example [ROCm/composable_kernel commit: `a4e1248dba`]	2025-06-18 10:49:43 +08:00
Kiefer van Teutem	609cb2c3ad	Fix argument order for calls to profile_batched_gemm_impl() (#2277 ) * Fix argument order for calls to profile_batched_gemm_impl() * Revert previous and swap the order of the profile_batched_gemm_impl() function arguments instead. * Revert copyright years for unchanged files. * Remove test_batched_gemm from REGRESSION_TESTS since it no longer takes more than 30 seconds to run. --------- Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `c7c6a0ccb3`]	2025-06-17 19:29:09 -07:00
Max Podkorytov	b29e3830a6	Fix default epilogue (#2358 ) * [ck-tile] fix default epilogue in gemm universal * argument validation needs vector size D * operator() needs to specify dram windows * copy/paste from cshuffle epilogue * clang-format * mark unused argument --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cd606f72c1`]	2025-06-17 17:30:21 -07:00
linqunAMD	af00674037	[CK_TILE] Support multi-config in tile_example_gemm_universal (#2240 ) * [CK_TILE] Support multi-config in tile_example_gemm_universal Add GemmConfig in run_gemm_example to support multiple tile config. - It is useful when use you need compare gemm perf with different tile/pipeline config - we also can use it simplify the code for wmma support in the furture. * [CK_TILE] Support multi-config in tile_example_gemm_universal Address review comments * rebase code and fix clang format. * fix clang format * support pipeline v5. * fix merge conflict * address review comment * add missing file * address review comment v2 * fix build error [ROCm/composable_kernel commit: `0eb8974502`]	2025-06-17 17:27:46 -07:00
John Afaganis	3ef7712ee3	Add missing copyright headers (#2359 ) * Add missing copyright headers * empty commit [ROCm/composable_kernel commit: `df54667102`]	2025-06-17 14:29:45 -07:00
Illia Silin	073bb8d588	Revert "Shard several of the most costly targets. (#2266 )" (#2361 ) This reverts commit `c1285aaada`. [ROCm/composable_kernel commit: `cdfd7722bf`]	2025-06-17 13:56:30 -07:00
Bartłomiej Kocot	d9316dfbeb	Fix Add in dynamic buffer for fp32/i8 (#2351 ) * Fix Add in dynamic buffer for fp32/i8 * fixes * Fix [ROCm/composable_kernel commit: `cc98a41f46`]	2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi	bde406245a	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
Aviral Goel	66afddf431	add script to pre commit hooks for checking file permissions (#2322 ) [ROCm/composable_kernel commit: `3af66e99ab`]	2025-06-17 07:07:08 -07:00

1 2 3 4 5 ...

2047 Commits