composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-19 02:01:01 +00:00

Author	SHA1	Message	Date
Enrico Degregori	86e0049300	Wmma support for grouped convolution bwd weight (#2947 ) * Convolution bwd weight device implementation * Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38 * Fix bug and disable splitK=-1 tests for wmma * Add generic instances for bf16 f32 bf16 * check gridwise level validity in device impl for 1 stage D0 * Fix bugs in device implementation: - rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly) * Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization * Remove workaround for 1x1Stride1Pad0 conv specialization * Add instances for xdl parity (for pipeline v1) * Add two stage instances (xdl parity) * Add multiple Ds instances * Add examples * Uncomment scale instances * Fix copyright * Fix examples compilation * Add atomic add float4 * Fix compilation error * Fix instances * Compute tolerances in examples instead of using default ones * Compute tolerances instead of using default ones in bilinear and scale tests * Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47 * Device implementation of explicit gemm for grouped conv bwd weight Based on batched gemm multiple D * Add instances for pipeline v1 and v3 * Add support for occupancy-based splitk * Fix ckProfiler dependencies * Review fixes * Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma' Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52 * Fix cmake file for tests * fix clang format * fix instance factory error * Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test. * Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test." This reverts commit da8e4cfb7917d45d46339ec74eb72e2f585f14cf. * Disable splitk for 2stage xdl on rdna (bug to be fixed) * Fix add_test_executable * Always ForceThreadTileTransfer for now, WaveTileTransfer does not work for convolution yet. * Grab device and gridwise files from bkp branch, this should enable splitK support for convolution and also we no longer ForceThreadTileTransfer for explicit gemm. Also grab some updates from 7e7243783008b11e904f127ecf1df55ef95e9af2 to fix building on clang20. * Fix bug in various bwd wei device implementations / profiler where the occupancy based split_k value could not be found because the Argument did not derive from ArgumentSplitK, leading to incorrect error tolerances. * Actually print the reason when a device implementation is not supported. * Print number of valid instances in profiler and tests. * Fix clang format for Two Stage implementation * Fix copyright * Address review comments * Fix explicit conv bwd weight struct * Fix gridwise common * Fix gridwise ab scale * Remove autodeduce 1 stage * Restore example tolerance calculation * Fix compilation error * Fix gridwise common * Fix gridwise gemm * Fix typo * Fix splitk * Fix splitk ab scale * Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test. * Reduce instances to only the tuned wmma V3 ones for implicit v1 intra and explicit v1 intra pad/nopad. * Add explicit oddMN support with custom tuned instances * Add two stage instances based on the parameters from the tuned cshuffle V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists. * Replace cshuffle non-v3 lists with v3 lists, making sure to not have duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases. * Remove some instances that give incorrect results (f16 NHWGC) * Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances. * Add back some generic instances to make sure we have the same shape / layout / datatype support as before the instance selection process. * Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tuning. Keep generic instances for support. * Disable two stage f16 instances which produce incorrect results. * Remove more instances which fail verification, for bf16_f32_bf16 and for f16 scale / bilinear. * Disable all non-generic two-stage instances in the instance lists for NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit. * Remove unused instance lists and related add_x_instance() functions, fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3. * Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remove custom ckProfiler target. * Remove straggler comments * Remove [[maybe_unused]] * Fix clang format * Remove unwanted instances. This includes all instances which are not NHWGCxGKYXC and F16 or BF16 (no mixed in-out types). * Add comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com> [ROCm/composable_kernel commit: `87dd073887`]	2025-12-17 15:58:58 -08:00
Geo Min	7ad4a687fc	details from org var (#3431 ) [ROCm/composable_kernel commit: `f4729de395`]	2025-12-17 11:54:13 -08:00
Yashvardhan Agarwal	83dc6ad263	[ck_tile] refactor reduce kernel (#3257 ) * refactor reduce kernel - Rename Reduce kernel as per convention - Move kept_dim and reduce_dims from runtime to compile-time parameters - Update Reduce2dProblem template to include KeptDim, ReduceDims, and Rank - Remove IsSupportedArgument validation function as it's unnecessary. Not using the GuaranteedLastDimensionVectorStride while making tensor view or descriptor which removes the bounds enforced earlier. We still calculate and use vector size. - Update reduce example to demonstrate NCHW->NHW reduction with non-contiguous support - Update tests Kernel now handles both contiguous and non-contiguous memory layout. * fix compile errors [ROCm/composable_kernel commit: `ea10a78203`]	2025-12-17 21:46:08 +02:00
ltqin	c8397e8ef2	flashattention fwd add (80, 96) instance (#3415 ) * add hdim (96,96) instance * change to (80,96) * format py * remove 96 in optdim * when N=6 change to llvm_amdgcn_raw_buffer_load_i32x3 [ROCm/composable_kernel commit: `92653168c2`]	2025-12-17 09:16:11 -08:00
Matti Eskelinen	e404594325	Fix minor issues in cmake-ck-dev script (#3438 ) * Remove extra slash from cmake-ck-dev.sh * Add quoting around string variables [ROCm/composable_kernel commit: `fe3d52d9b0`]	2025-12-17 08:57:21 -08:00
music-dino	76d5fb93fe	Add rocm to prefix path for codegen (#3404 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Mirza Halilčević <109971222+mirza-halilcevic@users.noreply.github.com> [ROCm/composable_kernel commit: `55c2886b17`]	2025-12-17 08:51:13 -08:00
spolifroni-amd	c92c3ac29d	[CK] Evened out the wording in ed out the wording in the changelog (#3418 ) [ROCm/composable_kernel commit: `871c2ece2d`]	2025-12-17 08:48:56 -08:00
rocking	97b2015929	Fix FMHA fp8 hdim=64 incorrect result in MI200 (#3423 ) * Fix incorrect result in hdim=64 * Add change log [ROCm/composable_kernel commit: `292f87aa03`]	2025-12-17 08:16:54 -08:00
andrew clark	2de39368c2	Adding sscache stats monitoring (#3428 ) * Adding additional sccache and redis logging to each build * Removing custom workspace * Removing script reference * Logging complete sccache stats * Ensuring monitor is stopped if build fails * Including additional sccache logging * Removing build duration log * Fixing groovy syntax error * Fixing syntax * Modifying logging statements * Fixing syntax * Modifying logging * Modifying logging * Including additional logging * Fixing logging message * Logging build path * Testing * Testing workspace path logs * Adding additonal logging to monitor * Modifying comments * Adding copyright info * Cleaning unnecessary logs * Removing build time logs * Merge branch 'develop' into aick-457 [ROCm/composable_kernel commit: `e67cd7edeb`]	2025-12-17 09:15:27 -07:00
kensclin	9b63a65886	Support A/B Quantization in Blockscale GEMM (#3343 ) * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Implement review suggested changes * Implement review suggested changes * Sync with develop * fix pre-commit error * Add unit tests for blockscale AB-Quantization * fix pre-commit error * fix pre-commit error * fix compile error * fix compile error * fix clang-format * fix clang-format * fix enumeration values not handled in switch * rebase file * Add missing enums to data_type_sizeof (#3430) Fixes broken build on gfx942. This was some test code that got merged at the same time. * [CK_BUILDER] CK Tile header installation for builder, algorithm concept improvements (#3419) * Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen * Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts --------- Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com> * Add build trace diagnostics to CI. (#3432) * generate and visualize build traces for all archs * generate build traces in all cases * fix jenkins logic * fix typo * use more threads for parsing dependency map * add script to parse ninja traces and issue warnings * fix python script syntax and header * fix python syntax one more time * fix python syntax * Support A/B Quantization in Blockscale GEMM * Implement review suggested changes * Sync with develop * Add unit tests for blockscale AB-Quantization * fix enumeration values not handled in switch * rebase file * rebase file --------- Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: DarylHawkinsAMD <Daryl.Hawkins@amd.com> Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `0500fcc017`]	2025-12-17 07:13:47 -08:00
KateJu	c3d078376b	fix some minor error (#3409 ) ReduceWithNoIndexTesBtHalfFloat_AMAX: fix typo error to ReduceWithNoIndexTesBHalfFloat_AMAX reduce_blockwise_test<int8_t, float to reduce_blockwise_test<int8_t, int32_t to solve error message "The reduction setting is invalid, exiting!" [ROCm/composable_kernel commit: `292df2719f`]	2025-12-16 19:50:49 -08:00
Yi DING	af1927262c	[CK_TILE] Add FP8xF4 Flatmm (#3401 ) * Refactor policy * fix a bank conflict * Enable mixed mx flatmm * Update [ROCm/composable_kernel commit: `57e1e4a848`]	2025-12-17 10:01:48 +08:00
Illia Silin	f35e7b59cc	Add build trace diagnostics to CI. (#3432 ) * generate and visualize build traces for all archs * generate build traces in all cases * fix jenkins logic * fix typo * use more threads for parsing dependency map * add script to parse ninja traces and issue warnings * fix python script syntax and header * fix python syntax one more time * fix python syntax [ROCm/composable_kernel commit: `3dfa794fab`]	2025-12-16 08:22:52 -08:00
DarylHawkinsAMD	29ed00bbd1	[CK_BUILDER] CK Tile header installation for builder, algorithm concept improvements (#3419 ) * Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen * Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts --------- Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com> [ROCm/composable_kernel commit: `1e6bbed1fb`]	2025-12-15 16:24:36 -07:00
John Shumway	ec9afcfe8d	Add missing enums to data_type_sizeof (#3430 ) Fixes broken build on gfx942. This was some test code that got merged at the same time. [ROCm/composable_kernel commit: `2544e394cf`]	2025-12-15 11:49:36 -08:00
Aviral Goel	389e797a9b	build: reduce build time for bquant tests by splitting into multiple cpp & support on other gfx10 case (#3395 ) * build: reduce build time for bqaunt unit tests by splitting into multiple cpp * reduce the test case & add the gfx10 support * fix: copyright header for new file * chore: add copyright to pass the CI * build: Hot fix to reduce massive build time by just disabling the instances * Update include/ck_tile/core/config.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: khushbu <khuagarw@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `5e2d25e20f`]	2025-12-15 07:19:29 -08:00
Sami Remes	4a29a8f84d	[CK_TILE] Fix some inconsistencies with OverrideBDatatype in BQuant GEMM (#3394 ) * Fix some inconsistencies with OverrideBDatatype * fix formatting * Fix BGlobalPrefetch, no static --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `a0cdb0b493`]	2025-12-15 07:18:38 -08:00
linqunAMD	7cdba74e97	[ck][gfx12] support contraction on gfx12 (#3421 ) * support contraction on gfx12 * increase tolerance for gfx11 in example contraction the precsion of gfx11 wmma is less than others. [ROCm/composable_kernel commit: `7e93eed878`]	2025-12-15 07:16:01 -08:00
linqunAMD	8811c57d44	[ck_tile] remove duplicate functions in ck_tile (#3311 ) * [ck_tile] remove duplicated shuffle_b and shuffle_b_permuteN * [ck_tile] move get_k_warp to gemm_shape * resolve code rebase error [ROCm/composable_kernel commit: `6d7299ff78`]	2025-12-15 07:13:00 -08:00
Johannes Graner	2fe4c8acec	Add grouped convnd dataset tests for bwd_data, bwd_weight and make them parallel (#3380 ) * Parallelization in dataset generation * Parallelizable tests for fwd, bwd data, bwd weight with datasets * .gitignore generated datasets * Test parallelization script with round-robin GPU scheduling * Parallelization updates to test generation and running * Dataset paths relative to executable * Update output from test generation * Default to one GPU in test generation * Add small dataset tests to Jenkins * Update copyright lines * Update test_data/generate_test_dataset.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Move trap disable * Common get path function --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `fe35ba5dac`]	2025-12-15 13:38:25 +01:00
Bartłomiej Kocot	a45c051ac9	[CK TILE][AICK-439] Fix cshuffle epilogue wave per shuffle (#3364 ) * [CK TILE] Fix cshufle epligoue wave per shuffle * Align shuffle per tile with smem * fixes * Fixes for double smem * fix [ROCm/composable_kernel commit: `3b773109e5`]	2025-12-15 12:59:48 +01:00
Johannes Graner	6238fe6d0d	[CK Grouped Gemm] Disable split-k kernel for split-k > 1 with non-contiguous strides (#3405 ) * Disable kernel for split-k > 1 with non-contiguous strides * Update device_grouped_gemm_xdl_splitk_cshuffle.hpp --------- AICK-441 (partial) Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3143a5a480`]	2025-12-15 08:03:00 +01:00
Linjun-AMD	51886bf22b	Add attention sink support for FMHA FWD (#3368 ) * Revert "Revert "Add attn sink (#2892)" (#3250)" This reverts commit e3be392d13e6ee107d823af32aca2d3ff03ca69d. * fix conflict Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Add F_sink parameter to FmhaFwdPipeline * Update tile_fmha_traits.hpp * Refactor pipeline creation in fmha_fwd.py Updated the pipeline creation logic to include 'sink' parameter in product combinations and adjusted the FmhaFwdPipeline calls accordingly. * Update fmha_fwd.py * Update fmha_fwd.py * Update example/ck_tile/01_fmha/script/correct_test_fwd_sink.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update CHANGELOG.md Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update CHANGELOG with new features and support * Update fmha_fwd.hpp * Update CHANGELOG.md * Update smoke_test_fwd_sink.sh * Update correct_test_fwd_sink.sh * Update smoke_test_fwd_sink.sh --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f5573f56d9`]	2025-12-15 12:21:59 +08:00
Emily Martins	eeb78c46a4	[CK_TILE] Stream-K Tree Reduction and Cache Skipping Integration (#3371 ) * CK Tile Stream-K Tree Reduction This change adds the first implementation of the Stream-K tree reduction strategy into CK Tile. The tree reduction reduces the the number of steps for accumulating results for a tile from O(N) to O(logN) where N is the number of workgroups contributing to a C tile. Additionally, in the original non-atomic reduction strategy, atomics were used to set the flags buffer and to read from the flags buffer. Howeover, through investigation with the tree reduciton, atomics with default (relaxed) semantics were not enough to guarantee workgroups would not read stale data, leading to incorrect results. Stronger acquire/release memory orderings are too expensive. So, this change also eliminates the use of atomics for setting the flags. Instead, we leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby avoiding the use of atomics. Prelimiary tests were also added for the normal reduction and tree reduction. More will be added in a future PR via tile engine. * Move Stream-K kernel files to a subdirectory * Cleanup Code Style & Handle Unsupported Reductions This change makes the following small changes: - Add an explicit else block for unimplemented reduction strategies - Clarify type of sk_flags_ptr via auto* - Add description for extra_iters_before_me variable * Run new copyright script on new files [ROCm/composable_kernel commit: `22b945e06e`]	2025-12-14 14:49:49 -07:00
John Shumway	a3270d2eb0	Add describe() method to device ops for runtime introspection (#3375 ) Introduces a polymorphic describe() method to BaseOperator that enables runtime introspection of kernel configurations through a unified interface. Key changes: * Add virtual describe() method to BaseOperator returning Description objects * Implement describe() in 6 device operation classes (conv fwd/bwd variants) * Create conv_describe.hpp with factory function for ConvDescription * Extract type definitions to conv_types.hpp to resolve circular dependencies * Add InstanceStringDescription for kernels without full ConvDescription support Other Improvements: * Update tests to use describe() instead of GetInstanceString() * Remove circular dependency include from conv_traits.hpp * Add ODD_C to ConvFwdSpecialization enum and fix OddC mapping * Replace silent fallback in conv_layout() with compile-time error This provides a foundation for runtime kernel introspection and better tooling support for analyzing and debugging kernel configurations. [ROCm/composable_kernel commit: `9ac51aa0f4`]	2025-12-14 12:49:12 -08:00
Enrico Degregori	5c81464568	CK Tile: Enable padding blockscale example (#3417 ) * Fix host code padding * restructure the ref code * clean up * Fix compilation error --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `21f06aa47d`]	2025-12-14 10:25:47 -08:00
Robin Voetter	417ed79412	[CK_BUILDER] convolution testing (#3267 ) * Add README.md for testing * Add tensor_memory_manager. * ck-builder: tensor memory manager rebase fixes This fixes some issues caused by the API being changed recently. Also, this streamlines the ckt namespace to always be ck_tile::builder::test, as this is already being used by other tests Really, this commit should be squashed into the previous, but I'm keeping it separate for brevity. * ck-builder: test arguments initial prototype * ck-builder: test system initial prototype * ck-builder: fix non-standardized copyright comments * ck-builder: new prototype * ck-builder: group testing inputs/outputs into a separate structure This is basically the return of the tensor memory manager after all, except that the design is more closely tied to the actual operation. Using a struct allows us to add additional input/output tensors without breaking code (by defaulting those new parameters). Note that the tensors are split into a separate inputs/outputs because we usually want to allocate the output _twice_: once for the real computation and once for the reference computation. * ck-builder: simplify prototype naming; start docs * ck-builder: update testing readme * ck-builder: testing documentation * ck-builder: HipStatusMatcher This matcher can be used to check HIP status codes and provide nice and readable error messages. * ck-builder: tensor_buffer.hpp tests * ck-builder: conv_fwd.hpp tests * ck-builder: add example end-to-end test in conv fwd 2d fp16 * ck-builder: simplify extent usage * ck-builder: update testing doc * ck-builder: skip end to end test on non-gfx9 * fix check_copyright_year interpreter /bin/bash is not guaranteed to exist on Linux. Signed, a NixOS user * ck-builder: fix copyrights * ck-builder: reduce conv fwd testing size This test allocated 24GB of memory, too much for 16GB cards. --------- Co-authored-by: John Shumway <jshumway@amd.com> [ROCm/composable_kernel commit: `6219b12730`]	2025-12-13 15:33:41 +01:00
Cong Ma	d287385933	[CK TILE GEMM STREAMK] update identifier names according to the new code style (#3348 ) * [CK TILE GEMM STREAMK] update identifier names according to the new code style [ROCm/composable_kernel commit: `9707ddb444`]	2025-12-12 17:08:26 -07:00
Enrico Degregori	7cbd8b75a0	Fix compilation ab scale multi target (#3413 ) [ROCm/composable_kernel commit: `b4a34371a6`]	2025-12-12 10:26:47 -08:00
linqunAMD	245c274287	[CK_TILE] Port hw independent changes from internal repo to develop branch (#3301 ) * [CK_TILE] Port hw independent changes from internal repo to develop branch It includes PR#96, #114, #120, #121. * correct rebase error [ROCm/composable_kernel commit: `fc7bf0ab1c`]	2025-12-12 09:28:37 -08:00
Illia Silin	f9bf419b01	disable test_tile_gemm_quant_bquant_preshuffle (#3420 ) [ROCm/composable_kernel commit: `9869641324`]	2025-12-12 09:27:12 -08:00
dependabot[bot]	b4d5a50216	Bump rocm-docs-core[api_reference] from 1.31.0 to 1.31.1 in /docs/sphinx (#3410 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.0 to 1.31.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.0...v1.31.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.31.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `8d7a4e0c73`]	2025-12-11 21:09:40 -08:00
Max Podkorytov	2ac57c22c1	[CK-Tile] fixup codegen for tile engine ops gemm multid and gemm preshuffle (#3383 ) * fixup gemm multi-d and preshuffle in tile engine codegen --------- Co-authored-by: Thrupti Raj Lakshmana Gowda <thruptiraj.lakshmanagowda@amd.com> [ROCm/composable_kernel commit: `4011dbfec3`]	2025-12-11 14:23:43 -08:00
Aviral Goel	5d5dbdfb0d	build: Hot fix to reduce massive build time by just disabling the instances (#3408 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `ff194a4271`]	2025-12-11 10:39:20 -08:00
Aviral Goel	32faf7b8e3	chore: add copyright to pass the CI (#3407 ) [ROCm/composable_kernel commit: `45c4ea510c`]	2025-12-11 10:34:15 -08:00
Aviral Goel	f2a25da322	chore: update copyright header for misc files (#3402 ) * chore: update copyright header for misc files * fix: typo in kernel resulting in ci failure [ROCm/composable_kernel commit: `4dcc3e59c1`]	2025-12-11 08:25:29 -08:00
Illia Silin	f55ff25622	Fix compilation errors with latest clang22 version. (#3396 ) * remove target attributes from deduction guides * switch CK_TILE_HOST_DEVICE_EXTERN based on clang version [ROCm/composable_kernel commit: `b2925ee207`]	2025-12-11 08:09:29 -08:00
eliotwang	d5645ff481	Bf16fp4 gemm (#2801 ) support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * rebase to new develop * fix clang format * update code according to reviewer's comment * Update README.md * update code according to reviewer's comment * update code according to reviewer's comment * Update CMakeLists.txt * Update README.md * Update CMakeLists.txt * Delete files * Delete files * Add unit tests * Update test_gemm_quant_base.hpp * merge bf16fp4 example to develop branch fix clang format * fix clang format * Update CMakeLists.txt * fix ci test * fix clang format * resolve conflicts --------- Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu> Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `715671e419`]	2025-12-11 07:20:29 -08:00
Enrico Degregori	53dc636c6e	Wmma support for gemm_ab_scale (#3314 ) * Support gemm_ab_scale: - Add tests - Integrate scaling implementation in multiple D - Generalize existing b_scale for ab_scale - Add instances - Generalize implementation for ScaleBlockM, ScaleBlockN, ScaleBlockK - Add support for all layouts supported by xdl - Fix splitk xdl * Fix copyright * Wmma support for gemm_blockscale_wp (#3315) * Support for preshuffle with ab scale - add support for b preshuffle in GridwiseGemm_wmma_cshuffle_v3_ab_scale - add support for AScaleLayout amnd BScaleLayout (can be different from ALayout and BLayout, respectively) - add Run method in v1 pipeline to support preshuffle + scaling - add support for preshuffle gemms in common invoker - Add splitk support * Fix copyright header [ROCm/composable_kernel commit: `ce99cab605`]	2025-12-11 09:06:20 +01:00
Ville Pietilä	fe0fe6f4ad	[CK_BUILDER] Improve CK Builder and CK Builder tests (#3382 ) * Remove stale documentation. * Add placeholder for conv algorithm design description. Add link to conv factory description. * Improve testing transfer parameters. * Python script to check the block tilings. * Improve tests and conv types serialization. * Change representation of boolean values from 1/0 to true/false in instance strings. * Change representation of boolean values from 1/0 to true/false in conv algorithm types. * Test code improvements. * Improve covn descriptions tests. * Improve conv signature definition in conv fwd builder tests. * clang-format. * Remove obsolete script. * Revert StaticAssertTypeEq changes in conv layout tests. * Remove obsolete using declaration. --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `d66e5f667c`]	2025-12-11 09:50:00 +02:00
Aviral Goel	d810876d63	feat(precommit-hooks): add check for correct copyright header (#3302 ) * chore(copyright): update copyright header for left files * feat(copyright): add copyright check to precommit hooks * chore(copyright): update copyright header for include/ck_tile directory * chore(copyright): update copyright header for example directory * chore(copyright): update copyright header for .github directory * refactor: copyright_check script with better if else handling * chore(copyright): update compyright header for remaining files * feat: add script to automate copyright addition [ROCm/composable_kernel commit: `6d25525adc`]	2025-12-10 22:50:43 -08:00
Aviral Goel	f38b64ae67	docs: add notes on tile distribution and inline comments (#3297 ) * docs: add notes on tile distribution and inline comments * Apply suggestions from code review Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `fbbdd36ea8`]	2025-12-10 22:47:19 -08:00
Geo Min	f2a77cf0bd	[ci] Bumping TheRock commit hash (#3385 ) * Bumping TheRock commit hash * new docker hash * Using new runner name [ROCm/composable_kernel commit: `8270900d60`]	2025-12-10 17:34:41 -08:00
John Shumway	c868964f6a	Improve sequence sorting and add unit tests (#3376 ) Old sequence sort code was showing up on build profiles. Convert it to constexpr functions for much more efficient build-time execution. The sorting is still O(N^2), but our sequences are small enough it executes quickly. This reduced compilation time of a small convolution by more than 10% and time overall time spent in the compiler on a narrow build by %6. [ROCm/composable_kernel commit: `15ed65db35`]	2025-12-10 12:25:23 -08:00
Po Yen Chen	737c80d47d	fix: python 3.8 compatibility in fmha codegen (#3388 ) [ROCm/composable_kernel commit: `b15df37255`]	2025-12-10 07:08:41 -08:00
Ville Pietilä	d719c09343	[CK_TILE] Split-K autodeduction (#3351 ) * First version of split-K autodeduction. * Fix circular dependency and kernel construction. * Fix tolerance calculation for bwd weight example. * Simplify kernel construction. * Fix kernel launching bug for split-K autodeduce. * Add split-K autodeduction support for the two stage example. * Fix a corner case. * Fix clang-format. * Fix clang-format for inc files. * Add missing header. * Prevent too large split-K values. * Fix formatting. * Add unit tests for IsSupportedArgument in grouped bwd conv. * clang-format. * Fix merge conflicts. * Address feedback from code review. * clang-format * Fix new tests after merge. --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `fc22320d78`]	2025-12-10 09:30:30 +02:00
Zzz9990	822da5d3a7	[CK_TILE MOE] add NT & preshuffle permute to cktile MOE (#3377 ) * update coherence --------- Co-authored-by: Zzz9990 <Zzz9990> [ROCm/composable_kernel commit: `1aa93ef551`]	2025-12-10 10:03:28 +08:00
Illia Silin	ee0d92f8fc	use hipTensor from monorepo for daily builds (#3386 ) [ROCm/composable_kernel commit: `934ba1208a`]	2025-12-09 14:39:08 -08:00
Illia Silin	5f4c14b336	temporarily disable daily builds on gfx1010 and gfx908 (#3384 ) [ROCm/composable_kernel commit: `0d8259affd`]	2025-12-09 10:37:13 -08:00
Illia Silin	cdacf1d5f5	Upgrade to ROCm7.1.1 as default compiler. (#3370 ) * upgrade to rocm7.1.1 as new default compiler * fix jenkinsfile [ROCm/composable_kernel commit: `7582c9e73f`]	2025-12-09 07:35:32 -08:00

1 2 3 4 5 ...

2790 Commits