composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-16 16:51:26 +00:00

Author	SHA1	Message	Date
Illia Silin	f55ff25622	Fix compilation errors with latest clang22 version. (#3396 ) * remove target attributes from deduction guides * switch CK_TILE_HOST_DEVICE_EXTERN based on clang version [ROCm/composable_kernel commit: `b2925ee207`]	2025-12-11 08:09:29 -08:00
eliotwang	d5645ff481	Bf16fp4 gemm (#2801 ) support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * rebase to new develop * fix clang format * update code according to reviewer's comment * Update README.md * update code according to reviewer's comment * update code according to reviewer's comment * Update CMakeLists.txt * Update README.md * Update CMakeLists.txt * Delete files * Delete files * Add unit tests * Update test_gemm_quant_base.hpp * merge bf16fp4 example to develop branch fix clang format * fix clang format * Update CMakeLists.txt * fix ci test * fix clang format * resolve conflicts --------- Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu> Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `715671e419`]	2025-12-11 07:20:29 -08:00
Enrico Degregori	53dc636c6e	Wmma support for gemm_ab_scale (#3314 ) * Support gemm_ab_scale: - Add tests - Integrate scaling implementation in multiple D - Generalize existing b_scale for ab_scale - Add instances - Generalize implementation for ScaleBlockM, ScaleBlockN, ScaleBlockK - Add support for all layouts supported by xdl - Fix splitk xdl * Fix copyright * Wmma support for gemm_blockscale_wp (#3315) * Support for preshuffle with ab scale - add support for b preshuffle in GridwiseGemm_wmma_cshuffle_v3_ab_scale - add support for AScaleLayout amnd BScaleLayout (can be different from ALayout and BLayout, respectively) - add Run method in v1 pipeline to support preshuffle + scaling - add support for preshuffle gemms in common invoker - Add splitk support * Fix copyright header [ROCm/composable_kernel commit: `ce99cab605`]	2025-12-11 09:06:20 +01:00
Ville Pietilä	fe0fe6f4ad	[CK_BUILDER] Improve CK Builder and CK Builder tests (#3382 ) * Remove stale documentation. * Add placeholder for conv algorithm design description. Add link to conv factory description. * Improve testing transfer parameters. * Python script to check the block tilings. * Improve tests and conv types serialization. * Change representation of boolean values from 1/0 to true/false in instance strings. * Change representation of boolean values from 1/0 to true/false in conv algorithm types. * Test code improvements. * Improve covn descriptions tests. * Improve conv signature definition in conv fwd builder tests. * clang-format. * Remove obsolete script. * Revert StaticAssertTypeEq changes in conv layout tests. * Remove obsolete using declaration. --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `d66e5f667c`]	2025-12-11 09:50:00 +02:00
Aviral Goel	d810876d63	feat(precommit-hooks): add check for correct copyright header (#3302 ) * chore(copyright): update copyright header for left files * feat(copyright): add copyright check to precommit hooks * chore(copyright): update copyright header for include/ck_tile directory * chore(copyright): update copyright header for example directory * chore(copyright): update copyright header for .github directory * refactor: copyright_check script with better if else handling * chore(copyright): update compyright header for remaining files * feat: add script to automate copyright addition [ROCm/composable_kernel commit: `6d25525adc`]	2025-12-10 22:50:43 -08:00
Aviral Goel	f38b64ae67	docs: add notes on tile distribution and inline comments (#3297 ) * docs: add notes on tile distribution and inline comments * Apply suggestions from code review Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `fbbdd36ea8`]	2025-12-10 22:47:19 -08:00
Geo Min	f2a77cf0bd	[ci] Bumping TheRock commit hash (#3385 ) * Bumping TheRock commit hash * new docker hash * Using new runner name [ROCm/composable_kernel commit: `8270900d60`]	2025-12-10 17:34:41 -08:00
John Shumway	c868964f6a	Improve sequence sorting and add unit tests (#3376 ) Old sequence sort code was showing up on build profiles. Convert it to constexpr functions for much more efficient build-time execution. The sorting is still O(N^2), but our sequences are small enough it executes quickly. This reduced compilation time of a small convolution by more than 10% and time overall time spent in the compiler on a narrow build by %6. [ROCm/composable_kernel commit: `15ed65db35`]	2025-12-10 12:25:23 -08:00
Po Yen Chen	737c80d47d	fix: python 3.8 compatibility in fmha codegen (#3388 ) [ROCm/composable_kernel commit: `b15df37255`]	2025-12-10 07:08:41 -08:00
Ville Pietilä	d719c09343	[CK_TILE] Split-K autodeduction (#3351 ) * First version of split-K autodeduction. * Fix circular dependency and kernel construction. * Fix tolerance calculation for bwd weight example. * Simplify kernel construction. * Fix kernel launching bug for split-K autodeduce. * Add split-K autodeduction support for the two stage example. * Fix a corner case. * Fix clang-format. * Fix clang-format for inc files. * Add missing header. * Prevent too large split-K values. * Fix formatting. * Add unit tests for IsSupportedArgument in grouped bwd conv. * clang-format. * Fix merge conflicts. * Address feedback from code review. * clang-format * Fix new tests after merge. --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `fc22320d78`]	2025-12-10 09:30:30 +02:00
Zzz9990	822da5d3a7	[CK_TILE MOE] add NT & preshuffle permute to cktile MOE (#3377 ) * update coherence --------- Co-authored-by: Zzz9990 <Zzz9990> [ROCm/composable_kernel commit: `1aa93ef551`]	2025-12-10 10:03:28 +08:00
Illia Silin	ee0d92f8fc	use hipTensor from monorepo for daily builds (#3386 ) [ROCm/composable_kernel commit: `934ba1208a`]	2025-12-09 14:39:08 -08:00
Illia Silin	5f4c14b336	temporarily disable daily builds on gfx1010 and gfx908 (#3384 ) [ROCm/composable_kernel commit: `0d8259affd`]	2025-12-09 10:37:13 -08:00
Illia Silin	cdacf1d5f5	Upgrade to ROCm7.1.1 as default compiler. (#3370 ) * upgrade to rocm7.1.1 as new default compiler * fix jenkinsfile [ROCm/composable_kernel commit: `7582c9e73f`]	2025-12-09 07:35:32 -08:00
dependabot[bot]	821b976ead	Bump rocm-docs-core[api_reference] from 1.20.1 to 1.31.0 in /docs/sphinx (#3374 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.20.1 to 1.31.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.31.0/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.20.1...v1.31.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `50ca3f83eb`]	2025-12-09 07:10:34 -08:00
lalala-sh	77f9a0a615	fix a16w4 moe bugs (#3373 ) * fix valid mask bug * update format [ROCm/composable_kernel commit: `6f0966e1e9`]	2025-12-09 17:54:55 +08:00
Yi DING	b726f9606c	[CK_TILE] Generate random tensor values with multiple threads (#3324 ) [ROCm/composable_kernel commit: `c1c2e41a03`]	2025-12-09 11:02:33 +08:00
Sami Remes	b85cf9d37c	[CK_TILE] Support more layouts for BQuant GEMM (#3349 ) * WIP: preparing to add transpose bq support * WIP: handle both row/col layout for BQ windows/tile dstr * Fix build * WIP: adding some test, debugging numerical errors * Fix all but pkint4 tests * Remove test_gemm_quant_typed.cpp again * update disabled tests * add conversion from pkint4 for b matrix * fix formatting * fix formatting * Fix tr_load and use override b datatype for clarity * fix formatting * make bquant preshuffle tests bqlayout column-major [ROCm/composable_kernel commit: `c363a98d41`]	2025-12-08 13:05:56 -08:00
Erwin Terpstra	7e54399be4	[CK Tile] Grouped GEMM aquant mode and non-persistent kernel (#3337 ) * wip: add aquant to grouped gemm quant example * fix: properly handle hot loop count in aquant pipeline * fix: add separate GemmConfig structs for AQuant, automatically select the correct one * feat: finish support for a non-persistent kernel invocation for grouped gemm quant, and add support code to example * refactor: cleaned up grouped gemm quant example a bit by reusing pipeline selection logic * chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants * feat: add quant grouped gemm tests cases for aquant (regular and transpose C) and non-persistent kernel * fix: update base pipeline classes according to changes in develop branch * Revert "chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants" This reverts commit `b3fd4d326d`. * feat: remove aquant config from grouped gemm quant example, update to add persistency as runtime parameter * chore: removed work-around for aquant bug that has been fixed * chore: fix typo in command-line parameters * fix: correct K warp tile size for gfx950 * chore: incorrect warp tile configuration on gfx942 [ROCm/composable_kernel commit: `fe07b5a1bf`]	2025-12-08 12:19:22 -08:00
Anton Gorenko	9cb42b092a	Add a workaround for a compiler issue for bwd on gfx90a and ROCm 7.1.1 (#3369 ) Sometimes there are not enough wait-states between v_mfma_f32... and v_accvgpr_read_b32 instructions if they are separated by s_cbranch. The workaround is to read accvgprs to vgpr before branching. [ROCm/composable_kernel commit: `ca6143f0b2`]	2025-12-08 07:44:17 -08:00
Yi DING	e63ba15ae2	[CK_TILE] Optimize Flatmm MXFP4 by Eliminating Runtime Division by 2 (#3287 ) * [CK_TILE] Optimize Flatmm MXFP4 by Eliminating Runtime Division by 2 * typo [ROCm/composable_kernel commit: `878b4e7f46`]	2025-12-08 19:20:44 +08:00
Bartłomiej Kocot	13c9c8580f	[CK_BUILDER] Ck Tile Grouped convolution factory (#3352 ) * [BUILDER] Ck Tile Grouped convolution factory * Part 2 * Fixes after rebase * Remove leftovers [ROCm/composable_kernel commit: `04612c30ce`]	2025-12-08 10:32:56 +01:00
yinglu	fc7547a552	ck: add tf32 in `DTYPES` to control instances build(#3317 ) [ROCm/composable_kernel commit: `8fec8054b2`]	2025-12-08 16:24:20 +08:00
Thomas Ning	771f37e4aa	Add the gfx1011 support on CK Tile with the SGPR builtin reading protection (#3350 ) * Finish the fixes * add the gfx1010 support macro * Fix the compilation error [ROCm/composable_kernel commit: `86a84ae611`]	2025-12-05 14:18:30 -08:00
Khushbu Agarwal	5ab9a6cfe4	[CK_Tile] Enable PreshuffleB for 2d block scale Gemm (#3298 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * fixing bq tensor calculation --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `6b1bceca7b`]	2025-12-05 09:57:52 -08:00
Illia Silin	12738d2e45	do not build hipblaslt for gfx90a to save time and disc space (#3362 ) [ROCm/composable_kernel commit: `608232ce82`]	2025-12-05 08:39:18 -08:00
Cong Ma	70a8425dfb	Congma/ck tile/aquant mem pipeline (#3346 ) * [CK TILE GEMM QUANT] Fix the bug in HotLoopTail of memory pipeline [ROCm/composable_kernel commit: `ed080f5a56`]	2025-12-05 09:35:27 -07:00
John Shumway	99a748498a	Ignore .cmake-format.yaml (#3356 ) We don't want to add cmake formatting until we are in the super repo, but its handy if developers want to experiment with formatting. For now we should ignore .cmake-format.yaml. [ROCm/composable_kernel commit: `7541d9b5b0`]	2025-12-05 08:26:00 -08:00
Bartłomiej Kocot	17e2c816c3	Profile resnet layout fixes (#3360 ) [ROCm/composable_kernel commit: `82f796a1f0`]	2025-12-05 08:20:46 -08:00
John Shumway	a157e33311	Simplify includes for CK builder reflection (#3357 ) We only want to import enums and types into the builder reflection code. But, some of the enums are included in much larger files or even big trees of include files. This leads to unintended mixing of code and very confusing interactions and symbol conflicts. We organize the includes and extract two new enum-only headers to help with decoupling in CK. This refactoring is critical if we want to include reflection in a device-operator "describe" method. * Remove a few unnecessary includes from headers in builder/reflect/. * Extract enums scheduler and pipeline to their own headers so they can be used without importing other code. * Order includes alphabetically for better organization. The immediate goal is to unblock reflection integration, and this type of cleanup helps the flexibility and robustness of the CK header library. [ROCm/composable_kernel commit: `f5b0af2272`]	2025-12-05 07:44:10 -08:00
Bartłomiej Kocot	157d2c87db	Add new section to changelog (#3295 ) * Add new section to changelog * Update CHANGELOG.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `35fc7c9e4f`]	2025-12-05 07:14:52 -08:00
jakpiase	54f903def5	fix enforcing fixedvectorsizes for ck tile conv (#3344 ) [ROCm/composable_kernel commit: `f7650ee82b`]	2025-12-05 09:30:22 +01:00
John Shumway	62e5b29702	Clean up conv_traits.hpp (#3354 ) When I asked for a description of operators that didn't have ConvTraits, I was getting very long confusing errors about ConvTraits not being defined. Now we get specific errors explaining which concepts are violated, making it easier to know which code to generalize or update. * Add concepts to conv_traits.hpp to get better error message. * Put the correct requires clauses in the right places to get descriptive error messages. * General cleanup of functions in conv_traits.hpp to make functions easier to read. [ROCm/composable_kernel commit: `13f6d63565`]	2025-12-04 19:12:36 -08:00
Po Yen Chen	d96f632fa1	[CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153 ) * Let fmha_fwd_v3() compatible with fmha_fwd() * Decouple get_fwd_blobs() and FmhaFwdKernel * Decouple compatibility checks from get_fwd_blobs() * Extract product feature checks out from get_fwd_blobs() * Remove duplicated code in factories and redundant checks * Remove FmhaFwdKernel<>::GetName() * Let FmhaFwdApiPool support pipelines with different mask_impl * Add tile setting for fmha fwd v3 pipeline * Add fwd v3 instances to tile_example_fmha_fwd manually * Remove unused function import * Undo irrelevant changes * Remove fwd v3 instances from tile_example_fmha_fwd * Finish fmha fwd v3 kernel instance codegen * Fix formatting * Remove unused F_idx attribute * Add is_generic_attention_mask<> traits * Add constraints to the fmha fwd v3 pipeline * Unify traits & problem used for fmha fwd v3 * Unify kernel launch code for fmha fwd v2 & v3 * Unify kernel template selection logic * Use same kernel codegen template for both v2 & v3 * Rename api() property as render() method * Allow specifying filter for fmha fwd api pool * Allow specifying function name when rendering api pool items * Separate fmha fwd v3 kernel dispatching logic from v2 * Remove lambda assignment * Add simple v2/v3 dispatch logic * Stop generating empty if-clauses Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them. * Use "".join() to concatenate fmha fwd api string content * Add more feature checks for fmha fwd v3 pipeline * Check features before dispatch to fmha_fwd_v3() * Add more feature checks for fmha_fwd_v3() * Add missing filter call * Use Tuple to reserve the dtype orders * Fix wrong pipeline matching logic * Add fmha fwd v3 group mode instances * Add functor_transform<> * Add type constraints to make_tile_window() * Remove fmha fwd v3 example * Fix wrong product(aiter mha_fwd()) config * Fix wrong fmha fwd v2/v3 selection logic * Fix formatting * Add comment to warning v3 kernel users * Fix wrong codegen logics * Remove unnecessary param * Fix format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `05292b3604`]	2025-12-05 10:31:12 +08:00
Illia Silin	96ff482d8d	fix hipblaslt build for different archs (#3358 ) [ROCm/composable_kernel commit: `d1193e8637`]	2025-12-04 18:29:14 -08:00
Max Podkorytov	1f7aa130eb	[CK-Tile] Refactor base pipeline usage (#3251 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder [ROCm/composable_kernel commit: `d184eed823`]	2025-12-04 11:45:49 -08:00
spolifroni-amd	c25f2909d0	[composable_kernel] initial draft of the ck tile conceptual doc (#3242 ) * Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc * Fixes for ruff check. * Fixes for ruff check 2. * Fixes for ruff check 3. --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com> [ROCm/composable_kernel commit: `d9d4c9c3df`]	2025-12-04 11:09:21 -08:00
Illia Silin	38076077ab	build latest hipblaslt in ck_pytorch docker (#3347 ) [ROCm/composable_kernel commit: `cd21e20ae7`]	2025-12-04 06:58:42 -08:00
Ville Pietilä	419aa4e420	[CK_BUILDER] Refactor convolution signature to provide data type/layout/elementwise op per tensor (#3331 ) * Separate layouts into separate entities for input, weight, and output tensors. * Add test for handling bias tensor layouts. * Use instance string in builder tests. * Add handling of output bias data types and layouts. * Generalize handling of the elementwise ops. * Test fix. * Create builder for layouts. * Layout builder improvements. * Improve layout builder. * Simplify bias layout handling. * Code clean-up. * Move layout utils into separate file. * Remove hard-coded layout combinations. * Small code clean-up. * Move data type utils into a separate file. * Add data types, layouts, and elementwise ops per conv tensor. * Builder bug fixes after refactoring. * Working baseline. * Make signature definition look nice in the test code. * Move TensorConfig into test implementations. * Fix all fwd conv builder tests. * Fix conv traits and descriptors tests. * More factory assets under a separate directory. * Fix building conv traits. * Fix clang-format. * Add Readme doc to describe the design. * Add link to main Readme. Fix links in the builder design doc. * Clean-up data type/layout/elementwise op conversions. * Switch from dimension and tensor type specific layouts to a flat list of tensor layouts. * Fix clang-formatting. * Fix clang-format for test code. * Simplify fwd conv signature definitions in the test code. * Remove accidental edits. * Fix comment string. * Fix instance factory after rebase. * Fix tests after rebase. * Unify layout handling. * Add more conv layout unit tests. * Clang-format. * Fix merge conflicts. * Improve elementwise op handling. --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `9cb1f421bc`]	2025-12-04 12:58:31 +02:00
arai713	a8f5d21fb8	[CK_TILE] Fix for Moving DataTypeTraits into a Common File (#3335 ) This PR fixes a mismatch caused when PR #3146 was merged out of sync with develop, which made its intended changes ineffective. This PR reapplies those changes to move DataTypeTraits into a common file to mitigate code duplication. Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `583fafc803`]	2025-12-03 22:46:22 -08:00
kensclin	a6f43cf9de	Ck tile/gemm blockscale opt (#3227 ) * GEMM block scale optimization kernel * GEMM block scale optimization kernel * Fix: Apply clang-format for style consistency * Fix: Apply clang-format for style consistency --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `ffc3120f63`]	2025-12-03 22:07:23 -08:00
rocking	228b1e8d87	fp8 fmha async pipeline (#3339 ) * replace qr with async pipeline * Add fp8fp32 to DTYPE_BITS * Add kAlignmentRandVal to avoid compile fail * format --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `eb7f617713`]	2025-12-04 12:18:25 +08:00
JH-Leon-KIM-AMD	250deafb9e	[CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216 ) * LWPCK-4043: Add GPU reference implementations for CK Tile convolution This commit implements GPU-based reference kernels for CK Tile convolution operations to enable faster verification of optimized kernels, especially for large tensors (>2GB). Changes: - Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution - Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data - Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight - Integrate GPU references with test infrastructure (replace -v=2 error) - Support for 1D, 2D, and 3D convolutions - Generic data type support (FP16, BF16, FP32) - Grid-stride loop pattern for scalability The GPU references use a simple, readable implementation that prioritizes correctness over performance. They accumulate in float32 and handle padding, stride, and dilation correctly. * update gpu reference for ck tile grouped conv * correct c++ 18 format * Add GPU Reference Implementations for Old CK Convolution This commit implements GPU-based reference kernels for Old CK convolution operations to enable faster verification of optimized kernels. Changes: - Fixed old CK forward GPU reference (naive_conv_fwd.hpp) * Fixed BF16 NaN issue (use type_convert instead of static_cast) * Fixed FP8/BF8 arithmetic (accumulate in float) * Fixed uninitialized variables * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed) - Created backward data GPU reference (naive_conv_bwd_data.hpp) * Implements input gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Created backward weight GPU reference (naive_conv_bwd_weight.hpp) * Implements weight gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Integrated with old CK examples * Forward: 10 XDL examples now support do_verification=2 * Backward data: Integrated with example/17_convnd_bwd_data/ * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only) * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU) Testing: - 50 comprehensive tests created - 42/42 tests passing (100% success rate) - CPU and GPU verification produce identical results - Verified across multiple dimensions, sizes, and data types Limitations: - GPU references support standard convolution only (G=1) - Fused operations (DL variants) not supported - Some tests blocked by optimized kernel size constraints Result: Old CK GPU references can replace CPU references for verification with 50-100x performance improvement for large tensors. * Apply clang-format to old CK GPU reference files * Fix C++17 compatibility: use brace initialization for aggregate types * add get_rtol, get_atl and consistency cout message * Use triple bracket syntax for kernel launch per review feedback Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski. This is more idiomatic HIP/CUDA style and equally correct. All tests still passing after this change. * Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode - Replace manual error checking with HIP_CHECK_ERROR macro - Add v=3 verification mode (GPU ref vs CPU ref direct comparison) - Consistent output format across all examples - All tests passing (7/7 v=3 tests pass for FP16) * Use ConvDims structure to simplify GPU reference kernels Replace 24 individual parameters with ConvDims structure per review feedback. - Add conv_common.hpp with ConvDims and helper function - Update kernel signatures: 24 params → 1 structure - Remove duplicate extraction code from host files * Use get_block_id() and get_thread_id() helpers in CK Tile Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions. Updated 3 CK Tile GPU reference kernels per review feedback. * Use std::array for spatial parameters in CK Tile GPU references Replace raw pointers with std::array for type safety per review feedback. - Add conv_common.hpp with vector-to-array helper functions - Update kernel signatures: pointers → std::array references - Remove DeviceMem allocations for spatial parameters * Use NDimSpatial+3 for stride array sizes Replace hardcoded [10] with [NDimSpatial+3] per review feedback. Array sizes now correctly reflect actual dimensions needed. * Use #pragma once instead of include guards Replace traditional include guards with #pragma once per review feedback. Updated 3 Old CK GPU reference headers. * Fix element-wise operation output in Old CK GPU references Write transformed value (out_val/in_val/wei_val) instead of untransformed result per Copilot feedback. This ensures element-wise operations are correctly applied to output. * Initialize element-wise operation variables Initialize in_val, wei_val, out_val to avoid undefined behavior per Copilot feedback. Updated backward data and backward weight kernels. * Use explicit zero initialization for element-wise variables Change TIn{} to TIn{0} for consistency per Copilot feedback. All 3 kernels now use consistent zero initialization. * Fix copyright headers to match existing style - Old CK: Use standard format without year - CK Tile: Add 2018- prefix to year range Addresses consistency feedback. * Rename GPU reference files: add _gpu suffix * Refactor index calculations: use std::array and extract to helper functions * Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `4baa4c9fae`]	2025-12-03 21:14:21 +02:00
Enrico Degregori	faa7f9ae07	Wmma support for gemm_multiply_multiply_wp (#3278 ) * Initial implementation with splitK support * Add gfx11 support * Fix compilation error * Add instances * Add irregular instances * Fix GetBuffer arguments * Minor changes * Address review comments * Fix compilation errors * Fix copyright header [ROCm/composable_kernel commit: `161835533b`]	2025-12-03 07:38:23 -08:00
John Shumway	6a4a1962b8	[CK_BUILDER] Add Description::instance_string() method and update tests (#3340 ) * Create Description::instance_string() function To expose more reflection capabilities in MIOpen, we add the instance_string functionality to the ckr::Description class. This PR introduces a base class, adds the instance_string method, and implements the method by injecting the Traits::instance_string method through the ConvDescription constructor. This will enable us to replace the specialized get_instance_string() method on device operations with a describe() method in a subsequent PR. * Test describe().instance_string() Update the instance string tests to also call `ckr::describe<Instance>().instance_string()`. This documents that the xld kernels are supported with describe(), but WMMA and DL kernels are not yet supported. Also update namespace and add a HasConvTraits concept. [ROCm/composable_kernel commit: `f29b67cf9b`]	2025-12-03 06:36:09 -08:00
jakpiase	67c2664625	[CK TILE] Add index optimizations for conv bwd weight (#3321 ) [ROCm/composable_kernel commit: `e6a583416b`]	2025-12-03 10:53:46 +01:00
Aviral Goel	5cb0da15ef	feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193 ) * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * feat(gemm_quant): add RRR and CRR layout support for aquant gemm * test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm * fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests * fix: test cases compilation failure due to PR# 2095 * fix: make condition to filter out tests for gfx950 more explicit * need to support the gfx950 * fix: add layout suppot for gfx950 * Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277) * WIP: add support for pk_int4_t for aquant mode layouts RR and CR * test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant * fix: compile time error for gfx950 * fix: minor bug where is_a_load_tr_v() was mising * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318) * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant * test: add unit tests for new layout support CCRC for aquant block scale gemm * docs: update changelog with new layout support info * Update CHANGELOG.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * refactor: break test instances into multiple cpp files to reduce build time (#3319) * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant * test: add unit tests for new layout support CCRC for aquant block scale gemm * refactor: break test instances into multiple cpp files to reduce build time * chore: rename file for better code readability * fix: merge conflict resolution * fix: remove memory pipeline because new layout is not compatible * build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr() * refactor: address review comments * solve the conflict --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `6cb0bc2d11`]	2025-12-02 14:59:07 -08:00
Illia Silin	97fe55be4b	Disable gemm_blockscale_f8 on gfx90a by default. (#3338 ) * disable gemm_blockscale_f8 instances on gfx90a by default * fix cmake logic, diasble some cmake output * fix cmake logic [ROCm/composable_kernel commit: `2c284a1780`]	2025-12-02 11:33:33 -08:00
John Shumway	fac57abc38	[CK_BUILDER] Refactor builder factory code. (#3276 ) Refactor the builder factory code into multiple files and subdirectories and a ck_tile::builder::factory namespace. The factory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations. Major changes in this PR: Dispatch logic is explicit in the function make_conv_instance instead of implicit in template specialization selection. Helper code is moved to a subdirectory builder/factory/helpers. Helpers now have unit tests. Factories are moved to their own files. Code moved to namespaces ck_tile::builder::factory and ck_tile::builder::factory::internal. This does not yet fix the problem of bad error messages, but the make_conv_instance function makes the poor error messages clear. The choice of algorithm must be much more robust (perhaps with explicit enumeration in the algorithm descriptor), so that the dispatch doesn't fail. Quality changes: Making dispatch explicit rather than implicit will improve robustness, readability, maintainability, testability, and extensibility. Separating code into separate files and subdirectories helps readability and extensibility. Adding unit tests for helpers documents behavior and will enable more complex logic and functionality. Separating files (especially unit tests) helps clarify includes and dependencies and makes code easier to refactor. [ROCm/composable_kernel commit: `280bc42191`]	2025-12-02 07:40:14 -08:00
Thomas Ning	5b559e7409	disable the gfx90a (#3336 ) [ROCm/composable_kernel commit: `8459d389ad`]	2025-12-02 07:27:37 -08:00

1 2 3 4 5 ...

2754 Commits