composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-05 22:22:27 +00:00

Author	SHA1	Message	Date
Johannes Graner	bb8445dca8	[CK] Integrate GPU reference into ckProfiler for convolutions (#3379 ) Refactor and integrate CK GPU references into ckProfiler. - All convolution layouts and groupings supported for all three directions - Unit tests verifying GPU and CPU reference is the same - Support added to profiler (do_verification = 2 enables GPU reference) - One profiler-based test per direction changed to GPU reference to demonstrate usag Closes AICK-427	2025-12-18 07:59:45 +01:00
Enrico Degregori	87dd073887	Wmma support for grouped convolution bwd weight (#2947 ) * Convolution bwd weight device implementation * Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38 * Fix bug and disable splitK=-1 tests for wmma * Add generic instances for bf16 f32 bf16 * check gridwise level validity in device impl for 1 stage D0 * Fix bugs in device implementation: - rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly) * Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization * Remove workaround for 1x1Stride1Pad0 conv specialization * Add instances for xdl parity (for pipeline v1) * Add two stage instances (xdl parity) * Add multiple Ds instances * Add examples * Uncomment scale instances * Fix copyright * Fix examples compilation * Add atomic add float4 * Fix compilation error * Fix instances * Compute tolerances in examples instead of using default ones * Compute tolerances instead of using default ones in bilinear and scale tests * Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47 * Device implementation of explicit gemm for grouped conv bwd weight Based on batched gemm multiple D * Add instances for pipeline v1 and v3 * Add support for occupancy-based splitk * Fix ckProfiler dependencies * Review fixes * Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma' Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52 * Fix cmake file for tests * fix clang format * fix instance factory error * Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test. * Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test." This reverts commit `d20c869d3d`. * Disable splitk for 2stage xdl on rdna (bug to be fixed) * Fix add_test_executable * Always ForceThreadTileTransfer for now, WaveTileTransfer does not work for convolution yet. * Grab device and gridwise files from bkp branch, this should enable splitK support for convolution and also we no longer ForceThreadTileTransfer for explicit gemm. Also grab some updates from 7e7243783008b11e904f127ecf1df55ef95e9af2 to fix building on clang20. * Fix bug in various bwd wei device implementations / profiler where the occupancy based split_k value could not be found because the Argument did not derive from ArgumentSplitK, leading to incorrect error tolerances. * Actually print the reason when a device implementation is not supported. * Print number of valid instances in profiler and tests. * Fix clang format for Two Stage implementation * Fix copyright * Address review comments * Fix explicit conv bwd weight struct * Fix gridwise common * Fix gridwise ab scale * Remove autodeduce 1 stage * Restore example tolerance calculation * Fix compilation error * Fix gridwise common * Fix gridwise gemm * Fix typo * Fix splitk * Fix splitk ab scale * Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test. * Reduce instances to only the tuned wmma V3 ones for implicit v1 intra and explicit v1 intra pad/nopad. * Add explicit oddMN support with custom tuned instances * Add two stage instances based on the parameters from the tuned cshuffle V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists. * Replace cshuffle non-v3 lists with v3 lists, making sure to not have duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases. * Remove some instances that give incorrect results (f16 NHWGC) * Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances. * Add back some generic instances to make sure we have the same shape / layout / datatype support as before the instance selection process. * Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tuning. Keep generic instances for support. * Disable two stage f16 instances which produce incorrect results. * Remove more instances which fail verification, for bf16_f32_bf16 and for f16 scale / bilinear. * Disable all non-generic two-stage instances in the instance lists for NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit. * Remove unused instance lists and related add_x_instance() functions, fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3. * Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remove custom ckProfiler target. * Remove straggler comments * Remove [[maybe_unused]] * Fix clang format * Remove unwanted instances. This includes all instances which are not NHWGCxGKYXC and F16 or BF16 (no mixed in-out types). * Add comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>	2025-12-17 15:58:58 -08:00
linqunAMD	7e93eed878	[ck][gfx12] support contraction on gfx12 (#3421 ) * support contraction on gfx12 * increase tolerance for gfx11 in example contraction the precsion of gfx11 wmma is less than others.	2025-12-15 07:16:01 -08:00
Johannes Graner	3143a5a480	[CK Grouped Gemm] Disable split-k kernel for split-k > 1 with non-contiguous strides (#3405 ) * Disable kernel for split-k > 1 with non-contiguous strides * Update device_grouped_gemm_xdl_splitk_cshuffle.hpp --------- AICK-441 (partial) Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-15 08:03:00 +01:00
John Shumway	9ac51aa0f4	Add describe() method to device ops for runtime introspection (#3375 ) Introduces a polymorphic describe() method to BaseOperator that enables runtime introspection of kernel configurations through a unified interface. Key changes: * Add virtual describe() method to BaseOperator returning Description objects * Implement describe() in 6 device operation classes (conv fwd/bwd variants) * Create conv_describe.hpp with factory function for ConvDescription * Extract type definitions to conv_types.hpp to resolve circular dependencies * Add InstanceStringDescription for kernels without full ConvDescription support Other Improvements: * Update tests to use describe() instead of GetInstanceString() * Remove circular dependency include from conv_traits.hpp * Add ODD_C to ConvFwdSpecialization enum and fix OddC mapping * Replace silent fallback in conv_layout() with compile-time error This provides a foundation for runtime kernel introspection and better tooling support for analyzing and debugging kernel configurations.	2025-12-14 12:49:12 -08:00
Enrico Degregori	b4a34371a6	Fix compilation ab scale multi target (#3413 )	2025-12-12 10:26:47 -08:00
Enrico Degregori	ce99cab605	Wmma support for gemm_ab_scale (#3314 ) * Support gemm_ab_scale: - Add tests - Integrate scaling implementation in multiple D - Generalize existing b_scale for ab_scale - Add instances - Generalize implementation for ScaleBlockM, ScaleBlockN, ScaleBlockK - Add support for all layouts supported by xdl - Fix splitk xdl * Fix copyright * Wmma support for gemm_blockscale_wp (#3315) * Support for preshuffle with ab scale - add support for b preshuffle in GridwiseGemm_wmma_cshuffle_v3_ab_scale - add support for AScaleLayout amnd BScaleLayout (can be different from ALayout and BLayout, respectively) - add Run method in v1 pipeline to support preshuffle + scaling - add support for preshuffle gemms in common invoker - Add splitk support * Fix copyright header	2025-12-11 09:06:20 +01:00
John Shumway	f5b0af2272	Simplify includes for CK builder reflection (#3357 ) We only want to import enums and types into the builder reflection code. But, some of the enums are included in much larger files or even big trees of include files. This leads to unintended mixing of code and very confusing interactions and symbol conflicts. We organize the includes and extract two new enum-only headers to help with decoupling in CK. This refactoring is critical if we want to include reflection in a device-operator "describe" method. * Remove a few unnecessary includes from headers in builder/reflect/. * Extract enums scheduler and pipeline to their own headers so they can be used without importing other code. * Order includes alphabetically for better organization. The immediate goal is to unblock reflection integration, and this type of cleanup helps the flexibility and robustness of the CK header library.	2025-12-05 07:44:10 -08:00
JH-Leon-KIM-AMD	4baa4c9fae	[CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216 ) * LWPCK-4043: Add GPU reference implementations for CK Tile convolution This commit implements GPU-based reference kernels for CK Tile convolution operations to enable faster verification of optimized kernels, especially for large tensors (>2GB). Changes: - Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution - Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data - Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight - Integrate GPU references with test infrastructure (replace -v=2 error) - Support for 1D, 2D, and 3D convolutions - Generic data type support (FP16, BF16, FP32) - Grid-stride loop pattern for scalability The GPU references use a simple, readable implementation that prioritizes correctness over performance. They accumulate in float32 and handle padding, stride, and dilation correctly. * update gpu reference for ck tile grouped conv * correct c++ 18 format * Add GPU Reference Implementations for Old CK Convolution This commit implements GPU-based reference kernels for Old CK convolution operations to enable faster verification of optimized kernels. Changes: - Fixed old CK forward GPU reference (naive_conv_fwd.hpp) * Fixed BF16 NaN issue (use type_convert instead of static_cast) * Fixed FP8/BF8 arithmetic (accumulate in float) * Fixed uninitialized variables * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed) - Created backward data GPU reference (naive_conv_bwd_data.hpp) * Implements input gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Created backward weight GPU reference (naive_conv_bwd_weight.hpp) * Implements weight gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Integrated with old CK examples * Forward: 10 XDL examples now support do_verification=2 * Backward data: Integrated with example/17_convnd_bwd_data/ * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only) * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU) Testing: - 50 comprehensive tests created - 42/42 tests passing (100% success rate) - CPU and GPU verification produce identical results - Verified across multiple dimensions, sizes, and data types Limitations: - GPU references support standard convolution only (G=1) - Fused operations (DL variants) not supported - Some tests blocked by optimized kernel size constraints Result: Old CK GPU references can replace CPU references for verification with 50-100x performance improvement for large tensors. * Apply clang-format to old CK GPU reference files * Fix C++17 compatibility: use brace initialization for aggregate types * add get_rtol, get_atl and consistency cout message * Use triple bracket syntax for kernel launch per review feedback Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski. This is more idiomatic HIP/CUDA style and equally correct. All tests still passing after this change. * Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode - Replace manual error checking with HIP_CHECK_ERROR macro - Add v=3 verification mode (GPU ref vs CPU ref direct comparison) - Consistent output format across all examples - All tests passing (7/7 v=3 tests pass for FP16) * Use ConvDims structure to simplify GPU reference kernels Replace 24 individual parameters with ConvDims structure per review feedback. - Add conv_common.hpp with ConvDims and helper function - Update kernel signatures: 24 params → 1 structure - Remove duplicate extraction code from host files * Use get_block_id() and get_thread_id() helpers in CK Tile Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions. Updated 3 CK Tile GPU reference kernels per review feedback. * Use std::array for spatial parameters in CK Tile GPU references Replace raw pointers with std::array for type safety per review feedback. - Add conv_common.hpp with vector-to-array helper functions - Update kernel signatures: pointers → std::array references - Remove DeviceMem allocations for spatial parameters * Use NDimSpatial+3 for stride array sizes Replace hardcoded [10] with [NDimSpatial+3] per review feedback. Array sizes now correctly reflect actual dimensions needed. * Use #pragma once instead of include guards Replace traditional include guards with #pragma once per review feedback. Updated 3 Old CK GPU reference headers. * Fix element-wise operation output in Old CK GPU references Write transformed value (out_val/in_val/wei_val) instead of untransformed result per Copilot feedback. This ensures element-wise operations are correctly applied to output. * Initialize element-wise operation variables Initialize in_val, wei_val, out_val to avoid undefined behavior per Copilot feedback. Updated backward data and backward weight kernels. * Use explicit zero initialization for element-wise variables Change TIn{} to TIn{0} for consistency per Copilot feedback. All 3 kernels now use consistent zero initialization. * Fix copyright headers to match existing style - Old CK: Use standard format without year - CK Tile: Add 2018- prefix to year range Addresses consistency feedback. * Rename GPU reference files: add _gpu suffix * Refactor index calculations: use std::array and extract to helper functions * Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-03 21:14:21 +02:00
Enrico Degregori	161835533b	Wmma support for gemm_multiply_multiply_wp (#3278 ) * Initial implementation with splitK support * Add gfx11 support * Fix compilation error * Add instances * Add irregular instances * Fix GetBuffer arguments * Minor changes * Address review comments * Fix compilation errors * Fix copyright header	2025-12-03 07:38:23 -08:00
Erwin Terpstra	46f1d740f0	Add grouped gemm instances for RDNA4 (#3237 ) * wip: grouped_gemm implementation based on wmma kernel + example for fp16 * chore: clean up grouped_gem_wmma_splitk_fp16 example * chore: add cmake options to fully disable XDL or WMMA kernels * feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts) * chore: add grouped gemm wmma bf16 example * refactor: reuse more code between instance factory functions * chore: turn test failure if not all batch sizes are supported into a warning * chore: made failing of test on unsupported instances conditional to not break old tests * chore: add log message to failure case where AK1/BK1/KBatch is too high for K value * fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run() * fix: stray comma after parameter list * fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran * chore: update copyright in header comments * nit: minor feebdack * refactor: unified XDL / wma tests * fix: properly disable FP8 instances when ONLY targeting gfx11 * refactor: add v3 suffix to grouped_gemm device struct name * fix: small typos in example code * fix: fully exclude xdl/wmma instances when using the corresponding cmake flags * chore: remove unused destructor and added pipeline support checks to remove unnecessary paths * fix: make sure to not add instance library to group if library was skipped * fix: make sure xdl grouped gemm doesnt fail the new test * fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case * fix: examples not working since dependent types and functions were moved to ck namespace in develop * fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances * chore: replace/add copyright headers with new format	2025-12-01 15:32:10 -08:00
Aviral Goel	de6466481f	chore(copyright): update copyright header for include directory (#3293 )	2025-11-26 11:00:05 -07:00
John Shumway	10a782d846	Fix template parameter macros (#3305 ) Some of the device implementation templates have macros like GridwiseGemmMultiABDTemplateParameters that can cause build errors if multiple files are included together. This error comes up with our builder code. To clean up the macros and make them safer, we follow these follow rules: * Use more specific names to avoid duplication. * Undefine the macro after it is used to avoid leaking out of the file scope. * Use a prefix CK_ on the macro to avoid conflicting with other libraries. * Use all caps with underscores for preprocessor macro names.	2025-11-26 09:48:17 -08:00
lalala-sh	f58bd56e6b	fix static assert (#3178 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-11-20 17:27:05 -08:00
Gavin Zhao	07314ac543	Add support for RDNA1 GPUs (#3220 ) * Allow compilation for RDNA1 (__gfx101__) Signed-off-by: Gavin Zhao <git@gzgz.dev> * More RDNA1 changes Signed-off-by: Gavin Zhao <git@gzgz.dev> * Even more RDNA1 changes Signed-off-by: Gavin Zhao <git@gzgz.dev> * cmake: skip build quantization for unsupported arches * add gfx10-1-generic support as well * add gfx1013 and complete gfx10-1-generic * fix clang format * enable DL kernels on gfx101x --------- Signed-off-by: Gavin Zhao <git@gzgz.dev> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-11-20 10:45:57 -08:00
Aviral Goel	f5ac3ee359	chore(copyright): update copyright header for include directory (#3224 ) * chore(copyright): update copyright header for tile_engine directory * chore(copyright): update copyright header for script directory * chore(copyright): update copyright header for test_data directory * chore(copyright): update copyright header for python directory * chore(copyright): update copyright header for profiler directory * chore(copyright): update copyright header for library directory * chore(copyright): update copyright header for include directory	2025-11-18 10:17:18 -08:00
jefyang1	d30babbd00	Add new gemm multiply multiply instances on gfx950 (#3213 )	2025-11-14 08:20:41 -08:00
yinglu	2a73eb3bc0	Simulate TF32 with BF16x3 (#3142 ) * tf32:bf16x3:use bf16x3 emulate tf32 gemm * change blockwiseGemm to demo bf16x3 * temp push * self review * self review * fix multi-device compile error * bug fix * code refactor * limit to gfx950 * enhance gemm gfx942 threshold * lower change from blockwise to warpwise * refact codes * refact codes * error fix * change threshold * bug fix * fix threshold error * change host reference implement to same as device * bug fix * bug fix * code refact * fix clang-format fail * code refine	2025-11-13 16:21:09 -08:00
Enrico Degregori	7414a0f4d4	Wmma support for gemm_reduce (#3145 ) * Initial implementation GEMM+Reduce: - device struct - epilogue struct * Fix tests, improve profiler and add initial instances * Add instances * Fix compilation error * Address review comments * Fix logging --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-11-12 11:23:54 -08:00
Enrico Degregori	1c544abf57	Extend support for ak1 / bk1 WMMA (#3073 ) * Extend AK1 / BK1 support: - Add support for AK1 != BK1 - Add support for AK1, BK1 > 8 - Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction * fix clang format	2025-11-11 07:38:15 -08:00
Gino Lu	e31a7a4f29	fix MX bpreshuffle gemm B grid descriptor dimension error. (#3170 )	2025-11-06 19:42:39 -08:00
Xudong Yuan	d04eba4ae3	Ck moe mxfp4 blockm32 (#3098 ) * block_m = 32 * ck block_m = 32 * aiter/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_b_preshuffle_mx_moe_v3.hpp format * mxfp4_moe v1 pipe * update format --------- Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-11-07 08:45:41 +08:00
Adam Osewski	b8527a9236	[CK_BUILDER] Convolution traits. (#3152 ) Added: 1. Convolution traits & unit tests 2. Update builder enumerators to have representation of Convolution Kernels properties. 3. Unified builder pipeline version & scheduler enumerators	2025-11-05 08:53:06 -08:00
Illia Silin	930423ab3b	Initialize new variable to prevent c++17 compiler error (#3156 ) * initialize new variable to prevent c++17 compiler error * build for gfx90a using -std=c++17 flag	2025-11-04 18:54:14 -08:00
John Shumway	6dbee64886	[CK_BUILDER] Add backward weight instance traits for xdl cshuffle. (#3143 ) * Add backward weight instance traits for xdl cshuffle. To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file. * Fix copyright notices. * Remove (c) symbol, replace with (C). Having UTF-8 in source caused an error with code generation.	2025-11-04 15:34:00 +01:00
Enrico Degregori	507d81c3af	Fix splitk preshuffle (#3137 ) * Fix splitK multiply_multiply_wp * Add tests for gemm_multiply_multiply_wp * Add tests for gemm_universal_preshuffle (KBatch = 1) * Add tests gemm_blockscale_wp * Fix splitk gemm universal preshuffle * Run new tests on arch supporting fp8 * Restore example * Fix strides profiler * Fix tests * Fix clang format * Finalize profiler preshuffle with tolerances * Minor improvements to splitk related changes * Address review comments: clang format and ckProfiler typo * Remove b_k_split_offset from SplitKBatchOffset struct	2025-11-03 11:59:01 -08:00
Bartłomiej Kocot	ab1a8356b6	Add 2GB limitation for grouped conv bwd weight (#3054 )	2025-11-01 14:16:45 +01:00
Enrico Degregori	4ebc48a3cd	WMMA gemm_add_relu_add_layernorm (#2989 ) * Summary: - Refactor epilogue (with CShuffle) to support fused operations: - EpilogueCShuffleBase holds common parts - EpilogueCShuffle: runs CShuffle and write out - EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out - Extend thread transfer v7r3: - Support for intermediate data type different from src and dst type - New functionality to write to dst buffer and keep data (to be able to use them for additional operations) * Adress review comments	2025-10-31 11:19:26 -07:00
John Shumway	5ed2046bee	Add the last two forward instance traits. (#3134 ) * Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle * Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d * A few small changes to fix broken instance traits.	2025-10-31 07:52:42 -07:00
kabrahamAMD	a7c52e8afa	Kabraham/fix block gemm v1 b scale (#3129 ) * fixed synchronization issue in block gemm pipeline v1 that caused b_scale to fail * run clang-format --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>	2025-10-31 07:19:01 -07:00
John Shumway	cafaeb6b7b	Add instance traits for two more grouped forward convolutions (#3112 )	2025-10-29 16:04:13 +01:00
Bartłomiej Kocot	66bae4306c	Grouped conv fwd with direct load (#3082 ) * Grouped conv fwd with direct load * fix * fix * Add IsSupported check * Fix * fix inductor	2025-10-29 09:54:42 +01:00
Ville Pietilä	1c17bae816	Add name member to CK elementwise operations. (#3102 )	2025-10-27 22:19:29 -07:00
John Shumway	54746e9329	[CK_BUILDER] Test and fix instance traits utils. (#3096 ) * Refactor instance_traits_util and add unit tests tests * Address reviewer comments. Just adds some TODOs to indicate deprecated layouts in our reflection. Our strategy is to leave the reflection code broad (covering deprecated features), but keep the builder concepts narrow. Once we've removed deprecated features from all instances, we can remove them from reflection. Also add a comment to the cmake to explain the unit test target test_conv_builder. * Addressed more reviewer comments. * Remove duplicate PassThrough::name Accidentally added this field to the end of the struct, too. The `name` field should be a the start of the struct for consistency.	2025-10-27 22:14:08 -07:00
Ville Pietilä	6c2ca1211a	[CK_BUILDER] First fwd convolution builder implementation (#3070 ) * Add experimental builder infrastructure for composable_kernel - Add experimental/builder directory with README documentation. - Create initial test infrastructure with CMakeLists.txt and placeholder test. - Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option. - Update .gitignore to not treat `experimental/builder` as a CMake build directory. This establishes the directory structure for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration. * Fix clang formatting. * Fix CMake build infrastructure for experimental builder - Add experimental/builder CMakeLists.txt with proper subdirectory structure - Add placeholder include/ck_tile/builder CMakeLists.txt for header installation - Fix gtest.cmake to use include_guard to prevent multiple inclusions - Update root CMakeLists.txt to include full builder directory instead of just tests * Scope C++20 settingto the test code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove redundant GTest::gtest linkage Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Introduce basic types, and convolution algorithm concepts and limits. * Add convolution signature concepts. * Add convolution factory. * Finalize conv factory implementation for fwd convolutions. * Add type definitions for testing. * Add placeholder test. * Add convolution builder definition. * Fully functional fwd conv builder. * Test improvements. * Clean-up include headers. * Enable the limit checks for the convolution algorithm parameters. * Remove dead code. * clang formatting. * Add more tests and missing conv specialization argument. * clang formatting. * Add explicit handling of the tensor layouts. * Add complete 2D/3D layout support to CK Builder - Add missing 2D layouts: GNHWC_GKYXC_GNHWK, NGCHW_GKCYX_NGKHW - Add missing 3D layout: GNDHWC_GKZYXC_GNDHWK - Add 1D layouts (NWGC, NGCW, GNWC, NGCW_GKCX) for future support - Add 3 tests for new 2D/3D layouts - All tests pass (5/5) * Add tests for remaining 2D/3D layouts - Add test for 2D NGCHW_GKYXC_NGKHW (channels-first) with Filter1x1Stride1Pad0 - Add test for 3D NDHWGC_GKZYXC_NDHWGK (channels-last) - All 7 tests pass (complete coverage for all 2D/3D forward layouts) * Change enum converters to consteval. * 7 tests with pipeline and specialization\| Test # \| Dim \| Type \| Layout \| Pipeline \| Specialization \| \|--------\|-----\|------\|----------------------\|----------\|-------------------------\| \| 1 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V1 \| DEFAULT \| \| 2 \| 2D \| FP16 \| GNHWC_GKYXC_GNHWK \| V3 \| FILTER_1X1_PAD0 \| \| 3 \| 2D \| FP32 \| NGCHW_GKCYX_NGKHW \| V4 \| FILTER_1X1_STRIDE1_PAD0 \| \| 4 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V5 \| FILTER_3x3 \| \| 5 \| 3D \| FP32 \| NGCDHW_GKCZYX_NGKDHW \| V1 \| FILTER_1X1_PAD0 \| \| 6 \| 3D \| BF16 \| GNDHWC_GKZYXC_GNDHWK \| V3 \| DEFAULT \| \| 7 \| 3D \| FP16 \| NDHWGC_GKZYXC_NDHWGK \| V4 \| FILTER_1X1_PAD0 \| * Add missing convolution layouts and provide better compile-time error in instance traits. * Fix clang formatting. * Changed I8 -> S8. * Fix signature. * Rename concepts and corresponding members. * Rename LDS related parameters. * Remove ODD_C specialization. Add V2 pipeline. * Add missing types. * Add elementwise operation to the conv signature. * Improve compile-time error message for unsupported elementwise ops. * Separate different fwd conv builder tests into separate compilation units. * Fix layout to string and add name to old CK PassThrough elementwise op. * Enable both CK and CK Tile tensor layouts in instance traits. * Fix clang-format. --------- Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: John Shumway <john.shumwayjr@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: JH-Leon-KIM-AMD <jeonghyun.kim@amd.com>	2025-10-27 20:09:24 +02:00
yinglu	6bbc05e1bd	conv:tf32:add missed instances (#3081 ) * conv:tf32:add missed instances	2025-10-24 16:28:36 +08:00
John Shumway	37dff024c1	[CK_BUILDER] Add compile-time reflection for a convolution instance (#3065 ) * [CK_BILDER] Add compile-time reflection for a convolution instance Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values. This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters. The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface. Key architectural decisions: - Forward-declare device kernels in instance_traits.hpp to avoid circular dependencies, since device implementation headers will include the reflection headers - Use compile-time constants and type aliases to expose kernel parameters, enabling zero-overhead introspection - Provide a templated instance_string() function that generates human-readable kernel configuration strings by serializing all template parameters in order, useful for debugging and kernel identification - Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep it opt-in until the API stabilizes - Add GetInstanceString() virtual method to BaseOperator, allowing runtime polymorphic access to compile-time kernel information This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically. Includes unit tests validating both the trait extraction accuracy and the string generation format.	2025-10-21 21:10:19 -07:00
Bartłomiej Kocot	3a28632b20	Gridwise gemm conv v3 force padded layout on gfx950 (#2961 ) * Gridwise gemm conv v3 force padded layout on gfx950 * fix bug in other gridwise * fix * Update gridwise_gemm_wmma_cshuffle_v3_common.hpp	2025-10-21 15:41:02 +02:00
Ville Pietilä	7e44b845b5	Fixed handling of split-K autodeduce argument for grouped convolution (#3024 ) * Fix handling of split-K autodeduce argument. * Fix clang formatting. * Test fix. * Fix clang formatting.	2025-10-17 15:36:39 +03:00
Enrico Degregori	440358c168	Wave Tile Transfer supporting global load with transpose (#3027 ) * Initial implementation: - add new thread group transfer supporting transpose instruction - refactor AB transfer to switch between thread and wave tiles methods * Add some comments and remove explicit wave and lane calculations * Remove compiler option for performance * fp16 example: use tuned instance * Missing cleanup * Integrate wave transfer in existing gemm and batched gemm instances * Add fast instances * extend implementation for 8 bit datatypes packed types not supported * Address review comments * Optimize pipeline v1 and re-introduce compiler option * Disable wave tile approach for b scale gemm * Fix for clang20 * Avoid code duplication of amd_global_load_transpose_to_vgpr function	2025-10-16 11:33:56 -07:00
kabrahamAMD	c4b2da9cbd	implement device batched gemm b scale for wmma (#2825 ) * rebased on top of develop * fixed missing shuffeling and wrong indexing * added tests for batched_b_scale * added missing files * fixed wrong stride computation and removed k batching (for now) due to precision issues * reinstated k-batching with PRNG constrained to -1..1 * added specialization of GeneratorTensor_3 for int4 and fixed internal overflow * added k-batching to reference and increased tolerances for test * changed gemm_b_scale and gemm_universal tests to use correct parameters * adressed review commentsd * ported fixes back to non-batched version of b_scale * adressed review comments * run clang-format on older commits * add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior * added newline at end of file * reflected changes from muitl-abd branch in batched b_scale * fixed gfx11 issue * changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed * run clang format * set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested. * reduced range for pk_i4 even further to 0..0 * removed failing xld instances. Failure now uncovered now that tests were fixed * removed generation of int4 values entierly * divide B buffer by BPackedSize --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>	2025-10-16 11:00:42 -07:00
yinglu	fada1a3cae	Conv:TF32: add more instances - 2 (#2879 ) * add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * tf32:conv:add instances for base class DeviceConvFwd * tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD * tf32:conv:add instances for base class DeviceGroupedConvBwdWeight * add tf32 in profiler * remove gnhwc/ngchw/ngcdhw instances * remove non-ndhwgc/nhwgc/nhwc instances * add check in IsSupportedArgument()	2025-10-10 15:28:17 +08:00
Sami Remes	9d4bfe3932	Add KBatch support for gemm_ab_scale (#2740 ) * Add KBatch support for gemm_ab_scale * Revert kernel parameters change * Remove printing * fix formatting * fix check * Use {} in if --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-10-09 08:33:16 +02:00
Illia Silin	4c98535456	fix compilation errors on RHEL8 and SLES15 (#2967 )	2025-10-03 07:08:49 -07:00
Thomas Ning	cadafde722	add the check of granularity for atomic add (#2959 )	2025-10-02 11:15:24 -07:00
yinglu	0f04f020d9	fix:tf32:fix build fail for all supported targets (#2942 ) * fix:tf32:fix build fail for all supported targets * new fix code	2025-09-29 08:04:11 -07:00
linqunAMD	769c58f133	[CK] Fix example_grouped_conv_bwd_data_xdl_fp16 with ksplit = 2 (#2943 ) root cause: AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1.	2025-09-29 07:56:33 -07:00
Bartłomiej Kocot	5477811670	Grouped Conv Bwd Data out index calculation optimizations (#2917 ) * Grouped Conv Bwd Data index calculation optimizations * fixes * refactor instances * gfx12 fixes * temporary disable splitK for gfx12	2025-09-29 15:59:11 +02:00
emezh	db2524be2d	Verify `HostTensorDescriptor` when it is created (#2829 ) * add proper GEMM layout verification * Handle "auto" strides. CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides). CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims. ValidateStrides throws if any of the tensor's strides is <=0. profile_gemm_multiply_add updated to support "auto" strides for tensors. Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes) auto-strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1 Note, -1 should be deprecated (use 0 instead) explicit strides (same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128 explicit strides (not the same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 mix of explicit and auto strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0 invalid stride bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64 terminate called after throwing an instance of 'std::runtime_error' what(): Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1 Aborted (core dumped) * - add more names to ck::tensor_layout for easier namespace hierarchy checking - updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp * added handling of partially initialized strides for GEMM. fixed more tests. * clang-format and more fixes * replace long dash by a simple hyphen - causes build failure in CK codegen. * increase sizeof input, otherwise output size becomes zero or negative with large filter size * select stride based on layout * specify layout explicitly to avoid errors in HostTensorDescriptor creation * add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor` * Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors * fix test (incl removing invalid config) * fix moe examples: - (in .cpp) add layout argument to non-2D tensors - (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index) * fix moe_gemm2 example. * fix profile and wmma examples * clean-up early mods for ckprofile. verified with: ``` ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 # ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128 # ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0 # ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0 # not implemented ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128 # ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138 # example_gemm_add_multiply_dl_fp16 example_gemm_add_multiply_xdl_fp16 # ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0 ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128 ``` * temporary skip first 8 test configs - they throw error * temporary skip first 8 test configs in wmma too - they throw error --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-09-25 18:22:13 -07:00
yinglu	df97a286d5	Conv:TF32: add more instances - 1 (#2867 ) * conv:tf32:add more instances * add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * remove gnhwc/ngchw/ngcdhw instances	2025-09-25 09:27:18 +08:00

1 2 3 4 5 ...

610 Commits