composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-11 09:40:51 +00:00

Author	SHA1	Message	Date
Erwin Terpstra	90bebdb065	Add grouped gemm instances for RDNA4 (#3237 ) * wip: grouped_gemm implementation based on wmma kernel + example for fp16 * chore: clean up grouped_gem_wmma_splitk_fp16 example * chore: add cmake options to fully disable XDL or WMMA kernels * feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts) * chore: add grouped gemm wmma bf16 example * refactor: reuse more code between instance factory functions * chore: turn test failure if not all batch sizes are supported into a warning * chore: made failing of test on unsupported instances conditional to not break old tests * chore: add log message to failure case where AK1/BK1/KBatch is too high for K value * fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run() * fix: stray comma after parameter list * fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran * chore: update copyright in header comments * nit: minor feebdack * refactor: unified XDL / wma tests * fix: properly disable FP8 instances when ONLY targeting gfx11 * refactor: add v3 suffix to grouped_gemm device struct name * fix: small typos in example code * fix: fully exclude xdl/wmma instances when using the corresponding cmake flags * chore: remove unused destructor and added pipeline support checks to remove unnecessary paths * fix: make sure to not add instance library to group if library was skipped * fix: make sure xdl grouped gemm doesnt fail the new test * fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case * fix: examples not working since dependent types and functions were moved to ck namespace in develop * fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances * chore: replace/add copyright headers with new format [ROCm/composable_kernel commit: `46f1d740f0`]	2025-12-01 15:32:10 -08:00
Aviral Goel	216c23b945	chore(copyright): update copyright header for include directory (#3293 ) [ROCm/composable_kernel commit: `de6466481f`]	2025-11-26 11:00:05 -07:00
John Shumway	90e0eb4dfc	Fix template parameter macros (#3305 ) Some of the device implementation templates have macros like GridwiseGemmMultiABDTemplateParameters that can cause build errors if multiple files are included together. This error comes up with our builder code. To clean up the macros and make them safer, we follow these follow rules: * Use more specific names to avoid duplication. * Undefine the macro after it is used to avoid leaking out of the file scope. * Use a prefix CK_ on the macro to avoid conflicting with other libraries. * Use all caps with underscores for preprocessor macro names. [ROCm/composable_kernel commit: `10a782d846`]	2025-11-26 09:48:17 -08:00
lalala-sh	ba44e7b7a4	fix static assert (#3178 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `f58bd56e6b`]	2025-11-20 17:27:05 -08:00
Gavin Zhao	d80f38f77f	Add support for RDNA1 GPUs (#3220 ) * Allow compilation for RDNA1 (__gfx101__) Signed-off-by: Gavin Zhao <git@gzgz.dev> * More RDNA1 changes Signed-off-by: Gavin Zhao <git@gzgz.dev> * Even more RDNA1 changes Signed-off-by: Gavin Zhao <git@gzgz.dev> * cmake: skip build quantization for unsupported arches * add gfx10-1-generic support as well * add gfx1013 and complete gfx10-1-generic * fix clang format * enable DL kernels on gfx101x --------- Signed-off-by: Gavin Zhao <git@gzgz.dev> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `07314ac543`]	2025-11-20 10:45:57 -08:00
Aviral Goel	902250eab3	chore(copyright): update copyright header for include directory (#3224 ) * chore(copyright): update copyright header for tile_engine directory * chore(copyright): update copyright header for script directory * chore(copyright): update copyright header for test_data directory * chore(copyright): update copyright header for python directory * chore(copyright): update copyright header for profiler directory * chore(copyright): update copyright header for library directory * chore(copyright): update copyright header for include directory [ROCm/composable_kernel commit: `f5ac3ee359`]	2025-11-18 10:17:18 -08:00
jefyang1	72dbbc7d77	Add new gemm multiply multiply instances on gfx950 (#3213 ) [ROCm/composable_kernel commit: `d30babbd00`]	2025-11-14 08:20:41 -08:00
yinglu	bdbe3e4eb9	Simulate TF32 with BF16x3 (#3142 ) * tf32:bf16x3:use bf16x3 emulate tf32 gemm * change blockwiseGemm to demo bf16x3 * temp push * self review * self review * fix multi-device compile error * bug fix * code refactor * limit to gfx950 * enhance gemm gfx942 threshold * lower change from blockwise to warpwise * refact codes * refact codes * error fix * change threshold * bug fix * fix threshold error * change host reference implement to same as device * bug fix * bug fix * code refact * fix clang-format fail * code refine [ROCm/composable_kernel commit: `2a73eb3bc0`]	2025-11-13 16:21:09 -08:00
Enrico Degregori	e00db44d0c	Wmma support for gemm_reduce (#3145 ) * Initial implementation GEMM+Reduce: - device struct - epilogue struct * Fix tests, improve profiler and add initial instances * Add instances * Fix compilation error * Address review comments * Fix logging --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `7414a0f4d4`]	2025-11-12 11:23:54 -08:00
Enrico Degregori	8e23284922	Extend support for ak1 / bk1 WMMA (#3073 ) * Extend AK1 / BK1 support: - Add support for AK1 != BK1 - Add support for AK1, BK1 > 8 - Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction * fix clang format [ROCm/composable_kernel commit: `1c544abf57`]	2025-11-11 07:38:15 -08:00
Gino Lu	0344170dac	fix MX bpreshuffle gemm B grid descriptor dimension error. (#3170 ) [ROCm/composable_kernel commit: `e31a7a4f29`]	2025-11-06 19:42:39 -08:00
Xudong Yuan	6e40562dff	Ck moe mxfp4 blockm32 (#3098 ) * block_m = 32 * ck block_m = 32 * aiter/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_b_preshuffle_mx_moe_v3.hpp format * mxfp4_moe v1 pipe * update format --------- Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `d04eba4ae3`]	2025-11-07 08:45:41 +08:00
Adam Osewski	54409e7fb5	[CK_BUILDER] Convolution traits. (#3152 ) Added: 1. Convolution traits & unit tests 2. Update builder enumerators to have representation of Convolution Kernels properties. 3. Unified builder pipeline version & scheduler enumerators [ROCm/composable_kernel commit: `b8527a9236`]	2025-11-05 08:53:06 -08:00
Illia Silin	bb4b6e5961	Initialize new variable to prevent c++17 compiler error (#3156 ) * initialize new variable to prevent c++17 compiler error * build for gfx90a using -std=c++17 flag [ROCm/composable_kernel commit: `930423ab3b`]	2025-11-04 18:54:14 -08:00
John Shumway	52204ff4e5	[CK_BUILDER] Add backward weight instance traits for xdl cshuffle. (#3143 ) * Add backward weight instance traits for xdl cshuffle. To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file. * Fix copyright notices. * Remove (c) symbol, replace with (C). Having UTF-8 in source caused an error with code generation. [ROCm/composable_kernel commit: `6dbee64886`]	2025-11-04 15:34:00 +01:00
Enrico Degregori	9575bcd099	Fix splitk preshuffle (#3137 ) * Fix splitK multiply_multiply_wp * Add tests for gemm_multiply_multiply_wp * Add tests for gemm_universal_preshuffle (KBatch = 1) * Add tests gemm_blockscale_wp * Fix splitk gemm universal preshuffle * Run new tests on arch supporting fp8 * Restore example * Fix strides profiler * Fix tests * Fix clang format * Finalize profiler preshuffle with tolerances * Minor improvements to splitk related changes * Address review comments: clang format and ckProfiler typo * Remove b_k_split_offset from SplitKBatchOffset struct [ROCm/composable_kernel commit: `507d81c3af`]	2025-11-03 11:59:01 -08:00
Bartłomiej Kocot	b2aa37f3f5	Add 2GB limitation for grouped conv bwd weight (#3054 ) [ROCm/composable_kernel commit: `ab1a8356b6`]	2025-11-01 14:16:45 +01:00
Enrico Degregori	e6be7bcc2a	WMMA gemm_add_relu_add_layernorm (#2989 ) * Summary: - Refactor epilogue (with CShuffle) to support fused operations: - EpilogueCShuffleBase holds common parts - EpilogueCShuffle: runs CShuffle and write out - EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out - Extend thread transfer v7r3: - Support for intermediate data type different from src and dst type - New functionality to write to dst buffer and keep data (to be able to use them for additional operations) * Adress review comments [ROCm/composable_kernel commit: `4ebc48a3cd`]	2025-10-31 11:19:26 -07:00
John Shumway	a8a377ca53	Add the last two forward instance traits. (#3134 ) * Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle * Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d * A few small changes to fix broken instance traits. [ROCm/composable_kernel commit: `5ed2046bee`]	2025-10-31 07:52:42 -07:00
kabrahamAMD	b7429e620c	Kabraham/fix block gemm v1 b scale (#3129 ) * fixed synchronization issue in block gemm pipeline v1 that caused b_scale to fail * run clang-format --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `a7c52e8afa`]	2025-10-31 07:19:01 -07:00
John Shumway	361a4c6e23	Add instance traits for two more grouped forward convolutions (#3112 ) [ROCm/composable_kernel commit: `cafaeb6b7b`]	2025-10-29 16:04:13 +01:00
Bartłomiej Kocot	cd30313161	Grouped conv fwd with direct load (#3082 ) * Grouped conv fwd with direct load * fix * fix * Add IsSupported check * Fix * fix inductor [ROCm/composable_kernel commit: `66bae4306c`]	2025-10-29 09:54:42 +01:00
Ville Pietilä	f0b6fdcadb	Add name member to CK elementwise operations. (#3102 ) [ROCm/composable_kernel commit: `1c17bae816`]	2025-10-27 22:19:29 -07:00
John Shumway	03c97c9524	[CK_BUILDER] Test and fix instance traits utils. (#3096 ) * Refactor instance_traits_util and add unit tests tests * Address reviewer comments. Just adds some TODOs to indicate deprecated layouts in our reflection. Our strategy is to leave the reflection code broad (covering deprecated features), but keep the builder concepts narrow. Once we've removed deprecated features from all instances, we can remove them from reflection. Also add a comment to the cmake to explain the unit test target test_conv_builder. * Addressed more reviewer comments. * Remove duplicate PassThrough::name Accidentally added this field to the end of the struct, too. The `name` field should be a the start of the struct for consistency. [ROCm/composable_kernel commit: `54746e9329`]	2025-10-27 22:14:08 -07:00
Ville Pietilä	e1e96b89fa	[CK_BUILDER] First fwd convolution builder implementation (#3070 ) * Add experimental builder infrastructure for composable_kernel - Add experimental/builder directory with README documentation. - Create initial test infrastructure with CMakeLists.txt and placeholder test. - Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option. - Update .gitignore to not treat `experimental/builder` as a CMake build directory. This establishes the directory structure for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration. * Fix clang formatting. * Fix CMake build infrastructure for experimental builder - Add experimental/builder CMakeLists.txt with proper subdirectory structure - Add placeholder include/ck_tile/builder CMakeLists.txt for header installation - Fix gtest.cmake to use include_guard to prevent multiple inclusions - Update root CMakeLists.txt to include full builder directory instead of just tests * Scope C++20 settingto the test code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove redundant GTest::gtest linkage Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Introduce basic types, and convolution algorithm concepts and limits. * Add convolution signature concepts. * Add convolution factory. * Finalize conv factory implementation for fwd convolutions. * Add type definitions for testing. * Add placeholder test. * Add convolution builder definition. * Fully functional fwd conv builder. * Test improvements. * Clean-up include headers. * Enable the limit checks for the convolution algorithm parameters. * Remove dead code. * clang formatting. * Add more tests and missing conv specialization argument. * clang formatting. * Add explicit handling of the tensor layouts. * Add complete 2D/3D layout support to CK Builder - Add missing 2D layouts: GNHWC_GKYXC_GNHWK, NGCHW_GKCYX_NGKHW - Add missing 3D layout: GNDHWC_GKZYXC_GNDHWK - Add 1D layouts (NWGC, NGCW, GNWC, NGCW_GKCX) for future support - Add 3 tests for new 2D/3D layouts - All tests pass (5/5) * Add tests for remaining 2D/3D layouts - Add test for 2D NGCHW_GKYXC_NGKHW (channels-first) with Filter1x1Stride1Pad0 - Add test for 3D NDHWGC_GKZYXC_NDHWGK (channels-last) - All 7 tests pass (complete coverage for all 2D/3D forward layouts) * Change enum converters to consteval. * 7 tests with pipeline and specialization\| Test # \| Dim \| Type \| Layout \| Pipeline \| Specialization \| \|--------\|-----\|------\|----------------------\|----------\|-------------------------\| \| 1 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V1 \| DEFAULT \| \| 2 \| 2D \| FP16 \| GNHWC_GKYXC_GNHWK \| V3 \| FILTER_1X1_PAD0 \| \| 3 \| 2D \| FP32 \| NGCHW_GKCYX_NGKHW \| V4 \| FILTER_1X1_STRIDE1_PAD0 \| \| 4 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V5 \| FILTER_3x3 \| \| 5 \| 3D \| FP32 \| NGCDHW_GKCZYX_NGKDHW \| V1 \| FILTER_1X1_PAD0 \| \| 6 \| 3D \| BF16 \| GNDHWC_GKZYXC_GNDHWK \| V3 \| DEFAULT \| \| 7 \| 3D \| FP16 \| NDHWGC_GKZYXC_NDHWGK \| V4 \| FILTER_1X1_PAD0 \| * Add missing convolution layouts and provide better compile-time error in instance traits. * Fix clang formatting. * Changed I8 -> S8. * Fix signature. * Rename concepts and corresponding members. * Rename LDS related parameters. * Remove ODD_C specialization. Add V2 pipeline. * Add missing types. * Add elementwise operation to the conv signature. * Improve compile-time error message for unsupported elementwise ops. * Separate different fwd conv builder tests into separate compilation units. * Fix layout to string and add name to old CK PassThrough elementwise op. * Enable both CK and CK Tile tensor layouts in instance traits. * Fix clang-format. --------- Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: John Shumway <john.shumwayjr@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: JH-Leon-KIM-AMD <jeonghyun.kim@amd.com> [ROCm/composable_kernel commit: `6c2ca1211a`]	2025-10-27 20:09:24 +02:00
yinglu	6a7861bbec	conv:tf32:add missed instances (#3081 ) * conv:tf32:add missed instances [ROCm/composable_kernel commit: `6bbc05e1bd`]	2025-10-24 16:28:36 +08:00
John Shumway	a488126d3e	[CK_BUILDER] Add compile-time reflection for a convolution instance (#3065 ) * [CK_BILDER] Add compile-time reflection for a convolution instance Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values. This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters. The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface. Key architectural decisions: - Forward-declare device kernels in instance_traits.hpp to avoid circular dependencies, since device implementation headers will include the reflection headers - Use compile-time constants and type aliases to expose kernel parameters, enabling zero-overhead introspection - Provide a templated instance_string() function that generates human-readable kernel configuration strings by serializing all template parameters in order, useful for debugging and kernel identification - Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep it opt-in until the API stabilizes - Add GetInstanceString() virtual method to BaseOperator, allowing runtime polymorphic access to compile-time kernel information This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically. Includes unit tests validating both the trait extraction accuracy and the string generation format. [ROCm/composable_kernel commit: `37dff024c1`]	2025-10-21 21:10:19 -07:00
Bartłomiej Kocot	ebd8495721	Gridwise gemm conv v3 force padded layout on gfx950 (#2961 ) * Gridwise gemm conv v3 force padded layout on gfx950 * fix bug in other gridwise * fix * Update gridwise_gemm_wmma_cshuffle_v3_common.hpp [ROCm/composable_kernel commit: `3a28632b20`]	2025-10-21 15:41:02 +02:00
Ville Pietilä	bc3a91d23f	Fixed handling of split-K autodeduce argument for grouped convolution (#3024 ) * Fix handling of split-K autodeduce argument. * Fix clang formatting. * Test fix. * Fix clang formatting. [ROCm/composable_kernel commit: `7e44b845b5`]	2025-10-17 15:36:39 +03:00
Enrico Degregori	6066662785	Wave Tile Transfer supporting global load with transpose (#3027 ) * Initial implementation: - add new thread group transfer supporting transpose instruction - refactor AB transfer to switch between thread and wave tiles methods * Add some comments and remove explicit wave and lane calculations * Remove compiler option for performance * fp16 example: use tuned instance * Missing cleanup * Integrate wave transfer in existing gemm and batched gemm instances * Add fast instances * extend implementation for 8 bit datatypes packed types not supported * Address review comments * Optimize pipeline v1 and re-introduce compiler option * Disable wave tile approach for b scale gemm * Fix for clang20 * Avoid code duplication of amd_global_load_transpose_to_vgpr function [ROCm/composable_kernel commit: `440358c168`]	2025-10-16 11:33:56 -07:00
kabrahamAMD	b085a51b44	implement device batched gemm b scale for wmma (#2825 ) * rebased on top of develop * fixed missing shuffeling and wrong indexing * added tests for batched_b_scale * added missing files * fixed wrong stride computation and removed k batching (for now) due to precision issues * reinstated k-batching with PRNG constrained to -1..1 * added specialization of GeneratorTensor_3 for int4 and fixed internal overflow * added k-batching to reference and increased tolerances for test * changed gemm_b_scale and gemm_universal tests to use correct parameters * adressed review commentsd * ported fixes back to non-batched version of b_scale * adressed review comments * run clang-format on older commits * add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior * added newline at end of file * reflected changes from muitl-abd branch in batched b_scale * fixed gfx11 issue * changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed * run clang format * set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested. * reduced range for pk_i4 even further to 0..0 * removed failing xld instances. Failure now uncovered now that tests were fixed * removed generation of int4 values entierly * divide B buffer by BPackedSize --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `c4b2da9cbd`]	2025-10-16 11:00:42 -07:00
yinglu	c1780cfebe	Conv:TF32: add more instances - 2 (#2879 ) * add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * tf32:conv:add instances for base class DeviceConvFwd * tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD * tf32:conv:add instances for base class DeviceGroupedConvBwdWeight * add tf32 in profiler * remove gnhwc/ngchw/ngcdhw instances * remove non-ndhwgc/nhwgc/nhwc instances * add check in IsSupportedArgument() [ROCm/composable_kernel commit: `fada1a3cae`]	2025-10-10 15:28:17 +08:00
Sami Remes	e7ef841a68	Add KBatch support for gemm_ab_scale (#2740 ) * Add KBatch support for gemm_ab_scale * Revert kernel parameters change * Remove printing * fix formatting * fix check * Use {} in if --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `9d4bfe3932`]	2025-10-09 08:33:16 +02:00
Illia Silin	594b16ea3c	fix compilation errors on RHEL8 and SLES15 (#2967 ) [ROCm/composable_kernel commit: `4c98535456`]	2025-10-03 07:08:49 -07:00
Thomas Ning	1f65936567	add the check of granularity for atomic add (#2959 ) [ROCm/composable_kernel commit: `cadafde722`]	2025-10-02 11:15:24 -07:00
yinglu	75888c151d	fix:tf32:fix build fail for all supported targets (#2942 ) * fix:tf32:fix build fail for all supported targets * new fix code [ROCm/composable_kernel commit: `0f04f020d9`]	2025-09-29 08:04:11 -07:00
linqunAMD	dde91b60fb	[CK] Fix example_grouped_conv_bwd_data_xdl_fp16 with ksplit = 2 (#2943 ) root cause: AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1. [ROCm/composable_kernel commit: `769c58f133`]	2025-09-29 07:56:33 -07:00
Bartłomiej Kocot	1d9ec09cf2	Grouped Conv Bwd Data out index calculation optimizations (#2917 ) * Grouped Conv Bwd Data index calculation optimizations * fixes * refactor instances * gfx12 fixes * temporary disable splitK for gfx12 [ROCm/composable_kernel commit: `5477811670`]	2025-09-29 15:59:11 +02:00
emezh	f628be2ed1	Verify `HostTensorDescriptor` when it is created (#2829 ) * add proper GEMM layout verification * Handle "auto" strides. CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides). CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims. ValidateStrides throws if any of the tensor's strides is <=0. profile_gemm_multiply_add updated to support "auto" strides for tensors. Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes) auto-strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1 Note, -1 should be deprecated (use 0 instead) explicit strides (same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128 explicit strides (not the same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 mix of explicit and auto strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0 invalid stride bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64 terminate called after throwing an instance of 'std::runtime_error' what(): Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1 Aborted (core dumped) * - add more names to ck::tensor_layout for easier namespace hierarchy checking - updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp * added handling of partially initialized strides for GEMM. fixed more tests. * clang-format and more fixes * replace long dash by a simple hyphen - causes build failure in CK codegen. * increase sizeof input, otherwise output size becomes zero or negative with large filter size * select stride based on layout * specify layout explicitly to avoid errors in HostTensorDescriptor creation * add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor` * Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors * fix test (incl removing invalid config) * fix moe examples: - (in .cpp) add layout argument to non-2D tensors - (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index) * fix moe_gemm2 example. * fix profile and wmma examples * clean-up early mods for ckprofile. verified with: ``` ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 # ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128 # ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0 # ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0 # not implemented ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128 # ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138 # example_gemm_add_multiply_dl_fp16 example_gemm_add_multiply_xdl_fp16 # ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0 ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128 ``` * temporary skip first 8 test configs - they throw error * temporary skip first 8 test configs in wmma too - they throw error --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `db2524be2d`]	2025-09-25 18:22:13 -07:00
yinglu	9cb95d4bc2	Conv:TF32: add more instances - 1 (#2867 ) * conv:tf32:add more instances * add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * remove gnhwc/ngchw/ngcdhw instances [ROCm/composable_kernel commit: `df97a286d5`]	2025-09-25 09:27:18 +08:00
linqunAMD	e338ee5004	[CK] Fix misc issues in CK examples (#2890 ) * [CK] Fix misc CK issues * revert fp8 change, it causes CI fail. * resubmit fp8 change [ROCm/composable_kernel commit: `f076f207ce`]	2025-09-24 11:28:20 -07:00
Illia Silin	c143f0305c	Upgrade to ROCm7.0.1 compiler. (#2909 ) * upgrade default docker to rocm7.0.1 * turn on build and test on gfx950 by default * use rocm-dev instead of rocm * link libhiprtc for codegen targets * resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards --------- Co-authored-by: Astha Rai <astha.rai713@gmail.com> [ROCm/composable_kernel commit: `8fe3838c65`]	2025-09-24 10:00:53 -07:00
Enrico Degregori	0b149c8695	Wmma support for multiple ABD GEMM (#2803 ) * multi_abd wmma support: - Add multiple A and B support to multiple D implementation (gridwise level) - Add multi_abd GEMM (device level) - Add instances (xdl parity) - Add tests (both xdl and wmma) - Add examples - Add ckProfiler support (both xdl and wmma) * Fix bug in device print function * Fix unused template parameter * Fix batched gemm for multiABD gridwise implementation * Fix gemm_universal_reduce with multiABDs gridwise implementation --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3d29bff2f0`]	2025-09-22 18:49:06 -07:00
Bartłomiej Kocot	67d51ed019	Disable bwd weight split-k autodeduce for single stage kernels (#2856 ) * Disable bwd weight split-k autodeduce for single stage kernels * update interface tests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `29446da1d5`]	2025-09-19 16:27:50 +02:00
yinglu	3f44e675e4	TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 ) * Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)" This reverts commit 82da15ffa430a297fb072d0a15b3ada5753f69b1. * fix compile error on gf12x * only run tf32 example on gfx942 * only build tf32 instance on gfx942 * ckProfiler:only support tf32 in gfx942 * delete unuseful messages [ROCm/composable_kernel commit: `dd7af118d7`]	2025-09-17 14:50:15 -07:00
Wojciech Laskowski	1f1d11e933	Added wmma support for gemm quantization: (#2841 ) - profiler for gemm quantization for DL/XDL - tests for gemm quantization for DL/XDL - implementation for gemm quantization for WMMA - profiler/tests for gemm qunatization for WMMA Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `f97b2a3f5d`]	2025-09-16 16:23:29 -07:00
Bartłomiej Kocot	e0b4bf70ca	Disable GridwiseOp prints if env var is off (#2843 ) * Disable GridwiseOp prints if env var is off * Fixes [ROCm/composable_kernel commit: `671adb59c5`]	2025-09-16 17:47:28 +02:00
Illia Silin	8cbf571d53	Revert "feature:tf32:add initial conv3d fwd kernel support (#2763 )" (#2848 ) This reverts commit 1a97bde100db0b7b5def711082bd2ea0e0aafc03. [ROCm/composable_kernel commit: `03b59f8c76`]	2025-09-15 08:27:04 -07:00
lym	5c712f856f	feature:tf32:add initial conv3d fwd kernel support (#2763 ) [ROCm/composable_kernel commit: `c51102144f`]	2025-09-15 21:03:00 +08:00
Wojciech Laskowski	5e10274417	WMMA support for GEMM reduce (#2823 ) Added gemm + reduce instance library for RDNA4. This includes: - New device implementation running GEMM and reduction kernel - instances for wmma (xdl parity) - examples for wmma (xdl parity) - tests for existing xdl and wmma [ROCm/composable_kernel commit: `b25d4d684a`]	2025-09-12 21:36:43 +02:00

1 2 3 4 5 ...

600 Commits