composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Author	SHA1	Message	Date
Joseph Macaranas	cc1392a405	Update TheRock CI SHA 20260102 (#3506 ) - TheRock CI compilation passed with the changes.	2026-01-02 14:23:43 -05:00
Ville Pietilä	6e8c401e33	[CK_BUILDER] Instance traits for conv bwd weight algorithms (#3498 ) Added instance traits for the following bwd weight conv algorithms DeviceGroupedConvBwdWeight_Xdl_CShuffleV3 DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 DeviceGroupedConvBwdWeight_Wmma_CShuffle DeviceGroupedConvBwdWeight_TwoStage_Xdl_CShuffle DeviceGroupedConvBwdWeight_TwoStage_Wmma_CShuffleV3 DeviceGroupedConvBwdWeight_DL DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 Added also unit tests for instance traits of those bwd weigth algorithms that are currently exposed by the narrow CK build for MIOpen. --------- Co-authored-by: Ville Pietilä <>	2025-12-31 15:41:15 -08:00
DarylHawkinsAMD	f3e4d46faa	Temporarily disable kernel instances that won't build on gfx1101 on Windows (#3499 ) ## Proposed changes This source file won't build for gfx1101 on Windows. It builds successfully on other gfx110X architectures, and also builds successfully on gfx1101 on Linux. This is the compile error: ``` [composable_kernel] FAILED: library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj [composable_kernel] ccache B:\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_WMMA -DCK_USE_XDL -DDPP_KERNELS -DLLVM_MAIN_REVISION=524190 -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -IC:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/library/include -IC:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/include -IB:/build/ml-libs/composable_kernel/build/include -IB:/build/base/half/stage/include -isystem B:/build/core/clr/dist/include -DWIN32 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_WARNINGS -DNOMINMAX -fms-extensions -fms-compatibility -D_ENABLE_EXTENDED_ALIGNED_STORAGE -Wno-documentation-unknown-command -Wno-documentation-pedantic -Wno-unused-command-line-argument -Wno-explicit-specialization-storage-class -Wno-ignored-attributes -Wno-unknown-attributes -Wno-duplicate-decl-specifier --hip-path=B:/build/core/clr/dist --hip-device-lib-path=B:/build/core/clr/dist/lib/llvm/amdgcn/bitcode -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=c++20 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -x hip --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx1103 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx1103 -MD -MT library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj -MF library\src\tensor_operation_instance\gpu\grouped_conv3d_bwd_weight_bilinear\CMakeFiles\device_grouped_conv3d_bwd_weight_bilinear_instance.dir\wmma\device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj.d -o library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj -c C:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp [composable_kernel] error: Illegal instruction detected: Operand has incorrect register class. [composable_kernel] V_CMP_NE_U32_e32 0, $src_private_base, implicit-def $vcc, implicit $exec [composable_kernel] 1 error generated when compiling for gfx1101. ``` This appears to be a compiler bug and we'll follow up to get a proper fix landed, but for the purposes of landing some work to enable gfx1151 support in TheRock we'd like to disable building of these kernels on this architecture temporarily. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [X] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [X] I have removed the stale documentation which is no longer relevant after this pull request - [X] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [X] Any dependent changes have been merged	2025-12-31 13:12:45 -07:00
kabrahamAMD	f86bbb1aef	[CK_Builder] [testing] Integrate device random generators (#3427 ) Implemented device random number generators for ck tensors. Includes tests and integration to ck builder testing interface.	2025-12-30 10:03:05 -08:00
Bartłomiej Kocot	2b8302eb6d	Fix grouped conv wrw kernels names (#3494 )	2025-12-30 16:45:39 +01:00
ApoorvaKalyani	53a1e4f551	Grouped convolution backward data WMMA v3 implementation (#3460 ) * Added device level implementation for bwd_data_wmma_v3. * Added first instance of bwd_data_wmma_v3(f16). * Add support for bwd data in gridwise implementation Some changes are general for convolution and some are specific for bwd data. We need to generalize them once we have fwd, bwd data and bwd weight * Initial device implementation of bwd data * Remove unused template parameters in device impl * Add one instance for different layout initial check of device implementation * Add tests for splitk and for different layouts * Appended more instances to wmma_v3_f16. * Added conv_2d bf16 wmma_v3 instances. * Added conv_3d_bf16 wmma_v3_instances. * Added conv_3d_f16_wmma_v3_instances. * Added SplitN test cases for wmma. * Conv3d_bwd_data_scale_wmma_v3 instances. * Conv3d_bwd_data_bilinear_wmma_v3_instances * Renaming the device level instances file to common name , since it is defined for different DataTypes. * Renaming the instances and fixing typo * Added the test cases to regression test list * NCHW support for wmma_v3 * Examples for bf16 and f16 bwd_data_wmma_v3 * Added transpose conditons for device impl * fixing bugs * Added the gemm_args array implmentation * WIP debug conv bwd * fix splitk * Grouped gemm fix * Update CmakeLists with EOF * Added more instances for tests * Fixed the run time error in examples and removed 3d conv examples. * Fixed a typo. * Updated CmakeLists to removed the 3d convultion deleted files * Added print error statements for unsupoorted argument * Added the merge conflict related changes * Fixed compilation error * Fixed the InstanceFactory duplication error. * Removed the print statements and added logs to Arg function * All the merge conflict related errors resolved * Added d_tensor tests. * Added the missing example types of wmm_v3 * Merge error fix * Corrected the instance name * Reverted the bias relu change * Revereted the transpose load local change * Updated the regression test list with bwd_data_scale * Revert "Revereted the transpose load local change" This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b. * Revert "Merge error fix" This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b. * Reverting the local change * Added merge error fix * Build error fix due to merge conflicts * Added bias_relu example for wmma_v3 * Modified the main method in dtensor tests * Updated the dtensor tests to pick all the shapes * Updated the dtensor test shapes. * Updated the mem operations in tests. * Added reference func * Fixed typos in device impl * Added new header file and modified the include file for 3d tests * Renamed the test file and added reference func call. * clang format fix * Added ignore params * Modified device impl and tests * Removed debug print statements and updated dtensor test shapes * Fixing merge conflicts * Fixing more merge conflicts * Fixed copyrights * Updated the tuned instances to bilinear and scale. * Adding tuned instances to vanilla wmma_v3 * Removed all unused instances and modified test layouts. * Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances. * Fix clang format * Updated tuned f16/-genric instances * Formatting the instances file * Fixed copyrights and clang issues * Nonsense commit to force git to force * Removed the transpose instances * Added verified genric instances * Fixing namespace errors * Added todo for failing shapes * Formatting instance file * Fix instance list formatting * Removing unnecessary formats * Renamed the common file * Unification of xdl and wmma bwd_data tests * Updated Cmake * Added all layout types and deleted code. * Updated Cmake to add the condition to all tests. --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2025-12-30 16:25:08 +01:00
yadaish	dae85ead64	[CK_TILE] support split-k a16w4 gemm1 (#3389 ) * initial version to support moe gemm1 split-k * add missing args * fix build warning * update reference * for split-k disable bias and weight * remove debug log * fix format * fix div by zero errors * fix cmake config * update * resolve conflicts * remove useless changes * reformat * fix * remove useless changes * fix ci --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>	2025-12-29 23:05:35 +08:00
JH-Leon-KIM-AMD	a0acc83a72	[CK_BUILDER] Add GPU Reference Algorithm to CK Builder (#3381 ) * [CK_BUILDER] Integrate GPU reference as ConvAlgorithm Add GPU reference as a ConvAlgorithm specialization, enabling: - Unified Builder API for reference and optimized kernels - Future ckProfiler integration for validation - First step toward numerical validation in Builder tests Changes: - Add ConvAlgorithmSpecialization::REFERENCE enum - Add ConvAlgorithm_Reference struct - Add IsReferenceAlgorithm concept - Create 3 reference factories (Forward, BwdData, BwdWeight) - Wire into conv_dispatcher - Add proof-of-concept test (passing) Test result: Can instantiate reference through Builder API * Add GPU reference execution tests - Reference kernel executes through Builder (459ms) - Both reference and optimized can instantiate - Tests passing Next: Implement utilities for comparison * Optimized Builder kernel execution works - MakeArgument pattern implemented - Builder-generated kernel executes successfully - Tests passing (451ms execution) Next: Add comparison * VALIDATION COMPLETE: Builder == Reference Builder-generated kernel output matches GPU reference! Test: Validate_Optimized_vs_Reference_Forward_2D_FP16 Result: PASS ✓ This proves CK Builder generates correct code! * Update to new Builder API All tests passing * Rename test file for clarity test_builder_kernel_execution -> test_builder_kernel_validation * Add all 3 directions support - Forward, Backward Data, Backward Weight - All reference factories working - Dispatcher wired for all directions - 9 tests passing Tests: - test_reference_execution: 3 tests (all directions) - test_optimized_execution: 3 tests (all directions) - test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders) * Add backward direction support - Backward data and weight dispatcher wiring - Fix factories for new API - All 3 directions tested - 9 tests passing * Refactor: Change IsReferenceAlgorithm from concept to consteval function Address review feedback: Use consteval function in dispatcher instead of concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL). - Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp - Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp - Update dispatcher to use function call: IsReferenceAlgorithm<T>() - Remove redundant algorithm checks from reference factory requires clauses All tests passing (9/9). * Move Tile algorithm check outside direction block to support all directions * Implement MakeInvokerPointer interface and add random input validation - Implement full Argument/Invoker structs for old CK interface (not just nullptr) - Refactor with reference_common.hpp to reduce code duplication - Add random input validation tests: Builder vs direct GPU reference (all directions) - Fix layout: GNHWC -> NHWGC to match reference kernel expectations - All 12 tests pass with IDENTICAL results on random input * Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp Keep types.hpp for data types only (enums), move algorithm descriptors to conv_algorithm_types.hpp as suggested by review. * Add static_assert to ensure reference factories only accept PassThrough operations Reference implementation doesn't support fused elementwise operations. Add compile-time validation to fail early with clear error message if non-PassThrough operations are specified on input, weight, or output. * Add InstanceTraits support for reference kernels - Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection - Create shared ReferenceCommonTraits base for common properties - Add 3 direction-specific InstanceTraits specializations in one file - Include data type and layouts in instance_string output * Remove optimized kernel validation tests from reference-only branch * Use existing layout helper and organize reference tests Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits test to validation folder. * Merge develop branch Fix DataType switch for new mixed precision types. * Fix comment spacing for CI * Convert IsReferenceAlgorithm from function to concept * Add reference tests to CI smoke tests * Consolidate 3 reference factories into single unified factory --------- Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>	2025-12-29 16:11:08 +02:00
Kiefer van Teutem	88ae445580	Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470 ) * Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests. * Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.	2025-12-29 12:58:29 +01:00
Yi DING	b0ea67e377	[CK_TILE] MX FLATMM Fix M Padding (#3489 ) * Fix M Padding * Fix tensor desc ele space size	2025-12-29 09:09:12 +08:00
joyeamd	a3916a8d16	enable f8 tests (#3488 )	2025-12-27 00:21:56 -08:00
Yi DING	7ce532eac7	[CK_TILE] Align FMHA BWD Reference with Kernel Implementation (#3486 )	2025-12-25 16:12:36 +08:00
Erwin Terpstra	e08efa551f	[CK_TILE] Grouped gemm quant tensor layouts (#3414 ) * feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time * chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times * chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems * fix: missing check for transposed load in bquant pipeline * chore: lower unit test tensors dimensions a bit for faster tests * chore: set grouped gemm example M/N warp tile to 16 --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-24 23:01:23 -08:00
Illia Silin	14668a56e3	remove the LLVM_MAIN_REVISION usage (#3487 )	2025-12-24 16:49:35 -08:00
Thrupti Raj Lakshmana Gowda	62a8ec155f	[CK TILE ENGINE] CI configuration with basic cases (#3475 ) * [CK TILE ENGINE] Adding GEMM BASIC TEST in Kenkins * fix RUN_TILE_ENGINE_BASIC_TESTS name typo * [CK Tile Engine] Updating basic CI * Resolving merging issues * Resolving merging issues --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-12-24 10:45:56 -08:00
kensclin	7f68f3c4fa	Enable padding blockscale for abquant (#3453 ) * Enable padding blockscale for abquant * run clang-format * Reduce unnecessary testing * remove cout	2025-12-24 09:12:40 -08:00
Po Yen Chen	1c3151963b	[CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425 ) * Add fp8bf16 support for batch_prefill * Fix wrong scale_s re-compute logic in batch_prefill * Fix wrong scale_s re-compute logic in fmha fwd * Fix batch_prefill codegen error * Remove no-longer used GetName() function * Add fp8 logits=True instances * Update CHANGELOG.md	2025-12-24 10:34:06 +08:00
jakpiase	c0797c1671	[CK_TILE] Minor splitk bugfix for gemms and conv (#3387 ) * fix for splitk if splitk < grid * add different splitk implementation * minor bugfix for streamk gemm * Add test --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-12-24 00:10:13 +01:00
Johannes Graner	e1381d6a71	[CK grouped gemm] Fix grouped gemm two stage HasMainK0BlockLoop (#3466 ) * Re-enable two stage kernel * Only disable on HasMainKBlockLoop mismatch * Address PR comments	2025-12-23 11:33:09 +01:00
kabrahamAMD	4ce7d4c511	[ck_builder] add utility functions to convolution (#3459 ) * reinstate conv_signature_utils.hpp * added tests for elementwise operation getters * add tests for getDataType functions * added test for no data type specified --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>	2025-12-23 10:39:49 +01:00
jakpiase	ead81d1b0b	[CK_TILE] Add splitk support to ck tile conv bwd data (#3353 ) * add splitk support to ck tile conv bwd data * add reviewers suggestions * minor fix * removed splitkbatchoffset struct	2025-12-23 10:03:42 +01:00
Lyu, Xudong	8b73633e65	fix: handle void return type in TailHandler error path with ROCm6 compiler (clang++) (#3477 ) Replace `decltype(TailHandler<>(...)){}` with direct function call to fix compilation error when return type is void. Co-authored-by: Yi DING <yi.ding@amd.com>	2025-12-23 15:03:18 +08:00
Yi DING	6864a618f4	[CK_TILE] FMHA Ignore BWD Failed Cases in Smoke Test (#3480 )	2025-12-23 13:28:15 +08:00
Bartłomiej Kocot	2955d77f3c	Fix grouped conv fwd wmma porting (#3479 ) * Fix grouped conv fwd wmma porting * add more limitations	2025-12-22 21:32:48 +01:00
Wojciech Laskowski	a8aebb7a8e	Post-merge cleanup for WMMA grouped conv fwd (#3468 ) * remove duplicate aliases * Split scaleadd_ab instances for WMMA grouped conv fwd * removed big shape from the test	2025-12-22 15:57:45 +01:00
Bartłomiej Kocot	44f1b5c5de	Fix jenkinsfile for large tensor conv test (#3478 )	2025-12-21 17:39:30 -08:00
Jan Patrick Lehr	9bd67c2cf2	[CK-TILE] Guard against compiler lexer diagnostic (#3444 ) * [CK-TILE] Guard against compiler lexer diagnostic A recent change to Clang added a lexer-level diagnostic about that C2y language feature. Since that is lexer level, the `__extension__` compiler built-in does not work as it is only respected after the lexer when parsing. This change adds guarding pragmas to disable the diagnostic in the lexer and not lead to warnings being treated as errors. * Fixing still existing build issue Once the one warning was removed, another one poppoed up. Both are related to the same c2y feature. Thus, ignoring both. * clang-format handling --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-19 17:32:20 -08:00
Bartłomiej Kocot	cbc8335964	Improve XDL to WMMA porting for grouped conv fwd (#3456 ) Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles. The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim.	2025-12-19 15:58:51 -07:00
Illia Silin	2d9c962e2c	get LLVM_MAIN_REVISION macro from compiler header (#3469 )	2025-12-19 14:57:12 -08:00
Geo Min	f67a20b0be	Revert "details from org var (#3431 )" (#3473 ) This reverts commit `f4729de395`.	2025-12-19 14:10:58 -08:00
Thrupti Raj Lakshmana Gowda	e22622f0ec	[TILE ENGINE] Restructure to Base class of GEMM (#3434 )	2025-12-19 23:53:56 +08:00
Wojciech Laskowski	0fd2b2f045	Adding support for scale and bilinear ops for WMMA grouped conv fwd (#3450 ) * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * removed unnecessary ck parts from compilation * initial gemm_add_multiply instance implementations * fixed profiler help message for gemm_add_multiply * improved multiply_add profiler layout help * fixed template arguments for test instances * added test for gemm_add_multiply * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * switched to splitK interface * log print added to splitk benchmarks * revert main cmake comments * newline change reverted * added add_fastgelu instances * revert unintended change in xdl add_fastgelu * created gemm_add_add_fastgelu instances * created fastegelu instances * added tests for all splitk fastgelus * Added tests. * multiply_add instances created * updates to add_multiply splitk instances * splitk xdl test fixes * added wmma multiply_multiply instances * fixed ONLY_XDL_AND_WMMA_KERNELS tag * Added gemm_add examples for wmma v1 and v3 * fixed / workarounded i8 instances * Modified the v3 code to added one fp16 bxdl instance. * added bf16 xdl instance. * adding gemm_add wmma_cshuffle and other support (cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * add instances into camkelists (cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * This is work in progress, edited the template parameters in order to build (cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype (cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * added datatype and use clang-format-12 (cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * Fixing build errors * Added instances for v3 * Adding instances and executables * Code update of template parameters modified. * Renamed file. * Added tests. * resolved error tests. * Fixing build errors * Updated comments * removed the changes as per the MR review comment. * Updated tests. * fp8 instances - not tested * Restored the Cmake file that was reverted by mistake during rebase. * fixed wmma_op test * Updated comments. * Updated the template parameter description * fixed rdna4 instances * fixed back compatibility on gfx11 * cleanups * fix ckProfiler * one more cmake fix * added fp8 instances * Updated tests to ad BF16 instances as per review comment * Added include file and cleaned up(as per review comment) * Updated and optimized the example code for all types. * Fixed clang format * Resolve "Implement `device_gemm_bilinear` for RDNA4" * test generalization to handle FP16 shuffle better * added missing changes * Added bf16 wmma instance for add_relu * Added f16 wmma instance and corrected bf16 instance errors. * Added instances to Cmake * Modified the template parameters to make the instances work. * Fixed typo in profiler * Added v3 instances for gemm_add_relu * addressed core review comments * Added test for gemm_add_relu wmma instance * Cleaned up the code. * Added examples for gemm_add_relu * Fixing typo to resolve build errors. * Fixes applied to fix the precision loss. * fix billinear test after merge * Removed the old wmma instances. * Added wrapper and renamed the wmma_v3 instances * Updated copyrights and added wrappers. * Fixes applied according to review comments * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Robin Voetter <robin@streamhpc.com> * Removed the old wmma instances. * Updated wrapper for the v3 instances * removed the old wmma examples * Renamed the v3 instances * Deleted the gtest file added by mistake. * Updated thge profiler with wrapper * Fixed test errors. * Fixed the review comments * Fixed the if condition MACROS. * REVERTED THE PROFILER CHANGES * Revert "REVERTED THE PROFILER CHANGES" This reverts commit `21cb98546c`. * Revert "Fixed test errors." This reverts commit `13efcc6fe1`. * Revert "Updated thge profiler with wrapper" This reverts commit `536f86661d`. * Added missing wrapper instances * Updated copyrights. * Fixed typo. * Fixed copyrights. * Updated copyrights. * updated copyrights. * comments on the atomics workaround * fixed cmake comment * Fix bug from merge * clang-format-18 * Fix compilation error * multi_abd wmma support: - Add multiple A and B support to multiple D implementation (gridwise level) - Add multi_abd GEMM (device level) - Add instances (xdl parity) - Add tests (both xdl and wmma) - Add examples - Add ckProfiler support (both xdl and wmma) * Fix bug in device print function * Fix unused template parameter * Add support for fwd conv in gridwise implementation. Identical to run function for bwd data. * Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future. * Make relevant profilers print the number of valid instances to aid testing. * Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now. Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working. * Reset output buffer after each run in profile_grouped_conv_fwd_impl(). * Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds. * Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation. * Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types. * Add int8 instances for 2D vanilla grouped conv fwd all layouts. * Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd * Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts. * Implement multi-AB support for grouped conv fwd and add example. * Add 1D instances * Add D layout tests to IsSupportedArgument() * Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary. * Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists. * Add some more tests for vanilla grouped conv fwd * Add 2D bias clamp instances and tests * Add 3D bias clamp instances and tests * Add 2D and 3D clamp instances and tests * Unify problem sizes across vanilla and clamp flavor tests * Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check. * Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations. * Remove wmma fp8 and bf8 instances when not targetting gfx12 * Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build * Remove old years for newly created files. * No need to time kernels for now. * Fixup comments * Pass struct args to Gridwise Run() function by reference. * Don't use workspace memory in the case where A needs explicit transposition but B does not. * Move calculation of rotating memory buffer sizes to Argument member functions. * After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts. * Unify xdl and wmma example code for grouped conv fwd scaleadd ab * Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same. * Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3. * Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge. * Small post-merge fixup, everything seems to work. * Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming. * Extend scaleadd_ab instance lists * Extend merged groups instance lists, including adaptations of xdl "2x" instances. * Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile. * Extend "mem" instance lists. * Extend regular instance lists. * Fixup comments and ignored kernel arg name * Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test. * Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code. * Re-enable sharding for wmma cshufflev3 instances * Post merge fix to vanilla test * Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general. * Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop. * Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument. * Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed. * Re-enable old wmma instances * Re-enable Linqun's Xdl Wmma instances * Small post-merge fixes * Fix copyright headers * Remove commented code snippet in gridwise Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance. * Adding tuned instace list for groupoed conv fwd (#3288) Following flavors are updated with tuned instance list: - grouped_conv2d_fwd - grouped_conv2d_fwd_bias_clamp - grouped_conv2d_fwd_clamp - grouped_conv3d_fwd - grouped_conv3d_fwd_bias_clamp - grouped_conv3d_fwd_clamp - grouped_conv3d_fwd_scaleadd_ab Re-factored instance selection: - removed all the unnecessary instance tuples (comp/mem/16x16/generic) - removed all unnecessary layouts and data types * Do not use std::remove_cvref_t, does not exist in C++17, use custom one. * Splitting grouped conv fwd instances (#3449) * Disable unnecessary and failing tests related to experimental CK builder * Disable unnecessary ck builder experimental tests fully * Adding extra flavors for grouped conv fwd As titled. Following variants are added: - grouped_conv3d_fwd_bilinear - grouped_conv3d_fwd_scale * fix cmake error * Fix failing int8 test for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: apoorva <apoorva@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Cenxuan <cenxuan@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com> Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-12-19 15:15:02 +01:00
Johannes Graner	323e014799	[CK Grouped Gemm] Fix workspace stride in two stage kernel (#3412 ) * Use correct workspace stride * Use correct stride in elementwise kernel * Fix test by adding padder * No UTF-8 in comments * Remove unnecessary changes * Remove more unnecessary changes * Use non-padded strides for workspace * Disable two stage kernel for RRR+MNKPadding+kbatch>2 Partially fixes AICK-441	2025-12-19 10:04:48 +01:00
John Afaganis	b188a2a896	Minor CHANGELOG.md correction (#3451 ) Address a minor issue where the changelog entry for #3423 was accidentally added to the wrong section.	2025-12-19 14:02:42 +08:00
Wojciech Laskowski	7795e73b47	Added large tensor support for grouped conv fwd wmma (#3437 ) * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * removed unnecessary ck parts from compilation * initial gemm_add_multiply instance implementations * fixed profiler help message for gemm_add_multiply * improved multiply_add profiler layout help * fixed template arguments for test instances * added test for gemm_add_multiply * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * switched to splitK interface * log print added to splitk benchmarks * revert main cmake comments * newline change reverted * added add_fastgelu instances * revert unintended change in xdl add_fastgelu * created gemm_add_add_fastgelu instances * created fastegelu instances * added tests for all splitk fastgelus * Added tests. * multiply_add instances created * updates to add_multiply splitk instances * splitk xdl test fixes * added wmma multiply_multiply instances * fixed ONLY_XDL_AND_WMMA_KERNELS tag * Added gemm_add examples for wmma v1 and v3 * fixed / workarounded i8 instances * Modified the v3 code to added one fp16 bxdl instance. * added bf16 xdl instance. * adding gemm_add wmma_cshuffle and other support (cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * add instances into camkelists (cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * This is work in progress, edited the template parameters in order to build (cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype (cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * added datatype and use clang-format-12 (cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * Fixing build errors * Added instances for v3 * Adding instances and executables * Code update of template parameters modified. * Renamed file. * Added tests. * resolved error tests. * Fixing build errors * Updated comments * removed the changes as per the MR review comment. * Updated tests. * fp8 instances - not tested * Restored the Cmake file that was reverted by mistake during rebase. * fixed wmma_op test * Updated comments. * Updated the template parameter description * fixed rdna4 instances * fixed back compatibility on gfx11 * cleanups * fix ckProfiler * one more cmake fix * added fp8 instances * Updated tests to ad BF16 instances as per review comment * Added include file and cleaned up(as per review comment) * Updated and optimized the example code for all types. * Fixed clang format * Resolve "Implement `device_gemm_bilinear` for RDNA4" * test generalization to handle FP16 shuffle better * added missing changes * Added bf16 wmma instance for add_relu * Added f16 wmma instance and corrected bf16 instance errors. * Added instances to Cmake * Modified the template parameters to make the instances work. * Fixed typo in profiler * Added v3 instances for gemm_add_relu * addressed core review comments * Added test for gemm_add_relu wmma instance * Cleaned up the code. * Added examples for gemm_add_relu * Fixing typo to resolve build errors. * Fixes applied to fix the precision loss. * fix billinear test after merge * Removed the old wmma instances. * Added wrapper and renamed the wmma_v3 instances * Updated copyrights and added wrappers. * Fixes applied according to review comments * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Robin Voetter <robin@streamhpc.com> * Removed the old wmma instances. * Updated wrapper for the v3 instances * removed the old wmma examples * Renamed the v3 instances * Deleted the gtest file added by mistake. * Updated thge profiler with wrapper * Fixed test errors. * Fixed the review comments * Fixed the if condition MACROS. * REVERTED THE PROFILER CHANGES * Revert "REVERTED THE PROFILER CHANGES" This reverts commit `21cb98546c`. * Revert "Fixed test errors." This reverts commit `13efcc6fe1`. * Revert "Updated thge profiler with wrapper" This reverts commit `536f86661d`. * Added missing wrapper instances * Updated copyrights. * Fixed typo. * Fixed copyrights. * Updated copyrights. * updated copyrights. * comments on the atomics workaround * fixed cmake comment * Fix bug from merge * clang-format-18 * Fix compilation error * multi_abd wmma support: - Add multiple A and B support to multiple D implementation (gridwise level) - Add multi_abd GEMM (device level) - Add instances (xdl parity) - Add tests (both xdl and wmma) - Add examples - Add ckProfiler support (both xdl and wmma) * Fix bug in device print function * Fix unused template parameter * Add support for fwd conv in gridwise implementation. Identical to run function for bwd data. * Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future. * Make relevant profilers print the number of valid instances to aid testing. * Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now. Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working. * Reset output buffer after each run in profile_grouped_conv_fwd_impl(). * Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds. * Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation. * Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types. * Add int8 instances for 2D vanilla grouped conv fwd all layouts. * Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd * Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts. * Implement multi-AB support for grouped conv fwd and add example. * Add 1D instances * Add D layout tests to IsSupportedArgument() * Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary. * Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists. * Add some more tests for vanilla grouped conv fwd * Add 2D bias clamp instances and tests * Add 3D bias clamp instances and tests * Add 2D and 3D clamp instances and tests * Unify problem sizes across vanilla and clamp flavor tests * Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check. * Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations. * Remove wmma fp8 and bf8 instances when not targetting gfx12 * Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build * Remove old years for newly created files. * No need to time kernels for now. * Fixup comments * Pass struct args to Gridwise Run() function by reference. * Don't use workspace memory in the case where A needs explicit transposition but B does not. * Move calculation of rotating memory buffer sizes to Argument member functions. * After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts. * Unify xdl and wmma example code for grouped conv fwd scaleadd ab * Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same. * Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3. * Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge. * Small post-merge fixup, everything seems to work. * Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming. * Extend scaleadd_ab instance lists * Extend merged groups instance lists, including adaptations of xdl "2x" instances. * Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile. * Extend "mem" instance lists. * Extend regular instance lists. * Fixup comments and ignored kernel arg name * Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test. * Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code. * Re-enable sharding for wmma cshufflev3 instances * Post merge fix to vanilla test * Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general. * Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop. * Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument. * Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed. * Re-enable old wmma instances * Re-enable Linqun's Xdl Wmma instances * Small post-merge fixes * Fix copyright headers * Remove commented code snippet in gridwise Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance. * Adding tuned instace list for groupoed conv fwd (#3288) Following flavors are updated with tuned instance list: - grouped_conv2d_fwd - grouped_conv2d_fwd_bias_clamp - grouped_conv2d_fwd_clamp - grouped_conv3d_fwd - grouped_conv3d_fwd_bias_clamp - grouped_conv3d_fwd_clamp - grouped_conv3d_fwd_scaleadd_ab Re-factored instance selection: - removed all the unnecessary instance tuples (comp/mem/16x16/generic) - removed all unnecessary layouts and data types * Do not use std::remove_cvref_t, does not exist in C++17, use custom one. * Splitting grouped conv fwd instances (#3449) * Disable unnecessary and failing tests related to experimental CK builder * Disable unnecessary ck builder experimental tests fully * Added large tensor support for grouped conv fwd wmma --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: apoorva <apoorva@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Cenxuan <cenxuan@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com> Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-18 21:55:50 -07:00
John Shumway	9a6e61de97	[CK_BUILDER] Add noreturn to consteval void functions (#3461 ) We have some metaprogramming helper functions that only exist to throw an error at build time. These should have the [[noreturn]] attribute, which is now required in our CI builds.	2025-12-18 19:07:30 -08:00
Yi DING	2220cbaba7	[CK_TILE] MX Flatmm Use Byte Pointer Arithmetic for A Tensor (#3446 ) * A as bytes * Reformat with static_for_product	2025-12-19 10:28:13 +08:00
yadaish	c0ee71d735	Dev/a8w4 and a8w8splitk (#3447 ) * Ck moe bs splitk pr (#3440) * splitk kick-off. Compilation fail * splitk hack pass * fix scale offset calc. * clang-format for a8w8_moe_blk_gemm1 splitk change * fix testcase error --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> * Zan/moe a8w4 (#3441) * update * update * update ck moe a8w4 * update * update * update * compile pass * update * update * python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready * support new a8w4 kernel * update * update ck_tile * re format * update * update * fix conflict * fix build * update ck_tile moe * fix clang format * fix the problem * fix accruacy issue * fix --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> Co-authored-by: Zzz9990 <zanzhang@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-12-19 09:26:52 +08:00
yinglu	ba897f8435	ck:tf32:complement CK_ENABLE_TF32 controls (#3426 )	2025-12-19 09:17:29 +08:00
andrew clark	e77a7ca2bc	Supporting Custom Build Trace File Names (#3443 ) * Removing hard-coded trace filename * Including stage name in notification * Simplifying capture setup and tagging file names with arch * Removed test property from notification message * Fixing regex to get arch name * Fixing error in notification and modified regex	2025-12-18 12:15:33 -08:00
Kiefer van Teutem	2ea710e88b	Grouped convolution forward device implementation and base flavors for RDNA3/4 (#2964 ) * Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * removed unnecessary ck parts from compilation * initial gemm_add_multiply instance implementations * fixed profiler help message for gemm_add_multiply * improved multiply_add profiler layout help * fixed template arguments for test instances * added test for gemm_add_multiply * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * switched to splitK interface * log print added to splitk benchmarks * revert main cmake comments * newline change reverted * added add_fastgelu instances * revert unintended change in xdl add_fastgelu * created gemm_add_add_fastgelu instances * created fastegelu instances * added tests for all splitk fastgelus * Added tests. * multiply_add instances created * updates to add_multiply splitk instances * splitk xdl test fixes * added wmma multiply_multiply instances * fixed ONLY_XDL_AND_WMMA_KERNELS tag * Added gemm_add examples for wmma v1 and v3 * fixed / workarounded i8 instances * Modified the v3 code to added one fp16 bxdl instance. * added bf16 xdl instance. * adding gemm_add wmma_cshuffle and other support (cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * add instances into camkelists (cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * This is work in progress, edited the template parameters in order to build (cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype (cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * added datatype and use clang-format-12 (cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * Fixing build errors * Added instances for v3 * Adding instances and executables * Code update of template parameters modified. * Renamed file. * Added tests. * resolved error tests. * Fixing build errors * Updated comments * removed the changes as per the MR review comment. * Updated tests. * fp8 instances - not tested * Restored the Cmake file that was reverted by mistake during rebase. * fixed wmma_op test * Updated comments. * Updated the template parameter description * fixed rdna4 instances * fixed back compatibility on gfx11 * cleanups * fix ckProfiler * one more cmake fix * added fp8 instances * Updated tests to ad BF16 instances as per review comment * Added include file and cleaned up(as per review comment) * Updated and optimized the example code for all types. * Fixed clang format * Resolve "Implement `device_gemm_bilinear` for RDNA4" * test generalization to handle FP16 shuffle better * added missing changes * Added bf16 wmma instance for add_relu * Added f16 wmma instance and corrected bf16 instance errors. * Added instances to Cmake * Modified the template parameters to make the instances work. * Fixed typo in profiler * Added v3 instances for gemm_add_relu * addressed core review comments * Added test for gemm_add_relu wmma instance * Cleaned up the code. * Added examples for gemm_add_relu * Fixing typo to resolve build errors. * Fixes applied to fix the precision loss. * fix billinear test after merge * Removed the old wmma instances. * Added wrapper and renamed the wmma_v3 instances * Updated copyrights and added wrappers. * Fixes applied according to review comments * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Robin Voetter <robin@streamhpc.com> * Removed the old wmma instances. * Updated wrapper for the v3 instances * removed the old wmma examples * Renamed the v3 instances * Deleted the gtest file added by mistake. * Updated thge profiler with wrapper * Fixed test errors. * Fixed the review comments * Fixed the if condition MACROS. * REVERTED THE PROFILER CHANGES * Revert "REVERTED THE PROFILER CHANGES" This reverts commit `21cb98546c`. * Revert "Fixed test errors." This reverts commit `13efcc6fe1`. * Revert "Updated thge profiler with wrapper" This reverts commit `536f86661d`. * Added missing wrapper instances * Updated copyrights. * Fixed typo. * Fixed copyrights. * Updated copyrights. * updated copyrights. * comments on the atomics workaround * fixed cmake comment * Fix bug from merge * clang-format-18 * Fix compilation error * multi_abd wmma support: - Add multiple A and B support to multiple D implementation (gridwise level) - Add multi_abd GEMM (device level) - Add instances (xdl parity) - Add tests (both xdl and wmma) - Add examples - Add ckProfiler support (both xdl and wmma) * Fix bug in device print function * Fix unused template parameter * Add support for fwd conv in gridwise implementation. Identical to run function for bwd data. * Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future. * Make relevant profilers print the number of valid instances to aid testing. * Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now. Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working. * Reset output buffer after each run in profile_grouped_conv_fwd_impl(). * Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds. * Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation. * Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types. * Add int8 instances for 2D vanilla grouped conv fwd all layouts. * Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd * Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts. * Implement multi-AB support for grouped conv fwd and add example. * Add 1D instances * Add D layout tests to IsSupportedArgument() * Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary. * Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists. * Add some more tests for vanilla grouped conv fwd * Add 2D bias clamp instances and tests * Add 3D bias clamp instances and tests * Add 2D and 3D clamp instances and tests * Unify problem sizes across vanilla and clamp flavor tests * Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check. * Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations. * Remove wmma fp8 and bf8 instances when not targetting gfx12 * Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build * Remove old years for newly created files. * No need to time kernels for now. * Fixup comments * Pass struct args to Gridwise Run() function by reference. * Don't use workspace memory in the case where A needs explicit transposition but B does not. * Move calculation of rotating memory buffer sizes to Argument member functions. * After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts. * Unify xdl and wmma example code for grouped conv fwd scaleadd ab * Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same. * Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3. * Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge. * Small post-merge fixup, everything seems to work. * Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming. * Extend scaleadd_ab instance lists * Extend merged groups instance lists, including adaptations of xdl "2x" instances. * Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile. * Extend "mem" instance lists. * Extend regular instance lists. * Fixup comments and ignored kernel arg name * Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test. * Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code. * Re-enable sharding for wmma cshufflev3 instances * Post merge fix to vanilla test * Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general. * Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop. * Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument. * Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed. * Re-enable old wmma instances * Re-enable Linqun's Xdl Wmma instances * Small post-merge fixes * Fix copyright headers * Remove commented code snippet in gridwise Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance. * Adding tuned instace list for groupoed conv fwd (#3288) Following flavors are updated with tuned instance list: - grouped_conv2d_fwd - grouped_conv2d_fwd_bias_clamp - grouped_conv2d_fwd_clamp - grouped_conv3d_fwd - grouped_conv3d_fwd_bias_clamp - grouped_conv3d_fwd_clamp - grouped_conv3d_fwd_scaleadd_ab Re-factored instance selection: - removed all the unnecessary instance tuples (comp/mem/16x16/generic) - removed all unnecessary layouts and data types * Do not use std::remove_cvref_t, does not exist in C++17, use custom one. * Splitting grouped conv fwd instances (#3449) * Disable unnecessary and failing tests related to experimental CK builder * Disable unnecessary ck builder experimental tests fully --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: apoorva <apoorva@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Cenxuan <cenxuan@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Wojciech Laskowski <77888887+wj-laskowski@users.noreply.github.com>	2025-12-18 13:12:15 -07:00
Bartłomiej Kocot	700b2ec9c0	Update AMD buffer coherency (#3403 ) * Update AMD buffer coherency [AICK-421] * fixes * fix * fixes * fixes * Add backward compatilibity * fix * fixes * fix * fix * fix * Update grouped_convolution_backward_weight_kernel.hpp	2025-12-18 10:16:22 +01:00
Yashvardhan Agarwal	15e81397a4	[CK_TILE] Epilogue chaining (Lwpck 3373) (#2773 ) * Epilogue chainer * epilogue chainer with context to share state in between epilogues * chain-able epilogues for cshuffle * clang-format * rebase related changes - Added separate chainer test - clang format * comment resolutions * clang-format * Policy based chaining - basic Policy structure to control blanket looping and barrier placement. - to be extended for fine grianed control - to be modified to move possible auto-compute values and SFC access count to policy * Refactoring as per spec - Introduced epilogue schedule, graph - modified chainer to function with graph and schedule * minor_changes - made functions to overload in the epilogue_graph file * clang-format * Documentation and Comments - Added comments to files - Noted changes in changelog - Added README to explain the chainer and current status, exact use steps to be added * Comment resolutions - README modified with the suggested changes - Comment fixed accordingly * major refactoring - modified the chainer files to match the new design - updated comments - updated readme - multi-d example shocases use of the chainer * minor cleanup * tensor and rowcol quant chainer epilogue - added scalarepilogue for tensor quant - added schedule for tensorquant - modified quant example to use chainer and appropriate schedules * Refactor epilogue chainer: generalize ops and standardize context interface Address review comments. Changes: - Rename CastToLdsOp to CastAndStoreToLdsOp for clarity - Standardize context member names (working_tile, out_tile, aux_windows) - Update README documentation with correct operation names - Clean up parameter naming in epilogue_chainer.hpp (OutWindow, AccTile, AuxWindows) - common_epilogue_ops.hpp: General-purpose ops (ScaleScalarOp, CastAndStoreToLdsOp, LoadFromLdsOp, ElementwiseOp, StoreOp, MoveWindowsOp) - cshuffle_epilogue_chainer_ops.hpp: CShuffle-specific context and slice operations - epilogue_chainer.hpp: Cleaned up parameter naming for generality - Removed test files that are no longer needed. These were added for intermediate use * update cshuffle chainer ops file w.r.t cshuffle_epilogue.hpp updates & add chainer to quant gemm example * fix compile errors - CI uses c++17 while the code had c++20 features --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-12-18 10:02:02 +01:00
Po Yen Chen	bfac64953f	[CK_TILE][FMHA] Add logits soft-capping support for FAv3 (WIP) (#3355 ) * Let fmha_fwd_v3() compatible with fmha_fwd() * Decouple get_fwd_blobs() and FmhaFwdKernel * Decouple compatibility checks from get_fwd_blobs() * Extract product feature checks out from get_fwd_blobs() * Remove duplicated code in factories and redundant checks * Remove FmhaFwdKernel<>::GetName() * Let FmhaFwdApiPool support pipelines with different mask_impl * Add tile setting for fmha fwd v3 pipeline * Add fwd v3 instances to tile_example_fmha_fwd manually * Remove unused function import * Undo irrelevant changes * Remove fwd v3 instances from tile_example_fmha_fwd * Finish fmha fwd v3 kernel instance codegen * Fix formatting * Remove unused F_idx attribute * Add is_generic_attention_mask<> traits * Add constraints to the fmha fwd v3 pipeline * Unify traits & problem used for fmha fwd v3 * Unify kernel launch code for fmha fwd v2 & v3 * Unify kernel template selection logic * Use same kernel codegen template for both v2 & v3 * Rename api() property as render() method * Allow specifying filter for fmha fwd api pool * Allow specifying function name when rendering api pool items * Separate fmha fwd v3 kernel dispatching logic from v2 * Remove lambda assignment * Add simple v2/v3 dispatch logic * Stop generating empty if-clauses Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them. * Use "".join() to concatenate fmha fwd api string content * Add more feature checks for fmha fwd v3 pipeline * Check features before dispatch to fmha_fwd_v3() * Add more feature checks for fmha_fwd_v3() * Add missing filter call * Use Tuple to reserve the dtype orders * Fix wrong pipeline matching logic * Add fmha fwd v3 group mode instances * Add functor_transform<> * Add type constraints to make_tile_window() * Remove fmha fwd v3 example * Fix wrong product(aiter mha_fwd()) config * Fix wrong fmha fwd v2/v3 selection logic * Fix formatting * Add comment to warning v3 kernel users * Fix wrong codegen logics * Remove unnecessary param * Fix format * Add logits soft-capping support for fmha fwd v3 pipeline (WIP) * Add missing Kargs base type --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-18 16:08:45 +08:00
Johannes Graner	bb8445dca8	[CK] Integrate GPU reference into ckProfiler for convolutions (#3379 ) Refactor and integrate CK GPU references into ckProfiler. - All convolution layouts and groupings supported for all three directions - Unit tests verifying GPU and CPU reference is the same - Support added to profiler (do_verification = 2 enables GPU reference) - One profiler-based test per direction changed to GPU reference to demonstrate usag Closes AICK-427	2025-12-18 07:59:45 +01:00
Enrico Degregori	87dd073887	Wmma support for grouped convolution bwd weight (#2947 ) * Convolution bwd weight device implementation * Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38 * Fix bug and disable splitK=-1 tests for wmma * Add generic instances for bf16 f32 bf16 * check gridwise level validity in device impl for 1 stage D0 * Fix bugs in device implementation: - rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly) * Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization * Remove workaround for 1x1Stride1Pad0 conv specialization * Add instances for xdl parity (for pipeline v1) * Add two stage instances (xdl parity) * Add multiple Ds instances * Add examples * Uncomment scale instances * Fix copyright * Fix examples compilation * Add atomic add float4 * Fix compilation error * Fix instances * Compute tolerances in examples instead of using default ones * Compute tolerances instead of using default ones in bilinear and scale tests * Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47 * Device implementation of explicit gemm for grouped conv bwd weight Based on batched gemm multiple D * Add instances for pipeline v1 and v3 * Add support for occupancy-based splitk * Fix ckProfiler dependencies * Review fixes * Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma' Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52 * Fix cmake file for tests * fix clang format * fix instance factory error * Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test. * Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test." This reverts commit `d20c869d3d`. * Disable splitk for 2stage xdl on rdna (bug to be fixed) * Fix add_test_executable * Always ForceThreadTileTransfer for now, WaveTileTransfer does not work for convolution yet. * Grab device and gridwise files from bkp branch, this should enable splitK support for convolution and also we no longer ForceThreadTileTransfer for explicit gemm. Also grab some updates from 7e7243783008b11e904f127ecf1df55ef95e9af2 to fix building on clang20. * Fix bug in various bwd wei device implementations / profiler where the occupancy based split_k value could not be found because the Argument did not derive from ArgumentSplitK, leading to incorrect error tolerances. * Actually print the reason when a device implementation is not supported. * Print number of valid instances in profiler and tests. * Fix clang format for Two Stage implementation * Fix copyright * Address review comments * Fix explicit conv bwd weight struct * Fix gridwise common * Fix gridwise ab scale * Remove autodeduce 1 stage * Restore example tolerance calculation * Fix compilation error * Fix gridwise common * Fix gridwise gemm * Fix typo * Fix splitk * Fix splitk ab scale * Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test. * Reduce instances to only the tuned wmma V3 ones for implicit v1 intra and explicit v1 intra pad/nopad. * Add explicit oddMN support with custom tuned instances * Add two stage instances based on the parameters from the tuned cshuffle V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists. * Replace cshuffle non-v3 lists with v3 lists, making sure to not have duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases. * Remove some instances that give incorrect results (f16 NHWGC) * Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances. * Add back some generic instances to make sure we have the same shape / layout / datatype support as before the instance selection process. * Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tuning. Keep generic instances for support. * Disable two stage f16 instances which produce incorrect results. * Remove more instances which fail verification, for bf16_f32_bf16 and for f16 scale / bilinear. * Disable all non-generic two-stage instances in the instance lists for NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit. * Remove unused instance lists and related add_x_instance() functions, fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3. * Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remove custom ckProfiler target. * Remove straggler comments * Remove [[maybe_unused]] * Fix clang format * Remove unwanted instances. This includes all instances which are not NHWGCxGKYXC and F16 or BF16 (no mixed in-out types). * Add comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>	2025-12-17 15:58:58 -08:00
Geo Min	f4729de395	details from org var (#3431 )	2025-12-17 11:54:13 -08:00
Yashvardhan Agarwal	ea10a78203	[ck_tile] refactor reduce kernel (#3257 ) * refactor reduce kernel - Rename Reduce kernel as per convention - Move kept_dim and reduce_dims from runtime to compile-time parameters - Update Reduce2dProblem template to include KeptDim, ReduceDims, and Rank - Remove IsSupportedArgument validation function as it's unnecessary. Not using the GuaranteedLastDimensionVectorStride while making tensor view or descriptor which removes the bounds enforced earlier. We still calculate and use vector size. - Update reduce example to demonstrate NCHW->NHW reduction with non-contiguous support - Update tests Kernel now handles both contiguous and non-contiguous memory layout. * fix compile errors	2025-12-17 21:46:08 +02:00
ltqin	92653168c2	flashattention fwd add (80, 96) instance (#3415 ) * add hdim (96,96) instance * change to (80,96) * format py * remove 96 in optdim * when N=6 change to llvm_amdgcn_raw_buffer_load_i32x3	2025-12-17 09:16:11 -08:00
Matti Eskelinen	fe3d52d9b0	Fix minor issues in cmake-ck-dev script (#3438 ) * Remove extra slash from cmake-ck-dev.sh * Add quoting around string variables	2025-12-17 08:57:21 -08:00

1 2 3 4 5 ...

2845 Commits