composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-20 04:49:54 +00:00

Author	SHA1	Message	Date
ApoorvaKalyani	bac1ccbf8b	Grouped convolution backward data WMMA v3 implementation (#3460 ) * Added device level implementation for bwd_data_wmma_v3. * Added first instance of bwd_data_wmma_v3(f16). * Add support for bwd data in gridwise implementation Some changes are general for convolution and some are specific for bwd data. We need to generalize them once we have fwd, bwd data and bwd weight * Initial device implementation of bwd data * Remove unused template parameters in device impl * Add one instance for different layout initial check of device implementation * Add tests for splitk and for different layouts * Appended more instances to wmma_v3_f16. * Added conv_2d bf16 wmma_v3 instances. * Added conv_3d_bf16 wmma_v3_instances. * Added conv_3d_f16_wmma_v3_instances. * Added SplitN test cases for wmma. * Conv3d_bwd_data_scale_wmma_v3 instances. * Conv3d_bwd_data_bilinear_wmma_v3_instances * Renaming the device level instances file to common name , since it is defined for different DataTypes. * Renaming the instances and fixing typo * Added the test cases to regression test list * NCHW support for wmma_v3 * Examples for bf16 and f16 bwd_data_wmma_v3 * Added transpose conditons for device impl * fixing bugs * Added the gemm_args array implmentation * WIP debug conv bwd * fix splitk * Grouped gemm fix * Update CmakeLists with EOF * Added more instances for tests * Fixed the run time error in examples and removed 3d conv examples. * Fixed a typo. * Updated CmakeLists to removed the 3d convultion deleted files * Added print error statements for unsupoorted argument * Added the merge conflict related changes * Fixed compilation error * Fixed the InstanceFactory duplication error. * Removed the print statements and added logs to Arg function * All the merge conflict related errors resolved * Added d_tensor tests. * Added the missing example types of wmm_v3 * Merge error fix * Corrected the instance name * Reverted the bias relu change * Revereted the transpose load local change * Updated the regression test list with bwd_data_scale * Revert "Revereted the transpose load local change" This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b. * Revert "Merge error fix" This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b. * Reverting the local change * Added merge error fix * Build error fix due to merge conflicts * Added bias_relu example for wmma_v3 * Modified the main method in dtensor tests * Updated the dtensor tests to pick all the shapes * Updated the dtensor test shapes. * Updated the mem operations in tests. * Added reference func * Fixed typos in device impl * Added new header file and modified the include file for 3d tests * Renamed the test file and added reference func call. * clang format fix * Added ignore params * Modified device impl and tests * Removed debug print statements and updated dtensor test shapes * Fixing merge conflicts * Fixing more merge conflicts * Fixed copyrights * Updated the tuned instances to bilinear and scale. * Adding tuned instances to vanilla wmma_v3 * Removed all unused instances and modified test layouts. * Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances. * Fix clang format * Updated tuned f16/-genric instances * Formatting the instances file * Fixed copyrights and clang issues * Nonsense commit to force git to force * Removed the transpose instances * Added verified genric instances * Fixing namespace errors * Added todo for failing shapes * Formatting instance file * Fix instance list formatting * Removing unnecessary formats * Renamed the common file * Unification of xdl and wmma bwd_data tests * Updated Cmake * Added all layout types and deleted code. * Updated Cmake to add the condition to all tests. --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `53a1e4f551`]	2025-12-30 16:25:08 +01:00
assistant-librarian[bot]	6850e5e7bb	Merge commit 'dae85ead64c16b34eaa643d09fb0d6da008ca814' into develop	2025-12-29 15:14:37 +00:00
yadaish	fc3ffa0d75	[CK_TILE] support split-k a16w4 gemm1 (#3389 ) * initial version to support moe gemm1 split-k * add missing args * fix build warning * update reference * for split-k disable bias and weight * remove debug log * fix format * fix div by zero errors * fix cmake config * update * resolve conflicts * remove useless changes * reformat * fix * remove useless changes * fix ci --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu> [ROCm/composable_kernel commit: `dae85ead64`]	2025-12-29 23:05:35 +08:00
assistant-librarian[bot]	a90c72d560	Merge commit 'a0acc83a72c84a8cdbbdef6f397e617ac040aa72' into develop	2025-12-29 14:13:26 +00:00
JH-Leon-KIM-AMD	3772cf9dd4	[CK_BUILDER] Add GPU Reference Algorithm to CK Builder (#3381 ) * [CK_BUILDER] Integrate GPU reference as ConvAlgorithm Add GPU reference as a ConvAlgorithm specialization, enabling: - Unified Builder API for reference and optimized kernels - Future ckProfiler integration for validation - First step toward numerical validation in Builder tests Changes: - Add ConvAlgorithmSpecialization::REFERENCE enum - Add ConvAlgorithm_Reference struct - Add IsReferenceAlgorithm concept - Create 3 reference factories (Forward, BwdData, BwdWeight) - Wire into conv_dispatcher - Add proof-of-concept test (passing) Test result: Can instantiate reference through Builder API * Add GPU reference execution tests - Reference kernel executes through Builder (459ms) - Both reference and optimized can instantiate - Tests passing Next: Implement utilities for comparison * Optimized Builder kernel execution works - MakeArgument pattern implemented - Builder-generated kernel executes successfully - Tests passing (451ms execution) Next: Add comparison * VALIDATION COMPLETE: Builder == Reference Builder-generated kernel output matches GPU reference! Test: Validate_Optimized_vs_Reference_Forward_2D_FP16 Result: PASS ✓ This proves CK Builder generates correct code! * Update to new Builder API All tests passing * Rename test file for clarity test_builder_kernel_execution -> test_builder_kernel_validation * Add all 3 directions support - Forward, Backward Data, Backward Weight - All reference factories working - Dispatcher wired for all directions - 9 tests passing Tests: - test_reference_execution: 3 tests (all directions) - test_optimized_execution: 3 tests (all directions) - test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders) * Add backward direction support - Backward data and weight dispatcher wiring - Fix factories for new API - All 3 directions tested - 9 tests passing * Refactor: Change IsReferenceAlgorithm from concept to consteval function Address review feedback: Use consteval function in dispatcher instead of concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL). - Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp - Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp - Update dispatcher to use function call: IsReferenceAlgorithm<T>() - Remove redundant algorithm checks from reference factory requires clauses All tests passing (9/9). * Move Tile algorithm check outside direction block to support all directions * Implement MakeInvokerPointer interface and add random input validation - Implement full Argument/Invoker structs for old CK interface (not just nullptr) - Refactor with reference_common.hpp to reduce code duplication - Add random input validation tests: Builder vs direct GPU reference (all directions) - Fix layout: GNHWC -> NHWGC to match reference kernel expectations - All 12 tests pass with IDENTICAL results on random input * Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp Keep types.hpp for data types only (enums), move algorithm descriptors to conv_algorithm_types.hpp as suggested by review. * Add static_assert to ensure reference factories only accept PassThrough operations Reference implementation doesn't support fused elementwise operations. Add compile-time validation to fail early with clear error message if non-PassThrough operations are specified on input, weight, or output. * Add InstanceTraits support for reference kernels - Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection - Create shared ReferenceCommonTraits base for common properties - Add 3 direction-specific InstanceTraits specializations in one file - Include data type and layouts in instance_string output * Remove optimized kernel validation tests from reference-only branch * Use existing layout helper and organize reference tests Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits test to validation folder. * Merge develop branch Fix DataType switch for new mixed precision types. * Fix comment spacing for CI * Convert IsReferenceAlgorithm from function to concept * Add reference tests to CI smoke tests * Consolidate 3 reference factories into single unified factory --------- Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> [ROCm/composable_kernel commit: `a0acc83a72`]	2025-12-29 16:11:08 +02:00
assistant-librarian[bot]	0b6dde06c3	Merge commit '88ae4455806efe2019bb0403606f7c4a1e3d9c3a' into develop	2025-12-29 12:22:38 +00:00
Kiefer van Teutem	04d4dd1ada	Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470 ) * Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests. * Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong. [ROCm/composable_kernel commit: `88ae445580`]	2025-12-29 12:58:29 +01:00
assistant-librarian[bot]	c3c8c20144	Merge commit 'b0ea67e37725c26860a3520dc31c1f7a01164db9' into develop	2025-12-29 01:43:07 +00:00
Yi DING	9045cafc8c	[CK_TILE] MX FLATMM Fix M Padding (#3489 ) * Fix M Padding * Fix tensor desc ele space size [ROCm/composable_kernel commit: `b0ea67e377`]	2025-12-29 09:09:12 +08:00
assistant-librarian[bot]	dd314aaa48	Merge commit 'a3916a8d16d6e8d676b890ea3f242a180aeef61b' into develop	2025-12-27 09:13:12 +00:00
joyeamd	38a547df56	enable f8 tests (#3488 ) [ROCm/composable_kernel commit: `a3916a8d16`]	2025-12-27 00:21:56 -08:00
assistant-librarian[bot]	c9bf3fde79	Merge commit '7ce532eac7faab5041d472b7dabebf57e09fbaf6' into develop	2025-12-25 08:16:26 +00:00
Yi DING	d80a3f9c70	[CK_TILE] Align FMHA BWD Reference with Kernel Implementation (#3486 ) [ROCm/composable_kernel commit: `7ce532eac7`]	2025-12-25 16:12:36 +08:00
assistant-librarian[bot]	4f1df06484	Merge commit 'e08efa551ff260f0e55c839cfc0e2b64c929eb57' into develop	2025-12-25 07:15:36 +00:00
Erwin Terpstra	bd73699148	[CK_TILE] Grouped gemm quant tensor layouts (#3414 ) * feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time * chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times * chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems * fix: missing check for transposed load in bquant pipeline * chore: lower unit test tensors dimensions a bit for faster tests * chore: set grouped gemm example M/N warp tile to 16 --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `e08efa551f`]	2025-12-24 23:01:23 -08:00
assistant-librarian[bot]	199991cf05	Merge commit '14668a56e376550cd68d116aa64302a1df05b56f' into develop	2025-12-25 01:42:14 +00:00
Illia Silin	21d679acab	remove the LLVM_MAIN_REVISION usage (#3487 ) [ROCm/composable_kernel commit: `14668a56e3`]	2025-12-24 16:49:35 -08:00
assistant-librarian[bot]	446db13a0f	Merge commit '62a8ec155facd901232977b688d5225d72969709' into develop	2025-12-24 19:11:47 +00:00
Thrupti Raj Lakshmana Gowda	b17fa5656f	[CK TILE ENGINE] CI configuration with basic cases (#3475 ) * [CK TILE ENGINE] Adding GEMM BASIC TEST in Kenkins * fix RUN_TILE_ENGINE_BASIC_TESTS name typo * [CK Tile Engine] Updating basic CI * Resolving merging issues * Resolving merging issues --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `62a8ec155f`]	2025-12-24 10:45:56 -08:00
assistant-librarian[bot]	a169d59e06	Merge commit '7f68f3c4fa5bf478313c2147610317b199f9e65b' into develop	2025-12-24 17:14:26 +00:00
kensclin	b29e16aa67	Enable padding blockscale for abquant (#3453 ) * Enable padding blockscale for abquant * run clang-format * Reduce unnecessary testing * remove cout [ROCm/composable_kernel commit: `7f68f3c4fa`]	2025-12-24 09:12:40 -08:00
assistant-librarian[bot]	e1039a7eeb	Merge commit '1c3151963bd5abd30a5ced62f6859994a45f710e' into develop	2025-12-24 02:47:07 +00:00
Po Yen Chen	51b7e7d2d6	[CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425 ) * Add fp8bf16 support for batch_prefill * Fix wrong scale_s re-compute logic in batch_prefill * Fix wrong scale_s re-compute logic in fmha fwd * Fix batch_prefill codegen error * Remove no-longer used GetName() function * Add fp8 logits=True instances * Update CHANGELOG.md [ROCm/composable_kernel commit: `1c3151963b`]	2025-12-24 10:34:06 +08:00
assistant-librarian[bot]	27c1ae2774	Merge commit 'c0797c167143aa750936c108caa0945640eeefd1' into develop	2025-12-23 23:13:10 +00:00
jakpiase	0d94859dca	[CK_TILE] Minor splitk bugfix for gemms and conv (#3387 ) * fix for splitk if splitk < grid * add different splitk implementation * minor bugfix for streamk gemm * Add test --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `c0797c1671`]	2025-12-24 00:10:13 +01:00
assistant-librarian[bot]	166fe9db60	Merge commit 'e1381d6a712ce5703cd9bc9e3ec351fa91b1d87d' into develop	2025-12-23 11:12:47 +00:00
Johannes Graner	8f9d91fe6c	[CK grouped gemm] Fix grouped gemm two stage HasMainK0BlockLoop (#3466 ) * Re-enable two stage kernel * Only disable on HasMainKBlockLoop mismatch * Address PR comments [ROCm/composable_kernel commit: `e1381d6a71`]	2025-12-23 11:33:09 +01:00
assistant-librarian[bot]	3e31171d74	Merge commit '4ce7d4c511c7e98a9ac01580ed1e9112e59061a0' into develop	2025-12-23 10:13:44 +00:00
kabrahamAMD	34edd1d99d	[ck_builder] add utility functions to convolution (#3459 ) * reinstate conv_signature_utils.hpp * added tests for elementwise operation getters * add tests for getDataType functions * added test for no data type specified --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `4ce7d4c511`]	2025-12-23 10:39:49 +01:00
assistant-librarian[bot]	b8269a8c17	Merge commit 'ead81d1b0bba57b86ac28f3e2994dc97279f8eb3' into develop	2025-12-23 09:20:57 +00:00
jakpiase	b4626d7093	[CK_TILE] Add splitk support to ck tile conv bwd data (#3353 ) * add splitk support to ck tile conv bwd data * add reviewers suggestions * minor fix * removed splitkbatchoffset struct [ROCm/composable_kernel commit: `ead81d1b0b`]	2025-12-23 10:03:42 +01:00
assistant-librarian[bot]	e64347d747	Merge commit '8b73633e651822d90b66ffd7d174a21891a99611' into develop	2025-12-23 07:15:46 +00:00
Lyu, Xudong	677e3cd174	fix: handle void return type in TailHandler error path with ROCm6 compiler (clang++) (#3477 ) Replace `decltype(TailHandler<>(...)){}` with direct function call to fix compilation error when return type is void. Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `8b73633e65`]	2025-12-23 15:03:18 +08:00
assistant-librarian[bot]	d0bc7ccc31	Merge commit '6864a618f47e5ba8d28ada30e2a59da7d051085d' into develop	2025-12-23 06:16:51 +00:00
Yi DING	b0959a72b9	[CK_TILE] FMHA Ignore BWD Failed Cases in Smoke Test (#3480 ) [ROCm/composable_kernel commit: `6864a618f4`]	2025-12-23 13:28:15 +08:00
assistant-librarian[bot]	9569b291ed	Merge commit '2955d77f3cfb3515c6d36d54879ed65b854dafa6' into develop	2025-12-22 21:12:09 +00:00
Bartłomiej Kocot	83d15b7bb4	Fix grouped conv fwd wmma porting (#3479 ) * Fix grouped conv fwd wmma porting * add more limitations [ROCm/composable_kernel commit: `2955d77f3c`]	2025-12-22 21:32:48 +01:00
assistant-librarian[bot]	9d93dd9352	Merge commit 'a8aebb7a8efbd9860487a4bc563706cf7a71f988' into develop	2025-12-22 16:14:04 +00:00
Wojciech Laskowski	d8164b2632	Post-merge cleanup for WMMA grouped conv fwd (#3468 ) * remove duplicate aliases * Split scaleadd_ab instances for WMMA grouped conv fwd * removed big shape from the test [ROCm/composable_kernel commit: `a8aebb7a8e`]	2025-12-22 15:57:45 +01:00
assistant-librarian[bot]	7a55f53fcf	Merge commit '44f1b5c5de8c85cbae1520fa054405d96df67304' into develop	2025-12-22 01:42:28 +00:00
Bartłomiej Kocot	2228960cc4	Fix jenkinsfile for large tensor conv test (#3478 ) [ROCm/composable_kernel commit: `44f1b5c5de`]	2025-12-21 17:39:30 -08:00
assistant-librarian[bot]	5be6381bcb	Merge commit '9bd67c2cf2fe8e4479a433bcd6d467e2ea9aedb4' into develop	2025-12-20 01:40:48 +00:00
Jan Patrick Lehr	500d143fa8	[CK-TILE] Guard against compiler lexer diagnostic (#3444 ) * [CK-TILE] Guard against compiler lexer diagnostic A recent change to Clang added a lexer-level diagnostic about that C2y language feature. Since that is lexer level, the `__extension__` compiler built-in does not work as it is only respected after the lexer when parsing. This change adds guarding pragmas to disable the diagnostic in the lexer and not lead to warnings being treated as errors. * Fixing still existing build issue Once the one warning was removed, another one poppoed up. Both are related to the same c2y feature. Thus, ignoring both. * clang-format handling --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `9bd67c2cf2`]	2025-12-19 17:32:20 -08:00
assistant-librarian[bot]	09019c1024	Merge commit 'cbc83359649b1b56cd745c4102e9556112f942c2' into develop	2025-12-19 23:13:41 +00:00
Bartłomiej Kocot	38ff45abf7	Improve XDL to WMMA porting for grouped conv fwd (#3456 ) Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles. The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim. [ROCm/composable_kernel commit: `cbc8335964`]	2025-12-19 15:58:51 -07:00
Illia Silin	34d26c63a0	get LLVM_MAIN_REVISION macro from compiler header (#3469 ) [ROCm/composable_kernel commit: `2d9c962e2c`]	2025-12-19 14:57:12 -08:00
Geo Min	f9b62a0e99	Revert "details from org var (#3431 )" (#3473 ) This reverts commit `e43a252d19`. [ROCm/composable_kernel commit: `f67a20b0be`]	2025-12-19 14:10:58 -08:00
assistant-librarian[bot]	ac5610980f	Merge commit 'e22622f0ec185bf9e717523c8734acfb13dad0a5' into develop	2025-12-19 16:14:44 +00:00
Thrupti Raj Lakshmana Gowda	2dacac9561	[TILE ENGINE] Restructure to Base class of GEMM (#3434 ) [ROCm/composable_kernel commit: `e22622f0ec`]	2025-12-19 23:53:56 +08:00
assistant-librarian[bot]	44900da55a	Merge commit '0fd2b2f0459b10570788b74bf1a794095a18fc96' into develop	2025-12-19 15:13:27 +00:00

... 7 8 9 10 11 ...

3949 Commits