composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 17:26:00 +00:00

Author	SHA1	Message	Date
Illia Silin	6bfef63414	enable aiter test_mha in daily CI (#2659 )	2025-08-11 09:50:33 -07:00
Yashvardhan Agarwal	191c62967b	Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 ) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format	2025-08-11 15:01:33 +02:00
geozhai	1e1ee758fa	update CK build instruction step 4 (#2563 ) Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-11 00:26:13 -04:00
Illia Silin	8613aa1e40	remove ck_tile transpose and gemm stages from CI (#2646 )	2025-08-08 10:48:44 -07:00
Illia Silin	7ac850ac72	Add daily AITER tests on gfx942. (#2639 ) * add option to select aiter branch, add tests on gfx942	2025-08-08 09:30:46 -07:00
Max Podkorytov	ab26026835	[CK-tile] add more tests for batched transpose testing the rectangular block tile sizes (#2634 ) * add failing tests * swap out and reference * add constraint assert to transpose input distribution * test both pipelines with rectangular block tile * print mismatched indices * add a smaller failing test for old pipeline * print grid and block * fill output before operating on it * swap m/n tile sizes and make one test pass * add device syncs * add one more flipped test case * flip block tile at host arg init * fix tiles for lds pipeline * clang-format * rename tests * roll back error check * remove device syncs * reduce large test case's size	2025-08-07 16:51:53 -07:00
Sami Remes	3c9400471d	[CK_TILE] Enable persistent kernel and tail handler in tile_engine (#2300 ) * Enable persistent kernel in tile_engine and use tail handler * Fix formatting * Add persistent to default_config.json * Remove extra newlines and add persistent also to user config * Reduce instances from default_config.json * add persistent to benchmark.json and custom_ci_config.json * changed the config file to have few instances --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-07 16:03:49 -07:00
Gino Lu	5d6d236b25	Add e8m0 scaled convert into CK_TILE (#2617 ) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert	2025-08-07 21:37:28 +08:00
Yi DING	b0a97498b0	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad	2025-08-07 21:24:43 +08:00
Sami Remes	ffdee5e774	[CK_TILE] Enable printing more structures in CK-Tile (#2443 ) * Add more printing to core cktile * Revert other changes in static encoding pattern * Refactor to using a free print() function * Remove loops and print just the containers * Print tuple with better formatting, fix sequence compilation * Add some tests for print utility * Add print utility header * Print for static_encoding_pattern * add buffer_view printing * Align vector_traits * Fix formatting * Lower-case enum strings Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Remove empty comment lines * Fix test with lower-case too * Reduce repeated code in print tests, move helper function closer to type definition, test X&Y * Add test_print_common.hpp * add print.hpp in core.hpp --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-08-07 15:45:27 +03:00
Enrico Degregori	21e9983913	Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 )" (#2637 ) This reverts commit `2203b0ddfe`. Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-07 12:30:08 +02:00
Bartłomiej Kocot	54c7e08a2f	Fix clang format after conv changes (#2636 )	2025-08-07 10:00:09 +02:00
Bartłomiej Kocot	5328b232b2	Grouped Convolution Forward Infer Bias Bnorm Activ (#2621 ) * Grouped Convolution Forward Infer Bias Bnorm Activ * 3d	2025-08-07 08:36:47 +02:00
Max Podkorytov	1824d65758	modernize scripts for running cmake and clang-format (#2503 ) Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-06 10:15:44 -07:00
Yashvardhan Agarwal	4750b293fe	General 2D Reduction Kernel (#2535 ) * General 2D Reduction Kernel * Move the reduction kernel from the example * Split the code and add the necessary policy, problem, shape files as per ck_tile convention * Add/modify the headers * Modified the example to work with the 'new' kernel * Added tests for the kernel * N-D refernce reduce * Added support for N-D input with transform to 2D * Added padding to support various input sized tensors * Bug fix in the thread buffer constructor * Some comments to explain the reduce2d block kernel * comments resolution * clang-format * comments resolution * clang-format * clang-format * comments resolution * clang-format	2025-08-06 15:36:59 +02:00
Adam Osewski	2622ff06cb	Remove unused lds direct load instruction. (#2573 ) This functionality is replaced by amd_async_buffer_load Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-06 15:16:12 +02:00
Yi DING	15e8b6ccf7	[CK_TILE] Fix FMHA qr_async causing errors in FA (#2627 )	2025-08-06 20:04:23 +08:00
Thomas Ning	07469142cb	delete all slp compilation flag in CK Tile (#2625 )	2025-08-06 00:34:39 -07:00
Illia Silin	833ae1d051	Revert "Reduce build time tile engine (#2579 )" (#2623 ) This reverts commit `e5b79b26fa`. rocm-7.1.0	2025-08-05 09:27:55 -07:00
Enrico Degregori	2203b0ddfe	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 ) * Add padding 1x1Stride1Pad0 conv specialization * Add gridwise checks for conv cshufflev3 * Merge padding with previous transforms * Apply transform changes for padding to default specialization as well --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-05 15:23:19 +02:00
Thomas Ning	cbfecf8d7a	Persistent grouped gemm CompV4 Enablement & Polish (#2605 ) * enable the persistent kernel for CompV4 * polish the example and clang format * fix the non-persistent kernel error --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-04 23:43:01 -07:00
Max Podkorytov	2a78da4708	fix build for test_ck_tile_fp8 on rhel8 (#2615 )	2025-08-04 17:43:15 -07:00
Illia Silin	fb96b49666	fix test_mx_mfma errors (#2614 )	2025-08-04 11:43:47 -07:00
rahjain-amd	59245df46d	Fix Debug Build for ckProfiler (#2609 ) Problem ======= relocation R_X86_64_32 out of range: 5405348154 is not in [0, 4294967295] Solution ======== The problem was caused due the limitation comes from the 32 bit offsets used in original DWARF standard. We have the option to switch to 64bit offset for your libs which free us from 4G size boundary. add -gdwarf64 and -Og to avoid this limit.	2025-08-04 11:28:09 -07:00
Jinchao Xu	15eb493152	Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611 ) Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances by splitting debug information into separate .dwo files. Add explicit cast to avoid signed/unsigned comparison warning.	2025-08-04 11:26:08 -07:00
Bartłomiej Kocot	8655ba989c	Mark non-grouped convolutions instances as deprecated (#2595 ) * Mark non-grouped convolutions instances as deprecated * Update CHANGELOG.md Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update library/src/tensor_operation_instance/gpu/conv1d_bwd_data/device_conv1d_bwd_data_xdl_nwc_kxc_nwk_bf16_instance.cpp Co-authored-by: John Afaganis <john.afaganis@amd.com> --------- Co-authored-by: John Afaganis <john.afaganis@amd.com>	2025-08-04 16:49:55 +02:00
Max Podkorytov	0d9439760f	remove std::format (#2604 )	2025-08-01 19:22:07 -07:00
Illia Silin	b786d12e56	remove std=c++17 compiler flag (#2603 )	2025-08-01 16:18:16 -07:00
Max Podkorytov	f36cb5b2aa	[CK-tile] remove old ck-tile transpose test (#2591 ) * remove old ck-tile transpose test * rename test exe for consistency * replace batched transpose regression test	2025-08-01 14:50:09 -07:00
Thomas Ning	e5b79b26fa	Reduce build time tile engine (#2579 ) * Modify CMakeLists to allow for splitting. * Modify CMakeLists for data and layout logic. * Run tests and get build artifact. * Test new Cmakelists for speedup. * Further improvements for speedup. * turn off the FMHA * turn off the automatic tile engine gemm * minor fix * disable the transpose test first * Address the comment * Jenkinsfile * change the make thread to 64 * change the compile thread to 32 * Try to use with less OS memory space * Have the Unity build batch size to 2 * reduce the chunk size --------- Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>	2025-08-01 14:42:33 -07:00
Illia Silin	788e8a878e	update the switch condition for buffer built-ins (#2602 )	2025-08-01 14:30:07 -07:00
Thomas Ning	7c44a763fa	Fix the GFX 950 Universal GEMM (#2597 ) * solve the gfx950 error * clang format * fix a typo error --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-01 09:32:24 -07:00
Illia Silin	e6104daecc	Add a daily CI stage to test AITER with latest CK. (#2598 ) * add a CI stage for AITER testing	2025-08-01 07:55:51 -07:00
lalala-sh	bb5c478295	fix weight index out of range (#2414 )	2025-08-01 17:50:02 +08:00
Aviral Goel	1441a0a7ee	Integration of a new pipeline for weight preshuffle into gemm examples (#2516 ) * something khushbu can help with * v1 v2 works with flatmm develop * v0 v1 v2 numerical error gone * Fixing numerical error, and interchange preshuffle configs to match with flatmm * Refactor GEMM pipeline configurations and integrate preshuffle support - Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3). - Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations. - Removed obsolete code and comments * clang format * fix vectorloadsize bug * add the Preshuffle3 * update kwarp calculation in gemm utils * update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch * fix: add CK_GFX950_SUPPORT macro for gfx950 detection * default disable rotating buffer * docs(CHANGELOG): update changelog for rocm 7.0 * Revert "docs(CHANGELOG): update changelog for rocm 7.0" This reverts commit `2bc16fff84`. * Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files. * revert example/ck_tile/flatmm to its original state * remove comment added by second author * switch to xor ALDSDescriptor * modify the MakeALdsDescriptor() * temporary profiling script * getting rid of line marker compiler error * UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy * add a minor fix for the config * typo fix * Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2 * revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp * revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp * reenable the GemmSpatiallyLocalTilePartitioner * make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline * remove hardcoded true for preshuffle bool template argument * rename script * remove gemm_profilie.sh script * merge conflict resolve * clang formatted * typo fix * Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp * Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy * Fix missing newline at end of file in run_gemm_example.inc * Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1 * addressing review comments * removing debug code * addressing review comments * Revert "addressing review comments" This reverts commit `29c45192ba`. * updating tile_engine code * addressing review comments --------- Co-authored-by: amd-khushbu <khuagarw@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-08-01 00:04:54 -07:00
Khushbu Agarwal	88d72178d6	[CK_Tile] Updating gpu timer when doing flush cache (#2593 ) * Missed updating function names in example * updating timer * code cleanup * addressing review comments * updating tile_engine code * addressing review comments	2025-07-31 16:43:33 -07:00
Aviral Goel	546ef78d1d	Disable fp8 instances on unsupported targets (#2592 ) * Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt * Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt * Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt * Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt --------- Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>	2025-07-31 12:18:02 -07:00
Ville Pietilä	e962a41638	Automatic deduction of split-K value for grouped convolution (#2491 ) * Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Use simple best occupancy model to calculate the split-K. * Handle split-K autodeduction in explicit gemm conv. * Add unit tests for split-K autodeduction. * Remove oversubscription. * Small fixes. * Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle. * Run clang formatting. * Fix error handling in the conv profiler. * Add missing documentation for the autodeducted split-K values. * Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver. * Fix clang formatting and split-K profiler documentation. * Rename max_occupancy value variable. * Calculate grid size for split-K autodeduction directly from input array shapes and template params. --------- Co-authored-by: Ville Pietilä <>	2025-07-31 12:08:45 +02:00
Anton Gorenko	7b074249f4	[CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571 ) * Add tests for host convesion f32/f16 to f8 * Add tests for host convesion from f8 to f32/f16 * Fix UB and corner cases in f32/f16 to/from f8 conversion * There are UBs when very small values are converted to f8: bitshifts can be larger that type width. Using unsigned long long does not help because exponent_diff >= 64 in such cases. This causes that values like 2.117582368e-22 are converted to non-zero f8 in host validation of FMHA tests, test_f8 crashes with segfault in completely irrelevant code like GTest internals or produces non-deterministic results etc. * Fix FNUZ conversion to return NaN for NaN inputs. * Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16 conversion. * Replace some magic numbers with values from numeric_traits * Build tests only on devices supporting the type	2025-07-31 09:54:17 +05:00
Illia Silin	e8709c24f4	upgrade clang-format version in install_precommit.sh (#2589 )	2025-07-30 08:02:25 -07:00
Max Podkorytov	de0cdb4c31	[CK-tile] add gtest for ck-tile batched transpose kernels (#2585 ) * add a dummy test file * add kernel launch logic to the test * transfer all test cases into gtest params * factor kernel out into test config * add load transpose pipeline tests * add padded tests and skip invalid kernels at runtime * enum class for pipeline type * add multiwarp test cases * fix type * try to solve the problem --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-07-30 07:31:05 -07:00
Gino Lu	b25d512e8a	add constexpr to pk_fp4::pack/unpack() (#2586 )	2025-07-30 10:29:04 -04:00
Khushbu Agarwal	61e21f5567	Update to gpu_timer for rotating_buffer (#2524 ) * update gpu_timer for rotating buffer as hipblasLt's implementation * timing fix * Updating gpu timer for old ck as well * Revert "Updating gpu timer for old ck as well" This reverts commit `958cd1bc99`. * code clean up with runtime argument; function rename * code cleanup * general timer fixes * bug fix * clang formatted * addressing reveiew comments * clang formatted * Addressing review comments * CI fix --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-29 15:21:05 -07:00
Illia Silin	b80099cc5f	Revert "Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558 )" (#2584 ) This reverts commit `c64a0c65b9`.	2025-07-29 13:04:51 -07:00
Thomas Ning	9d4b494f07	Expand the bandwidth of direct_global_to_lds for gfx950 (#2576 ) * Expand the bandwidth of direct_global_to_lds for gfx950 * clang-format * fix the remod.py and script for clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-28 23:56:53 -07:00
rocking	01642ca8b1	set default optdim (#2580 )	2025-07-29 13:44:10 +08:00
Illia Silin	49723e94bb	fix the clang-format (#2578 )	2025-07-28 20:49:55 -07:00
Yi DING	1926cd0cb8	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-29 09:31:14 +08:00
Andres Lugo	7fe50dc3da	Remove filter for only batch on receipt 4 (#2574 ) Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA	2025-07-28 14:53:24 -07:00
Bartłomiej Kocot	5b244105d9	Enable multiple D for grouped conv fwd large tensors (#2572 )	2025-07-28 22:39:07 +02:00

1 2 3 4 5 ...

2186 Commits