composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-12 18:17:58 +00:00

Author	SHA1	Message	Date
aska-0096	96d24497f5	fix conflict. disable all v-col instance for fmha fwd	2025-08-12 04:02:41 +00:00
aska-0096	1716171be4	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-12 03:52:34 +00:00
Yi DING	4fde1646e5	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp	2025-08-12 11:11:55 +08:00
aska-0096	f43e903b1d	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-12 01:52:52 +00:00
Aviral Goel	a7badc6ec5	feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 ) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-08-11 10:54:37 -07:00
aska-0096	498d234ab8	change the warp setting for hdim32 fmha fwd	2025-08-11 15:37:37 +00:00
aska-0096	8c101ccb88	fix bug on non-gfx950	2025-08-08 18:35:53 +00:00
aska-0096	106edeecd9	remove non-necessary change	2025-08-08 09:07:40 +00:00
aska-0096	78edd7303b	bug fix, clang format;	2025-08-08 09:04:02 +00:00
aska-0096	3b9fb6af38	Remove unnecessary changes	2025-08-08 08:08:03 +00:00
aska-0096	6bb57c2c57	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-08 07:50:12 +00:00
aska-0096	1ecee378d5	remove unnecessary files; rename some files	2025-08-08 06:19:31 +00:00
aska-0096	b4640a9de6	merge fa_decode pipeline into fmha_fwd api	2025-08-08 05:46:18 +00:00
Yi DING	b0a97498b0	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad	2025-08-07 21:24:43 +08:00
Yashvardhan Agarwal	4750b293fe	General 2D Reduction Kernel (#2535 ) * General 2D Reduction Kernel * Move the reduction kernel from the example * Split the code and add the necessary policy, problem, shape files as per ck_tile convention * Add/modify the headers * Modified the example to work with the 'new' kernel * Added tests for the kernel * N-D refernce reduce * Added support for N-D input with transform to 2D * Added padding to support various input sized tensors * Bug fix in the thread buffer constructor * Some comments to explain the reduce2d block kernel * comments resolution * clang-format * comments resolution * clang-format * clang-format * comments resolution * clang-format	2025-08-06 15:36:59 +02:00
Yi DING	15e8b6ccf7	[CK_TILE] Fix FMHA qr_async causing errors in FA (#2627 )	2025-08-06 20:04:23 +08:00
Thomas Ning	07469142cb	delete all slp compilation flag in CK Tile (#2625 )	2025-08-06 00:34:39 -07:00
Thomas Ning	cbfecf8d7a	Persistent grouped gemm CompV4 Enablement & Polish (#2605 ) * enable the persistent kernel for CompV4 * polish the example and clang format * fix the non-persistent kernel error --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-04 23:43:01 -07:00
aska-0096	0d12fc944f	Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA	2025-08-04 10:27:42 +00:00
aska-0096	746f4ccb99	Load Q through lds, implement xor;	2025-08-04 06:49:01 +00:00
aska-0096	2d4e73d2b4	small refactor	2025-08-01 10:44:54 +00:00
Aviral Goel	1441a0a7ee	Integration of a new pipeline for weight preshuffle into gemm examples (#2516 ) * something khushbu can help with * v1 v2 works with flatmm develop * v0 v1 v2 numerical error gone * Fixing numerical error, and interchange preshuffle configs to match with flatmm * Refactor GEMM pipeline configurations and integrate preshuffle support - Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3). - Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations. - Removed obsolete code and comments * clang format * fix vectorloadsize bug * add the Preshuffle3 * update kwarp calculation in gemm utils * update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch * fix: add CK_GFX950_SUPPORT macro for gfx950 detection * default disable rotating buffer * docs(CHANGELOG): update changelog for rocm 7.0 * Revert "docs(CHANGELOG): update changelog for rocm 7.0" This reverts commit `2bc16fff84`. * Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files. * revert example/ck_tile/flatmm to its original state * remove comment added by second author * switch to xor ALDSDescriptor * modify the MakeALdsDescriptor() * temporary profiling script * getting rid of line marker compiler error * UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy * add a minor fix for the config * typo fix * Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2 * revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp * revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp * reenable the GemmSpatiallyLocalTilePartitioner * make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline * remove hardcoded true for preshuffle bool template argument * rename script * remove gemm_profilie.sh script * merge conflict resolve * clang formatted * typo fix * Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp * Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy * Fix missing newline at end of file in run_gemm_example.inc * Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1 * addressing review comments * removing debug code * addressing review comments * Revert "addressing review comments" This reverts commit `29c45192ba`. * updating tile_engine code * addressing review comments --------- Co-authored-by: amd-khushbu <khuagarw@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-08-01 00:04:54 -07:00
Khushbu Agarwal	88d72178d6	[CK_Tile] Updating gpu timer when doing flush cache (#2593 ) * Missed updating function names in example * updating timer * code cleanup * addressing review comments * updating tile_engine code * addressing review comments	2025-07-31 16:43:33 -07:00
aska-0096	a28b6e67fe	upgrade prefill pipeline; simple iglp; consistent data produce and consume order	2025-07-31 10:25:37 +00:00
aska-0096	75cba48682	enable larger tile size; upgrade xor pattern	2025-07-31 05:13:27 +00:00
aska-0096	69890afc98	remove all lds bankconflict with xor layouts	2025-07-30 12:25:33 +00:00
aska-0096	8dacc35c4c	enable prefill overload operator().	2025-07-30 03:51:06 +00:00
Khushbu Agarwal	61e21f5567	Update to gpu_timer for rotating_buffer (#2524 ) * update gpu_timer for rotating buffer as hipblasLt's implementation * timing fix * Updating gpu timer for old ck as well * Revert "Updating gpu timer for old ck as well" This reverts commit `958cd1bc99`. * code clean up with runtime argument; function rename * code cleanup * general timer fixes * bug fix * clang formatted * addressing reveiew comments * clang formatted * Addressing review comments * CI fix --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-29 15:21:05 -07:00
rocking	01642ca8b1	set default optdim (#2580 )	2025-07-29 13:44:10 +08:00
Yi DING	1926cd0cb8	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-29 09:31:14 +08:00
Andres Lugo	7fe50dc3da	Remove filter for only batch on receipt 4 (#2574 ) Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA	2025-07-28 14:53:24 -07:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
rocking	b36e0b029f	[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539 ) * uncomment all the headdim, use optdim to control * change default back to -1 * uncomment splitkv instance * Fix typo in receipt 4 for appendkv * support optdim for bwd, splitkv and appendkv * Fix 192 key error --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>	2025-07-28 17:16:32 +08:00
Max Podkorytov	821cd26c13	[CK-Tile] Merge transpose examples (#2450 ) * unify pipeline signature with existing example * iwyu * move stuff around in load-tile-transpose * cleanups in batched transpose pipeline * comments * use same inputs size * cleaner printf * print host args * use 64 block sides in the 37_transpose example * roll back grid dimension size adjustment for 37_transpose example * transpose grid for 37_transpose to unify with 35_batched_transpose * unify grid computation logic * make policy methods device only (since they are used only on device from the pipeline) * more host/device attribute cleanups * copy over problem * move over pipeline and policy * add switch to batched transpose api * make the lds problem more similar to original problem * factor out logic into traits * factor out conditional compilation into trait parameter * propagate pipeline to args * unhardcode pipeline dispatch parameter * refactor vector size * put warp tile out of dispatch * rename template parameter for trait * rewrite vector size in terms of problem * mark policy-internal struct variable as device * factor out input distribution and thread access pattern from policies * reword vector size * use datatype across batched transpose pipelines, problems and kernel * remove transpose traits from lds pipeline * add padding to the lds pipeline interface * add comment * remove ck_tile example #37 * update cmakelists * add test for new pipeline * update batched transpose test * roll back load_tile_transpose changes * remove comments * pack dispatch parameters into a config * padM can be enabled * adjust lds vector size to enable padding along N * update test * clean up logic * swap m/n input vector size * adjust perf test script * sweep over C/W in perf test * count both read and written bytes into bandwidth (x2 the number) * clang-format * widen size range for perf test * remove 64k x 64k case; it's too large for index * remove thread tile from dispatch * Solve merge conflict * fix compile * modify the transpose * solve the test error and clang format * Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Fixing 0ms and inf GB/s issue in img2col (#2565) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` * merge with develop * solve clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>	2025-07-26 21:51:54 -07:00
Bartłomiej Kocot	5741edf761	Fix clang format (#2567 ) * clean * clang format fix	2025-07-25 09:54:34 -07:00
rahjain-amd	78082855d8	Fixing 0ms and inf GB/s issue in img2col (#2565 ) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`	2025-07-25 21:15:50 +05:30
aska-0096	13bcc913de	fix the lds alignment caused performance regression	2025-07-25 07:10:01 +00:00
Mateusz Ozga	b507d889c1	[CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520 ) * Init commit new API * apply clang-format * PreShuffle preapring * Apply Preshuffle condition to universal_gemm * Fix: convert size_t to index_t * Review changes * Mode 100755 -> 100644 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-07-24 20:39:56 +02:00
Yashvardhan Agarwal	606b0cc947	[CK_TILE] Support for elementwise kernel (#2246 ) * Elementwise kernel implementation Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: yashagar <yashagar@amd.com> * Elementwise with generalized nDims * Adding the n-ary input tensor feature * Generalize dimensions on top of inputs * Add TFLOPS + remove std usage for tuples * 1D basecase optimization * Cleanup code + refactoring to a common interface * Generalize to unary and add an example * Cleanup, refactoring and commenting * Suggestions for LWPCK-3170: elementwise kernel improvements * Clang-format: remod.py * Replace InputTensorType with XDataType as the type of input_tensors * Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs * Move examples to folder 19_elementwise * Add missing copyright headers and fix some existing ones * Replace an assert with throw std::runtime_error in elementwise example * Avoid reading the output by using make_static_distributed_tensor for y_tile * Removed two unused includes * No need to move windows to the next block when each workgroup processes a single tile * Only copy input tensors to the device * Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example * Adding output strides to the kernel, transposition example and update the other examples * Changes made by remod.py * Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp * Reuse generic reference binary/unary operation in examples + refactoring the transpose reference * Fix comments in elementwise_example.cpp - Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses - ElementWiseTraits was renamed to ElementWiseShape - Adopt suggestions made by Copilot when prompted to check for factual or typographical errors * Simplify CMakeLists.txt and remove the unused variables this uncovers * Rename a file and fix some copyright statements * Changes made by script/clang-format-overwrite.sh * Add basic unit test for ElementWiseKernel * Remove left-over uninformative comment in apply unit test * Changes made by clang-format-overwrite.sh * fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp * Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array * Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally * Move examples to folder 20_elementwise * Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK * Fix CLang formating * Bump up the elementwise example folder number * Elementwise: add padding + minor cleanup * Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view * Add isSupportedArg to Elementwise kernel + addapt example and unit tests * Fix clang-format on the unit test file --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-07-24 11:21:45 +02:00
jakpiase	6681593864	[CK_TILE] Grouped Convolution Backward Weight Kernel (#2357 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * resolved conflicts * rebase fixes * fixes * tmp * add working support for splitk * minor fix * fixes * fixes * minor fix * small fix * Split K and preprocessing fixes --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-07-24 10:41:35 +02:00
Illia Silin	1b6f024836	refactor fmha_bwd.py (#2546 )	2025-07-23 09:09:56 -07:00
Haocong WANG	a5fdc663c8	fix async copytest bug (#2509 ) * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * change bit width check * refactor macros into constexpr functions which still get inlined * wrap s_waitcnt api * parameterize test * cleanup * cleanup fp8 stub * add fp8 test cases; todo which input parameters are valid? * replace n for fp8 in test cases * add large shapes; fp8 fails again * change input init * test sync/async * time the test * clang-format test * use float instead of bfloat to cover a 4-byte type * fix logic - arg sections should be 'or'd * make block_sync_lds_direct_load interface similar to old ck * fix a few comment typos * name common shapes * revert the example to original logic of not waiting lds * clang-format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-23 00:14:02 -07:00
Cong Ma	e62710e461	ck_tile kernel for gemm with groupwise quantized A tensor (#2473 ) * ck_tile kernel for gemm with groupwise quantized A or B tensor. This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers. * [Block Scale Gemm] Only run gemm quant examples on __gfx94__ - Only run gemm quant examples on __gfx94__ for usage of `v_cvt_pk_fp8_f32` - Format the code * [Block Scale Gemm] Remove Bquant Gemm BlockScale This cleanup is in preparation for future development of bquant. By isolating Aquant-related code, we can streamline the codebase and make it easier to add and maintain bquant functionality in subsequent updates. * [Block Scale Gemm] Format code with clang-format-12 The latest clang-format (v19) in ROCm 7.0 generate different result than clang-format-12 which is used in CK CI. Format code with clang-format-12 for consistency. * [Block Scale Gemm] Split the k direction loop - Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp to make the logic clearer. - Disable C transposition. * [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm * [Block Scale Gemm] Update copyright * test * Add TailHandler * Move TileDistributionEncodingPatternAQ * Refactor * refactor * fix bug * fix bug * help solve the PR comment * Format the code * [Block Scale Gemm] Add unit tests * [Block Scale Gemm] Add support to 16x16x32 MFMA - Add support to 16x16x32 MFMA - Fix a bug when exchange data crossing lanes --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com> Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-07-23 00:10:16 -07:00
aska-0096	14e0ab70c6	tempsave. asynccopy+trload sanity checked	2025-07-22 08:04:05 +00:00
aska-0096	1b468bac0b	tempsave, trload+asyncload done	2025-07-21 05:55:55 +00:00
aska-0096	afd96d8180	compile pass	2025-07-18 10:04:34 +00:00
aska-0096	5616551115	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-07-18 05:17:27 +00:00
Linjun-AMD	095393276a	h_dim256 fmha use async_qr pipeline (#2510 )	2025-07-18 09:59:38 +08:00
aska-0096	94b6430489	temp save	2025-07-17 10:06:09 +00:00
aska-0096	7e330553dc	Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline	2025-07-17 07:24:32 +00:00

1 2 3 4 5 ...

260 Commits