composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 11:47:48 +00:00

Author	SHA1	Message	Date
YC Lin	ae275aa105	[GEMM] Refactor block gemm, pipeline, and policy of instruction schedule opt	2025-07-28 14:54:51 -04:00
YC Lin	6113ca8062	[Add] Add build option for generating assembly	2025-07-28 14:54:51 -04:00
YC Lin	97a960042b	[GEMM] Refactor block gemm and pipeline policy of instruction schedule	2025-07-28 14:54:51 -04:00
Clement Lin	8785e6599e	Add flash_attention_fwd toy_example	2025-07-28 14:54:51 -04:00
mhYang	a949b82c9f	Update tile size and use slc	2025-07-28 14:54:51 -04:00
mhYang	9158612a9f	Fix add flops calculation	2025-07-28 14:54:51 -04:00
ClementLinCF	88a4c7414f	Create README.md	2025-07-28 14:54:51 -04:00
mhYang	ac972bfd11	Use mfma 16x16x32	2025-07-28 14:54:51 -04:00
mhYang	5326d403e4	Fix KERNEL_D config	2025-07-28 14:54:51 -04:00
YC Lin	fe319b97ae	[GEMM] Add pragma message for different MFMA options	2025-07-28 14:54:51 -04:00
YC Lin	76751567b5	[GEMM] Fix print typos	2025-07-28 14:54:51 -04:00
Clement Lin	4c526ab140	Fix indentation typo	2025-07-28 14:54:51 -04:00
Clement Lin	5b10e9f3dd	[GEMM] Fix MFMA condition checks	2025-07-28 14:54:51 -04:00
Clement Lin	a95665a6af	[GEMM] Add new macor options check	2025-07-28 14:54:51 -04:00
Clement Lin	1099762267	[GEMM] Add macros for multiple optimization options	2025-07-28 14:54:51 -04:00
YC Lin	890a159877	[GEMM] default MFMA config	2025-07-28 14:54:51 -04:00
YC Lin	8d75ae7c96	git push test	2025-07-28 14:54:51 -04:00
root	a36d246cc0	[GEMM] fix MFMA configurations	2025-07-28 14:54:51 -04:00
mhYang	15e6f36f66	Adjust mfma schedule order	2025-07-28 14:54:51 -04:00
Clement Lin	e9f7c9bf42	[GEMM] Replace const auto with constexpr index_t	2025-07-28 14:54:51 -04:00
Clement Lin	cef77c1dcb	[GEMM] Update cache-aware wg schedule	2025-07-28 14:54:51 -04:00
bobofang	127e742e96	Add MFMA M16N16K16 and M16N16K32 methods these two methods are default off	2025-07-28 14:54:51 -04:00
YC Lin	e866f814f9	[GEMM] remove a_col_major/b_row_majro case	2025-07-28 14:54:51 -04:00
root	bf69235cfb	[GEMM] modify if-else locations	2025-07-28 14:54:51 -04:00
mhYang	ba8b5112c4	Fix AccDataType and CDataType 1. Fix AccDataType and CDataType 2. Remove indent 3. Align merge_transform for tutorial	2025-07-28 14:54:51 -04:00
mhYang	d6fd468603	Fix build error	2025-07-28 14:54:51 -04:00
root	b3986c32a6	[GEMM] disable/enable instruction scheduling	2025-07-28 14:54:51 -04:00
mhYang	42f2e21865	Fix missing message	2025-07-28 14:54:51 -04:00
mhYang	38ce4dd8c3	Fix xor transform dim.	2025-07-28 14:54:51 -04:00
Clement Lin	b03668fe8a	[GEMM] Add cache-aware WG schedule and adjust block tile 113 -> 121.7 TFops	2025-07-28 14:54:51 -04:00
mhYang	39ca852330	Add LDS bank conlict solutions	2025-07-28 14:54:51 -04:00
bobofang	22147ace51	Fix add accuracy issue 2673 GB/s -> 3271 GB/s Perf: 0.0512898 ms, 3271.06 GB/s	2025-07-28 14:54:51 -04:00
root	d7d9fdaf1b	[GEMM] use mfma k8 warp gemm	2025-07-28 14:54:51 -04:00
root	1b8d7cd1b9	[GEMM] disable/enable prefetch	2025-07-28 14:54:50 -04:00
Clement Lin	6a2036015e	[CK TILE] Toy example - basic gemm	2025-07-28 14:54:50 -04:00
Clement Lin	077056b32d	Adjust block shape 2673 GB/s -> 3647 GB/s	2025-07-28 14:54:50 -04:00
Clement Lin	2ff691f3f2	Utilize vectorized memory access 1998.24 GB/s -> 2673 GB/s	2025-07-28 14:54:50 -04:00
Clement Lin	078b5c68a0	Adjust the size of thread block 1968.42 GB/s -> 1998.24 GB/s	2025-07-28 14:54:50 -04:00
Clement Lin	8d205a9298	[CK TILE] Toy example - basic add	2025-07-28 14:54:50 -04:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
rocking	b36e0b029f	[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539 ) * uncomment all the headdim, use optdim to control * change default back to -1 * uncomment splitkv instance * Fix typo in receipt 4 for appendkv * support optdim for bwd, splitkv and appendkv * Fix 192 key error --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>	2025-07-28 17:16:32 +08:00
Max Podkorytov	821cd26c13	[CK-Tile] Merge transpose examples (#2450 ) * unify pipeline signature with existing example * iwyu * move stuff around in load-tile-transpose * cleanups in batched transpose pipeline * comments * use same inputs size * cleaner printf * print host args * use 64 block sides in the 37_transpose example * roll back grid dimension size adjustment for 37_transpose example * transpose grid for 37_transpose to unify with 35_batched_transpose * unify grid computation logic * make policy methods device only (since they are used only on device from the pipeline) * more host/device attribute cleanups * copy over problem * move over pipeline and policy * add switch to batched transpose api * make the lds problem more similar to original problem * factor out logic into traits * factor out conditional compilation into trait parameter * propagate pipeline to args * unhardcode pipeline dispatch parameter * refactor vector size * put warp tile out of dispatch * rename template parameter for trait * rewrite vector size in terms of problem * mark policy-internal struct variable as device * factor out input distribution and thread access pattern from policies * reword vector size * use datatype across batched transpose pipelines, problems and kernel * remove transpose traits from lds pipeline * add padding to the lds pipeline interface * add comment * remove ck_tile example #37 * update cmakelists * add test for new pipeline * update batched transpose test * roll back load_tile_transpose changes * remove comments * pack dispatch parameters into a config * padM can be enabled * adjust lds vector size to enable padding along N * update test * clean up logic * swap m/n input vector size * adjust perf test script * sweep over C/W in perf test * count both read and written bytes into bandwidth (x2 the number) * clang-format * widen size range for perf test * remove 64k x 64k case; it's too large for index * remove thread tile from dispatch * Solve merge conflict * fix compile * modify the transpose * solve the test error and clang format * Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Fixing 0ms and inf GB/s issue in img2col (#2565) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` * merge with develop * solve clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>	2025-07-26 21:51:54 -07:00
Bartłomiej Kocot	5741edf761	Fix clang format (#2567 ) * clean * clang format fix	2025-07-25 09:54:34 -07:00
rahjain-amd	78082855d8	Fixing 0ms and inf GB/s issue in img2col (#2565 ) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`	2025-07-25 21:15:50 +05:30
Enrico Degregori	b01a27ff22	Support b_scale: (#2350 ) - extend pipeline v1 and v3 - add instances - add tests - add example Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-07-24 18:49:58 -07:00
Mateusz Ozga	b507d889c1	[CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520 ) * Init commit new API * apply clang-format * PreShuffle preapring * Apply Preshuffle condition to universal_gemm * Fix: convert size_t to index_t * Review changes * Mode 100755 -> 100644 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-07-24 20:39:56 +02:00
Yi DING	4338346b10	Use filename but not path to filter compilation (#2556 )	2025-07-24 17:38:14 +08:00
Yashvardhan Agarwal	606b0cc947	[CK_TILE] Support for elementwise kernel (#2246 ) * Elementwise kernel implementation Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: yashagar <yashagar@amd.com> * Elementwise with generalized nDims * Adding the n-ary input tensor feature * Generalize dimensions on top of inputs * Add TFLOPS + remove std usage for tuples * 1D basecase optimization * Cleanup code + refactoring to a common interface * Generalize to unary and add an example * Cleanup, refactoring and commenting * Suggestions for LWPCK-3170: elementwise kernel improvements * Clang-format: remod.py * Replace InputTensorType with XDataType as the type of input_tensors * Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs * Move examples to folder 19_elementwise * Add missing copyright headers and fix some existing ones * Replace an assert with throw std::runtime_error in elementwise example * Avoid reading the output by using make_static_distributed_tensor for y_tile * Removed two unused includes * No need to move windows to the next block when each workgroup processes a single tile * Only copy input tensors to the device * Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example * Adding output strides to the kernel, transposition example and update the other examples * Changes made by remod.py * Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp * Reuse generic reference binary/unary operation in examples + refactoring the transpose reference * Fix comments in elementwise_example.cpp - Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses - ElementWiseTraits was renamed to ElementWiseShape - Adopt suggestions made by Copilot when prompted to check for factual or typographical errors * Simplify CMakeLists.txt and remove the unused variables this uncovers * Rename a file and fix some copyright statements * Changes made by script/clang-format-overwrite.sh * Add basic unit test for ElementWiseKernel * Remove left-over uninformative comment in apply unit test * Changes made by clang-format-overwrite.sh * fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp * Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array * Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally * Move examples to folder 20_elementwise * Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK * Fix CLang formating * Bump up the elementwise example folder number * Elementwise: add padding + minor cleanup * Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view * Add isSupportedArg to Elementwise kernel + addapt example and unit tests * Fix clang-format on the unit test file --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-07-24 11:21:45 +02:00
jakpiase	6681593864	[CK_TILE] Grouped Convolution Backward Weight Kernel (#2357 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * resolved conflicts * rebase fixes * fixes * tmp * add working support for splitk * minor fix * fixes * fixes * minor fix * small fix * Split K and preprocessing fixes --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-07-24 10:41:35 +02:00
Illia Silin	1b6f024836	refactor fmha_bwd.py (#2546 )	2025-07-23 09:09:56 -07:00

1 2 3 4 5 ...

667 Commits