composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-18 20:09:25 +00:00

Author	SHA1	Message	Date
Aviral Goel	a535de0f75	chore(copyright): update copyright header for example directory (#3273 ) * chore(copyright): update copyright header for codegen directory * chore(copyright): update copyright header for example directory [ROCm/composable_kernel commit: `d85f065b15`]	2025-11-24 18:02:41 -08:00
Johannes Graner	580a54b400	Update pre-commit to fixed versions, run remod for ck_tile (#2895 ) * Fix ruff linter errors * Fix remod dos2unix command * Clang format * Ignore utility in remod * Run remod * Specify clang-format version in pre-commit * Specify ruff version * Include PoolKernelArgs in reference_pool * Add calculate_total_elements to reference batched contraction * Fix calculate_total_elements declaration * Refactor remod pre-commit hook * Fix Aquant tests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `d40b50b9d5`]	2025-10-16 15:29:17 -07:00
ClementLinCF	7907a466de	[CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540 ) * [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm * Update rmsnorm host reference * Update tree reduction of rmsnorm for reference host * Fix cross warp for m > 1 cases * Add RMSNorm model selectable option for host reference * Fix save_unquant cases * Update reference rmsnorm forward function to use enum for model sensitivity * Update reference rmsnorm calculation for model sensitivity * Fix m warp for layernorm * Adjust parameter of reference for twoPass * Fix clang format * Run clang-format-overwrite.sh to fix formating issue * fix clang format --------- Co-authored-by: MHYang <mengyang@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `e1b0bdfbfa`]	2025-10-13 11:52:37 -07:00
linqunAMD	01b8e6a80f	[CK_TILE] Refine Generic2dBlockShape to fix ck_tile example 2,10,11,14 on rdna3 and 4 (#2795 ) BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side. Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape [ROCm/composable_kernel commit: `c254f3d7b4`]	2025-09-10 08:29:20 +08:00
linqunAMD	807f7510b5	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `9fcc1ee9fd`]	2025-08-18 10:08:31 -07:00
Po Yen Chen	4456552543	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors [ROCm/composable_kernel commit: `7d669440a6`]	2025-06-23 12:29:15 +08:00
Satyanvesh Dittakavi	a4517b0a9d	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
ruanjm	fcbf9630fe	[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861 ) * 50ms -> 28ms * Fix bug in non fuse_add_store cases * Fine tuned setting for 2 pass pipeline * adjust workload * remove unnecessary change * add layernorm * Adding output quant and unquant results at the same time. * fix test * fix format * tune for cases 128x640 and 128x1024 * bug ifx [ROCm/composable_kernel commit: `d49abdaa87`]	2025-03-25 20:09:45 +08:00
ruanjm	dce7207ece	Implement fp8 quant for layernorm and rmsnorm (#1814 ) [ROCm/composable_kernel commit: `64d5c4d6cb`]	2025-01-24 16:40:43 +08:00
ruanjm	9f9eddd0cf	[CK_TILE] Add Various Fusion Functions to RMSNorm (#1802 ) * Add shortcut to RMSNorm * Modify test for adding shortcut for RMSNorm * Add fused parameter into tests * 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp * 1. Supports various stride and percisions. * Add support of Epilogue * Add fuse and epilogue support to rmsnorm ref * Modify rmsnorm example * Refactor tests/examples * Bug fix for newly added tests/examples * Bug fix for new tests 2 * Modify smoke test scripts remove dbg code * Supports non-smooth dyanmic quant * Update Rmsnorm2dFwd::GetName() * rename xscale and prec_sx to smoothscale and prec_sm Bug fix after rename Remove files * change example_rmsnorm2d_fwd.cpp * update performance calculator * Fix issue in two-pass when fuse add is enabled * Remove comment of beta --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `04dd314883`]	2025-01-15 10:23:48 +08:00
AMD-dteng	12103d0f17	enable bias feature that add bias before adding residual (for rtpllm project) (#1741 ) * 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16 * delete comment * 1.remove fmha change 2.change buffer name from bias to xbias * Now bias can be used independently from fadd * change kbias to kxbias --------- Co-authored-by: feli <felix.li@amd.com> [ROCm/composable_kernel commit: `d5c8a334ca`]	2025-01-08 17:51:06 +08:00
feli	5ce28a1d13	Ck tile/layernorm: implement naive reduce, opt performance (#1784 ) * add no welford * enable output raw * raw of int8 * fix build * fix smoke test err * [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate * [cktile]layernorm, fix err commit files and remove uselss * fix quant 8192 err & change norm_reduce class and file name --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `4bc610416a`]	2025-01-03 14:28:59 +08:00
dummycoderfe	561c221342	[Ck tile] layernorm2d fwd optimize (#1637 ) * optimze small N case using vec io and using rcp div * [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass * [Ck_tile] fix blockSize compute in Generic2dBlockShape * [Ck_tile]fix kfastfdiv template style * [Ck_tile] layernorm, fix stype in review --------- Co-authored-by: dummycoderfe <noplydummmycoder@163.com> [ROCm/composable_kernel commit: `686a58a912`]	2024-11-08 12:28:23 +08:00
Juan Manuel Martinez Caamaño	6e74da9b87	[generate.py] Override blob list if it already exists (#1635 ) Before, generate.py appended the list at the end of the output file. When running the cmake configuration steps multiple times on the examples, the blob list (such as fwd_blob_list.txt) would grow at every configuration. `library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around this issue by removing the output file if it exists. Now, generate.py overrides the content of the output file. There is no need for the workaround in the CMakeLists.txt; and the issue is solved for the example projects too. [ROCm/composable_kernel commit: `464abd235e`]	2024-11-05 10:09:52 +01:00
carlushuang	537ff25c21	[CK_TILE] layernorm have more accurate residual (#1623 ) * more accurate residual * modify comment * Fix literal case in README.md --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `cb6c5d39dc`]	2024-11-02 13:30:16 +08:00
carlushuang	776c87ea7e	[CK_TILE] layernorm support fused-quant/fused-add (#1604 ) * add prenorm/postnorm support, refactor using generate.py * update README * update README * fix format * update some description and fix format * update format * format * use non-raw for loading * format and update n4096 * dynamic-quant ready * update readme * support fused dynamic-quant * update fused-quant, with smooth * update README * update args * update some based on comment [ROCm/composable_kernel commit: `c3a4800c5f`]	2024-10-31 14:54:53 +08:00

16 Commits