composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 17:26:00 +00:00

Author	SHA1	Message	Date
aska-0096	1c98007901	clang format	2025-08-12 01:53:31 +00:00
aska-0096	f43e903b1d	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-12 01:52:52 +00:00
Aviral Goel	a7badc6ec5	feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 ) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-08-11 10:54:37 -07:00
Illia Silin	6bfef63414	enable aiter test_mha in daily CI (#2659 )	2025-08-11 09:50:33 -07:00
aska-0096	3868ddd708	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-11 15:59:40 +00:00
aska-0096	498d234ab8	change the warp setting for hdim32 fmha fwd	2025-08-11 15:37:37 +00:00
aska-0096	b86f7786e2	tempsave, update the blocksync functions	2025-08-11 14:21:09 +00:00
Yashvardhan Agarwal	191c62967b	Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 ) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format	2025-08-11 15:01:33 +02:00
geozhai	1e1ee758fa	update CK build instruction step 4 (#2563 ) Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-11 00:26:13 -04:00
aska-0096	7b8052d7ca	fix bug in pki4	2025-08-10 06:00:51 +00:00
aska-0096	76cbbb84a2	fix bugs in gemm	2025-08-09 03:25:12 +00:00
aska-0096	8c101ccb88	fix bug on non-gfx950	2025-08-08 18:35:53 +00:00
aska-0096	efb8549279	fix bug	2025-08-08 17:53:19 +00:00
Illia Silin	8613aa1e40	remove ck_tile transpose and gemm stages from CI (#2646 )	2025-08-08 10:48:44 -07:00
Illia Silin	7ac850ac72	Add daily AITER tests on gfx942. (#2639 ) * add option to select aiter branch, add tests on gfx942	2025-08-08 09:30:46 -07:00
aska-0096	729e8785fb	fix bugs	2025-08-08 15:42:15 +00:00
aska-0096	250dc13c75	fix clangformat with 18.1.3	2025-08-08 09:31:01 +00:00
aska-0096	106edeecd9	remove non-necessary change	2025-08-08 09:07:40 +00:00
aska-0096	78edd7303b	bug fix, clang format;	2025-08-08 09:04:02 +00:00
aska-0096	3b9fb6af38	Remove unnecessary changes	2025-08-08 08:08:03 +00:00
aska-0096	6bb57c2c57	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-08 07:50:12 +00:00
aska-0096	1ecee378d5	remove unnecessary files; rename some files	2025-08-08 06:19:31 +00:00
aska-0096	b4640a9de6	merge fa_decode pipeline into fmha_fwd api	2025-08-08 05:46:18 +00:00
Max Podkorytov	ab26026835	[CK-tile] add more tests for batched transpose testing the rectangular block tile sizes (#2634 ) * add failing tests * swap out and reference * add constraint assert to transpose input distribution * test both pipelines with rectangular block tile * print mismatched indices * add a smaller failing test for old pipeline * print grid and block * fill output before operating on it * swap m/n tile sizes and make one test pass * add device syncs * add one more flipped test case * flip block tile at host arg init * fix tiles for lds pipeline * clang-format * rename tests * roll back error check * remove device syncs * reduce large test case's size	2025-08-07 16:51:53 -07:00
Sami Remes	3c9400471d	[CK_TILE] Enable persistent kernel and tail handler in tile_engine (#2300 ) * Enable persistent kernel in tile_engine and use tail handler * Fix formatting * Add persistent to default_config.json * Remove extra newlines and add persistent also to user config * Reduce instances from default_config.json * add persistent to benchmark.json and custom_ci_config.json * changed the config file to have few instances --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-07 16:03:49 -07:00
Gino Lu	5d6d236b25	Add e8m0 scaled convert into CK_TILE (#2617 ) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert	2025-08-07 21:37:28 +08:00
Yi DING	b0a97498b0	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad	2025-08-07 21:24:43 +08:00
Sami Remes	ffdee5e774	[CK_TILE] Enable printing more structures in CK-Tile (#2443 ) * Add more printing to core cktile * Revert other changes in static encoding pattern * Refactor to using a free print() function * Remove loops and print just the containers * Print tuple with better formatting, fix sequence compilation * Add some tests for print utility * Add print utility header * Print for static_encoding_pattern * add buffer_view printing * Align vector_traits * Fix formatting * Lower-case enum strings Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Remove empty comment lines * Fix test with lower-case too * Reduce repeated code in print tests, move helper function closer to type definition, test X&Y * Add test_print_common.hpp * add print.hpp in core.hpp --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-08-07 15:45:27 +03:00
Enrico Degregori	21e9983913	Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 )" (#2637 ) This reverts commit `2203b0ddfe`. Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-07 12:30:08 +02:00
Bartłomiej Kocot	54c7e08a2f	Fix clang format after conv changes (#2636 )	2025-08-07 10:00:09 +02:00
Bartłomiej Kocot	5328b232b2	Grouped Convolution Forward Infer Bias Bnorm Activ (#2621 ) * Grouped Convolution Forward Infer Bias Bnorm Activ * 3d	2025-08-07 08:36:47 +02:00
Max Podkorytov	1824d65758	modernize scripts for running cmake and clang-format (#2503 ) Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-06 10:15:44 -07:00
Yashvardhan Agarwal	4750b293fe	General 2D Reduction Kernel (#2535 ) * General 2D Reduction Kernel * Move the reduction kernel from the example * Split the code and add the necessary policy, problem, shape files as per ck_tile convention * Add/modify the headers * Modified the example to work with the 'new' kernel * Added tests for the kernel * N-D refernce reduce * Added support for N-D input with transform to 2D * Added padding to support various input sized tensors * Bug fix in the thread buffer constructor * Some comments to explain the reduce2d block kernel * comments resolution * clang-format * comments resolution * clang-format * clang-format * comments resolution * clang-format	2025-08-06 15:36:59 +02:00
Adam Osewski	2622ff06cb	Remove unused lds direct load instruction. (#2573 ) This functionality is replaced by amd_async_buffer_load Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-06 15:16:12 +02:00
Yi DING	15e8b6ccf7	[CK_TILE] Fix FMHA qr_async causing errors in FA (#2627 )	2025-08-06 20:04:23 +08:00
Thomas Ning	07469142cb	delete all slp compilation flag in CK Tile (#2625 )	2025-08-06 00:34:39 -07:00
aska-0096	fe63a646a4	add __restrict__ to tr load	2025-08-06 05:58:43 +00:00
Illia Silin	833ae1d051	Revert "Reduce build time tile engine (#2579 )" (#2623 ) This reverts commit `e5b79b26fa`. rocm-7.1.0	2025-08-05 09:27:55 -07:00
Enrico Degregori	2203b0ddfe	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 ) * Add padding 1x1Stride1Pad0 conv specialization * Add gridwise checks for conv cshufflev3 * Merge padding with previous transforms * Apply transform changes for padding to default specialization as well --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-05 15:23:19 +02:00
aska-0096	414cad667b	Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug	2025-08-05 07:23:51 +00:00
Thomas Ning	cbfecf8d7a	Persistent grouped gemm CompV4 Enablement & Polish (#2605 ) * enable the persistent kernel for CompV4 * polish the example and clang format * fix the non-persistent kernel error --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-04 23:43:01 -07:00
Max Podkorytov	2a78da4708	fix build for test_ck_tile_fp8 on rhel8 (#2615 )	2025-08-04 17:43:15 -07:00
Illia Silin	fb96b49666	fix test_mx_mfma errors (#2614 )	2025-08-04 11:43:47 -07:00
rahjain-amd	59245df46d	Fix Debug Build for ckProfiler (#2609 ) Problem ======= relocation R_X86_64_32 out of range: 5405348154 is not in [0, 4294967295] Solution ======== The problem was caused due the limitation comes from the 32 bit offsets used in original DWARF standard. We have the option to switch to 64bit offset for your libs which free us from 4G size boundary. add -gdwarf64 and -Og to avoid this limit.	2025-08-04 11:28:09 -07:00
Jinchao Xu	15eb493152	Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611 ) Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances by splitting debug information into separate .dwo files. Add explicit cast to avoid signed/unsigned comparison warning.	2025-08-04 11:26:08 -07:00
Bartłomiej Kocot	8655ba989c	Mark non-grouped convolutions instances as deprecated (#2595 ) * Mark non-grouped convolutions instances as deprecated * Update CHANGELOG.md Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update library/src/tensor_operation_instance/gpu/conv1d_bwd_data/device_conv1d_bwd_data_xdl_nwc_kxc_nwk_bf16_instance.cpp Co-authored-by: John Afaganis <john.afaganis@amd.com> --------- Co-authored-by: John Afaganis <john.afaganis@amd.com>	2025-08-04 16:49:55 +02:00
aska-0096	0d12fc944f	Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA	2025-08-04 10:27:42 +00:00
aska-0096	4f31847de1	add vmcnt guard before load ktile	2025-08-04 10:02:17 +00:00
aska-0096	746f4ccb99	Load Q through lds, implement xor;	2025-08-04 06:49:01 +00:00
Max Podkorytov	0d9439760f	remove std::format (#2604 )	2025-08-01 19:22:07 -07:00

1 2 3 4 5 ...

2240 Commits