composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 10:37:44 +00:00

Author	SHA1	Message	Date
Max Podkorytov	de41af5d2e	Add a new gemm pipeline based on ComputeV4 which utilizes async copy API (#2949 ) * check in pipeline and policy for async load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked solve merge conflicts * fix cmakelists * make it build * fix? buffer async fence * relax fences; it appears it only is needed between pairs of ping-pongs * remove fences * remove fences * cleanup and reformat * add steps annotations * comment all pipeline steps / remove unexplainable syncs * clang-format * add comment * cleanup kernel types for test * fix comment * fix hardcoded warp size * faithfully copy block gemm from compute v4 policy to async policy * make async test gfx950 only * fix cmake logic * set separate compile options for async * refine comment in policy * try update hotloop scheduler * cleanup comments * test more K block sizes * unhardcode Ks, sort of * add large odd test case * fix build for quant * add comment to hot loop scheduler and rename enum * reformat * reword the pipeline description * reformat * address review / add static asserts / typo fix * update changelog [ROCm/composable_kernel commit: `a7da3c68b9`]	2025-10-01 15:38:07 -07:00
Sami Remes	93ba707be4	Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe (#2893 ) * Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe * also do the same for amd_buffer_addressing_builtins.hpp * merge with develop * fix clang format --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `ef43078788`]	2025-09-30 15:12:30 -07:00
Anton Gorenko	8118d84f77	[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836 ) * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Add F32 MFMA warp gemms * Support f32 in fwd FMHA * Implement transpose_vectors for 4-byte types (float) * Fix unexpected implicit f32->uint32 cast in buffer_store<4> __builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint). mbuf_t types in other buffer_store<> are changed for consistency. * Support F32 in bwd FMHA hdim = 256 is disabled for now because it uses too much memory on gfx90a * Support Headdim = 48 (divisible by 16) in fwd * Add fp32-specific receipts (800 and 801) * Tune fwd tiles * Tune bwd tiles * Use small tiles only for small seqlen_q * Fix after rebasing * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Remove constraints and adjust filtering for fp32 Custom constraints are no longer needed because now the smallest tile is selected automtically based on seqlen_q. Filters related to qr_async_trload disabled valid fp32 tiles. * Add fp32 tests * Make splitkv and appendkv compile for fp32 only There are no instances yet, but API still must compile when only fp32 is requested. * Remove unimportant f32 instances * Add test_ck_tile_fmha__fp32 to REGRESSION_TESTS Replace magic numbers with a constant, improve comments for dropout * Update changelog * Fix condition that dq_acc must be set to zero when mask is used The change was introduced in #2799 * Replace warp_uniform with recently added amd_wave_read_first_lane * Add hdim = 96 and 192 to fwd [ROCm/composable_kernel commit: `1edd250115`]	2025-09-27 18:03:48 +05:00
Khushbu Agarwal	bb5eeef2af	Fix for Add the API to load SGPR (#2913 ) * Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)" This reverts commit `5cc40c160f`. * Fix: sgpr minor issue * cyclic dependency resolved * clang formatted * removing unused variable * clang formatted --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b56e5d1d79`]	2025-09-25 10:32:42 -07:00
Sami Remes	aac547782b	[CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903 ) * Fix issue with constexpr checks in scaling/cshuffle * Remove IsLoadableTile * Move amd_wave_read_first_lane before first usage [ROCm/composable_kernel commit: `dcd33a6ecc`]	2025-09-23 23:08:18 -07:00
asleepzzz	5cc40c160f	Revert "[CK-Tile] Add the API to load SGPR (#2878 )" (#2904 ) This reverts commit `fb5e953a05`. [ROCm/composable_kernel commit: `f161b5b738`]	2025-09-23 14:33:51 -07:00
Thomas Ning	fb5e953a05	[CK-Tile] Add the API to load SGPR (#2878 ) * Have a workable version for SGPR * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * substitute with the new sgpr read api * update the CHANGELOG * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * change to static for logic * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. [ROCm/composable_kernel commit: `2cbbf5dcb3`]	2025-09-23 01:23:56 -07:00
linqunAMD	71dc8a9d4d	Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725 ) * Enable xdl in gfx11 & gfx12 * update cmake file * fix all instance build (cmake) * fix batched_gemm_gemm(cmake) * rebase cmake files * fix cmake build error * remve CK_ENABLE_DYNAMIC_WARP_SIZE * update cmake build error2 * fix gfx11 build CK_USE_XDL is enabled on gfx11 and gfx12 * fix gfx10 build * fix gfx11 error --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `f22740df82`]	2025-09-15 10:59:25 -07:00
Thomas Ning	cb3bbd3881	Fix the vector load & fix the gfx950 compv4 error (#2831 ) [ROCm/composable_kernel commit: `1894a0dbc3`]	2025-09-12 11:48:45 -07:00
Illia Silin	7382914d59	Revert "add vector load 16/32 for bf16/fp16 (#2779 )" (#2818 ) This reverts commit `985d1319a1`. [ROCm/composable_kernel commit: `b4207c01c7`]	2025-09-10 13:35:15 -07:00
zjing14	985d1319a1	add vector load 16/32 for bf16/fp16 (#2779 ) * add vector load 16/32 for bf16/fp16 * comment addressed * clang format --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `7ecdba878f`]	2025-09-09 23:15:19 -07:00
Yi DING	1f35e4c5b7	[CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715 ) * FMHA avoid unnecessary vmcnt0 Squashed commit of the following: commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:51 2025 +0000 merge develop and solve conflicts commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c Merge: 2dac61a4f `5d56dde0e` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:21 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140 Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 02:17:43 2025 +0000 update bwd commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682 Author: Kevin Choi <kevin.choi@amd.com> Date: Mon Aug 18 19:36:38 2025 +0000 add restrict to applicable functions commit 45534dee5bcbe532da46fc5cd6601cde10d84387 Author: Ding, Yi <yi.ding@amd.com> Date: Mon Aug 18 02:07:03 2025 +0000 bwd filter commit 7abd7b372b82cba94a457238b6b4a81d093e7280 Author: Kevin Choi <kevin.choi@amd.com> Date: Sat Aug 16 08:15:23 2025 +0000 remove noinline attr as it causes a lot more s_waitcnt's commit 89c29746a09255c1d26038171157e91d1b68d14a Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 12:11:17 2025 +0000 remove innerloop, move restrict parameters to mainloop and add noinline attribute. commit 6f61b3a5c80011411aa3aebf7983602f7c117566 Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 07:06:51 2025 +0000 Create inner lambda with restrict parameters, add restrict to some parameters commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a Author: aska-0096 <haocwang@amd.com> Date: Thu Aug 14 03:43:54 2025 +0000 save for debug commit 5f2c3cfa86c6951208a1cc227fa556704a885a88 Merge: 25f067b4f 165a2723c Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:15:22 2025 +0000 Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 25f067b4f09d6909a05e252c7621124046dfda57 Merge: 447c1c5d6 `bd3b4afb9` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:14:26 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 165a2723c557420b48891cc1ce3434e3675aef5d Merge: 447c1c5d6 `be39c00a8` Author: asleepzzz <hanwen.chang@amd.com> Date: Wed Aug 13 00:34:11 2025 +0800 Merge branch 'develop' into wip-async-tr-fa commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 14:25:50 2025 +0000 refactor blockgemm change, isolate to v2; commit 8f67083511ff77d31c880f4427d3bdf53a179568 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:26:13 2025 +0000 clang format commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965 Merge: c84662c34 `19ef22e56` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:04:41 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit c84662c345755ec5f3d524fdde4aa951c8f86298 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 08:46:06 2025 +0000 Fix the bug commit e0647ffa5646f8132529b152af02750c4010013d Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 04:02:41 2025 +0000 fix conflict. disable all v-col instance for fmha fwd commit 781f98236c376f57591a6d481cc2ee04b36a148b Merge: 241f3d7dc `8cb8da53c` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 03:52:34 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:53:31 2025 +0000 clang format commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f Merge: 1a629c098 `eda4a5e80` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:52:52 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 1a629c09876cc05f0750db7eade1d527dc32a1d3 Merge: f65874e5b `92c0435e2` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:59:40 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit f65874e5b07579d5b734b4c68877679a3ee04dac Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:37:37 2025 +0000 change the warp setting for hdim32 fmha fwd commit 7c5f5e65e97486c074ef9a138900ed9aafea547e Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 14:21:09 2025 +0000 tempsave, update the blocksync functions commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915 Author: aska-0096 <haocwang@amd.com> Date: Sun Aug 10 06:00:51 2025 +0000 fix bug in pki4 commit 073db2e18af21f1ed1fb3d1f1c15830838df986f Author: aska-0096 <haocwang@amd.com> Date: Sat Aug 9 03:25:12 2025 +0000 fix bugs in gemm commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 18:35:53 2025 +0000 fix bug on non-gfx950 commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 17:53:19 2025 +0000 fix bug commit 30de97f473685e0bd5b82f15eee2493d9a05cffd Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 15:42:15 2025 +0000 fix bugs commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:31:01 2025 +0000 fix clangformat with 18.1.3 commit e4cb185c41586d018771a5413efd909d8d53a8c5 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:07:40 2025 +0000 remove non-necessary change commit 498f0d44cfba17287cce8d10855cce5c5de263db Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:04:02 2025 +0000 bug fix, clang format; commit 3cb648cbc4883e6889340d85f48d803a21b9c805 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 08:08:03 2025 +0000 Remove unnecessary changes commit 9e7ff3b611b7933b65973907a0cae312a15d31c6 Merge: a3c1bfe6d `f4247d199` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 07:50:12 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 06:19:31 2025 +0000 remove unnecessary files; rename some files commit 6c257fa27729c005d539b5b71deeba3703031089 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 05:46:18 2025 +0000 merge fa_decode pipeline into fmha_fwd api commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449 Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 6 05:58:43 2025 +0000 add __restrict__ to tr load commit bbad2b979b701533b74f43452ffe0f775e019139 Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 5 07:23:51 2025 +0000 Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug commit d7fabd5f765e2a573ddbaf0857ce6f691407e562 Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:27:42 2025 +0000 Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9 Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:02:17 2025 +0000 add vmcnt guard before load ktile commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 06:49:01 2025 +0000 Load Q through lds, implement xor; commit 62bb9f05177dfb8280d6c2be67a88492d6be4838 Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 1 10:44:54 2025 +0000 small refactor commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 10:25:37 2025 +0000 upgrade prefill pipeline; simple iglp; consistent data produce and consume order commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 05:13:27 2025 +0000 enable larger tile size; upgrade xor pattern commit a468e59a01d6dd85c105ca30ac249491256c5915 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 12:25:33 2025 +0000 remove all lds bankconflict with xor layouts commit 39ff55cdc377311112100fb24bc013adfd8960c0 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 03:51:06 2025 +0000 enable prefill overload operator(). commit a7b152a788e8035c93f8e4cbf317863182665d8f Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 25 07:10:01 2025 +0000 fix the lds alignment caused performance regression commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 23 09:05:57 2025 +0000 remove unnecessary features commit 9758750801c7fd5a80f654eb982f43b87d674fa3 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 22 08:04:05 2025 +0000 tempsave. asynccopy+trload sanity checked commit 1c4c04d725047357224ebf8a2b94d9010a5651a6 Author: aska-0096 <haocwang@amd.com> Date: Mon Jul 21 05:55:55 2025 +0000 tempsave, trload+asyncload done commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 10:04:34 2025 +0000 compile pass commit d41b5eace939909084d32281710fb81142ad5fec Merge: 3f86a81ee `8c3766f0d` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:17:27 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit 3f86a81eee75256a78df02032d50814aaa42b038 Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:16:39 2025 +0000 tempsave commit 7d43f7446a9a20773f70e08462393f6c9afb7280 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 10:06:09 2025 +0000 temp save commit 727629cd9115f1be9c1800bb65a8ea84ff06c250 Merge: aa5da19c9 94bceebc9 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 07:24:32 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline commit 94bceebc96ef4885e0ac861b7793e2e2897481bd Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 03:10:46 2025 +0000 move test_copy into test commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:41:31 2025 +0000 remove unnecessary output commit b1dbcacb1832560c6cc967a079dffce558228f0b Merge: 5b0d311e6 0eaf3325a Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:13 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix commit 5b0d311e649257557a7014c28fcfac0c327b77b5 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:10 2025 +0000 add input validation and bug fix commit 0eaf3325a8e019402ff12a2402f446f8471f584f Merge: a66e1d29a `f77d70498` Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Date: Wed Jul 16 11:23:57 2025 -0700 Merge branch 'develop' into test_copy_fix commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:55:50 2025 +0000 fix vmcnt shift commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:37:07 2025 +0000 Improve s_waitcnt_imm calculation commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 05:39:50 2025 +0000 fix the s_waitcnt_imm calculation commit 1c0870089a0e7c78ed71a278bf52d98fc780e482 Merge: d6ee05e36 `92ada43ba` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:57:57 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32 Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:54:33 2025 +0000 Add block_sync_lds_direct_load utility commit c037a72040217471f52ee76bed9c07bf5b22aef4 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 15 09:39:03 2025 +0000 fix async copytest bug commit aa5da19c94022449b027e7a57668f2e219f0f171 Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 10 04:29:33 2025 +0000 temp save, change all instance to 1wave commit ddd172feb9eb2cb783420a8db6f44d51b350c370 Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 8 08:37:20 2025 +0000 tempsave, fmha_decode commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33 Author: aska-0096 <haocwang@amd.com> Date: Sat Jun 21 15:02:57 2025 +0000 temp save, waiting for debug commit 71dd31f15bca01995c8cb0be9e903103f4657181 Author: aska-0096 <haocwang@amd.com> Date: Thu Jun 19 05:11:52 2025 +0000 save an example for __bf16 type commit cdf33e079fa7d7d5b03b06550df2356b02041d7b Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 07:27:24 2025 +0000 fix bwd code commit d630998dc6751f44097b1e9a239bb5063a793736 Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 06:37:16 2025 +0000 Fix for fwd/bwd kernel build filter commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 08:13:18 2025 +0000 Add restrict to avoid unnecessary vmcnt --------- Co-authored-by: aska-0096 <haocwang@amd.com> * Add comments for c-stype cast * Better comments --------- Co-authored-by: aska-0096 <haocwang@amd.com> [ROCm/composable_kernel commit: `de61e55493`]	2025-08-25 20:55:12 +08:00
Po Yen Chen	5d56dde0e0	[CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values (#2669 ) * Allow return VGPR get_warp_id() value * Avoid using SALU in async_load_raw() [ROCm/composable_kernel commit: `0db21053e6`]	2025-08-22 10:17:05 +08:00
Po Yen Chen	b87e256c42	[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549 ) * Support async load dwordx4 * Enlarge load size on gfx950 [ROCm/composable_kernel commit: `4a7ecce096`]	2025-08-22 10:13:47 +08:00
linqunAMD	615ca9842d	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `9fcc1ee9fd`]	2025-08-18 10:08:31 -07:00
Tianyuan Wu	abb90422b4	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `68134b60e4`]	2025-08-15 16:22:27 -07:00
Haocong WANG	e3e8de3477	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `05a6e92705`]	2025-08-13 14:57:43 +08:00
asleepzzz	4730050182	Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 ) This reverts commit `747d127983`. [ROCm/composable_kernel commit: `5b39de4bb6`]	2025-08-12 20:27:10 +08:00
Haocong WANG	747d127983	Optimize fmha fwd decode & prefill for gfx950 (#2641 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `b7322a521a`]	2025-08-12 19:43:14 +08:00
Sami Remes	1777ce3229	[CK_TILE] Enable printing more structures in CK-Tile (#2443 ) * Add more printing to core cktile * Revert other changes in static encoding pattern * Refactor to using a free print() function * Remove loops and print just the containers * Print tuple with better formatting, fix sequence compilation * Add some tests for print utility * Add print utility header * Print for static_encoding_pattern * add buffer_view printing * Align vector_traits * Fix formatting * Lower-case enum strings Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Remove empty comment lines * Fix test with lower-case too * Reduce repeated code in print tests, move helper function closer to type definition, test X&Y * Add test_print_common.hpp * add print.hpp in core.hpp --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `ffdee5e774`]	2025-08-07 15:45:27 +03:00
Adam Osewski	4816a695a4	Remove unused lds direct load instruction. (#2573 ) This functionality is replaced by amd_async_buffer_load Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `2622ff06cb`]	2025-08-06 15:16:12 +02:00
Thomas Ning	35c62cb1bb	Expand the bandwidth of direct_global_to_lds for gfx950 (#2576 ) * Expand the bandwidth of direct_global_to_lds for gfx950 * clang-format * fix the remod.py and script for clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `9d4b494f07`]	2025-07-28 23:56:53 -07:00
Illia Silin	24f228df3b	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config [ROCm/composable_kernel commit: `504b101da3`]	2025-07-28 11:34:07 -07:00
Gheorghe-Teodor Bercea	a8b68103e0	Refactor async loads to work on all GPUs (#2545 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cbfa62e4b6`]	2025-07-26 22:04:59 -07:00
Haocong WANG	60ff19fb4e	fix async copytest bug (#2509 ) * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * change bit width check * refactor macros into constexpr functions which still get inlined * wrap s_waitcnt api * parameterize test * cleanup * cleanup fp8 stub * add fp8 test cases; todo which input parameters are valid? * replace n for fp8 in test cases * add large shapes; fp8 fails again * change input init * test sync/async * time the test * clang-format test * use float instead of bfloat to cover a 4-byte type * fix logic - arg sections should be 'or'd * make block_sync_lds_direct_load interface similar to old ck * fix a few comment typos * name common shapes * revert the example to original logic of not waiting lds * clang-format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `a5fdc663c8`]	2025-07-23 00:14:02 -07:00
Yi DING	3bb687d502	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> [ROCm/composable_kernel commit: `f1d8ad2818`]	2025-07-16 23:56:22 -07:00
Illia Silin	305be94f42	fix compilation errors with clang20 (#2464 ) [ROCm/composable_kernel commit: `e033a1b4bf`]	2025-07-07 19:40:30 -07:00
Thomas Ning	a01042c3cf	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com> [ROCm/composable_kernel commit: `f240ae3248`]	2025-07-07 10:08:49 -07:00
Thomas Ning	3414888f92	Transpose builtin macro defense (#2374 ) * add the macro defense * add the static assert check [ROCm/composable_kernel commit: `107e3623c7`]	2025-06-20 11:24:54 -07:00
joyeamd	3cb0dd8506	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `a2f01141aa`]	2025-06-18 01:28:34 -07:00
Satyanvesh Dittakavi	bde406245a	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
John Afaganis	2fdbade459	Remove usage of 'warpSize' variable as it has been deprecated (#2295 ) * SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround * SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo --------- Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com> [ROCm/composable_kernel commit: `6635d1bb88`]	2025-06-10 07:34:54 -07:00
valarLip	06f87e3fb9	extend buffer load to support load 32 bf16/fp16 at same time (#2291 ) [ROCm/composable_kernel commit: `8482977a37`]	2025-06-06 17:21:19 +08:00
valarLip	671f579903	extend buffer load for fp16/bf16x16 (#2270 ) * extend buffer load for fp16/bf16x16 * format [ROCm/composable_kernel commit: `0fdbf6bcd1`]	2025-06-02 10:29:54 +08:00
Illia Silin	24057b3662	fix the buffer intrinsic names for clang >=20 (#2228 ) [ROCm/composable_kernel commit: `8146e471f1`]	2025-05-23 14:58:25 -07:00
Illia Silin	9c7b0a65f9	Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192 )" (#2227 ) This reverts commit `7d92e48278`. [ROCm/composable_kernel commit: `1b846143c6`]	2025-05-22 15:41:17 -07:00
Illia Silin	7d92e48278	Update the buffer load/store intrinsic names for clang>=20. (#2192 ) * fix the buffer load/store intrinsic names * fix clang format [ROCm/composable_kernel commit: `58f9e9ffbc`]	2025-05-13 10:18:14 -07:00
carlushuang	652263d073	[CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153 ) * combine 2-3 as single stage * support zeroing * improve long tokens * update specialization * b16 ws * 8bit topk optimize * update 15 example [ROCm/composable_kernel commit: `4e9b76f88c`]	2025-05-06 17:32:07 +08:00
Juan Manuel Martinez Caamaño	7a42b06988	Replace inline assembly with builtins in FHMA (#2067 ) * Replace inline assembly with builtins in FHMA --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `f14e648e7c`]	2025-04-10 09:48:37 +02:00
carlushuang	f2dd57b76f	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit `b92caa3d84`. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3e81279d26`]	2025-03-13 11:41:39 +08:00
Illia Silin	b92caa3d84	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit `45fbd9210a`. [ROCm/composable_kernel commit: `8cbcd3e0d0`]	2025-03-11 10:40:18 -07:00
carlushuang	45fbd9210a	[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 ) * support hdim=192/128 pair * remove useless print * update [ROCm/composable_kernel commit: `7a93b16ff6`]	2025-03-11 21:07:40 +08:00
Juan Manuel Martinez Caamaño	31b75f6fdd	Fix typo: v_offset used in initialization of v_offset (#1951 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `0f62dd9928`]	2025-03-06 17:37:29 -08:00
kylasa	676d236a5e	Addressing (Post Merge) code review comments for PR 1845 (#1883 ) * Addressing code review comments. * Addressing code review comments. * Reorganized code for better readability. * add ck_tile gemms for new types in CI * fix jenkins syntax * fix script syntax * Add the test cases back * Address the review comments * Address review comments * clang format * Solve the merging issues * Addressed the comments * clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `66c5f5b0b6`]	2025-03-06 11:40:30 -08:00
Illia Silin	9d24409070	Replace buffer load/store intrinsics with builtins (#1876 ) * replace buffer load/store intrinsics with builtins * fix clang format * replace buffer load/store intrinsics with built-ins in ck_tile * fix clang format * add switch between buffer intrinsics and built-ins * change the builtins threshold to clang20 * fix clang format * fix some compilation errors * revert changes in ck_tile * revert changes in ck_tile * delete all root files and folders when CI completes * try changing the username in CI * fix groovy syntax * add user and group id info to ci dockers * change ownership of all files in CI to jenkins at the end * update changelog [ROCm/composable_kernel commit: `a88bf76ecc`]	2025-03-05 14:33:28 -08:00
Bartłomiej Kocot	5705af0aad	[CK TILE] GEMM with packed i4 (#1885 ) * [CK TILE] GEMM with packed i4 * Fixes * fixes * fixes * fixes [ROCm/composable_kernel commit: `4d9973ec8e`]	2025-02-20 09:59:49 +01:00
kylasa	0aee5c2d16	Support for dtypes (fp8, bf8, bf16 and fp16) for the ck_tile/03_gemm example. (#1845 ) * Support bf16/fb8/bf8 datatypes for ck_tile/gemm * remove commented out code. * Addressing code review comments and enabling universal_gemm for all the supported data types. * Merge conflict resolution. * Solve the memory pipeline compilation error. Merge with the new change of CShuffle * finish the feature, pass the tests * Fix the pipeline and add the benchmark script for other data types --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `ab5d027866`]	2025-02-07 15:07:06 -07:00
Mateusz Ozga	91b54b683e	CK-Tile Grouped GEMM refactor and post PR fixes (#1756 ) * Grouped gemm simple code refactor * Offset invoker * Invoke generic Run, and replace name of parrtitioner variable * Tests fix type * Removed namespaces * Add template param to avoid implicit cast * Remove generic function * Constant value * underline enum to int16_t * Generalize partitioner function * Remove whitespaces * Rename function * Using support * Clang-format * Clang-format * Fn-partitioner description fn * Typo * Typo 2 * Better description * Better description * Refactor after review * Use ctr instead of set fn * Inovke ctr and typo * Comments * Remove unnecessary comment * Review, remove modulo [ROCm/composable_kernel commit: `3c93d3c444`]	2025-01-21 21:06:10 +01:00
Po Yen Chen	08975b5a8f	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com> [ROCm/composable_kernel commit: `37cdbf4f0e`]	2024-12-20 14:41:01 +08:00
carlushuang	74b0db75f7	[CK_TILE] fused-moe first version (#1634 ) * moe pipeline * update code * compile OK * update * update cpu reference * update pipeline_gemm0 * compiler ok * update pipeline * rename to ex pipeline * block-asm * update * update * update first gemm ok * compute correct * update file structure * update README * update * update * update code * update API * return unsupport case * add comment * update readme * update * uncomment * update * fix build err --------- Co-authored-by: valarLip <340077269@qq.com> [ROCm/composable_kernel commit: `440e28b08f`]	2024-11-26 11:14:56 +08:00

1 2

59 Commits