composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 17:26:00 +00:00

Author	SHA1	Message	Date
carlushuang	a8742f7e31	[CK_TILE][CORE] enhance slice_tile api (#2430 ) * support slice cross p * fix some bug in y_len * more case * fix a bug when R exist * support -1 to hint end of current length * format * change commit	2025-07-06 20:13:12 -07:00
Mingtao Gu	7998ae8969	[CK] Mxfp4 moe blockscale buf2lds version support (#2455 ) * change cshuffle size * added mxfp4 moe async buffer loading without B preshuffle * added mx moe B shuffling + scale shuffling (async loads) * minor fix --------- Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-07-06 15:42:00 +08:00
Adam Osewski	3d70c638d1	Always force output clearing for grouped conv bwd data (#2446 ) * Always force output clearing * dont run set zero for residual --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-07-04 07:49:52 -06:00
Max Podkorytov	158ddeb8ce	[CK-TILE] File-level documentation for static encoding pattern (#2433 ) * add file-level comment * Finished the write-up --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-07-04 02:26:18 -07:00
Vidyasagar Ananthan	2e971eff90	Removing reference to undefined parameter for ignore statement. (#2447 )	2025-07-03 20:10:29 -07:00
damien-lejeune	1183824573	Fix clang in ck develop branch (#2445 ) Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2025-07-02 10:07:47 -06:00
chenjun	74a34e0f50	fix KPerBlock = 64 a8w8 bpreshulle gemm build fail in gfx950 (#2437 ) Co-authored-by: valarLip <340077269@qq.com>	2025-07-02 19:12:07 +08:00
Gino Lu	60eb70f543	Fix return value bug that drops minus sign in some cases. (#2415 ) * fix return value bug. * refine change according to comment.	2025-07-02 14:53:00 +08:00
huaiguxu	e1c5172fdb	Huaiguxu/moe fp8 pertoken scale fix (#2391 ) * fix pertoken_scale a_scale dimension * clang-format * Fix moe_gemm2_fp8 perTokenScale reference and example.	2025-06-27 10:24:34 +08:00
linqunAMD	1749c0409e	[CK][CONV] Support NCHW in class DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (#2375 ) 1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it. 2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance. 3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.	2025-06-26 08:32:39 +08:00
Thomas Ning	e03293ebce	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>	2025-06-25 08:20:35 -07:00
Rostyslav Geyyer	daf71fb8e4	Enable fp4 tests (#2329 )	2025-06-25 07:38:54 -05:00
linqunAMD	37e1a27537	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-25 01:07:45 -07:00
Po Yen Chen	50fad03524	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs()	2025-06-25 15:19:21 +08:00
Xiao Li	bac51b6ec0	Fix amd_ck_fp8.hpp macro definitions (#2325 ) * Fix amd_ck_fp8.hpp macro definitions 1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before. 2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace fp8_impl to avoid redefined error when building with rocm 6.4+ (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h) Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>	2025-06-24 22:46:15 -06:00
Yi DING	c5d9181e1b	Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393 )	2025-06-24 16:35:54 -07:00
Anton Gorenko	77123600ee	Improve fmha_bwd tests performance (#2376 ) * Avoid passing indices (std::vector) by value to host tensor's operator() Each access requires 2 allocations and copies of the vector. * Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification * Compute ds_hp_host_ref in parallel This sequntial ForEach is the slowest part of validation and it benefits from parallel computation. * Do not use ForEach for simple copy and conversion of large tensors These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices.	2025-06-24 07:45:24 -07:00
Kiefer van Teutem	9e74ae7c89	Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319 ) * Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty. * Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good. * Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet. * Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3. In principle the whole thing works now, just need to add other data types and perhaps do some cleanup. * Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16). * Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts. * Fixup comments and TODOs * Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small. * Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests. * Take batching into account when using rotating memory or clearing the C tensor. * Implement small refactors / comments etc. from review. * Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists. * Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems. * Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups. * Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2025-06-24 07:28:13 -07:00
lalala-sh	bb571a0330	fix moe i4 bug from aiter (#2339 )	2025-06-24 14:51:29 +08:00
Yi DING	b8212864cf	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-06-24 01:33:31 +08:00
Rostyslav Geyyer	dbfe70e72a	Add accelerated stochastic rounding on gfx950 (#2355 ) * Add native prand generation support for gfx950 * Update seed calculation	2025-06-23 09:31:46 -05:00
John Shumway	47ae4b0955	Shard several of the most costly targets. (#2373 ) * Shard several of the most costly targets. Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance * fix clang format * Fix build errors in instantiation code. I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR. * Fix link errors from mismatched declarations. Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Shard the longest 2D convolution builds Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes. * Use PROJECT_SOURCE_DIR for submodule compatibility I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Remove accidental copy of a file * Remove accidental copies of template files. --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2025-06-23 07:24:36 -07:00
Po Yen Chen	7d669440a6	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors	2025-06-23 12:29:15 +08:00
Max Podkorytov	0366fb2abc	Update for xformers (#2372 ) * update api * update kernel api * clang-format	2025-06-22 00:28:30 -07:00
Bartłomiej Kocot	cebdee4d9e	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes	2025-06-20 15:44:36 -07:00
Thomas Ning	107e3623c7	Transpose builtin macro defense (#2374 ) * add the macro defense * add the static assert check	2025-06-20 11:24:54 -07:00
Max Podkorytov	11eb9f1c77	Reland fix default epilogue (#2367 ) * Revert "Revert "Fix default epilogue (#2358)" (#2364)" This reverts commit `64a2fda713`. * add operator() with old signature	2025-06-19 10:39:30 -07:00
joyeamd	a2f01141aa	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-06-18 01:28:34 -07:00
Thomas Ning	64a2fda713	Revert "Fix default epilogue (#2358 )" (#2364 ) This reverts commit `cd606f72c1`.	2025-06-17 22:43:05 -07:00
carlushuang	a4e1248dba	[CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335 ) * support local_token for hipgraph * update README * fix comment * fix fmoe example	2025-06-18 10:49:43 +08:00
Max Podkorytov	cd606f72c1	Fix default epilogue (#2358 ) * [ck-tile] fix default epilogue in gemm universal * argument validation needs vector size D * operator() needs to specify dram windows * copy/paste from cshuffle epilogue * clang-format * mark unused argument --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-17 17:30:21 -07:00
linqunAMD	0eb8974502	[CK_TILE] Support multi-config in tile_example_gemm_universal (#2240 ) * [CK_TILE] Support multi-config in tile_example_gemm_universal Add GemmConfig in run_gemm_example to support multiple tile config. - It is useful when use you need compare gemm perf with different tile/pipeline config - we also can use it simplify the code for wmma support in the furture. * [CK_TILE] Support multi-config in tile_example_gemm_universal Address review comments * rebase code and fix clang format. * fix clang format * support pipeline v5. * fix merge conflict * address review comment * add missing file * address review comment v2 * fix build error	2025-06-17 17:27:46 -07:00
Illia Silin	cdfd7722bf	Revert "Shard several of the most costly targets. (#2266 )" (#2361 ) This reverts commit `3a0cb27966`.	2025-06-17 13:56:30 -07:00
Bartłomiej Kocot	cc98a41f46	Fix Add in dynamic buffer for fp32/i8 (#2351 ) * Fix Add in dynamic buffer for fp32/i8 * fixes * Fix	2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi	4c57157d50	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-06-17 11:54:30 -07:00
Thomas Ning	3c4cdfac4f	Fix the CK Tile related operators (#2356 ) * fix the flatmm * Fix the pipeline * address the comment	2025-06-16 17:38:52 -07:00
Illia Silin	5523df4b2d	Revert "fix the flatmm (#2349 )" (#2352 ) This reverts commit `d996bc78be`.	2025-06-16 07:54:55 -07:00
Bartłomiej Kocot	f6c2ff9dce	Grouped convolution forward with clamp (#2334 ) * Grouped convolution forward with clamp * Optimize clamp * unary fixes * test gk bias * Revert "test gk bias" This reverts commit `8e42e29d7b`. * Revert "Revert "test gk bias"" This reverts commit `e73c0550ce`. * workaround comment	2025-06-16 15:36:53 +02:00
Thomas Ning	d996bc78be	fix the flatmm (#2349 )	2025-06-16 02:17:53 -07:00
ruanjm	b34c234f51	Add support for specifying valid flag when fetching elements for tile_scatter_gather (#2332 ) * Add support for specifying valid flag when fetching elements for tile_scatter_gather Add constexpr for operator[] of TrueGenerator * Use different path when valid is enabled	2025-06-16 17:17:03 +08:00
carlushuang	fb97f75099	hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem (#2348 )	2025-06-15 22:49:04 -07:00
Illia Silin	2d8a804152	Fix direct lds load for gfx950 and clang20 (#2346 ) * fix direct lds load for gfx950 and clang20 * Update include/ck/utility/amd_buffer_addressing_builtins.hpp * Fix format --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>	2025-06-15 15:22:34 -07:00
Mateusz Ozga	bd96ac9742	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails	2025-06-13 19:39:11 +02:00
John Shumway	3a0cb27966	Shard several of the most costly targets. (#2266 ) * Shard several of the most costly targets. Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance * fix clang format * Fix build errors in instantiation code. I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR. * Fix link errors from mismatched declarations. Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Shard the longest 2D convolution builds Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes. * Use PROJECT_SOURCE_DIR for submodule compatibility I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project. --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2025-06-13 03:58:50 -07:00
kylasa	5f1ad09b61	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build.	2025-06-12 18:24:02 -07:00
Thomas Ning	f59b8c7d3d	OCP FP8 Macro restructure (#2331 ) * solved the problem	2025-06-12 09:46:33 -07:00
Bartłomiej Kocot	bb4f471b09	Grouped conv bwd weight with grouped gemm (#2304 ) * Grouped conv bwd weight with grouped gemm * fixes * fix * Fixes * test comments * restore atol * fix	2025-06-12 10:15:07 +02:00
carlushuang	8aff45a8af	[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel (#2327 ) * refactor subtoken logic to let more kernel pickup mp kernel * typo	2025-06-12 11:44:22 +08:00
Yi DING	37554c31e8	Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 ) * [fix] align v3 gufusion pipeline * fix device kernel selection. * Add .co direct asm support by CK_USE_ASM_MOE_STAGE2_BLOCKSCALE * experimental optimization for scale load in blkscale gemm * Add asm for no-loop v3_128x128x128 * fix bugs * tune fp8 example * Update v1_128x128x128 to 2x2 instead of 4x1 * wip * add warmup to asm launch * wip2 * 16x16 function merged to moe * temp save, a performant version. * wip3 * Update .co binary to 16x16 * 16x16x128 correct; 64x64x128 failed * update * use mem_op::set when topk=1 * add mx fp8 b_preshuffle support, function not yet tested. * Spilt the fp4 target. Fix the known bugs. 128x128x128 sanity checked; remove prints * some fixes * fix update * remove some unnecessary hacky; enable 256x256x256 tilesize * update for function debug * Add pipeline v3. Have some runtime issue and register spill * Fix pipe v3 correctness issue * remove unnecessary hacky * clang format * fix a bug * fix the bug, functional test passed * tempsave; buggy at passed 4 e8m0 to scaled mfma * added fp4_bpreshuffle example, build failures * fixed some bugs * implement shuffled scale mxfp4gemm, blocker: opsel not effect * hotfix * fix bugs, build passed * (M, N, K)=(128, 128, 128) function failed. * temp save for gemm1. Function not ready * fix compile error. Gemm2 pass. Gemm1 WIP * fix bug for a lds read * update moe * Compile pass. Gemm1 function WIP * update moe * fix fp8; fix even/odd * tempsave * update moe * Revert "update" This reverts commit `960b2bce1c`. * Revert "use mem_op::set when topk=1" This reverts commit `def952a178`. * Add v3 128x128x128_4x4_16x16.co for gfx950 * temp cmake flag suppression for aiter test * add code for mxfp4 gemm, blockscale not supported yet * gemm1 up-only pass. GU WIP * function pass with inline asm hacky * revert unexpected file change * updated and build passed * update CE elementOP * added code for debug * Gemm1 GUFusion function pass. Perf WIP * Fix fp8/bf8; remove duplicated code * disable the scheduler in v3; bring it back when compiler feature ready. * update moe v1 pipeline * Add gemm1 v1 32x128x128 * remove schedule barrier * updated * Fix fp8/bf8 B-row * mfma using asm, device result correct, host result need to check * gemm1 v3 64x128x128 debug * fix cpu ref * a/b thread_desc stride fix * Use random scale for init1 * 16x16x128 input size blockscale function passed * fix blockscale gemm bug * tempsave. Almost all instances passed. * v1 fix for mi350. * temp save * debug save * update debug * fix the bug, 128x128x256 tile function passed * v3 * rename moe block selector and pipeline * Add gemm1 v1 * Add gemm1 v1 to selector * added mx moe block v3 support, function passed * compile error fix * Improve the pipeline * Pack e8m0 as int32_t * v1 compile pass. Function not ready * debug synchronize issue over different GPU/ROCm * minor fix * Add profiler filter * Add f4 ckProfiler * Fix example compile error * Add f4 profiler examples * tempsave * v1 function pass. * v3 function pass * align file and function name * mx_moe_fp4 ready for aiter with clang-format. * modify the way we represent fp4 * generalize the pipeline scheduling. * init moe mx f4 scale shuffle * Cmakelist diable compiler-bound flags * mx_fp4 default parameter change * Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler. * update code * tempsave; modify the way we represent fp4 * generalize the pipeline scheduling. * Add gemm1 gfx942 .co support * updated code, build passed. * Update gemm2 asm with latest compiler flag * Fix mx f4 ckProfiler * Fix blockwise gemm mx v1 * lds conflict free + buffer load lds * Add gemm2 v3 64x128x128 * fix a, b scale loading bugs, a, b scale loading now correctly * Add gemm2 v3 64x128x128 * commit with debug info * fix fp4 profiler * Add mx fp4 pileline v1 instances * Fix v2 topk_weight cal. Add silu asm. * v2 tok_weight WIP * init mx fp4 B no preshuffle version * tempsave. compile pass, function wrong * enable fp4 moe no weigth preshuffle, function pass * update the TFlops calculation in the example * Add gemm2 64x128x128 asm. Fix BF16 ref. * fix 2 typos in fp4_preshuffle * Better kernel selection in device classes * correct preShuffleBuffer we should used packed k to do shuffle. * lds conflict free + buffer load lds * optimize offset math in dma * Fix fp4 ckProfiler * Fix MX MFMA tests * fix f4 pipeline issues * gemm1 func pass * update mx moe gemm1_bns tile size to 64x128x256 * update mx moe gemm1 gemm2 TF and BW calculation * fix typo * temp save * Fix example_gemm_mx build * rename the block pipeline * correct a typo in tail * Add rotating to mx examples * fix the correctness issue * Fix v1; use M padding * Add NT flag to B/BScale buffer * Merge gemm_mx_common.hpp * temp save, 4.4~4.5 * Fix 'Merge gemm_mx_common.hpp' * refactor the pipeline * Pad the M for scale buffer unconditionaly * update MX moe GEMM1 hotloopscheduling * change the gemm1 tile from 64x128x128 to 128x64x128 * Unconditional Ascale padding * Pad shuffled a scale only * pad ascale * add vmcnt guard for async copy * Profiler add f4 wp * Merge preshuffle device * Add more fp4 wp instances * Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format * Clang-format after 2 merges * Remove rocm6.3 workaround flags and macro * Fix fp8 config * Fix bf8 config * flag and barrier fix for copmiler branch MainOpSelV3 * Add fp8 profiler instances * Remove debug infos; Enable flags for blockscale f8 * No asm ver. for merging moe blocksale fp8 into mainline * update the flag name for f8blockscale * recover example * fix performance bug of bpreshuffle f8 gemm * clang format, remove single rate mfma restriction for f8 * remove single rate mfma restriction for f8 blockscale gemm * Fix moe blockscale gemm1 barrier 0x800 for new compiler * add pipeline v1 for MOE Gemm2 * Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns * Fix OOB; add MB96 instances * remove unnecessary files * fix the cmake issue * Enable splitk for mxfp4; clang format; * Generate random tensor values with multiple threads * Use packed_size_v for A/BPackedSize * Fix warning * Fix target_compile_options for disabled target on gfx942 * fix moe pki4 on gfx950 * doc the kGroup definition * Fix ThreadwiseTensorSliceTransfer_v4::Run (Fuse scale) * Refactor thread_copy_lds_direct_load; fix gfx942 direct lds load example; fix f16_pki4 example * Fix unknown compiler flag * fix two failed examples. * fix some failure tile size in gfx950 universal gemm. fix test_gemm_fp16 * workaround fix for test_gemm_f32; * We have very limited support for lds direct load if input matrix is not K major * fix test_gemm_splitk; * Fix compile for mx_mfma_op * add mfma selection logic for multipled_v3 * Clean up * Fix device gemm mx link error * improve the global atomic pattern * Revert unnecessary copyright updates * restore minimum_occupancy logic * Avoid data race in moe gemm2 ref * Build fp8 gemm_multiply_multiply and moe only on gfx94/95 * update the instance in device_mx_gemm * Resolve comments * Copyright 2025 * Remove unused code * fix library linking issue --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-06-12 09:25:59 +08:00
Bartłomiej Kocot	8c1ed6f4c1	Move SetZero functions inside the kernels for Grouped Conv (#2255 ) * Disable SetZero before launch kernel for grouped conv fwd * Move set zero to kernel * wmma fix * fix --------- Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>	2025-06-11 23:41:03 +02:00

... 4 5 6 7 8 ...

1131 Commits