composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 00:58:44 +00:00

Author	SHA1	Message	Date
Jeff Huang	7ede589f4b	[CK_TILE] Add sequence padding and variable length support in fmha (a… (#2851 ) * [CK_TILE] Add sequence padding and variable length support in fmha (and v3) - Group Mode Padding: Introduces the `-s_qpad` argument to support physically padded layouts. Kernels now use padded start pointers (`seqstart_padded__ptr`) for memory addressing. - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens` arguments for efficient processing of variable-length sequences by passing cumulative effective lengths (`cu_seqlen__ptr`) to the kernel. - FMHA examples: Support padding and variable length both in group and batch mode. Dispatcher is updated as well (dispatch to kPadSeqLenK enabled pipeline). - New padding test cases: Add padding test cases to `smoke_test_fwd.sh`, and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that specifically validate/benchmark the new padding and variable-length functionalities in both group and batch modes. * [CK_TILE] Fix build error in fmha unit tests --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `86dd59cd01`]	2025-09-19 17:36:49 +08:00
Anton Gorenko	a64deec3ba	[CK_TILE] FMHA Fix synchronization issues in BWD pipelines (#2876 ) * Run ctest with --output-on-failure * Fix synchronization issues in bwd pipelines The bwd kernel reuses the same area of LDS for ds (SGrad), bias and dbias (BiasGrad). This means that there must be block_sync_lds between loading one tensor and storing another to the same area. Heavy instructions like MFMA/WMMA and global loads are executed between reuses of the same memory so in MOST cases loading is finished by all warps before storing is started. However, sometimes warps progress at different speeds. Running the tests multiple times and, preferably, with multiple processes on the same GPU helps to trigger this issue: bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure [ROCm/composable_kernel commit: `2aec38f9ec`]	2025-09-19 11:34:45 +05:00
ltqin	fd80c78f50	Add input fp8 and output bf16 attention (#2726 ) * change host using fp16 to check * fp8 to fp8 compare * rewrite input parameters * add not squant * remove some output code * for scale = 1 * format * saturates only for fp8 * add fp8bf16 data type * add fp8bf16 data type * fix test fp8 code * add run_fp8bf16_tests * change fmha fwd example parameter(adding fp8bf16) * Support fp8bf16 for Aiter * Support aiter fp8bf16 in c++ * fix comment about fp8 in readme.md * add fp8fp32 * add fp8fp32 test * remove range_q etc. * format * fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type * add fp8bf16 to data_type function * change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8 * format * reset atol for fp8 * fix bug for atol --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `dd249f1cd6`]	2025-09-19 14:26:43 +08:00
Max Podkorytov	a00705b4fd	poc convert fnuz fp8 to non-native dtype similar to ocp (#2871 ) [ROCm/composable_kernel commit: `e469fee046`]	2025-09-18 22:51:01 -07:00
SamiAario-AMD	30b63f4c04	Add gemm weight preshuffle pk_int_t support (#2858 ) * Factor out the three separate copies of load_interleaved_pk_type into a common utility class * Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the weight preshuffle GEMM example * Remove a duplicate function * Add support for B tensor type pk_int4_t for the weight preshuffle GEMM, with tests included * I4 support introduced more failing test cases that mirror the existing ones for F8 * Simplify the check for which tests to skip (they all have F8 as A tensor type) * Add a changelog entry * add the test for v2 wp pipeline, polish the code, add the support of int4 for v2 wp pipeline * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `47cd0d5cff`]	2025-09-18 21:26:10 -07:00
Mateusz Ozga	64e1f86daf	[CK_TILE] Multiple-ABD GEMM example (#2788 ) * Multi ABD - initial commit * Clang-foramt fix * block gemm, unify the name of CDataType * Apply chnages to mem-pipeline * Rollback prefix for DType and Layout * Gemm Kernel Basic, rename * WMMA config * Grouped GEMM * Clang-format * Dropout, name * Review v2 * Move element_wise fn to unnary, remov old ones fn * clang-format * Fix issue review * WP operator adjust to universal gemm * v2 prepare * Remove unused comment * Remove vectorsize * Rollback * Adjust pipeline for abd * Shuffle argument * CI-fail fix quant * Fix ag_br pipeline * Failing tests * Typo * Single argument support [ROCm/composable_kernel commit: `30ab1d6a71`]	2025-09-19 01:14:11 +02:00
Rostyslav Geyyer	83e2403545	Fix UB caused by reinterpret_cast (#2849 ) * Use bit_cast instead of reinterpret_cast to avoid UB * Apply same fix in ck_tile [ROCm/composable_kernel commit: `14bbc545ea`]	2025-09-18 07:12:37 -07:00
Yi DING	8bc9d6226d	[CK_TILE] FMHA Test Ignore Known Errors (#2872 ) [ROCm/composable_kernel commit: `7ee7915e94`]	2025-09-18 16:51:21 +08:00
aledudek	a9d74c3208	[CK_TILE] Fix batched_gemm tests for gfx950 (#2869 ) [ROCm/composable_kernel commit: `427dca076b`]	2025-09-17 16:43:41 -07:00
yinglu	19463895a8	TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 ) * Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)" This reverts commit `954db22b39`. * fix compile error on gf12x * only run tf32 example on gfx942 * only build tf32 instance on gfx942 * ckProfiler:only support tf32 in gfx942 * delete unuseful messages [ROCm/composable_kernel commit: `dd7af118d7`]	2025-09-17 14:50:15 -07:00
Aviral Goel	a7a7fa13bb	build(grouped_gemm): added appropriate compiler flag to resolve numerical error for fp8 on gfx950 (#2868 ) [ROCm/composable_kernel commit: `7c934b72ab`]	2025-09-17 11:04:21 -07:00
Michał Kulikowski	5334a45c0e	[CK][Examples] - fixing grouped_conv_bwd_weight command parser. (#2840 ) -added parameter to change group count for grouped_gemm examples. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> [ROCm/composable_kernel commit: `5c4f52a02a`]	2025-09-17 10:39:48 -07:00
pmaybank	377a3da125	[CK_TILE] Add support for gfx12 in tile_engine for GEMM benchmarking (#2802 ) * initial work on adding support of gfx12 in tile_engine for GEMM benchmarking * add stage("Run TILE_ENGINE_GEMM Tests on gfx1201") to Jenkins config * make tile_[m/n/k] validation arch dependent [ROCm/composable_kernel commit: `592d73ad73`]	2025-09-17 17:59:01 +01:00
Gino Lu	f9660c00dc	[CK_TILE] Refine pk_fp4's fill, pack, and unpack (#2845 ) * fix bug * let pack/unpack return pk_fp4_t * fix clang-format [ROCm/composable_kernel commit: `c2997f2b7f`]	2025-09-17 10:54:06 +08:00
Aviral Goel	0fb1cfa4b7	fix(grouped_gemm): pipeline selection when tail_num varies per group and leads to numerical error (#2863 ) * fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num * WIP: add temp config to stress test numerical error correction * refactor: remove comments [ROCm/composable_kernel commit: `db79fad16f`]	2025-09-16 18:43:19 -07:00
Wojciech Laskowski	302398f3fd	Added wmma support for gemm quantization: (#2841 ) - profiler for gemm quantization for DL/XDL - tests for gemm quantization for DL/XDL - implementation for gemm quantization for WMMA - profiler/tests for gemm qunatization for WMMA Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `f97b2a3f5d`]	2025-09-16 16:23:29 -07:00
Aviral Goel	f8f7e8497a	feat(tile_window): print content of tile window for easier debugging (#2827 ) * feat(tile_window): add function to print content of tile windowof static length, given a 2D range * chore: make documentation less verbose [ROCm/composable_kernel commit: `2723dbd332`]	2025-09-16 15:47:21 -07:00
Aviral Goel	ce0feff1af	test(grouped_gemm): add gtests for the example/grouped_gemm_preshuffle to ensure its integrity (#2811 ) * test(grouped_gemm): add gtests for the example to maintain its integrity * test(grouped_gemm_preshuffle): add prefill variant to testbed to cover wider range * fix: removed residue code to make b_shuffle() work again * test(grouped_gemm_preshuffle): limit the test suite to gfx942 arch as it fails on gfx90a * build: add gfx950 as build target for gtests * test(grouped_gemm_preshuffle): temporarily disable fp8 prec tests due to numerical errors * fix(grouped_gemm_preshuffle): resolved fp8 tests failure on gfx950 by adding correct compiler flag [ROCm/composable_kernel commit: `48e08c6429`]	2025-09-16 15:43:30 -07:00
Emily Martins	4adef6618c	[CK_TILE] Stream-K GEMM Implementation (#2781 ) * Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function Prior to this change, the splitk_batch_offset parameter of MakeGemmTensorViews had type SplitKBatchOffset. But, the only member variable of the SplitKBatchOffset class used in the MakeGemmTensorViews function was splitted_k (an int32_t). The splitted_k value was used as part of defining the dimensions of the tensor view. That said, for Stream K, we do not need to use the SplitKBatchOffset class since we are not using Split K. Thus, this commit changes the splitk_batch_offset parameter to a int32_t called k_size. This will avoid the constraint of requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset class while still providing the same functionality. Calls to UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly. * StreamK Kernel RunGemm Implementation Stream K cannot simply use UniversalGemmKernel's RunGemm for the following reasons: 1. The UniversalGemmKernel::RunGemm function computes num_loop based on a static function of the TilePartitioner. That said, for Stream K, num_loop must be computed using a member function (namely GetCurrentIterLength from PR #2708). 2. The UniversalGemmKernel::RunGemm function requires the use of a SplitKBatchOffset object which is not used for Stream K since we are not using Split K. Thus, this change adds a RunGemm function in the StreamKKernel class. * initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm * Fix indexing and offset issues for StreamK These changes do the following: - Ensure offsets along the M and N dimensions are multiplied by MPerblock or NPerBlock, respectively. This ensures tile window origins are at the correct locations. - Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply divmod to the given references to ensure correct values are available to the caller. - Added documentation in the Stream-K operator() * Initial gtests for Stream-K These changes add an initial gtest suite for the CK Tile Stream-K kernel. Currently, due to bugs in the StreamKTilePartitioner (which will be handled in a future PR), there are validation issues for certain cases which may differ on different architectures. Thus, we opted to run cases that are only fully data-parallel (skipping others). A guard was added to Stream-K's IsSupportedArgument method to ensure that callers are aware of this constraint. Additionally, to ensure testing reproducibility, options for setting the number of CUs and occupancy were added to MakeKernelArgs. * Use GemmPipeline operator() variant that takes hot loop and tail num In Stream-K, the num_loop value varies per WG and per iteration of a Stream-K loop. So instead, we use the version of the GemmPipeline's operator() function that takes in has_hot_loop and tail_num. This is similar to what is done in Grouped GEMM. * changes from review: comments, move readfirstlane, remove ifndef * Switch direction of C tensor traversal & add padding guard Prior to this change, WGs travelled backwards through their assigned macro tiles in the C tensor. For instance, if WG0 is responsible for C tiles 0 and 1, it would first visit tile 1 then tile 0. This means that the iter_end decrements in each iteration of the stream-K while loop. Since we are working with unsigned integers, the subtraction operation may not be safe. Thus, this change makes is such that WGs travel forward so that their iter_start is incremented and their iter_end remains fixed. Additionally, we added a guard against WGs that are neither sk_blocks nor dp_blocks to ensure such WGs do not participate in the GEMM. Together, these changes make is such that the algorithm is correct when sk_blocks is greater than zero. * Disable StreamK_M256_N256_K256_SKBlocks12 test case This instance involves >=3 WGs contributing to each macro tile in C. Due to the use of atomics, this is resulting in precision errors. These errors will not persist once the reduction strategy is implemented. We will re-enable this test then. --------- Co-authored-by: Astha Rai <astha.rai713@gmail.com> [ROCm/composable_kernel commit: `dee185d80c`]	2025-09-16 16:21:47 -06:00
linqunAMD	1e9b1826b5	[CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (c254f… (#2837 ) * [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (`5b17f135b7` ) WarpPerBlock_M * WarpPerBlock_N are not equal with ThreadPerBlock_M * ThreadPerBlock_N /warpSize. we should calculate BlockSize from WarpPerBlock_M * WarpPerBlock_N To compatible with wave32, function GetBlockSize is added to calculate correct size in host side. * fix blocksize for all kernel related with generic2dblockshap * remove constexpr for blocks [ROCm/composable_kernel commit: `b7a806f244`]	2025-09-16 08:47:55 -07:00
Bartłomiej Kocot	576fd484c2	Disable GridwiseOp prints if env var is off (#2843 ) * Disable GridwiseOp prints if env var is off * Fixes [ROCm/composable_kernel commit: `671adb59c5`]	2025-09-16 17:47:28 +02:00
Cong Ma	e167f8843f	[CK TILE GEMM] Add support to convert i4 to OCP FP8/BF8 (#2853 ) [ROCm/composable_kernel commit: `78a9823cb4`]	2025-09-16 07:18:51 -07:00
JH-Leon-KIM-AMD	6980efa6fe	[CK Tile] Grouped conv fwd splitn support (#2776 ) ## What's New Add Split-N support for grouped convolution forward to handle tensors >2GB by splitting the batch dimension. ## Bug Fix Fixed 32-bit integer overflow that caused crashes with 6+ splits: - Use `long_index_t` for batch offset calculations - Remove redundant GemmM initialization in constructors ## How It Works - Automatically splits batch dimension when tensor exceeds 2GB - Uses grid.z dimension for parallel processing of splits - Each split processes a subset of batches independently ## Testing Verified with tile_example_grouped_conv_fwd: - n=3000 (6 splits) ✓ - n=3500 (7 splits) ✓ - n=10480 (40 splits) ✓ [ROCm/composable_kernel commit: `804065a36b`]	2025-09-16 16:56:11 +03:00
Haocong WANG	4db9e47cd5	[CK_TILE] fix bug when iperm =0 in fmha fwd (#2820 ) * fix bug when iperm =0 in fmha fwd * Disable f8 fmha smoke test until fix pr merged --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `59cb906482`]	2025-09-16 15:07:10 +08:00
Po Yen Chen	9c6cca5bc5	[CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833 ) * Re-mapping thread block indices for causal=True kernels * Use more intuitive remap_opt value * Fallback to origin remapping if seqlen_q >= 64K * Use GenericAttentionMask to reduce mask computation * Avoid unnecessary boundary check for IsMasking=false case * Fix wrong kernel entry specifier * Add s_nop to prevent delay wave0-3 * Refine scheduling * Remove unnecessary sched_group_barrier() * Move sched_group_barrier() call to scheduler * Replace inline asm s_setprio with intrinsics * Rephrase comments * Expend some o_acc rescaling insts to avoid SIMD idle * Fix block idx special mapping logic * Tune block index mapping for causal=False cases * Tune block index mapping for causal=True cases * Fix wrong vmcnt() * Remove parameter name * Use boolean option for turn on/off causal mask * Update benchmark_fwd_v3.sh option usages * Add option if compiler support it [ROCm/composable_kernel commit: `7fbc9d6c97`]	2025-09-16 11:32:38 +08:00
Thrupti Raj Lakshmana Gowda	fe557311d4	[CK Tile Engine] k_block_per_cu changes in Preshuffle (#2842 ) * kperblock changes in CK Tile Engine Preshuffle * Config file formatting changes [ROCm/composable_kernel commit: `7d7ded62d3`]	2025-09-15 13:22:11 -07:00
linqunAMD	71dc8a9d4d	Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725 ) * Enable xdl in gfx11 & gfx12 * update cmake file * fix all instance build (cmake) * fix batched_gemm_gemm(cmake) * rebase cmake files * fix cmake build error * remve CK_ENABLE_DYNAMIC_WARP_SIZE * update cmake build error2 * fix gfx11 build CK_USE_XDL is enabled on gfx11 and gfx12 * fix gfx10 build * fix gfx11 error --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `f22740df82`]	2025-09-15 10:59:25 -07:00
Illia Silin	954db22b39	Revert "feature:tf32:add initial conv3d fwd kernel support (#2763 )" (#2848 ) This reverts commit `d4dbf93119`. [ROCm/composable_kernel commit: `03b59f8c76`]	2025-09-15 08:27:04 -07:00
lym	d4dbf93119	feature:tf32:add initial conv3d fwd kernel support (#2763 ) [ROCm/composable_kernel commit: `c51102144f`]	2025-09-15 21:03:00 +08:00
Cong Ma	9b65e9ec43	[CK TILE GEMM] set correct value to TiledMMAPermuteN_ (#2839 ) - TiledMMAPermuteN_ should be set to true when config if GemmConfigPreshufflePrefill [ROCm/composable_kernel commit: `e5d73da2da`]	2025-09-13 20:54:08 -07:00
John Afaganis	deca9e73f0	Add Chris Millette as approver (#2844 ) [ROCm/composable_kernel commit: `3a51dbba85`]	2025-09-12 16:16:17 -07:00
linqunAMD	930f95d4a6	[CK_TILE] Enable ck_tile tests on gfx11 and gfx12 (#2821 ) * [CK_TILE] Enable ck_tile test on gfx11 & gfx12 * revert an unnecessary change * enable pk_int4 on gfx11 & gfx12 * revert .pre-commit-config.yaml [ROCm/composable_kernel commit: `b0ee317d83`]	2025-09-12 12:45:14 -07:00
Anton Gorenko	e76f294f85	[CK_TILE] FMHA Reduce build time by disabling instances that are not tested (#2834 ) * Use lse = false for PagedKV tests There are no instances with lse = true so splitkv is actually launched as a fallback. * Reduce build time by disabling instances that are not tested [ROCm/composable_kernel commit: `847834a408`]	2025-09-12 12:44:25 -07:00
Wojciech Laskowski	f2edb06bb0	WMMA support for GEMM reduce (#2823 ) Added gemm + reduce instance library for RDNA4. This includes: - New device implementation running GEMM and reduction kernel - instances for wmma (xdl parity) - examples for wmma (xdl parity) - tests for existing xdl and wmma [ROCm/composable_kernel commit: `b25d4d684a`]	2025-09-12 21:36:43 +02:00
Illia Silin	8c0cdebe63	Enable FMHA and AITER tests on gfx950. (#2812 ) * enable aiter and fmha test stages on gfx950 * use newer compiler for gfx950 * make sure gfx950 runs correct docker * fix typo * upgrade base docker for aiter * change base docker for aiter tests * do not add group render to ck_aiter image * add group irc in ck_aiter docker * do not fix the irc group id to 39 * do not set jenkins uid and gid * skip group irc for aiter tests * fix syntax error in dockerfile * change the base docker for aiter tests * add irc group back to ck_aiter docker [ROCm/composable_kernel commit: `b9d69d32a8`]	2025-09-12 12:20:32 -07:00
Thrupti Raj Lakshmana Gowda	f1c582209f	[CK TILE ENGINE] Adding GEMM Preshuffle to CK Tile Engine (#2712 ) * Partial Progress : Completed ListBlob * Additional changes in Listbob * Partial Progress : Generate Blobs Completed * Partial Progress : Added Host side code for Preshuffle * Working code for Preshuffle before Cleanup * Partial Progress : Cleanup * Partial Progress : Datatype Validation * Partial Progress : Warptiles for preshuffle changed from hardcoding to take from config * Partial Progress : Cleanup * Partial Progress : Code Cleanup * Partial Progress : Passing all valid tiles failing for unsupported tiles * Partial Progress : Working code, testing pending for edge cases * Partial Progress for testing * Completed Code * kBlockPerCu as tunable parameter from config * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Partial Progress : Working listkernels * Partial Progress : Cleanup Working listkernels * Partial Progress : Single instance * Partial Progress : Working single instance code * Partial Progress : Working generate individual instance code * Partial Progress : Working rewamped code for given config file needed validation and edge case testing * Partial Progress : Working Code, testing pending * Removing LOGS file * Working code * Minor changes to GEMM Preshuffle : Restructured * Minor Changes in Preshuffle * Changes to Jenkins File * Changes to Jenkins file to consider new architecture * Changes to Jenkins file for fixing CI --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `f6ba94fb5c`]	2025-09-12 11:50:19 -07:00
Thomas Ning	cb3bbd3881	Fix the vector load & fix the gfx950 compv4 error (#2831 ) [ROCm/composable_kernel commit: `1894a0dbc3`]	2025-09-12 11:48:45 -07:00
linqunAMD	07def6b13d	Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 ) * Fix example * fix build error * update pk_i4 & moe test case * fix all instance build (examples) * fix batched_gemm_gemm (example) * disable example_gemm_bias_softmax_gemm_permute on gfx11 * remove unnecessary disable gfx11 * update tests * update tests2 [ROCm/composable_kernel commit: `321627aec5`]	2025-09-12 08:17:07 -07:00
Illia Silin	adc66b9b0e	build and test on gfx942 by default (#2830 ) [ROCm/composable_kernel commit: `bca99a499d`]	2025-09-11 14:02:21 -07:00
Michał Kulikowski	064eb037db	[CK][EXAMPLES] (#2826 ) -Added parameter to enable/disable verification and timing of kernel in various examples that missed it. -Added parameter to change number of groups to execute in grouped_gemm_examples. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> [ROCm/composable_kernel commit: `ffe9775e70`]	2025-09-11 12:33:00 -07:00
Aviral Goel	6e774b512a	fix(copyright header): add header to missing files (#2807 ) [ROCm/composable_kernel commit: `f3239395dc`]	2025-09-11 12:27:08 -07:00
Cong Ma	741ddfe584	[CK TILE GEMM] Fixed the regression issue with transpose C in Quant Gemm (#2819 ) The numerical error was introduced after merging row/col quant. And it is fixed. [ROCm/composable_kernel commit: `2ed39f8d91`]	2025-09-11 11:38:16 -07:00
John Afaganis	b7a55a6224	Add Haocong as PR approver (#2828 ) [ROCm/composable_kernel commit: `28e12c62a2`]	2025-09-11 09:09:22 -07:00
linqunAMD	a303edcfb0	[CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 (#2808 ) * [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 * fix gemm_splitk_two_stage * revert .pre-commit-config.yaml [ROCm/composable_kernel commit: `60d3e8f504`]	2025-09-11 07:27:33 -07:00
linqunAMD	eaf1fa7edb	[CK_TILE] fix example reduces, permute and elementwise on gfx11 & gfx12 (#2810 ) 1. Refine Reduce2dShape to support both wave32 and wave64 2. Fix example reduce, permute and elementwise on gfx11 and gfx12 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `0b9a638f26`]	2025-09-11 12:41:20 +08:00
Khushbu Agarwal	2eb6cbb6a8	[CK-Tile] Fix quant example code (#2813 ) * initial commit * remove extra files * fixing errors * updated ReadMe file for mapping of diff quants with diff configs * addressing review comments * addressing review comments * Resolved merge conflicts * [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled The get_preshuffle_or was not working as expected, which led to incorrect behavior in the quantization preshuffle process. This change replaces it with the more reliable is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied. --------- Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `80a61afb9b`]	2025-09-10 17:15:39 -07:00
Illia Silin	7382914d59	Revert "add vector load 16/32 for bf16/fp16 (#2779 )" (#2818 ) This reverts commit `985d1319a1`. [ROCm/composable_kernel commit: `b4207c01c7`]	2025-09-10 13:35:15 -07:00
Enrico Degregori	4935bf3fcb	Fix merge bug: add DeviceMoEGemmMXBPreShuffle again (#2816 ) [ROCm/composable_kernel commit: `bbc8c7d999`]	2025-09-10 08:03:29 -07:00
zjing14	985d1319a1	add vector load 16/32 for bf16/fp16 (#2779 ) * add vector load 16/32 for bf16/fp16 * comment addressed * clang format --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `7ecdba878f`]	2025-09-09 23:15:19 -07:00
Anton Gorenko	57d63b3e70	[CK_TILE] Add gtests for FMHA (#2744 ) * Improve random number generation * use different seed for each input (Q, K, V...); * use deterministic generation of: * seqstart_q/k (for group mode); * block_table (for paged-kvcahe); * cache_batch_idx (for kvcache); * Extract arg_parser-related code from run functions to use them as tests * Split examples into main programs and fmha runners, build instances separately * Add dummy tests that use instances and runners * Fix a missed corner case of f32->f8 conversion When value if < min f8 denormal but > min f8 denormal / 2, it must be rounded to min f8 denormal (i.e. 0b1), not to 0. * Fix incorrect fp8 scales for P and O in validation code DataTypeConfig was incorrectly compared with fp8_t. * Add host generation of dropout random values and use it for validation Previously host validation (reference_batched_dropout) used random numbers generated by BlockDropout of the kernel, meaning that incorrect generation on device (bad distribution, repeated numbers, too many zeros, etc.) would not trigger any validation errors. * Implement tests from smoke_test_bwd.sh * Return result as enum to distinguish failure and missing instance * Add tests for bwd features: bias, alibi, dropout * Implement tests from smoke_test_fwd.sh * Pass seqlen_q/k as vectors to fwd and bwd runners * Add tests for fwd features: bias, alibi, dropout * Add tests for pagedkv and splitkv * Fix conditions when to use splitkv and pagedkv kernels splitkv was executed only when use_kvcache which == (need_append_kvcache \|\| use_cache_batch_idx \|\| 0 < page_block_size). In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1. In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance. * Add tests for appendkv * Use is_v_rowmajor = true because there are no instances with column layout anymore * Split public and private compile options for instances Tests and examples need to know only about CK_TILE_FMHA_FWD__API. Improve parsing validation in bias and mask * Pass bias as string for consistency with mask * Catch parsing and other exceptions * Add bwd test for deterministic flag * Initialize fp8 tensors (-init=ufq) similarly to uf * Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null seqlen_k cannot be used to determine padding when seqlen_k_ptr is provided. The actual seqlen_k is taken from seqlen_k_ptr[b]. Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr may contain arbitrary values. In the example or tests this produces incorrect results with appendkv (for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8). * Fix use_pagedkv value when kvcache = true but page_block_size = 0 In this case block_table_ptr is nullptr which is accessed in the kernel. * Clean up bwd tests * Unify fwd tests for f16/bf16 and fp8 * Use better explicit instantiation declaration for fmha_bwd<2> * Use the same seed for all tests, allow to override it with env variable * Undo clang-format of one irrelevant file For some reason my local clang-format-18 and the one in CI work differently. * Do not build instances and tests on unsupported archs * Build instance libraries as OBJECT library * CI: Enable sccache for HIP There are source files with LANGUAGE HIP, they need -DCMAKE_HIP_COMPILER_LAUNCHER=sccache * Add tests to REGRESSION_TESTS * Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0 The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf. * Pass CK_TILE_FMHA_FWD__API as INTERFACE compile options The instances don't actually depend on them, only examples and tests do. Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS without recompiling instances that are already in ccache. Fix formatting and names [ROCm/composable_kernel commit: `ec006bb8e0`]	2025-09-10 08:06:14 +05:00

1 2 3 4 5 ...

2357 Commits