composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
Sami Remes	9a2bedaded	use TEST_F	2025-09-18 11:36:37 +00:00
Sami Remes	0673b7f19d	remove aquant utils	2025-09-18 11:27:59 +00:00
Sami Remes	f1d22a52b0	refactor quant tests	2025-09-18 11:26:10 +00:00
Sami Remes	df5ba96229	Merge branch 'develop' into samremes/tensorwise_quant	2025-09-16 17:25:51 +03:00
Cong Ma	78a9823cb4	[CK TILE GEMM] Add support to convert i4 to OCP FP8/BF8 (#2853 )	2025-09-16 07:18:51 -07:00
JH-Leon-KIM-AMD	804065a36b	[CK Tile] Grouped conv fwd splitn support (#2776 ) ## What's New Add Split-N support for grouped convolution forward to handle tensors >2GB by splitting the batch dimension. ## Bug Fix Fixed 32-bit integer overflow that caused crashes with 6+ splits: - Use `long_index_t` for batch offset calculations - Remove redundant GemmM initialization in constructors ## How It Works - Automatically splits batch dimension when tensor exceeds 2GB - Uses grid.z dimension for parallel processing of splits - Each split processes a subset of batches independently ## Testing Verified with tile_example_grouped_conv_fwd: - n=3000 (6 splits) ✓ - n=3500 (7 splits) ✓ - n=10480 (40 splits) ✓	2025-09-16 16:56:11 +03:00
Sami Remes	98f478d24a	Merge branch 'develop' into samremes/tensorwise_quant	2025-09-16 10:54:58 +03:00
Haocong WANG	59cb906482	[CK_TILE] fix bug when iperm =0 in fmha fwd (#2820 ) * fix bug when iperm =0 in fmha fwd * Disable f8 fmha smoke test until fix pr merged --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-09-16 15:07:10 +08:00
Po Yen Chen	7fbc9d6c97	[CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833 ) * Re-mapping thread block indices for causal=True kernels * Use more intuitive remap_opt value * Fallback to origin remapping if seqlen_q >= 64K * Use GenericAttentionMask to reduce mask computation * Avoid unnecessary boundary check for IsMasking=false case * Fix wrong kernel entry specifier * Add s_nop to prevent delay wave0-3 * Refine scheduling * Remove unnecessary sched_group_barrier() * Move sched_group_barrier() call to scheduler * Replace inline asm s_setprio with intrinsics * Rephrase comments * Expend some o_acc rescaling insts to avoid SIMD idle * Fix block idx special mapping logic * Tune block index mapping for causal=False cases * Tune block index mapping for causal=True cases * Fix wrong vmcnt() * Remove parameter name * Use boolean option for turn on/off causal mask * Update benchmark_fwd_v3.sh option usages * Add option if compiler support it	2025-09-16 11:32:38 +08:00
Thrupti Raj Lakshmana Gowda	7d7ded62d3	[CK Tile Engine] k_block_per_cu changes in Preshuffle (#2842 ) * kperblock changes in CK Tile Engine Preshuffle * Config file formatting changes	2025-09-15 13:22:11 -07:00
linqunAMD	f22740df82	Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725 ) * Enable xdl in gfx11 & gfx12 * update cmake file * fix all instance build (cmake) * fix batched_gemm_gemm(cmake) * rebase cmake files * fix cmake build error * remve CK_ENABLE_DYNAMIC_WARP_SIZE * update cmake build error2 * fix gfx11 build CK_USE_XDL is enabled on gfx11 and gfx12 * fix gfx10 build * fix gfx11 error --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-09-15 10:59:25 -07:00
Illia Silin	03b59f8c76	Revert "feature:tf32:add initial conv3d fwd kernel support (#2763 )" (#2848 ) This reverts commit `c51102144f`.	2025-09-15 08:27:04 -07:00
Sami Remes	dc97be711d	rename quant pipeline problem	2025-09-15 13:47:01 +00:00
Sami Remes	fc4dbd8b7b	remove unused utility function	2025-09-15 13:41:59 +00:00
Sami Remes	5c53da40e4	revert host.hpp include	2025-09-15 13:39:52 +00:00
Sami Remes	f532de59d4	revert common.hpp include	2025-09-15 13:39:05 +00:00
Sami Remes	8f67d1ec14	revert include - from a merge problem?	2025-09-15 13:34:54 +00:00
lym	c51102144f	feature:tf32:add initial conv3d fwd kernel support (#2763 )	2025-09-15 21:03:00 +08:00
Sami Remes	92750c6b53	Merge remote-tracking branch 'origin/develop' into samremes/tensorwise_quant	2025-09-15 12:21:40 +00:00
Sami Remes	3a29035046	Add to changelog	2025-09-15 12:21:36 +00:00
Sami Remes	c899adbda4	Don't use readfirstlane for reading scales - doesn't work for some reason	2025-09-15 12:18:02 +00:00
Cong Ma	e5d73da2da	[CK TILE GEMM] set correct value to TiledMMAPermuteN_ (#2839 ) - TiledMMAPermuteN_ should be set to true when config if GemmConfigPreshufflePrefill	2025-09-13 20:54:08 -07:00
John Afaganis	3a51dbba85	Add Chris Millette as approver (#2844 )	2025-09-12 16:16:17 -07:00
linqunAMD	b0ee317d83	[CK_TILE] Enable ck_tile tests on gfx11 and gfx12 (#2821 ) * [CK_TILE] Enable ck_tile test on gfx11 & gfx12 * revert an unnecessary change * enable pk_int4 on gfx11 & gfx12 * revert .pre-commit-config.yaml	2025-09-12 12:45:14 -07:00
Anton Gorenko	847834a408	[CK_TILE] FMHA Reduce build time by disabling instances that are not tested (#2834 ) * Use lse = false for PagedKV tests There are no instances with lse = true so splitkv is actually launched as a fallback. * Reduce build time by disabling instances that are not tested	2025-09-12 12:44:25 -07:00
Wojciech Laskowski	b25d4d684a	WMMA support for GEMM reduce (#2823 ) Added gemm + reduce instance library for RDNA4. This includes: - New device implementation running GEMM and reduction kernel - instances for wmma (xdl parity) - examples for wmma (xdl parity) - tests for existing xdl and wmma	2025-09-12 21:36:43 +02:00
Illia Silin	b9d69d32a8	Enable FMHA and AITER tests on gfx950. (#2812 ) * enable aiter and fmha test stages on gfx950 * use newer compiler for gfx950 * make sure gfx950 runs correct docker * fix typo * upgrade base docker for aiter * change base docker for aiter tests * do not add group render to ck_aiter image * add group irc in ck_aiter docker * do not fix the irc group id to 39 * do not set jenkins uid and gid * skip group irc for aiter tests * fix syntax error in dockerfile * change the base docker for aiter tests * add irc group back to ck_aiter docker	2025-09-12 12:20:32 -07:00
Thrupti Raj Lakshmana Gowda	f6ba94fb5c	[CK TILE ENGINE] Adding GEMM Preshuffle to CK Tile Engine (#2712 ) * Partial Progress : Completed ListBlob * Additional changes in Listbob * Partial Progress : Generate Blobs Completed * Partial Progress : Added Host side code for Preshuffle * Working code for Preshuffle before Cleanup * Partial Progress : Cleanup * Partial Progress : Datatype Validation * Partial Progress : Warptiles for preshuffle changed from hardcoding to take from config * Partial Progress : Cleanup * Partial Progress : Code Cleanup * Partial Progress : Passing all valid tiles failing for unsupported tiles * Partial Progress : Working code, testing pending for edge cases * Partial Progress for testing * Completed Code * kBlockPerCu as tunable parameter from config * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update tile_engine/ops/gemm_preshuffle/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Partial Progress : Working listkernels * Partial Progress : Cleanup Working listkernels * Partial Progress : Single instance * Partial Progress : Working single instance code * Partial Progress : Working generate individual instance code * Partial Progress : Working rewamped code for given config file needed validation and edge case testing * Partial Progress : Working Code, testing pending * Removing LOGS file * Working code * Minor changes to GEMM Preshuffle : Restructured * Minor Changes in Preshuffle * Changes to Jenkins File * Changes to Jenkins file to consider new architecture * Changes to Jenkins file for fixing CI --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-09-12 11:50:19 -07:00
Thomas Ning	1894a0dbc3	Fix the vector load & fix the gfx950 compv4 error (#2831 )	2025-09-12 11:48:45 -07:00
linqunAMD	321627aec5	Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 ) * Fix example * fix build error * update pk_i4 & moe test case * fix all instance build (examples) * fix batched_gemm_gemm (example) * disable example_gemm_bias_softmax_gemm_permute on gfx11 * remove unnecessary disable gfx11 * update tests * update tests2	2025-09-12 08:17:07 -07:00
Sami Remes	f5775480cd	Add tensor quant to example	2025-09-12 15:02:02 +00:00
Sami Remes	0038808195	Cshuffle epilogue tests with tensor scaling	2025-09-12 07:09:12 +00:00
Illia Silin	bca99a499d	build and test on gfx942 by default (#2830 )	2025-09-11 14:02:21 -07:00
Michał Kulikowski	ffe9775e70	[CK][EXAMPLES] (#2826 ) -Added parameter to enable/disable verification and timing of kernel in various examples that missed it. -Added parameter to change number of groups to execute in grouped_gemm_examples. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2025-09-11 12:33:00 -07:00
Aviral Goel	f3239395dc	fix(copyright header): add header to missing files (#2807 )	2025-09-11 12:27:08 -07:00
Cong Ma	2ed39f8d91	[CK TILE GEMM] Fixed the regression issue with transpose C in Quant Gemm (#2819 ) The numerical error was introduced after merging row/col quant. And it is fixed.	2025-09-11 11:38:16 -07:00
Sami Remes	bed6e9e59d	Add TensorWise quant mode	2025-09-11 16:19:39 +00:00
John Afaganis	28e12c62a2	Add Haocong as PR approver (#2828 )	2025-09-11 09:09:22 -07:00
Sami Remes	652d2e49a0	rename gemm_group_quant to gemm_quant	2025-09-11 14:40:23 +00:00
linqunAMD	60d3e8f504	[CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 (#2808 ) * [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 * fix gemm_splitk_two_stage * revert .pre-commit-config.yaml	2025-09-11 07:27:33 -07:00
linqunAMD	0b9a638f26	[CK_TILE] fix example reduces, permute and elementwise on gfx11 & gfx12 (#2810 ) 1. Refine Reduce2dShape to support both wave32 and wave64 2. Fix example reduce, permute and elementwise on gfx11 and gfx12 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-09-11 12:41:20 +08:00
Khushbu Agarwal	80a61afb9b	[CK-Tile] Fix quant example code (#2813 ) * initial commit * remove extra files * fixing errors * updated ReadMe file for mapping of diff quants with diff configs * addressing review comments * addressing review comments * Resolved merge conflicts * [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled The get_preshuffle_or was not working as expected, which led to incorrect behavior in the quantization preshuffle process. This change replaces it with the more reliable is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied. --------- Co-authored-by: Cong Ma <congma13@amd.com>	2025-09-10 17:15:39 -07:00
Illia Silin	b4207c01c7	Revert "add vector load 16/32 for bf16/fp16 (#2779 )" (#2818 ) This reverts commit `7ecdba878f`.	2025-09-10 13:35:15 -07:00
Enrico Degregori	bbc8c7d999	Fix merge bug: add DeviceMoEGemmMXBPreShuffle again (#2816 )	2025-09-10 08:03:29 -07:00
zjing14	7ecdba878f	add vector load 16/32 for bf16/fp16 (#2779 ) * add vector load 16/32 for bf16/fp16 * comment addressed * clang format --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-09-09 23:15:19 -07:00
Anton Gorenko	ec006bb8e0	[CK_TILE] Add gtests for FMHA (#2744 ) * Improve random number generation * use different seed for each input (Q, K, V...); * use deterministic generation of: * seqstart_q/k (for group mode); * block_table (for paged-kvcahe); * cache_batch_idx (for kvcache); * Extract arg_parser-related code from run functions to use them as tests * Split examples into main programs and fmha runners, build instances separately * Add dummy tests that use instances and runners * Fix a missed corner case of f32->f8 conversion When value if < min f8 denormal but > min f8 denormal / 2, it must be rounded to min f8 denormal (i.e. 0b1), not to 0. * Fix incorrect fp8 scales for P and O in validation code DataTypeConfig was incorrectly compared with fp8_t. * Add host generation of dropout random values and use it for validation Previously host validation (reference_batched_dropout) used random numbers generated by BlockDropout of the kernel, meaning that incorrect generation on device (bad distribution, repeated numbers, too many zeros, etc.) would not trigger any validation errors. * Implement tests from smoke_test_bwd.sh * Return result as enum to distinguish failure and missing instance * Add tests for bwd features: bias, alibi, dropout * Implement tests from smoke_test_fwd.sh * Pass seqlen_q/k as vectors to fwd and bwd runners * Add tests for fwd features: bias, alibi, dropout * Add tests for pagedkv and splitkv * Fix conditions when to use splitkv and pagedkv kernels splitkv was executed only when use_kvcache which == (need_append_kvcache \|\| use_cache_batch_idx \|\| 0 < page_block_size). In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1. In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance. * Add tests for appendkv * Use is_v_rowmajor = true because there are no instances with column layout anymore * Split public and private compile options for instances Tests and examples need to know only about CK_TILE_FMHA_FWD__API. Improve parsing validation in bias and mask * Pass bias as string for consistency with mask * Catch parsing and other exceptions * Add bwd test for deterministic flag * Initialize fp8 tensors (-init=ufq) similarly to uf * Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null seqlen_k cannot be used to determine padding when seqlen_k_ptr is provided. The actual seqlen_k is taken from seqlen_k_ptr[b]. Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr may contain arbitrary values. In the example or tests this produces incorrect results with appendkv (for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8). * Fix use_pagedkv value when kvcache = true but page_block_size = 0 In this case block_table_ptr is nullptr which is accessed in the kernel. * Clean up bwd tests * Unify fwd tests for f16/bf16 and fp8 * Use better explicit instantiation declaration for fmha_bwd<2> * Use the same seed for all tests, allow to override it with env variable * Undo clang-format of one irrelevant file For some reason my local clang-format-18 and the one in CI work differently. * Do not build instances and tests on unsupported archs * Build instance libraries as OBJECT library * CI: Enable sccache for HIP There are source files with LANGUAGE HIP, they need -DCMAKE_HIP_COMPILER_LAUNCHER=sccache * Add tests to REGRESSION_TESTS * Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0 The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf. * Pass CK_TILE_FMHA_FWD__API as INTERFACE compile options The instances don't actually depend on them, only examples and tests do. Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS without recompiling instances that are already in ccache. Fix formatting and names	2025-09-10 08:06:14 +05:00
linqunAMD	c254f3d7b4	[CK_TILE] Refine Generic2dBlockShape to fix ck_tile example 2,10,11,14 on rdna3 and 4 (#2795 ) BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side. Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape	2025-09-10 08:29:20 +08:00
linqunAMD	df4ee556d6	[CK_TILE] Fix flatmm on gfx11 and gfx12 (#2790 ) 1. Correct shuffle_b and MakeBFlatDramTileDistribution according to WMMA warp layout 2. Add FlatmmConfig16_Wmma for gfx11 and gfx12	2025-09-10 08:28:00 +08:00
Cong Ma	82890192dd	[CK TILE] Support fp8/fp16 with pk_int4_t as data types for tensors A and B (#2805 ) - Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats - Implement A(bf8) B(i4) support in universal GEMM - Use new implementation for i4 to fp8 conversion in Block Scale	2025-09-09 16:40:52 -07:00
Yi DING	91178b4011	[CK_TILE] Fix kname & typo in FMHA BWD (#2809 )	2025-09-09 15:08:00 -07:00

1 2 3 4 5 ...

2353 Commits