composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 19:57:40 +00:00

Author	SHA1	Message	Date
Sami Aario	d75d38bf05	Add DstDataType as a template parameter to load_tile_with_elementwise, and use it for type conversion	2026-01-08 07:53:33 +00:00
Sami Aario	2e798d15e1	Add functionality and tests for fp16 x fp8 and fp8 x fp16	2026-01-08 07:53:33 +00:00
Sami Aario	7fdf8222c2	Add functionality and tests for bf16 x fp8 and fp8 x bf16	2026-01-08 07:53:33 +00:00
Sami Aario	0b0ddf1a38	Add MFMA warp gemm for float, float, float, 32, 32, 16	2026-01-08 07:53:33 +00:00
Sami Aario	7bb452d9b8	Refactor type conversions out of MakeBLdsBlockDescriptor, WIP!	2026-01-08 07:53:33 +00:00
Sami Aario	fc82ebc174	Introduce DetermineWarpPrecType for determining warp GEMM precision types	2026-01-08 07:53:33 +00:00
SamiAario-AMD	0a4388d4cc	Merge branch 'develop' into LWPCK-3549-cleanups	2026-01-08 09:08:21 +02:00
Bartłomiej Kocot	bc497beffb	[CK TILE] Fix grouped conv kernels splitk and double lds (#3527 )	2026-01-08 07:59:38 +01:00
Sami Aario	2edd077b50	Adjust whitespace with clang-format	2026-01-07 19:44:00 +00:00
Sami Aario	ca17ac3358	When possible, use the overload of load_tile_transpose that does not require assignment	2026-01-07 19:44:00 +00:00
Sami Aario	321611081f	Remove an unused overload of load_tile_transpose_with_offset	2026-01-07 19:44:00 +00:00
Sami Aario	8fc4030a57	Add an instance of load_tile_transpose that takes a reference to the output tensor as an input	2026-01-07 19:44:00 +00:00
Sami Aario	63a455952a	No need to specify DstDataType in load_and_convert_tile as WarpTile knows its DataType	2026-01-07 19:44:00 +00:00
Sami Aario	3d55a1e682	No need to specify SrcDataType in load_and_convert_tile as WarpWindow knows its DataType	2026-01-07 19:44:00 +00:00
Sami Aario	514035e6cf	In BQuantGemmPipelineAgBgCrCompV3, always convert BDatatype pk_int4_t to ADataType regardless of BLayout	2026-01-07 19:44:00 +00:00
Sami Aario	9af4498194	Remove the defaults for SrcDataType and DstDataType in GemmPipelineAgBgCrImplBase::GlobalPrefetch	2026-01-07 16:21:32 +00:00
Sami Aario	9633d3f5bb	In GetAWindows and GetBWindows, use DataType from LDS tensor view	2026-01-07 16:21:32 +00:00
Sami Aario	9559a93432	Make explicit that the tile window argument to load_tile_with_elementwise and the two load methods it uses are tuples	2026-01-07 16:21:32 +00:00
Sami Aario	cfa11f2d1f	Rename InterleavedPKTypeLoader to ConverterLoader, and load_int4_tile to load_and_convert_tile	2026-01-07 16:21:32 +00:00
Sami Aario	3a094e2f8b	Include ck_tile/core.hpp in load_interleaved_pk_type.hpp for better IDE integration	2026-01-07 16:21:32 +00:00
Sami Aario	74533b4755	Rename load_interleaved_pk_type to load_and_convert_tile	2026-01-07 16:21:32 +00:00
Sami Aario	994b8f4c22	Minor refactoring of load_interleaved_pk_type	2026-01-07 16:21:32 +00:00
Sami Aario	ca71cd75fc	Reduce the scope of KPack in MakeALdsBlockDescriptor	2026-01-07 16:21:32 +00:00
Sami Aario	825d17c3d7	Fix a comment	2026-01-07 16:21:32 +00:00
Sami Aario	bda5a7aa2d	Add braces	2026-01-07 16:21:31 +00:00
Sami Aario	969156985b	Use decltype for consistency in Interwave variant of BlockGemmImpl	2026-01-07 16:21:31 +00:00
Sami Aario	4d77856be5	Make some functions return void explicitly instead of auto	2026-01-07 16:21:31 +00:00
Cong Ma	d7497d2694	[CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512 ) Refactor function amd_buffer_load_invalid_element_return_zero to avoid the inefficient ASM code generated by compiler. Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage Tested compilers: - Rocm 7.0.1 - Rocm 7.1.1 Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-07 00:05:56 -08:00
Khushbu Agarwal	aaa35f0bbf	[CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * initial commit with prints * debugging * fine-grained working * debugging medium grained * fixing the tile window * formatting * enabling prefill shapes * working prefill shapes * formatted * clean up * code cleanup * bug fix after merging with develop * clean up after merging with develop * added comments for the tile window and tile distribution encoding --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>	2026-01-06 12:46:59 -08:00
kyle-256	76696ace44	[CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452 ) * update grouped_gemm blockwise kernel * update config * update kernel * update examples * remove test code for now * sync test files with origin/develop * update example * fix code lint * fix code-lint * update test code * run clang format * run pre-commit * update api	2026-01-06 12:36:04 -08:00
kensclin	2309c86054	[CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495 ) * [CK_TILE] add preshuffleB mode for ABQuant GEMM * fix precommit error * use template method call for cvt_scale_to_fp32 * fix precommit error * add test code * fix precommit error * switch abquant gemmconfig to default * Add changelog.md * fix precommit error * fix conflict	2026-01-06 12:35:01 -08:00
joyeamd	b78563b3d3	Merge some updates for ck_tile headers (#3342 ) * fix some issues from internal branch * update cshuffle_epilogue * update cshuffle_epilogue * update cshuffle * update warp_gemm	2026-01-05 23:39:00 -08:00
joyeamd	2b563ad048	Joye/revise wp pipeline (#3493 ) * [CK_TILE] unify double and single lds implementation (#108) Unify LDS buffer management API for single and double buffering modes This change consolidates the Local Data Store (LDS) buffer management by: Merging single and double LDS buffer APIs into a unified interface Implementing ping-pong address calculation in pipeline when double LDS is enabled Computing pong buffer addresses dynamically using base address offsets --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update wp_pipeline * fix a c++17 issue * update for ci errors * fix ci issues * include a header to fix ci errors * fix some rebase issues * update with rebase --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-05 13:49:26 -08:00
Estevan Vedovelli	1224bc0a82	Add support to gfx1153 and fix gfx115X WMMA config (#3496 ) * Support for gfx115X * Changes for gfx115X * Add gfx1153 * Update changelog --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-01-05 10:03:30 -08:00
Jeff Huang	cc75a1dc5f	[FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442 ) * add page_block_size parameter * add is_sglang_layout to parameters * add kv_offset_array_transform to batch async for page size 16 * add kv_last_page_lens to kernel * change kv layout to [num_total_pages, page_block_size, hdim] * format * - enable codegen of batch_prefill kernels - create new problem struct BlockFmhaBatchPrefillPipelineProblem for batch prefill kernels - generate different page sizes of batch prefill kernels (1, 16) * 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950 2. support page size 1024 * fix python format * change kv cache layout to [num_blocks, num_kv_heads, head_size/x, block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X] * 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values 2. Makes batch prefill kernel traits structures inherent from fmha fwd traits 3. Add some static check for Page size, vector size, hdim, ..., etc. * [Refactor] Replace is_sglang_layout with Enums for KV cache configuration Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single boolean. Changes: * Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`. * Updated Kernel, Pipeline, and Traits to template on these Enums. * Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`. * Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`. * Updated CodeGen scripts to support new parameters. This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations. * 1. remove batch prefill pipeline with sk_pad=false 2. correct some comments 3. add static assert to make sure v offsets is in same page within a tile. * fix vgpr spill count * remove unnecessary t2s functions * add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py * support linear kv cache layout * Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse kv_page_indices as a pointer of the lookup table. * 1. merge multiple transforms into single transform. 2. add static check to make sure vlayout is row-major. * move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs. * update changelog --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>	2026-01-05 18:41:47 +08:00
Max Podkorytov	e339101e9c	[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder * hacky poc * fix tile engine * kill the lambda * maybe fix test build * more fixes * clang-format * save temp * clang-format * mostly fix examples * clang-format * remove dead code * more cleanup * fix fmha bwd build (default epilogue set/add appears to be broken) * fix default epilogue tests but not correctness * clang-format * fix bquant * clang-format * cleanup dead code * rearrange make windows for readability * restore changes to IsSupportedArgument * fix smoke-builder * clang-format * fixup rename class * build fixes * clang-format * fix builder * fixup * remove set from builder tests * fix test * clang-format * re-refactor the kernels * clang-format * fix header license * remove memory operation from conv bwd test * clang-format * clang-format example,include * clang-format test * build fixes * clang-format * solve compilation error * fix the CI * solve compilation error * clang format * solve merge conflict * solve merge conflict * solve the gfx11 error * solve test error * moar build fixes * remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-04 03:28:14 -08:00
John Afaganis	ec23be0b9d	Update unsigned long literals and format specifiers to work correctly in Windows (#3483 ) Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows. The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit. This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.	2026-01-02 22:16:41 -07:00
yadaish	dae85ead64	[CK_TILE] support split-k a16w4 gemm1 (#3389 ) * initial version to support moe gemm1 split-k * add missing args * fix build warning * update reference * for split-k disable bias and weight * remove debug log * fix format * fix div by zero errors * fix cmake config * update * resolve conflicts * remove useless changes * reformat * fix * remove useless changes * fix ci --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>	2025-12-29 23:05:35 +08:00
Yi DING	b0ea67e377	[CK_TILE] MX FLATMM Fix M Padding (#3489 ) * Fix M Padding * Fix tensor desc ele space size	2025-12-29 09:09:12 +08:00
Erwin Terpstra	e08efa551f	[CK_TILE] Grouped gemm quant tensor layouts (#3414 ) * feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time * chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times * chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems * fix: missing check for transposed load in bquant pipeline * chore: lower unit test tensors dimensions a bit for faster tests * chore: set grouped gemm example M/N warp tile to 16 --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-24 23:01:23 -08:00
Illia Silin	14668a56e3	remove the LLVM_MAIN_REVISION usage (#3487 )	2025-12-24 16:49:35 -08:00
kensclin	7f68f3c4fa	Enable padding blockscale for abquant (#3453 ) * Enable padding blockscale for abquant * run clang-format * Reduce unnecessary testing * remove cout	2025-12-24 09:12:40 -08:00
Po Yen Chen	1c3151963b	[CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425 ) * Add fp8bf16 support for batch_prefill * Fix wrong scale_s re-compute logic in batch_prefill * Fix wrong scale_s re-compute logic in fmha fwd * Fix batch_prefill codegen error * Remove no-longer used GetName() function * Add fp8 logits=True instances * Update CHANGELOG.md	2025-12-24 10:34:06 +08:00
jakpiase	c0797c1671	[CK_TILE] Minor splitk bugfix for gemms and conv (#3387 ) * fix for splitk if splitk < grid * add different splitk implementation * minor bugfix for streamk gemm * Add test --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-12-24 00:10:13 +01:00
jakpiase	ead81d1b0b	[CK_TILE] Add splitk support to ck tile conv bwd data (#3353 ) * add splitk support to ck tile conv bwd data * add reviewers suggestions * minor fix * removed splitkbatchoffset struct	2025-12-23 10:03:42 +01:00
Lyu, Xudong	8b73633e65	fix: handle void return type in TailHandler error path with ROCm6 compiler (clang++) (#3477 ) Replace `decltype(TailHandler<>(...)){}` with direct function call to fix compilation error when return type is void. Co-authored-by: Yi DING <yi.ding@amd.com>	2025-12-23 15:03:18 +08:00
Jan Patrick Lehr	9bd67c2cf2	[CK-TILE] Guard against compiler lexer diagnostic (#3444 ) * [CK-TILE] Guard against compiler lexer diagnostic A recent change to Clang added a lexer-level diagnostic about that C2y language feature. Since that is lexer level, the `__extension__` compiler built-in does not work as it is only respected after the lexer when parsing. This change adds guarding pragmas to disable the diagnostic in the lexer and not lead to warnings being treated as errors. * Fixing still existing build issue Once the one warning was removed, another one poppoed up. Both are related to the same c2y feature. Thus, ignoring both. * clang-format handling --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-19 17:32:20 -08:00
Illia Silin	2d9c962e2c	get LLVM_MAIN_REVISION macro from compiler header (#3469 )	2025-12-19 14:57:12 -08:00
Yi DING	2220cbaba7	[CK_TILE] MX Flatmm Use Byte Pointer Arithmetic for A Tensor (#3446 ) * A as bytes * Reformat with static_for_product	2025-12-19 10:28:13 +08:00
yadaish	c0ee71d735	Dev/a8w4 and a8w8splitk (#3447 ) * Ck moe bs splitk pr (#3440) * splitk kick-off. Compilation fail * splitk hack pass * fix scale offset calc. * clang-format for a8w8_moe_blk_gemm1 splitk change * fix testcase error --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> * Zan/moe a8w4 (#3441) * update * update * update ck moe a8w4 * update * update * update * compile pass * update * update * python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready * support new a8w4 kernel * update * update ck_tile * re format * update * update * fix conflict * fix build * update ck_tile moe * fix clang format * fix the problem * fix accruacy issue * fix --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> Co-authored-by: Zzz9990 <zanzhang@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-12-19 09:26:52 +08:00

1 2 3 4 5 ...

646 Commits