composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 11:47:48 +00:00

Author	SHA1	Message	Date
Sami Remes	c7298e57c0	remove some old files	2026-02-06 18:37:34 +00:00
Sami Remes	dc4366a876	add main include file	2026-02-06 18:12:54 +00:00
Sami Remes	06a8998254	clean up kernel and pipeline code	2026-02-06 18:11:17 +00:00
Sami Remes	c588a1fd42	use unpacked scales	2026-02-06 17:26:03 +00:00
Sami Remes	061c9f9374	save packing approach	2026-02-06 15:54:57 +00:00
Sami Remes	a8d48f9224	now offsetting with M/MPerXdl to get scales	2026-02-05 17:31:32 +00:00
Sami Remes	350022827f	init=1 init=2 working, some scales are still wrong as init=0 failing	2026-02-05 10:28:49 +00:00
Sami Remes	6c61804665	try to enable scale loading in kernel and pipeline	2026-02-05 09:25:21 +00:00
Sami Remes	16fa73db63	use proper rtol/atol	2026-02-03 09:57:20 +00:00
Sami Remes	6b50755cd2	fix alignment calculation of lds tensor views	2026-02-03 08:24:03 +00:00
Sami Remes	b47853d3fe	enable fp4 for universal gemm - without any scaling	2026-02-03 03:10:35 -05:00
Sami Remes	4d241289c9	use default scale (no scale) for 16x16x128 mfma scale	2026-01-30 12:55:46 -05:00
Sami Remes	407df88c02	enable 32 element for fp4	2026-01-30 12:47:45 -05:00
Sami Remes	771c46aa8b	add initial version for scale block_gemm, not used yet	2026-01-30 12:42:45 -05:00
Sami Remes	b124a72ff5	revert mostly back to original comp_async	2026-01-30 12:40:48 -05:00
Sami Remes	2cc0e3d019	override base policys vector size with static_assert 4/12/16 bytes	2026-01-30 03:55:56 -05:00
Sami Remes	409a7d8edb	Merge remote-tracking branch 'origin/develop' into samremes/ck_tile_mx_gemm	2026-01-30 03:30:11 -05:00
MHYangAMD	6ff0737843	Fix redundant cast in model sensitive rmsnorm (#3681 ) * Fix redundant cast * Fix linting	2026-01-30 10:52:19 +08:00
Khushbu Agarwal	9b168082b7	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition	2026-01-28 19:45:09 -08:00
Jeff Huang	e3556fed04	Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657 ) - Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather - Add is_gather_dim() and get_gather_index() for multi-dim page lookup - Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1}) - Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread - Add static_assert for RowMajor VLayout requirement in batch prefill Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-01-29 07:18:41 +08:00
Sami Remes	0033748c62	revert custom ldstile, should be able to use the regular ones	2026-01-28 10:37:13 -05:00
Yi DING	8e3d84aba3	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-27 23:46:49 -08:00
damien-lejeune	91e32f305f	[CK Tile] multi reduce improvements (#3607 ) * WIP: refactoring * Swap operation/data nested loops order * Improve memory coalescing * Add comments * Enforce same identity element for the reduce operations * Re-add compile time constant * Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-01-27 12:56:09 -08:00
Sami Remes	30d4c25d5a	use PackedSize in slicing	2026-01-27 13:01:54 -05:00
Illia Silin	b26cb596b0	fix some syntax errors (#3658 )	2026-01-27 09:59:39 -08:00
Sami Remes	f62cc5415f	current state of pipeline	2026-01-27 12:56:24 -05:00
Bartłomiej Kocot	3d67e6c492	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt	2026-01-27 11:04:11 +02:00
Aviral Goel	b8751e505d	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite	2026-01-26 11:27:42 -08:00
Thomas Ning	3900e1e7ce	Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634 ) * Finished the work * Fix the pipeline	2026-01-26 10:29:28 -08:00
Sami Remes	70c7fcda43	WIP: debugging...	2026-01-26 11:33:45 -05:00
Emily Martins	f5c2f09036	[CK_TILE] Fix alignment in Stream-K workspace buffer (#3625 ) * Fix alignment issue in Stream-K workspace buffer In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests. * Compute reference GEMM on GPU for test verification to decrease testing time	2026-01-23 16:14:22 -07:00
ltqin	67f0b74ec6	Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 )" (#3635 ) This reverts commit `de5a1d730d`. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-01-23 09:03:22 -08:00
Sami Remes	d2a7c2f041	compiles again using get_y_sliced_thread_data in warpgemm loop	2026-01-23 11:01:43 -05:00
Po Yen Chen	de5a1d730d	Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 ) This reverts commit `dd0b4294af`.	2026-01-22 21:21:19 -08:00
kensclin	31a35ecab4	GEMM Blockscale ABQuant Optimization (#3620 ) * GEMM Blockscale ABQuant Optimization * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix precommit error * clean * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2026-01-22 09:39:38 -08:00
Bartłomiej Kocot	44f481a45c	[CK TILE] Fix basic gemm pipelines (#3611 ) * [CK TILE] Fix basic pipelines * fixes	2026-01-22 08:11:18 -06:00
Linjun-AMD	0b13697a88	[CK_TILE][FMHA]Add new tile size for async (#3623 ) * Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)" This reverts commit `8f75869408`. * Add new tile_size for async pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-22 16:07:14 +08:00
ltqin	dd0b4294af	Fp8 block scale quantization for fmha fwd (#3330 ) * add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit `356c3c9706`. * only format my code * format py * fix auto not allowd in function prototype * change instance tttt to ttff * fix structured binding issue * change s_acc elementwise op * async pipeline add block scale * add quantation P using shift exp2 * precompute (m - shift) once per row * change blk scale seqstrt ptr name * fix some name * fix for deduction guide * fix some comments * add P scale to qr_ksvs_pipeline * add comment to idx_identity * change the method of calculating descale block index * unify naming style: use block_scale_ as name prefix * unify naming style * update the CHANGELOG.md * Add FP8 block scale quantization support for FMHA forward kernel --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-01-21 20:58:26 -08:00
Yi DING	fcc9372c00	[CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615 )	2026-01-21 09:54:46 +08:00
Max Podkorytov	91b4102a59	Add persistent async input scheduler for GEMM kernels (#3520 ) Add signal-based synchronization for persistent GEMM kernels where input data becomes available incrementally. Uses modulo wraparound (like PyTorch's AsyncMM) for chunk index calculation: chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks Key components: - PersistentAsyncInputScheduler struct with tiles_per_chunk_m, chunk_signals, tile_idx_pivot_m, and num_chunks fields - wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency - IsSupportedArgument validation for scheduler parameters - Example demonstrating async input scheduling with simulated producer - GTest unit tests covering all layout combinations	2026-01-20 10:37:09 -08:00
Linjun-AMD	8f75869408	Revert "[CK_TILE][FMHA] Add new tile size for async (#3586 )" (#3613 ) This reverts commit `f3aafb9555`.	2026-01-20 09:40:54 -08:00
Bartłomiej Kocot	0727e85e52	[CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518 ) * [BULDER] Add grouped conv fwd ck tile profiler * [CK TILE] Fix grouped conv kernels splitk and double lds * Updates * Fixes * Move to ckProfiler * Fixes * fix * fix * Change instances to empty list by default * fix * fix * Update grouped_convolution_signatures.hpp * Update grouped_convolution_forward_tile_algs.hpp * [CK TILE] Add grouped convolution forward tests (#3556) * [CK TILE] Add grouped convolution forward tests * fix jenkins * fixes * comments fixes * unit test * unit test fix * Move instances outside builder * fix includes * clang format fix * readme fix * fix includes * fixes	2026-01-19 22:29:01 -07:00
Cong Ma	0517d43d31	[CK TILE] remove dependency on std chrono (#3599 ) * [CK TILE] remove dependency on std chrono * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-19 15:31:02 -08:00
Linjun-AMD	f3aafb9555	[CK_TILE][FMHA] Add new tile size for async (#3586 ) * add new tile size for async Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-19 15:22:33 -08:00
Adam Osewski	1a6d1b59ef	[CK_BUILDER] Convolution forward transfer concepts. (#3535 ) * Rename member variable to better reflect its actuall meaning. * Add transfer checks for conv fwd xdl. * Validate tensor layouts & vector size conv fwd v3. * Add combined transfer concepts. * Add transfer concepts for conv fwd factories. * Fix clang format * Add helper instruction to get max mem vector instruction width. * Apply review comments. * Rename thread cluster access(->arrange) order concept * FIx merge artifacts. * Add generic access order limits into block transfer concept.	2026-01-19 10:54:10 +01:00
Cong Ma	f9104ef9b3	[CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584 )	2026-01-16 15:27:39 -08:00
Sami Remes	f09e10936d	fixed vector load siz for fp4	2026-01-16 12:04:34 -05:00
Sami Remes	16ca5cb532	WIP	2026-01-16 08:22:11 -05:00
Estevan Vedovelli	e30207985a	Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-01-15 09:35:24 -05:00
Jeff Huang	993d3e2f0e	[FMHA] Enable page size 16 for batch prefill kernel (#3568 ) * [FMHA] Enable page size 16 for batch prefill kernel * Refactor batch prefill KV offset logic to simplify template arguments - Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args. - Add static assert to forbid `page_size=1` with vectorized layout.	2026-01-15 22:11:44 +08:00

1 2 3 4 5 ...

688 Commits