composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-02 21:27:45 +00:00

Author	SHA1	Message	Date
Tianxing Wu	0d2a9badba	fixed example	2025-10-23 11:17:46 +00:00
Juuso Korhonen	3c0e6d37bf	fixing bugs	2025-10-23 09:47:30 +00:00
Juuso Korhonen	e144872308	change to BLOCK_M in shape definitions	2025-10-23 08:11:55 +00:00
Tianxing Wu	f72b994b00	More compilation fixes	2025-10-20 15:53:35 +00:00
Juuso Korhonen	d68a541c19	fixing compile errors...	2025-10-20 15:04:47 +00:00
Juuso Korhonen	97e7527eb1	fixing compile errors...	2025-10-20 14:03:15 +00:00
Tianxing Wu	9fda954253	Compiling fix	2025-10-20 13:16:19 +00:00
Tianxing Wu	995c6701d3	Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/composable_kernel into tianxing/unified-attention	2025-10-17 09:05:12 +00:00
Tianxing Wu	af9167abad	example	2025-10-17 09:05:10 +00:00
Juuso Korhonen	9940bd07f6	fix order in mask caller	2025-10-16 11:23:46 +00:00
Juuso Korhonen	072de3842f	comment	2025-10-16 09:23:39 +00:00
Juuso Korhonen	aa4908ac14	fix mask	2025-10-16 09:18:38 +00:00
Juuso Korhonen	62932576c4	use correct mask in kernel	2025-10-16 09:02:08 +00:00
Juuso Korhonen	498a97aa1d	merge	2025-10-16 08:57:14 +00:00
Juuso Korhonen	63c17b7236	correct masking by transforming y_idx = y_idx / num_queries_per_kv	2025-10-16 08:54:07 +00:00
Tianxing Wu	853fa21566	Example boostrap	2025-10-15 11:58:44 +00:00
Juuso Korhonen	72fe8b311c	merge	2025-10-14 12:35:33 +00:00
Juuso Korhonen	4d232d59cc	fix seq_len -> cur_batch_query_len	2025-10-14 12:34:33 +00:00
Tianxing Wu	b940a75328	Comments	2025-10-14 12:19:20 +00:00
Tianxing Wu	ec29289bb1	kv paging	2025-10-14 12:04:11 +00:00
Tianxing Wu	c87f2e3ca9	o window change	2025-10-14 09:59:47 +00:00
Tianxing Wu	96b208f6c7	Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/composable_kernel into tianxing/unified-attention	2025-10-14 09:58:30 +00:00
Tianxing Wu	e1120fffb0	pipeline api	2025-10-14 09:58:27 +00:00
Juuso Korhonen	c3d27abfb8	fix q window	2025-10-14 09:49:54 +00:00
Juuso Korhonen	b37c356090	fix q window origin	2025-10-14 09:36:28 +00:00
Tianxing Wu	6a7fa959b7	kv tensor view and initial window	2025-10-13 12:53:43 +00:00
Tianxing Wu	cd354286c1	Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/composable_kernel into tianxing/unified-attention	2025-10-13 11:32:30 +00:00
Tianxing Wu	be58d51d36	o ptr and window	2025-10-13 11:32:28 +00:00
Juuso Korhonen	6ba25b7e84	add commenting	2025-10-13 10:34:55 +00:00
Juuso Korhonen	81a02ffb40	Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/composable_kernel into tianxing/unified-attention	2025-10-13 10:30:22 +00:00
Juuso Korhonen	b721f79f99	fix	2025-10-13 10:30:11 +00:00
Tianxing Wu	16129a794a	stride fix	2025-10-13 10:30:08 +00:00
Tianxing Wu	96fde33ec4	Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/composable_kernel into tianxing/unified-attention	2025-10-13 10:29:07 +00:00
Tianxing Wu	55fc6d7151	kv tensor view	2025-10-13 10:28:02 +00:00
Juuso Korhonen	af94aaf1cb	refactor the q tensor view transformation	2025-10-13 10:22:52 +00:00
Juuso Korhonen	49ce980c67	Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/composable_kernel into tianxing/unified-attention	2025-10-13 10:21:27 +00:00
Juuso Korhonen	2d6dab29eb	refactor the q tensor view transformation	2025-10-13 10:18:23 +00:00
Tianxing Wu	36a65b1968	refactor	2025-10-13 10:05:23 +00:00
Tianxing Wu	bc6385f389	Some refactor	2025-10-13 10:01:38 +00:00
Tianxing Wu	1f4648dab5	refactor. and fixed q transformation	2025-10-10 15:27:36 +00:00
Tianxing Wu	df60493219	refactor	2025-10-10 13:25:19 +00:00
Juuso Korhonen	436eb3a4f8	transform q tensor view	2025-10-10 12:08:16 +00:00
Tianxing Wu	191f179038	unified attention rename	2025-10-09 08:47:19 +00:00
Tianxing Wu	e54cb5a713	intial commit	2025-10-06 13:02:38 +00:00
Sami Remes	ef43078788	Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe (#2893 ) * Use __builtin_amdgcn_readfirstlane for buffer resource in fused_moe * also do the same for amd_buffer_addressing_builtins.hpp * merge with develop * fix clang format --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-09-30 15:12:30 -07:00
joyeamd	b60af5bde9	[CK_TILE]enhance elementwise test (#2683 ) * enhance elementwise * fix ci issues	2025-09-30 08:29:37 -07:00
Aviral Goel	bebf0e9d15	Extend Grouped GEMM with MultiD (Single & Double Shared Memory) feature to use persistent kernel option (#2933 ) * feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature * refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel * tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments * fix: segfault fix by passing correct parameters for d tensors * style: clang format * WIP: host code for grouped_gemm_multi_d persistent kernel compiles but segfaults * feat(grouped_gemm_multi_d): add functionality to run persistant kernel * feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature * refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel * tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments * fix: segfault fix by passing correct parameters for d tensors * style: clang format * fix: incorrect validation method and Dtensor layout in test suite * docs: improved README text based on review comments * fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint	2025-09-29 15:03:56 -07:00
Khushbu Agarwal	81458a6681	Weight Preshuffle Block Scale gemm support (#2877 ) * initial commit * remove extra files * fixing errors * updated ReadMe file for mapping of diff quants with diff configs * addressing review comments * addressing review comments * Resolved merge conflicts * [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled The get_preshuffle_or was not working as expected, which led to incorrect behavior in the quantization preshuffle process. This change replaces it with the more reliable is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied. * initial commit * debugging * working fp8 for init constant * fp8 working with all inits * updated block level code with comments * changing the loop iter * debugging * debugging * debugging * code fix * code clean up * clang formatted * Add comment * code cleanup * clang formatted * merge conflicts fixes * applying the latest int4 changes to the piepline * fixing test code for updated traits * Adding gtest * review comments addressed * addressing review comments * remove c++20 code * added flush cache changes --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>	2025-09-29 12:46:37 -07:00
carlushuang	2e9428eb63	hot fix check eid range (#2924 ) * hot fix check eid range * fix clang format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-09-29 09:38:38 -07:00
SamiAario-AMD	0f10e6d921	[CK_TILE] Fixing Type Conversions in PassThroughPack8 (#2769 ) * Change the return type of run_gemm_combinations in the basic tests * Change the return type of run_gemm_combinations in the universal tests * Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4 * Add universal GEMM test for fp8 x pk_i4 * Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4. * Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t> * Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t> * No need for utility in test_ck_tile_elementwise_1d * Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8 * Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant * For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union * Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced * Convert from float to bf16 during compilation rather than using magic values * Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8 * Comment out the basic test for fp16 x pk_i4 as it does not pass * Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t> * Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8 * Add basic and universal GEMM tests for bf8 x pk_i4 * Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now * Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now * Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp * Use explicit macros for enabling and disabling the the constexpr lookup based converters * Fix two failing tests * Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant * Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16 * On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t	2025-09-29 13:34:47 +03:00

1 2 3 4 5 ...

475 Commits