composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-12 02:05:50 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	8dcde8d10f	Fix in generate_instances.py and re-generated the instances	2025-04-23 10:30:40 +00:00
Qianfeng Zhang	022ed3fd8a	Back to use exp() instead of exp2() since exp() in ck_tile using fast __builtin_amdgcn_exp2f()	2025-04-22 14:47:24 +00:00
Qianfeng Zhang	26db7e0b7c	Use kN0=64 to save vgprs	2025-04-22 14:45:27 +00:00
Qianfeng Zhang	65ddb1a863	Fix the script name	2025-04-22 13:46:48 +00:00
Qianfeng Zhang	58ab5533a6	Fix in GetTileRangeAlongX	2025-04-22 13:46:23 +00:00
Qianfeng Zhang	677fd60d10	Add script compare_with_triton_2.sh for measuring the jagged cases of seqlen 1024/2048/4096/8192/16384/32768	2025-04-22 13:45:48 +00:00
Qianfeng Zhang	2546e905ce	Change gemm0 to iterate along kN0 so that BlockGemm can overlap with maksing and siLu	2025-04-20 13:23:15 +00:00
Qianfeng Zhang	ee259a8924	Fix the GetTileRangeAlongX() to align with the hstu masking definition when both causal=true and local=true	2025-04-18 15:37:49 +00:00
Qianfeng Zhang	efc786f6a3	Remove un-needed __builtin_amdgcn_sched_barrier(0)	2025-04-18 10:05:57 +00:00
Qianfeng Zhang	88e54a8989	Use shared ring Lds buffers for K/V to avoid over-lapping between first-K/last-V or last-K/first-V	2025-04-18 09:47:43 +00:00
Qianfeng Zhang	f12a47218f	Tiny codes simplification in pipeline	2025-04-18 08:22:11 +00:00
Qianfeng Zhang	ca1ae84fc6	Remove one line of __builtin_amdgcn_sched_barrier(0)	2025-04-17 14:21:14 +00:00
Qianfeng Zhang	b0ae27046f	Fix the integer overflow in total_flops calculation	2025-04-17 10:34:13 +00:00
Qianfeng Zhang	6086ead2f9	Add scripts for comparing with triton	2025-04-17 10:33:44 +00:00
Qianfeng Zhang	1351d9cd1b	Use exp2() to calculate exp() for better performance	2025-04-16 06:54:06 +00:00
Qianfeng Zhang	226a254723	Remove the comparing of row/col to max_uih_len in masking	2025-04-16 04:35:42 +00:00
Qianfeng Zhang	d1749b3aae	Use kM0=128 kN0=64 to completely remove the vgprs spilling	2025-04-15 15:08:46 +00:00
Qianfeng Zhang	3cd1b13e46	Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations	2025-04-15 14:40:55 +00:00
Qianfeng Zhang	cad1356170	Use packed cast_tile for fp16	2025-04-15 14:29:30 +00:00
Qianfeng Zhang	fff13b6c76	Update to partially reduce the register spilling	2025-04-15 07:44:33 +00:00
Qianfeng Zhang	c2e6ab8516	Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier()	2025-04-13 11:00:22 +00:00
Qianfeng Zhang	238e78d82e	Update the in pipeline codes	2025-04-13 09:43:58 +00:00
Qianfeng Zhang	53e567977e	Fix in calculation of total_flops and update benchmark scripts	2025-04-13 08:50:00 +00:00
Qianfeng Zhang	71697d9cb9	Add output of estimated TFLOPS	2025-04-09 14:50:18 +00:00
Qianfeng Zhang	1766e6d3be	Update to the scripts and error thresholds	2025-04-09 10:34:37 +00:00
Qianfeng Zhang	dd2cd2cbcb	Tune the input initialization to avoid over-flow in silu	2025-04-09 10:03:32 +00:00
Qianfeng Zhang	86c0e45987	Add benchmark_hstu_attention.sh	2025-04-09 08:28:05 +00:00
Qianfeng Zhang	9cb2dca958	Add several verification test cases	2025-04-08 16:38:35 +00:00
Qianfeng Zhang	561d490990	Fix in kernel and forward dispatch for jagged mode	2025-04-08 16:37:52 +00:00
Qianfeng Zhang	dc2f72a09f	Fix in hstu-attention pipeline (which makes some testing cases passed)	2025-04-08 15:53:08 +00:00
Qianfeng Zhang	dbcf38aae9	Fixes and updates	2025-04-07 15:29:23 +00:00
Qianfeng Zhang	10e72d3362	Change in HstBlockMasking and kernel/reference codes for using masking	2025-04-03 14:46:12 +00:00
Qianfeng Zhang	733734553b	Fix and change in example	2025-04-03 14:44:36 +00:00
Qianfeng Zhang	121a950df5	Add hstu attention kernel implementation, instances and interfaces (building succeeded)	2025-04-03 08:20:54 +00:00
Qianfeng Zhang	83f29243df	fix the jagged mode tensor access in reference_hstu_attention	2025-03-29 12:55:40 +00:00
Qianfeng Zhang	4a0fc292d0	Initial reference implementation of hstu attention	2025-03-28 16:26:43 +00:00
Illia Silin	23a949706c	Disable all pk_i4 tests for all targets except gfx942/950. (#2022 ) * only build gemm_fp8_pk_i4 examples for gfx942/950 * fix cmake logic * moved the architecture check to IsSupported function * Revert "moved the architecture check to IsSupported function" This reverts commit `056d2a08b3`. * disable all pk_i4 tests for targets other than gfx942/950 * fix cmake logic	2025-03-26 15:15:57 -07:00
Illia Silin	99b2bbc1d6	Make sure gemm_fp8_pk_i4 examples only build and run on gfx942/950. (#2010 ) * only build gemm_fp8_pk_i4 examples for gfx942/950 * fix cmake logic * moved the architecture check to IsSupported function * Revert "moved the architecture check to IsSupported function" This reverts commit `056d2a08b3`.	2025-03-25 14:43:38 -07:00
Andriy Roshchenko	72d888821c	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales	2025-03-25 15:33:03 -06:00
ruanjm	d49abdaa87	[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861 ) * 50ms -> 28ms * Fix bug in non fuse_add_store cases * Fine tuned setting for 2 pass pipeline * adjust workload * remove unnecessary change * add layernorm * Adding output quant and unquant results at the same time. * fix test * fix format * tune for cases 128x640 and 128x1024 * bug ifx	2025-03-25 20:09:45 +08:00
Andriy Roshchenko	6660dc6b8e	Introduce MX GEMM for FP8 data type (#2000 )	2025-03-24 15:41:07 -06:00
carlushuang	6c08c5c46d	add mask support in hdim=192/128 (#1999 )	2025-03-21 18:28:43 +08:00
BingYuan.Zhou	5a0d693b86	fix ck_tile/basic_gemm build error (#1988 )	2025-03-20 22:01:14 -07:00
felix	902dbe89ad	change cmake (#2006 ) Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-20 19:25:11 -07:00
carlushuang	e3c9886cdf	[CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982 ) * return value with macro and revert the return value * [CK-TILE] no-macro launch api solution (#1992) * no-macro solution * address -Wcomma --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-03-20 11:00:29 -07:00
jakpiase	0e91d32c61	[CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919 ) * switch to universal gemm for batched and grouped gemms * added reviewer comments * fixed grouped gemm tests	2025-03-20 11:17:04 +01:00
rocking	b819c217e4	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-03-20 00:06:45 +08:00
Illia Silin	1342ecf7fb	Add a daily CI build on gfx908. (#1987 ) * add one daily ci build on gfx908 * add redis invocation tag for gfx908 * make ci build for gfx908 conditional * fix groovy logic * add option to run perf tests for gfx908 * disable a few tests on mi100	2025-03-17 18:08:53 -07:00
aledudek	5095906975	Async grouped gemm v3 (#1940 ) * Fully async grouped gemm * Remove commented code * Remvoe maybe_unused * host kernel args * Checkpoint segfault debugging... * Working part1 * Working part2 * Remvoe comments... * Use void ptr for gemm kernel host args * Fix device_grouped_gemm_multiple_d_dl build issue * Fix device_grouped_gemm_xdl build issue	2025-03-17 16:42:43 +01:00
valarLip	52b1cd7780	hotfix fmoe build issue (#1976 )	2025-03-13 15:11:59 +08:00

1 2 3 4 5 ...

539 Commits