composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 18:42:06 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	cea919aefb	Use 16x16x16 WarpGemm	2025-04-24 08:14:09 +00:00
Qianfeng Zhang	7848d15d39	Using __builtin_amdgcn_rcpf in siLU function	2025-04-24 06:28:16 +00:00
Qianfeng Zhang	aec19176d4	Combine minus with scale_s	2025-04-24 05:47:24 +00:00
Qianfeng Zhang	ce4665262b	Move silu calculation to gemm1 iteration and try to interleave gemm_1 and silu	2025-04-24 04:49:58 +00:00
Qianfeng Zhang	2d2e1941a8	Update in using masking for the case where kMasking is false and kPadSeqLenK is true	2025-04-23 10:47:27 +00:00
Qianfeng Zhang	8dcde8d10f	Fix in generate_instances.py and re-generated the instances	2025-04-23 10:30:40 +00:00
Qianfeng Zhang	022ed3fd8a	Back to use exp() instead of exp2() since exp() in ck_tile using fast __builtin_amdgcn_exp2f()	2025-04-22 14:47:24 +00:00
Qianfeng Zhang	7316a44ff3	Update exp() in ck_tile/core/numeric/math.hpp to use __expf	2025-04-22 14:46:53 +00:00
Qianfeng Zhang	26db7e0b7c	Use kN0=64 to save vgprs	2025-04-22 14:45:27 +00:00
Qianfeng Zhang	65ddb1a863	Fix the script name	2025-04-22 13:46:48 +00:00
Qianfeng Zhang	58ab5533a6	Fix in GetTileRangeAlongX	2025-04-22 13:46:23 +00:00
Qianfeng Zhang	677fd60d10	Add script compare_with_triton_2.sh for measuring the jagged cases of seqlen 1024/2048/4096/8192/16384/32768	2025-04-22 13:45:48 +00:00
Qianfeng Zhang	2546e905ce	Change gemm0 to iterate along kN0 so that BlockGemm can overlap with maksing and siLu	2025-04-20 13:23:15 +00:00
Qianfeng Zhang	ee259a8924	Fix the GetTileRangeAlongX() to align with the hstu masking definition when both causal=true and local=true	2025-04-18 15:37:49 +00:00
Qianfeng Zhang	efc786f6a3	Remove un-needed __builtin_amdgcn_sched_barrier(0)	2025-04-18 10:05:57 +00:00
Qianfeng Zhang	88e54a8989	Use shared ring Lds buffers for K/V to avoid over-lapping between first-K/last-V or last-K/first-V	2025-04-18 09:47:43 +00:00
Qianfeng Zhang	f12a47218f	Tiny codes simplification in pipeline	2025-04-18 08:22:11 +00:00
Qianfeng Zhang	ca1ae84fc6	Remove one line of __builtin_amdgcn_sched_barrier(0)	2025-04-17 14:21:14 +00:00
Qianfeng Zhang	b0ae27046f	Fix the integer overflow in total_flops calculation	2025-04-17 10:34:13 +00:00
Qianfeng Zhang	6086ead2f9	Add scripts for comparing with triton	2025-04-17 10:33:44 +00:00
Qianfeng Zhang	1351d9cd1b	Use exp2() to calculate exp() for better performance	2025-04-16 06:54:06 +00:00
Qianfeng Zhang	226a254723	Remove the comparing of row/col to max_uih_len in masking	2025-04-16 04:35:42 +00:00
Qianfeng Zhang	d1749b3aae	Use kM0=128 kN0=64 to completely remove the vgprs spilling	2025-04-15 15:08:46 +00:00
Qianfeng Zhang	3cd1b13e46	Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations	2025-04-15 14:40:55 +00:00
Qianfeng Zhang	cad1356170	Use packed cast_tile for fp16	2025-04-15 14:29:30 +00:00
Qianfeng Zhang	fff13b6c76	Update to partially reduce the register spilling	2025-04-15 07:44:33 +00:00
Qianfeng Zhang	c2e6ab8516	Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier()	2025-04-13 11:00:22 +00:00
Qianfeng Zhang	238e78d82e	Update the in pipeline codes	2025-04-13 09:43:58 +00:00
Qianfeng Zhang	53e567977e	Fix in calculation of total_flops and update benchmark scripts	2025-04-13 08:50:00 +00:00
Qianfeng Zhang	71697d9cb9	Add output of estimated TFLOPS	2025-04-09 14:50:18 +00:00
Qianfeng Zhang	1766e6d3be	Update to the scripts and error thresholds	2025-04-09 10:34:37 +00:00
Qianfeng Zhang	dd2cd2cbcb	Tune the input initialization to avoid over-flow in silu	2025-04-09 10:03:32 +00:00
Qianfeng Zhang	86c0e45987	Add benchmark_hstu_attention.sh	2025-04-09 08:28:05 +00:00
Qianfeng Zhang	9cb2dca958	Add several verification test cases	2025-04-08 16:38:35 +00:00
Qianfeng Zhang	561d490990	Fix in kernel and forward dispatch for jagged mode	2025-04-08 16:37:52 +00:00
Qianfeng Zhang	dc2f72a09f	Fix in hstu-attention pipeline (which makes some testing cases passed)	2025-04-08 15:53:08 +00:00
Qianfeng Zhang	dbcf38aae9	Fixes and updates	2025-04-07 15:29:23 +00:00
Qianfeng Zhang	10e72d3362	Change in HstBlockMasking and kernel/reference codes for using masking	2025-04-03 14:46:12 +00:00
Qianfeng Zhang	733734553b	Fix and change in example	2025-04-03 14:44:36 +00:00
Qianfeng Zhang	121a950df5	Add hstu attention kernel implementation, instances and interfaces (building succeeded)	2025-04-03 08:20:54 +00:00
Qianfeng Zhang	83f29243df	fix the jagged mode tensor access in reference_hstu_attention	2025-03-29 12:55:40 +00:00
Qianfeng Zhang	4a0fc292d0	Initial reference implementation of hstu attention	2025-03-28 16:26:43 +00:00
Rostyslav Geyyer	441343a23d	Add MX FP4 device conversion tests (#1889 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add printf to fix intrinsic * Add a sw-based workaround	2025-03-26 19:23:01 -05:00
Illia Silin	23a949706c	Disable all pk_i4 tests for all targets except gfx942/950. (#2022 ) * only build gemm_fp8_pk_i4 examples for gfx942/950 * fix cmake logic * moved the architecture check to IsSupported function * Revert "moved the architecture check to IsSupported function" This reverts commit `056d2a08b3`. * disable all pk_i4 tests for targets other than gfx942/950 * fix cmake logic	2025-03-26 15:15:57 -07:00
Bartłomiej Kocot	54c81a1fcf	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes	2025-03-26 21:13:38 +01:00
Illia Silin	fd915b83f7	fix clang format (#2021 )	2025-03-26 09:42:10 -07:00
Mirza Halilčević	21e0ca197d	Add default arguments for prologue and epilogue. (#2020 )	2025-03-26 09:28:40 -07:00
Illia Silin	99b2bbc1d6	Make sure gemm_fp8_pk_i4 examples only build and run on gfx942/950. (#2010 ) * only build gemm_fp8_pk_i4 examples for gfx942/950 * fix cmake logic * moved the architecture check to IsSupported function * Revert "moved the architecture check to IsSupported function" This reverts commit `056d2a08b3`.	2025-03-25 14:43:38 -07:00
Andriy Roshchenko	72d888821c	MX GEMM examples with FP8, FP16, and E8M0 scales (#2016 ) * Add `scalar_type` specification for E8M0 exponent * Specialize `nnvb_data_t_selector` for E8M0 exponent * Remove partial specializations for `scalar_type` of `non_native_vector_base` template * Reword command line helper string * Create MX GEMM examples for different scales	2025-03-25 15:33:03 -06:00
Illia Silin	44c093ba0c	Enable ClangBuildAnalizer when doing ninja build traces. (#2009 ) * enable ClangBuildAnalizer when doing ninja traces * add branch and date to clang build log name * fix jenkins syntax * fix jenkins syntax once more * fix jenkins syntax once more * simplify the clang_build log name * simplify the clang_build log name further	2025-03-25 12:27:04 -07:00

1 2 3 4 5 ...

1818 Commits