composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 01:10:17 +00:00

Author	SHA1	Message	Date
aska-0096	1b468bac0b	tempsave, trload+asyncload done	2025-07-21 05:55:55 +00:00
aska-0096	afd96d8180	compile pass	2025-07-18 10:04:34 +00:00
aska-0096	5616551115	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-07-18 05:17:27 +00:00
aska-0096	ae39c84f55	tempsave	2025-07-18 05:16:39 +00:00
Linjun-AMD	095393276a	h_dim256 fmha use async_qr pipeline (#2510 )	2025-07-18 09:59:38 +08:00
Thrupti Raj Lakshmana Gowda	0f3083ab5c	[CKTILE] Layout Support for CK Tile engine (#2482 ) * Updating runtime log message for CK TILE ENGINE * CKTile layout from config * CKTile custom config for CI * Documentation for Layout Changes * CKTile Layout changes to Jenkins * Fixing Clang Format * Changes to Jenkins file to fix error * fix(cmake-ck-dev): no longer sets invalid values as gpu arch * style(py files): ruff formatting * fix(cmake-ck-release): no longer sets invalid values as gpu arch * chore(cmake-tile_engine): add reminder to uncomment user config json * Changes to jenkin file to address more cases * Changes to Jenkins to fix Error * Changes to Jenkins file for fixing an error * Update Jenkinsfile (#2517) * Update Jenkinsfile --------- Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-17 12:19:41 -07:00
Emily Martins	c08986b026	Tests for CK Tile Batched Transpose and Smoothquant (#2453 ) * Create tests for ck tile batched transpose using example * Create ck tile tests for smoothquant using examples * fix precision input strings and convert batched transpose to regression tests * Code cleanup and fix asserts * add missing licenses * update copyright and licensing in files * Update smoothquant tests to use example's smoothquant.cpp * Add custom target for batched transpose tests * Add missing new lines at end of files for CMakelists * fix typo in batched transpose CMakeList target_compile_options --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com>	2025-07-17 09:53:34 -06:00
Mateusz Ozga	7fc000d7b3	Fix CI clang-format (#2521 )	2025-07-17 14:41:29 +02:00
aska-0096	94b6430489	temp save	2025-07-17 10:06:09 +00:00
aska-0096	7e330553dc	Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline	2025-07-17 07:24:32 +00:00
slippedJim	05b65d0c7c	update (#2519 )	2025-07-17 15:24:19 +08:00
Haocong WANG	28072adc3a	fix mfma32x32 dispatch (#2490 )	2025-07-17 15:24:12 +08:00
Yi DING	f1d8ad2818	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>	2025-07-16 23:56:22 -07:00
Khushbu Agarwal	579bd73435	Fixing numerical error, and interchange preshuffle configs to match with flatmm (#2515 )	2025-07-16 22:33:03 -07:00
aska-0096	804f77dce5	move test_copy into test	2025-07-17 03:10:46 +00:00
aska-0096	21627d7ca7	remove unnecessary output	2025-07-17 02:41:31 +00:00
aska-0096	287792c44a	Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix	2025-07-17 02:26:13 +00:00
aska-0096	a4221db304	add input validation and bug fix	2025-07-17 02:26:10 +00:00
Po Yen Chen	722c22fb15	Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389 )" (#2514 ) This reverts commit `b2dea90116`.	2025-07-17 10:09:01 +08:00
linqunAMD	fbd9f32abe	[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459 ) 1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com>	2025-07-17 08:19:57 +08:00
Max Podkorytov	21fd7e9538	Merge branch 'develop' into test_copy_fix	2025-07-16 11:23:57 -07:00
linqunAMD	6e76b82059	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-16 07:58:23 -07:00
Illia Silin	a4bf78ac0e	replace obsolete warpSize system variable with the new one (#2496 )	2025-07-16 07:39:15 -07:00
Illia Silin	f5d1e3fa48	Use a clang20 compiler for gfx950 builds. (#2504 ) * update docker tag for gfx950 ci build * update compiler path for gfx950 ci build * suppress compiler path override for gfx950 * clean up	2025-07-16 07:37:53 -07:00
aska-0096	d6df7bf851	fix vmcnt shift	2025-07-16 08:55:50 +00:00
aska-0096	40e039e4e4	Improve s_waitcnt_imm calculation	2025-07-16 08:37:07 +00:00
huaiguxu	c1badfd30c	Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-16 15:44:34 +08:00
MHYangAMD	3499fe67ff	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>	2025-07-16 14:05:26 +08:00
aska-0096	c30f8b709b	fix the s_waitcnt_imm calculation	2025-07-16 05:39:50 +00:00
aska-0096	ec0a45b29f	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix	2025-07-16 03:57:57 +00:00
aska-0096	e5cc4af808	Add block_sync_lds_direct_load utility	2025-07-16 03:54:33 +00:00
rahjain-amd	6b09f0823e	add missing condition for bf16 (#2502 ) Without this DataType = unknown - ``` sh Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s, ``` after this change ```sh Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s, ```	2025-07-15 21:25:56 +05:30
aska-0096	eea58629cf	fix async copytest bug	2025-07-15 09:39:03 +00:00
carlushuang	cfe211cc60	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example	2025-07-15 09:42:18 +08:00
Gino Lu	141bf2d54d	[CK_TILE] Add pk_fp4 data type (#2422 ) * [draft] Add pk_fp4 and test * Add hw conversion for fp4 * Refine test code and pk_fp4 constructor. * fix test indent * modify according to comment. * fix clang-format * modify according comments. --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-14 20:35:06 +08:00
Andriy Roshchenko	25b359d630	MX GEMM - Add FP6 GEMM Test (#2488 ) * Add F6 GEMM MX Test * Add BF6 GEMM MX Test	2025-07-11 15:32:12 -06:00
Andriy Roshchenko	518dc21ae8	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8	2025-07-11 13:07:05 -06:00
Khushbu Agarwal	d239b91fd5	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted	2025-07-11 08:27:55 -07:00
Qianfeng	45904b8fd7	Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487 ) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com>	2025-07-11 18:14:47 +08:00
Aviral Goel	a26ba690fd	fix(precommit_install): fix bug for bare metal machines (#2448 ) Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-10 11:00:47 -06:00
Andres Lugo	aadeffde18	Update FMHA recipe for Pytorch SDPA integration (#2480 ) * Add receipts in splitk and appendk * remove grouped * Remove logits --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-10 09:00:23 -07:00
Illia Silin	1b66f3f4a3	Add declarations for atomic add for fp16 and unsigned short. (#2483 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short * revrt back to atomic add using casting	2025-07-10 07:18:56 -07:00
Illia Silin	d9b37c7121	Fix blockscale fp8 gemm examples (#2476 ) * fix blockscale fp8 gemm examples * refactor the compiler flags * fix hip version calculation	2025-07-10 07:12:13 -07:00
aska-0096	18669925cc	temp save, change all instance to 1wave	2025-07-10 04:29:33 +00:00
shay-li77	d814fefe18	support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338 ) * mask support ratio for y axis * format code * add notes for param y_ratio * fix comments error * support template and mdiv for ratio mask * refactor y-ratio mask constructor * optimize coordinate calculation * add SimplifiedRatioAttentionMask	2025-07-09 23:18:55 +08:00
Yi DING	032ca60015	[CK_TILE] Avoid compile kernel in host pass (#2475 )	2025-07-09 22:27:54 +08:00
Po Yen Chen	ad9863fe05	[CK_TILE] Low CU utilization optimization for fMHA fwd kernels (#2402 ) * Wrap tile size mapping as class method * Warp pipeline generating as class method * Add constraint as kernel dispatching criteria * Support mutltiple tile size for a (hdim, hdim_v) combination * Use smaller tile size if CU utilization is low * Use integar as the key of the tile size map * Fix type error * Simply override parent class method return value * Add attribute to eliminate warnging * Allow using environment variables to turn on/off custom factory * Unify param naming style * Add missing HIP runtime include directive * Fix os.environ.get() usage	2025-07-09 22:01:33 +08:00
Vidyasagar Ananthan	e391b025a0	New ninja tracing script (#2472 ) * Adding ninja log json convertion utility * Updating to match old ninjatracing * Updating Jenkins to use new ninjatracing * Ensuring v7 works * Removing old ninjatracing from dockerfile	2025-07-08 22:36:50 -07:00
Illia Silin	93420ecf89	Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474 ) This reverts commit `112b47e885`.	2025-07-08 19:01:26 -07:00
Illia Silin	112b47e885	Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short	2025-07-08 18:09:30 -04:00

1 2 3 4 5 ...

2122 Commits