composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-08 15:30:23 +00:00

Author	SHA1	Message	Date
John Shumway	67b2821623	Switch to C++20 standard for all CMake targets. (#2536 ) All our platforms support C++20 now, so update to C++20 standard for language features such as concepts, designated initializers, range-based for initializers, and consteval. This PR only switches the compiler flags to C++20, no other changes.	2025-07-22 10:52:10 -07:00
Cong Ma	f102eedfb3	[CK_TILE] Migrate CK Tile examples to Tests to autorun on CI (#2421 ) [CK_TILE] Add new ck tile unit test * Add new ck tile unit test smoke-gemm-universal * Add new ck tile unit test smoke-gemm-basic * Add new ck tile unit test topk_softmax * Add new ck tile unit test add_rmsnorm2d_rdquant_fwd	2025-07-22 08:15:18 -06:00
Rostyslav Geyyer	c9886109b4	Update packed fp4 layout (#2523 )	2025-07-21 16:58:59 -05:00
Emily Martins	1fa1c34b7e	Tests for CK tile Permute and MOE Sorting (#2417 ) * Convert ck-tile 06_permute smoke test to unit tests for fp16, fp8, and fp32 * Apply clang format and update copy right year * Convert ck tile moe sorting example smoke tests to unit tests * fix CMakelists to ensure that permute and moe_sorting are built for gfx9 only. * Remove number prefix from permute and moe_sorting directory names * code cleanup * add missing test cases for fp16 permute * remove unecessary parentheses * Cleanup * Remove uneccessary final nullptr * update copyright and licensing statement in files * Add custom target for permute tests * Add missing new line at end of file for moe sorting CMakelist. * Update MOE sorting tests to account for MOE sorting example updates The ck_tile/13_moe_sorting example was updated to include different cases dependending on whether MOE_SORTING_FMOE_2D_BUF is set. So, the ck_tile tests for MOE sorting were updated to account for these changes. --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-07-21 11:20:28 -07:00
Aviral Goel	84a7600bdc	fix(cmake-dev): cmake dev script works with non bash shells (#2530 )	2025-07-19 23:15:50 -07:00
Emily Martins	20306db651	Tests for CK Tile Flatmm and MOE Smoothquant (#2458 ) * CK tile tests for flatmm using example * MOE smoothquant draft tests * fix create_arg default index to zero for MOE smoothquant * revert MOE smoothquant changes * code clean up * Add back MOE smoothquant changes * Add MOE smoothquant cases for different precisions and update cmake * clean up comments * Update flamm cmake * revert change made to moe_smoothquant smoke_test.sh EXE path * remove unecessary comment in MOE smoothquant cmakelist * comment out adding moe_smoothquant subdirectory for now due to bugs with GPU core dump issue on gfx942 and gfx90a * Clean up run_test_case function in MOE smootquant tests * update copyright and licensing on files * Remove flatmm test dir since tests should be done as weighted preshuffle gemm * Add flamm smoke test cases to weighted preshuffle gemm gtests * remove blank line from CMakeLists --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-19 23:13:36 -07:00
Illia Silin	ead17e6265	disable building CI for gfx942 by default (#2529 )	2025-07-18 12:25:24 -07:00
Mingtao Gu	0198257d79	[CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512 ) * added MPerBlock=32 for MXFP4 GEMM decode * added two instance for M>128 scenario. * added 1 instance * format --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-07-18 14:35:54 +08:00
Yi DING	f0a8c18017	[CK_TILE] Fix tile_example_moe_sorting broke in #2436 (#2525 )	2025-07-17 22:50:58 -07:00
Linjun-AMD	095393276a	h_dim256 fmha use async_qr pipeline (#2510 )	2025-07-18 09:59:38 +08:00
Thrupti Raj Lakshmana Gowda	0f3083ab5c	[CKTILE] Layout Support for CK Tile engine (#2482 ) * Updating runtime log message for CK TILE ENGINE * CKTile layout from config * CKTile custom config for CI * Documentation for Layout Changes * CKTile Layout changes to Jenkins * Fixing Clang Format * Changes to Jenkins file to fix error * fix(cmake-ck-dev): no longer sets invalid values as gpu arch * style(py files): ruff formatting * fix(cmake-ck-release): no longer sets invalid values as gpu arch * chore(cmake-tile_engine): add reminder to uncomment user config json * Changes to jenkin file to address more cases * Changes to Jenkins to fix Error * Changes to Jenkins file for fixing an error * Update Jenkinsfile (#2517) * Update Jenkinsfile --------- Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-17 12:19:41 -07:00
Emily Martins	c08986b026	Tests for CK Tile Batched Transpose and Smoothquant (#2453 ) * Create tests for ck tile batched transpose using example * Create ck tile tests for smoothquant using examples * fix precision input strings and convert batched transpose to regression tests * Code cleanup and fix asserts * add missing licenses * update copyright and licensing in files * Update smoothquant tests to use example's smoothquant.cpp * Add custom target for batched transpose tests * Add missing new lines at end of files for CMakelists * fix typo in batched transpose CMakeList target_compile_options --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com>	2025-07-17 09:53:34 -06:00
Mateusz Ozga	7fc000d7b3	Fix CI clang-format (#2521 )	2025-07-17 14:41:29 +02:00
slippedJim	05b65d0c7c	update (#2519 )	2025-07-17 15:24:19 +08:00
Haocong WANG	28072adc3a	fix mfma32x32 dispatch (#2490 )	2025-07-17 15:24:12 +08:00
Yi DING	f1d8ad2818	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>	2025-07-16 23:56:22 -07:00
Khushbu Agarwal	579bd73435	Fixing numerical error, and interchange preshuffle configs to match with flatmm (#2515 )	2025-07-16 22:33:03 -07:00
Po Yen Chen	722c22fb15	Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389 )" (#2514 ) This reverts commit `b2dea90116`.	2025-07-17 10:09:01 +08:00
linqunAMD	fbd9f32abe	[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459 ) 1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com>	2025-07-17 08:19:57 +08:00
linqunAMD	6e76b82059	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-16 07:58:23 -07:00
Illia Silin	a4bf78ac0e	replace obsolete warpSize system variable with the new one (#2496 )	2025-07-16 07:39:15 -07:00
Illia Silin	f5d1e3fa48	Use a clang20 compiler for gfx950 builds. (#2504 ) * update docker tag for gfx950 ci build * update compiler path for gfx950 ci build * suppress compiler path override for gfx950 * clean up	2025-07-16 07:37:53 -07:00
huaiguxu	c1badfd30c	Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-16 15:44:34 +08:00
MHYangAMD	3499fe67ff	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>	2025-07-16 14:05:26 +08:00
rahjain-amd	6b09f0823e	add missing condition for bf16 (#2502 ) Without this DataType = unknown - ``` sh Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s, ``` after this change ```sh Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s, ```	2025-07-15 21:25:56 +05:30
carlushuang	cfe211cc60	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example	2025-07-15 09:42:18 +08:00
Gino Lu	141bf2d54d	[CK_TILE] Add pk_fp4 data type (#2422 ) * [draft] Add pk_fp4 and test * Add hw conversion for fp4 * Refine test code and pk_fp4 constructor. * fix test indent * modify according to comment. * fix clang-format * modify according comments. --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-14 20:35:06 +08:00
Andriy Roshchenko	25b359d630	MX GEMM - Add FP6 GEMM Test (#2488 ) * Add F6 GEMM MX Test * Add BF6 GEMM MX Test	2025-07-11 15:32:12 -06:00
Andriy Roshchenko	518dc21ae8	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8	2025-07-11 13:07:05 -06:00
Khushbu Agarwal	d239b91fd5	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted	2025-07-11 08:27:55 -07:00
Qianfeng	45904b8fd7	Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487 ) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com>	2025-07-11 18:14:47 +08:00
Aviral Goel	a26ba690fd	fix(precommit_install): fix bug for bare metal machines (#2448 ) Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-10 11:00:47 -06:00
Andres Lugo	aadeffde18	Update FMHA recipe for Pytorch SDPA integration (#2480 ) * Add receipts in splitk and appendk * remove grouped * Remove logits --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-10 09:00:23 -07:00
Illia Silin	1b66f3f4a3	Add declarations for atomic add for fp16 and unsigned short. (#2483 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short * revrt back to atomic add using casting	2025-07-10 07:18:56 -07:00
Illia Silin	d9b37c7121	Fix blockscale fp8 gemm examples (#2476 ) * fix blockscale fp8 gemm examples * refactor the compiler flags * fix hip version calculation	2025-07-10 07:12:13 -07:00
shay-li77	d814fefe18	support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338 ) * mask support ratio for y axis * format code * add notes for param y_ratio * fix comments error * support template and mdiv for ratio mask * refactor y-ratio mask constructor * optimize coordinate calculation * add SimplifiedRatioAttentionMask	2025-07-09 23:18:55 +08:00
Yi DING	032ca60015	[CK_TILE] Avoid compile kernel in host pass (#2475 )	2025-07-09 22:27:54 +08:00
Po Yen Chen	ad9863fe05	[CK_TILE] Low CU utilization optimization for fMHA fwd kernels (#2402 ) * Wrap tile size mapping as class method * Warp pipeline generating as class method * Add constraint as kernel dispatching criteria * Support mutltiple tile size for a (hdim, hdim_v) combination * Use smaller tile size if CU utilization is low * Use integar as the key of the tile size map * Fix type error * Simply override parent class method return value * Add attribute to eliminate warnging * Allow using environment variables to turn on/off custom factory * Unify param naming style * Add missing HIP runtime include directive * Fix os.environ.get() usage	2025-07-09 22:01:33 +08:00
Vidyasagar Ananthan	e391b025a0	New ninja tracing script (#2472 ) * Adding ninja log json convertion utility * Updating to match old ninjatracing * Updating Jenkins to use new ninjatracing * Ensuring v7 works * Removing old ninjatracing from dockerfile	2025-07-08 22:36:50 -07:00
Illia Silin	93420ecf89	Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474 ) This reverts commit `112b47e885`.	2025-07-08 19:01:26 -07:00
Illia Silin	112b47e885	Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471 ) * add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short	2025-07-08 18:09:30 -04:00
Vidyasagar Ananthan	33d704a6f9	Separating ninja build tracing and setting flag to false (#2470 ) * Separating ninja build tracing and setting flag to false * Add ftime-tracing flag * Fix conditional issue * Try adding a script block * Embed Clang analysis in ftime trace block	2025-07-08 10:52:00 -07:00
Haocong WANG	5557eadce6	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-07-08 10:42:07 +08:00
Illia Silin	e033a1b4bf	fix compilation errors with clang20 (#2464 )	2025-07-07 19:40:30 -07:00
Po Yen Chen	b2dea90116	Eliminate warning caused by failed to meet occupancy requirement (#2389 ) Co-authored-by: felix <felix.li@amd.com>	2025-07-08 09:17:25 +08:00
Thomas Ning	f240ae3248	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com>	2025-07-07 10:08:49 -07:00
Andriy Roshchenko	054f85ab7c	MX GEMM - FP6 Example (#2419 ) Adds support for MX FP6 data type in MX GEMM block pipeline version v1. Provides an example of MX FP6 GEMM algorithm. --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com>	2025-07-07 10:33:26 -06:00
dependabot[bot]	bfe573d3ba	Bump sphinxcontrib-bibtex from 2.6.4 to 2.6.5 in /docs/sphinx (#2424 ) --- updated-dependencies: - dependency-name: sphinxcontrib-bibtex dependency-version: 2.6.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-07-07 07:30:49 -07:00
spolifroni-amd	096bf2de41	updating the doxyfile and the index.rst so that it gets the full API (#2416 ) * updating the doxyfile and the index.rst so that it gets the full API * added recommended doxygen values	2025-07-07 07:29:36 -07:00
rahjain-amd	ad593c286f	Fixing Debug build (#2404 ) Failed to build `tile_example_fmha_bwd` due to below error ``` /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare] 358 \| assert(slopes.size() == nhead); \| ~~~~~~~~~~~~~ ^ ~~~~~ /usr/include/assert.h:103:27: note: expanded from macro 'assert' 103 \| (static_cast <bool> (expr) \ \| ^~~~ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:989:16: note: in instantiation of function template specialization 'run<FmhaBwdFp16>' requested here 989 \| return run<FmhaBwdFp16>(arg_parser) ? 0 : -2; \| ^ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare] 358 \| assert(slopes.size() == nhead); \| ~~~~~~~~~~~~~ ^ ~~~~~ /usr/include/assert.h:103:27: note: expanded from macro 'assert' 103 \| (static_cast <bool> (expr) \ \| ^~~~ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:993:16: note: in instantiation of function template specialization 'run<FmhaBwdBf16>' requested here 993 \| return run<FmhaBwdBf16>(arg_parser) ? 0 : -2; \| ^ 2 errors generated when compiling for gfx942. ``` Fixed with proper cast	2025-07-07 14:46:22 +05:30

1 2 3 4 5 ...

2108 Commits