composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 10:37:44 +00:00

Author	SHA1	Message	Date
Jinchao Xu	d2525548c6	Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611 ) Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances by splitting debug information into separate .dwo files. Add explicit cast to avoid signed/unsigned comparison warning. [ROCm/composable_kernel commit: `15eb493152`]	2025-08-04 11:26:08 -07:00
Bartłomiej Kocot	c903850b65	Mark non-grouped convolutions instances as deprecated (#2595 ) * Mark non-grouped convolutions instances as deprecated * Update CHANGELOG.md Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update library/src/tensor_operation_instance/gpu/conv1d_bwd_data/device_conv1d_bwd_data_xdl_nwc_kxc_nwk_bf16_instance.cpp Co-authored-by: John Afaganis <john.afaganis@amd.com> --------- Co-authored-by: John Afaganis <john.afaganis@amd.com> [ROCm/composable_kernel commit: `8655ba989c`]	2025-08-04 16:49:55 +02:00
Max Podkorytov	b13f01345d	remove std::format (#2604 ) [ROCm/composable_kernel commit: `0d9439760f`]	2025-08-01 19:22:07 -07:00
Illia Silin	8703165b46	remove std=c++17 compiler flag (#2603 ) [ROCm/composable_kernel commit: `b786d12e56`]	2025-08-01 16:18:16 -07:00
Max Podkorytov	dd4a259904	[CK-tile] remove old ck-tile transpose test (#2591 ) * remove old ck-tile transpose test * rename test exe for consistency * replace batched transpose regression test [ROCm/composable_kernel commit: `f36cb5b2aa`]	2025-08-01 14:50:09 -07:00
Thomas Ning	aa2f9b4c73	Reduce build time tile engine (#2579 ) * Modify CMakeLists to allow for splitting. * Modify CMakeLists for data and layout logic. * Run tests and get build artifact. * Test new Cmakelists for speedup. * Further improvements for speedup. * turn off the FMHA * turn off the automatic tile engine gemm * minor fix * disable the transpose test first * Address the comment * Jenkinsfile * change the make thread to 64 * change the compile thread to 32 * Try to use with less OS memory space * Have the Unity build batch size to 2 * reduce the chunk size --------- Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com> [ROCm/composable_kernel commit: `e5b79b26fa`]	2025-08-01 14:42:33 -07:00
Illia Silin	6fc0a709dc	update the switch condition for buffer built-ins (#2602 ) [ROCm/composable_kernel commit: `788e8a878e`]	2025-08-01 14:30:07 -07:00
Thomas Ning	9325ae50e8	Fix the GFX 950 Universal GEMM (#2597 ) * solve the gfx950 error * clang format * fix a typo error --------- Co-authored-by: ThomasNing <thomasning@amd.com> [ROCm/composable_kernel commit: `7c44a763fa`]	2025-08-01 09:32:24 -07:00
Illia Silin	6c5d1d39b8	Add a daily CI stage to test AITER with latest CK. (#2598 ) * add a CI stage for AITER testing [ROCm/composable_kernel commit: `e6104daecc`]	2025-08-01 07:55:51 -07:00
lalala-sh	def677fbc7	fix weight index out of range (#2414 ) [ROCm/composable_kernel commit: `bb5c478295`]	2025-08-01 17:50:02 +08:00
Aviral Goel	4f9a78fbf0	Integration of a new pipeline for weight preshuffle into gemm examples (#2516 ) * something khushbu can help with * v1 v2 works with flatmm develop * v0 v1 v2 numerical error gone * Fixing numerical error, and interchange preshuffle configs to match with flatmm * Refactor GEMM pipeline configurations and integrate preshuffle support - Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3). - Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations. - Removed obsolete code and comments * clang format * fix vectorloadsize bug * add the Preshuffle3 * update kwarp calculation in gemm utils * update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch * fix: add CK_GFX950_SUPPORT macro for gfx950 detection * default disable rotating buffer * docs(CHANGELOG): update changelog for rocm 7.0 * Revert "docs(CHANGELOG): update changelog for rocm 7.0" This reverts commit `2bc16fff84`. * Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files. * revert example/ck_tile/flatmm to its original state * remove comment added by second author * switch to xor ALDSDescriptor * modify the MakeALdsDescriptor() * temporary profiling script * getting rid of line marker compiler error * UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy * add a minor fix for the config * typo fix * Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2 * revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp * revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp * reenable the GemmSpatiallyLocalTilePartitioner * make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline * remove hardcoded true for preshuffle bool template argument * rename script * remove gemm_profilie.sh script * merge conflict resolve * clang formatted * typo fix * Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp * Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy * Fix missing newline at end of file in run_gemm_example.inc * Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1 * addressing review comments * removing debug code * addressing review comments * Revert "addressing review comments" This reverts commit `29c45192ba`. * updating tile_engine code * addressing review comments --------- Co-authored-by: amd-khushbu <khuagarw@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `1441a0a7ee`]	2025-08-01 00:04:54 -07:00
Khushbu Agarwal	464c6f459e	[CK_Tile] Updating gpu timer when doing flush cache (#2593 ) * Missed updating function names in example * updating timer * code cleanup * addressing review comments * updating tile_engine code * addressing review comments [ROCm/composable_kernel commit: `88d72178d6`]	2025-07-31 16:43:33 -07:00
Aviral Goel	8bea96ab32	Disable fp8 instances on unsupported targets (#2592 ) * Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt * Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt * Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt * Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt --------- Co-authored-by: AviralGoelAMD <aviralgoel@amd.com> [ROCm/composable_kernel commit: `546ef78d1d`]	2025-07-31 12:18:02 -07:00
Ville Pietilä	e91e8e7908	Automatic deduction of split-K value for grouped convolution (#2491 ) * Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Use simple best occupancy model to calculate the split-K. * Handle split-K autodeduction in explicit gemm conv. * Add unit tests for split-K autodeduction. * Remove oversubscription. * Small fixes. * Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle. * Run clang formatting. * Fix error handling in the conv profiler. * Add missing documentation for the autodeducted split-K values. * Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver. * Fix clang formatting and split-K profiler documentation. * Rename max_occupancy value variable. * Calculate grid size for split-K autodeduction directly from input array shapes and template params. --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `e962a41638`]	2025-07-31 12:08:45 +02:00
Anton Gorenko	2ef590ab43	[CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571 ) * Add tests for host convesion f32/f16 to f8 * Add tests for host convesion from f8 to f32/f16 * Fix UB and corner cases in f32/f16 to/from f8 conversion * There are UBs when very small values are converted to f8: bitshifts can be larger that type width. Using unsigned long long does not help because exponent_diff >= 64 in such cases. This causes that values like 2.117582368e-22 are converted to non-zero f8 in host validation of FMHA tests, test_f8 crashes with segfault in completely irrelevant code like GTest internals or produces non-deterministic results etc. * Fix FNUZ conversion to return NaN for NaN inputs. * Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16 conversion. * Replace some magic numbers with values from numeric_traits * Build tests only on devices supporting the type [ROCm/composable_kernel commit: `7b074249f4`]	2025-07-31 09:54:17 +05:00
Illia Silin	64f8c28b42	upgrade clang-format version in install_precommit.sh (#2589 ) [ROCm/composable_kernel commit: `e8709c24f4`]	2025-07-30 08:02:25 -07:00
Max Podkorytov	e18a5f6859	[CK-tile] add gtest for ck-tile batched transpose kernels (#2585 ) * add a dummy test file * add kernel launch logic to the test * transfer all test cases into gtest params * factor kernel out into test config * add load transpose pipeline tests * add padded tests and skip invalid kernels at runtime * enum class for pipeline type * add multiwarp test cases * fix type * try to solve the problem --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `de0cdb4c31`]	2025-07-30 07:31:05 -07:00
Gino Lu	80a9d720ca	add constexpr to pk_fp4::pack/unpack() (#2586 ) [ROCm/composable_kernel commit: `b25d512e8a`]	2025-07-30 10:29:04 -04:00
Khushbu Agarwal	7af5e043b1	Update to gpu_timer for rotating_buffer (#2524 ) * update gpu_timer for rotating buffer as hipblasLt's implementation * timing fix * Updating gpu timer for old ck as well * Revert "Updating gpu timer for old ck as well" This reverts commit `958cd1bc99`. * code clean up with runtime argument; function rename * code cleanup * general timer fixes * bug fix * clang formatted * addressing reveiew comments * clang formatted * Addressing review comments * CI fix --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `61e21f5567`]	2025-07-29 15:21:05 -07:00
Illia Silin	c810dda70e	Revert "Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558 )" (#2584 ) This reverts commit a111f65475de7974fe89070963d21a2741341ef7. [ROCm/composable_kernel commit: `b80099cc5f`]	2025-07-29 13:04:51 -07:00
Thomas Ning	69b117ae73	Expand the bandwidth of direct_global_to_lds for gfx950 (#2576 ) * Expand the bandwidth of direct_global_to_lds for gfx950 * clang-format * fix the remod.py and script for clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `9d4b494f07`]	2025-07-28 23:56:53 -07:00
rocking	0fb487ad63	set default optdim (#2580 ) [ROCm/composable_kernel commit: `01642ca8b1`]	2025-07-29 13:44:10 +08:00
Illia Silin	6704e5303f	fix the clang-format (#2578 ) [ROCm/composable_kernel commit: `49723e94bb`]	2025-07-28 20:49:55 -07:00
Yi DING	a592107cb9	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `1926cd0cb8`]	2025-07-29 09:31:14 +08:00
Andres Lugo	53b36035f2	Remove filter for only batch on receipt 4 (#2574 ) Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA [ROCm/composable_kernel commit: `7fe50dc3da`]	2025-07-28 14:53:24 -07:00
Bartłomiej Kocot	f25da17c36	Enable multiple D for grouped conv fwd large tensors (#2572 ) [ROCm/composable_kernel commit: `5b244105d9`]	2025-07-28 22:39:07 +02:00
linqunAMD	d3a1842669	Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 ) * Remove HIP_COMPILE_DEVICE * add missing files * fix clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `0782ee8eb3`]	2025-07-28 13:01:07 -07:00
Illia Silin	3345f5f417	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config [ROCm/composable_kernel commit: `504b101da3`]	2025-07-28 11:34:07 -07:00
Illia Silin	61ff984dcd	use ninja to build packages (#2575 ) [ROCm/composable_kernel commit: `9786087010`]	2025-07-28 11:04:12 -07:00
jefyang1	8d7d85be00	Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558 ) [ROCm/composable_kernel commit: `c64a0c65b9`]	2025-07-28 09:03:54 -07:00
rocking	4c9de4fbb4	[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539 ) * uncomment all the headdim, use optdim to control * change default back to -1 * uncomment splitkv instance * Fix typo in receipt 4 for appendkv * support optdim for bwd, splitkv and appendkv * Fix 192 key error --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> [ROCm/composable_kernel commit: `b36e0b029f`]	2025-07-28 17:16:32 +08:00
shay-li77	d143f1466f	fix mha bwd dbias random mismatch (#2570 ) * fix mha bwd dbias random mismatch * formatting code [ROCm/composable_kernel commit: `8ae528a1b4`]	2025-07-28 14:39:31 +08:00
Bartłomiej Kocot	23ee2ddb5c	Enable bf16 RNE on gfx950 (#2542 ) * Enable bf16 RNE for gfx950 * test bhalf * fix * fix * Comments fixes * fixes * clean * fix [ROCm/composable_kernel commit: `685771b875`]	2025-07-28 00:47:17 +02:00
Gheorghe-Teodor Bercea	457e3c8b08	Refactor async loads to work on all GPUs (#2545 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cbfa62e4b6`]	2025-07-26 22:04:59 -07:00
Max Podkorytov	98bad686a9	[CK-Tile] Merge transpose examples (#2450 ) * unify pipeline signature with existing example * iwyu * move stuff around in load-tile-transpose * cleanups in batched transpose pipeline * comments * use same inputs size * cleaner printf * print host args * use 64 block sides in the 37_transpose example * roll back grid dimension size adjustment for 37_transpose example * transpose grid for 37_transpose to unify with 35_batched_transpose * unify grid computation logic * make policy methods device only (since they are used only on device from the pipeline) * more host/device attribute cleanups * copy over problem * move over pipeline and policy * add switch to batched transpose api * make the lds problem more similar to original problem * factor out logic into traits * factor out conditional compilation into trait parameter * propagate pipeline to args * unhardcode pipeline dispatch parameter * refactor vector size * put warp tile out of dispatch * rename template parameter for trait * rewrite vector size in terms of problem * mark policy-internal struct variable as device * factor out input distribution and thread access pattern from policies * reword vector size * use datatype across batched transpose pipelines, problems and kernel * remove transpose traits from lds pipeline * add padding to the lds pipeline interface * add comment * remove ck_tile example #37 * update cmakelists * add test for new pipeline * update batched transpose test * roll back load_tile_transpose changes * remove comments * pack dispatch parameters into a config * padM can be enabled * adjust lds vector size to enable padding along N * update test * clean up logic * swap m/n input vector size * adjust perf test script * sweep over C/W in perf test * count both read and written bytes into bandwidth (x2 the number) * clang-format * widen size range for perf test * remove 64k x 64k case; it's too large for index * remove thread tile from dispatch * Solve merge conflict * fix compile * modify the transpose * solve the test error and clang format * Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Fixing 0ms and inf GB/s issue in img2col (#2565) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` * merge with develop * solve clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> [ROCm/composable_kernel commit: `821cd26c13`]	2025-07-26 21:51:54 -07:00
liang	d055927b83	reorder grid dim schedule (#2533 ) Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `d2459878cf`]	2025-07-26 02:46:55 +08:00
Bartłomiej Kocot	69c8415c05	Fix clang format (#2567 ) * clean * clang format fix [ROCm/composable_kernel commit: `5741edf761`]	2025-07-25 09:54:34 -07:00
rahjain-amd	1e3243ff59	Fixing 0ms and inf GB/s issue in img2col (#2565 ) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` [ROCm/composable_kernel commit: `78082855d8`]	2025-07-25 21:15:50 +05:30
Adam Osewski	63d239d406	Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463 ) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `c8eb2f995c`]	2025-07-25 10:34:31 +02:00
Enrico Degregori	2d68b3f9c0	Support b_scale: (#2350 ) - extend pipeline v1 and v3 - add instances - add tests - add example Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b01a27ff22`]	2025-07-24 18:49:58 -07:00
Cong Ma	c721559117	[CK TILE] Apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests (#2560 ) cherry-pick c68687e30 and apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `2addf05b91`]	2025-07-24 16:06:32 -07:00
Illia Silin	12f5978e20	remove repetitive code (#2562 ) [ROCm/composable_kernel commit: `9c04a55626`]	2025-07-24 14:52:46 -07:00
Aviral Goel	9097832e16	Revamp TERMINOLOGY.md (#2522 ) * Add comprehensive terminology reference for Composable Kernel, including glossary, hardware and memory hierarchy, execution model, programming model, memory access, tile-based computing, kernel operations, and optimizations. * Refine terminology in documentation for clarity and consistency. * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Enhance TERMINOLOGY.md by adding definitions for CUDA and HIP * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * address review comments * refine pipeline and tile partitioner --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `963dfa680b`]	2025-07-24 12:38:24 -07:00
Mateusz Ozga	c3568357ca	[CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520 ) * Init commit new API * apply clang-format * PreShuffle preapring * Apply Preshuffle condition to universal_gemm * Fix: convert size_t to index_t * Review changes * Mode 100755 -> 100644 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `b507d889c1`]	2025-07-24 20:39:56 +02:00
AviralGoelAMD	8d0acd764f	docs(CHANGELOG): update changelog for rocm 7.0 [ROCm/composable_kernel commit: `1e84fdaca7`]	2025-07-24 14:36:53 -04:00
Andriy Roshchenko	9395318666	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8 [ROCm/composable_kernel commit: `3421272f90`]	2025-07-24 14:36:53 -04:00
Cong Ma	87dae7bde1	[CK_TILE] Disable moe_sorting unit test on gfx908 (#2555 ) * [CK_TILE] Disable moe_sorting unit test on gfx908 - gfx908 does not support instruction used in moe_sorting * Update CMakeLists.txt --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `adeaf61ee5`]	2025-07-24 09:32:06 -07:00
Yi DING	e397140640	Use filename but not path to filter compilation (#2556 ) [ROCm/composable_kernel commit: `4338346b10`]	2025-07-24 17:38:14 +08:00
Yashvardhan Agarwal	094e5bad50	[CK_TILE] Support for elementwise kernel (#2246 ) * Elementwise kernel implementation Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: yashagar <yashagar@amd.com> * Elementwise with generalized nDims * Adding the n-ary input tensor feature * Generalize dimensions on top of inputs * Add TFLOPS + remove std usage for tuples * 1D basecase optimization * Cleanup code + refactoring to a common interface * Generalize to unary and add an example * Cleanup, refactoring and commenting * Suggestions for LWPCK-3170: elementwise kernel improvements * Clang-format: remod.py * Replace InputTensorType with XDataType as the type of input_tensors * Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs * Move examples to folder 19_elementwise * Add missing copyright headers and fix some existing ones * Replace an assert with throw std::runtime_error in elementwise example * Avoid reading the output by using make_static_distributed_tensor for y_tile * Removed two unused includes * No need to move windows to the next block when each workgroup processes a single tile * Only copy input tensors to the device * Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example * Adding output strides to the kernel, transposition example and update the other examples * Changes made by remod.py * Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp * Reuse generic reference binary/unary operation in examples + refactoring the transpose reference * Fix comments in elementwise_example.cpp - Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses - ElementWiseTraits was renamed to ElementWiseShape - Adopt suggestions made by Copilot when prompted to check for factual or typographical errors * Simplify CMakeLists.txt and remove the unused variables this uncovers * Rename a file and fix some copyright statements * Changes made by script/clang-format-overwrite.sh * Add basic unit test for ElementWiseKernel * Remove left-over uninformative comment in apply unit test * Changes made by clang-format-overwrite.sh * fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp * Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array * Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally * Move examples to folder 20_elementwise * Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK * Fix CLang formating * Bump up the elementwise example folder number * Elementwise: add padding + minor cleanup * Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view * Add isSupportedArg to Elementwise kernel + addapt example and unit tests * Fix clang-format on the unit test file --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `606b0cc947`]	2025-07-24 11:21:45 +02:00
jakpiase	bdb86fee78	[CK_TILE] Grouped Convolution Backward Weight Kernel (#2357 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * resolved conflicts * rebase fixes * fixes * tmp * add working support for splitk * minor fix * fixes * fixes * minor fix * small fix * Split K and preprocessing fixes --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `6681593864`]	2025-07-24 10:41:35 +02:00

1 2 3 4 5 ...

2153 Commits