composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-12 02:05:50 +00:00

Author	SHA1	Message	Date
Yi DING	a592107cb9	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `1926cd0cb8`]	2025-07-29 09:31:14 +08:00
Andres Lugo	53b36035f2	Remove filter for only batch on receipt 4 (#2574 ) Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA [ROCm/composable_kernel commit: `7fe50dc3da`]	2025-07-28 14:53:24 -07:00
Bartłomiej Kocot	f25da17c36	Enable multiple D for grouped conv fwd large tensors (#2572 ) [ROCm/composable_kernel commit: `5b244105d9`]	2025-07-28 22:39:07 +02:00
linqunAMD	d3a1842669	Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 ) * Remove HIP_COMPILE_DEVICE * add missing files * fix clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `0782ee8eb3`]	2025-07-28 13:01:07 -07:00
Illia Silin	3345f5f417	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config [ROCm/composable_kernel commit: `504b101da3`]	2025-07-28 11:34:07 -07:00
Illia Silin	61ff984dcd	use ninja to build packages (#2575 ) [ROCm/composable_kernel commit: `9786087010`]	2025-07-28 11:04:12 -07:00
jefyang1	8d7d85be00	Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558 ) [ROCm/composable_kernel commit: `c64a0c65b9`]	2025-07-28 09:03:54 -07:00
rocking	4c9de4fbb4	[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539 ) * uncomment all the headdim, use optdim to control * change default back to -1 * uncomment splitkv instance * Fix typo in receipt 4 for appendkv * support optdim for bwd, splitkv and appendkv * Fix 192 key error --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> [ROCm/composable_kernel commit: `b36e0b029f`]	2025-07-28 17:16:32 +08:00
shay-li77	d143f1466f	fix mha bwd dbias random mismatch (#2570 ) * fix mha bwd dbias random mismatch * formatting code [ROCm/composable_kernel commit: `8ae528a1b4`]	2025-07-28 14:39:31 +08:00
Bartłomiej Kocot	23ee2ddb5c	Enable bf16 RNE on gfx950 (#2542 ) * Enable bf16 RNE for gfx950 * test bhalf * fix * fix * Comments fixes * fixes * clean * fix [ROCm/composable_kernel commit: `685771b875`]	2025-07-28 00:47:17 +02:00
Gheorghe-Teodor Bercea	457e3c8b08	Refactor async loads to work on all GPUs (#2545 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cbfa62e4b6`]	2025-07-26 22:04:59 -07:00
Max Podkorytov	98bad686a9	[CK-Tile] Merge transpose examples (#2450 ) * unify pipeline signature with existing example * iwyu * move stuff around in load-tile-transpose * cleanups in batched transpose pipeline * comments * use same inputs size * cleaner printf * print host args * use 64 block sides in the 37_transpose example * roll back grid dimension size adjustment for 37_transpose example * transpose grid for 37_transpose to unify with 35_batched_transpose * unify grid computation logic * make policy methods device only (since they are used only on device from the pipeline) * more host/device attribute cleanups * copy over problem * move over pipeline and policy * add switch to batched transpose api * make the lds problem more similar to original problem * factor out logic into traits * factor out conditional compilation into trait parameter * propagate pipeline to args * unhardcode pipeline dispatch parameter * refactor vector size * put warp tile out of dispatch * rename template parameter for trait * rewrite vector size in terms of problem * mark policy-internal struct variable as device * factor out input distribution and thread access pattern from policies * reword vector size * use datatype across batched transpose pipelines, problems and kernel * remove transpose traits from lds pipeline * add padding to the lds pipeline interface * add comment * remove ck_tile example #37 * update cmakelists * add test for new pipeline * update batched transpose test * roll back load_tile_transpose changes * remove comments * pack dispatch parameters into a config * padM can be enabled * adjust lds vector size to enable padding along N * update test * clean up logic * swap m/n input vector size * adjust perf test script * sweep over C/W in perf test * count both read and written bytes into bandwidth (x2 the number) * clang-format * widen size range for perf test * remove 64k x 64k case; it's too large for index * remove thread tile from dispatch * Solve merge conflict * fix compile * modify the transpose * solve the test error and clang format * Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Fixing 0ms and inf GB/s issue in img2col (#2565) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` * merge with develop * solve clang format --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> [ROCm/composable_kernel commit: `821cd26c13`]	2025-07-26 21:51:54 -07:00
liang	d055927b83	reorder grid dim schedule (#2533 ) Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `d2459878cf`]	2025-07-26 02:46:55 +08:00
Bartłomiej Kocot	69c8415c05	Fix clang format (#2567 ) * clean * clang format fix [ROCm/composable_kernel commit: `5741edf761`]	2025-07-25 09:54:34 -07:00
rahjain-amd	1e3243ff59	Fixing 0ms and inf GB/s issue in img2col (#2565 ) issue : ==== ``` sh $ bin/tile_example_img2col Perf: 0 ms, inf GB/s ``` solution : ====== Problem occured because config.time_kernel is false by default. if false, then no need to calculate perf, just print proper message `image_to_coloumn: pass, No Perf generated due to config.time_kernel=0` [ROCm/composable_kernel commit: `78082855d8`]	2025-07-25 21:15:50 +05:30
Adam Osewski	63d239d406	Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463 ) * Add logging to IsSupported. * Less casting in AddClamp * Conv+bias+clamp instances & profiler BF16 * Fix 3D instances & run just 1x for verification. * :Run just once for verification conv fwd. * ckProfiler conv fwd clampwq * Remove exec bit & formatting * Add support for MultiD for grouped conv fwd v3. * Enable 2Lds. * clean * align instances * align instances * profiler fixes * Fixes * fix * fix --------- Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `c8eb2f995c`]	2025-07-25 10:34:31 +02:00
Enrico Degregori	2d68b3f9c0	Support b_scale: (#2350 ) - extend pipeline v1 and v3 - add instances - add tests - add example Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b01a27ff22`]	2025-07-24 18:49:58 -07:00
Cong Ma	c721559117	[CK TILE] Apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests (#2560 ) cherry-pick c68687e30 and apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `2addf05b91`]	2025-07-24 16:06:32 -07:00
Illia Silin	12f5978e20	remove repetitive code (#2562 ) [ROCm/composable_kernel commit: `9c04a55626`]	2025-07-24 14:52:46 -07:00
Aviral Goel	9097832e16	Revamp TERMINOLOGY.md (#2522 ) * Add comprehensive terminology reference for Composable Kernel, including glossary, hardware and memory hierarchy, execution model, programming model, memory access, tile-based computing, kernel operations, and optimizations. * Refine terminology in documentation for clarity and consistency. * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Enhance TERMINOLOGY.md by adding definitions for CUDA and HIP * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update TERMINOLOGY.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * address review comments * refine pipeline and tile partitioner --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `963dfa680b`]	2025-07-24 12:38:24 -07:00
Mateusz Ozga	c3568357ca	[CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520 ) * Init commit new API * apply clang-format * PreShuffle preapring * Apply Preshuffle condition to universal_gemm * Fix: convert size_t to index_t * Review changes * Mode 100755 -> 100644 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `b507d889c1`]	2025-07-24 20:39:56 +02:00
AviralGoelAMD	8d0acd764f	docs(CHANGELOG): update changelog for rocm 7.0 [ROCm/composable_kernel commit: `1e84fdaca7`]	2025-07-24 14:36:53 -04:00
Andriy Roshchenko	9395318666	MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481 ) * Add GEMM MX BF6 example * Fix BF6 type_convert * Add type_convert for bf16x6 * Add compare operator to f4x2_pk_t * Update README for 67_gemm_microscaling * Fix host tensor initialization with integer values for FP8 [ROCm/composable_kernel commit: `3421272f90`]	2025-07-24 14:36:53 -04:00
Cong Ma	87dae7bde1	[CK_TILE] Disable moe_sorting unit test on gfx908 (#2555 ) * [CK_TILE] Disable moe_sorting unit test on gfx908 - gfx908 does not support instruction used in moe_sorting * Update CMakeLists.txt --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `adeaf61ee5`]	2025-07-24 09:32:06 -07:00
Yi DING	e397140640	Use filename but not path to filter compilation (#2556 ) [ROCm/composable_kernel commit: `4338346b10`]	2025-07-24 17:38:14 +08:00
Yashvardhan Agarwal	094e5bad50	[CK_TILE] Support for elementwise kernel (#2246 ) * Elementwise kernel implementation Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: yashagar <yashagar@amd.com> * Elementwise with generalized nDims * Adding the n-ary input tensor feature * Generalize dimensions on top of inputs * Add TFLOPS + remove std usage for tuples * 1D basecase optimization * Cleanup code + refactoring to a common interface * Generalize to unary and add an example * Cleanup, refactoring and commenting * Suggestions for LWPCK-3170: elementwise kernel improvements * Clang-format: remod.py * Replace InputTensorType with XDataType as the type of input_tensors * Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs * Move examples to folder 19_elementwise * Add missing copyright headers and fix some existing ones * Replace an assert with throw std::runtime_error in elementwise example * Avoid reading the output by using make_static_distributed_tensor for y_tile * Removed two unused includes * No need to move windows to the next block when each workgroup processes a single tile * Only copy input tensors to the device * Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example * Adding output strides to the kernel, transposition example and update the other examples * Changes made by remod.py * Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp * Reuse generic reference binary/unary operation in examples + refactoring the transpose reference * Fix comments in elementwise_example.cpp - Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses - ElementWiseTraits was renamed to ElementWiseShape - Adopt suggestions made by Copilot when prompted to check for factual or typographical errors * Simplify CMakeLists.txt and remove the unused variables this uncovers * Rename a file and fix some copyright statements * Changes made by script/clang-format-overwrite.sh * Add basic unit test for ElementWiseKernel * Remove left-over uninformative comment in apply unit test * Changes made by clang-format-overwrite.sh * fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp * Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array * Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally * Move examples to folder 20_elementwise * Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK * Fix CLang formating * Bump up the elementwise example folder number * Elementwise: add padding + minor cleanup * Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view * Add isSupportedArg to Elementwise kernel + addapt example and unit tests * Fix clang-format on the unit test file --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `606b0cc947`]	2025-07-24 11:21:45 +02:00
jakpiase	bdb86fee78	[CK_TILE] Grouped Convolution Backward Weight Kernel (#2357 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * resolved conflicts * rebase fixes * fixes * tmp * add working support for splitk * minor fix * fixes * fixes * minor fix * small fix * Split K and preprocessing fixes --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `6681593864`]	2025-07-24 10:41:35 +02:00
Cong Ma	54bcfe99a7	[CK Tile] Fix building issue on RHEL8 (#2554 ) `#include <bit>` led a building failure on RHEL8. `<bit>` is C++20 header file. It is not supported on RHEL8. [ROCm/composable_kernel commit: `1d8941554e`]	2025-07-23 15:47:57 -07:00
Illia Silin	e1bfc7d751	refactor fmha_bwd.py (#2546 ) [ROCm/composable_kernel commit: `1b6f024836`]	2025-07-23 09:09:56 -07:00
Haocong WANG	4ed2dda658	fix async copytest bug (#2509 ) * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * change bit width check * refactor macros into constexpr functions which still get inlined * wrap s_waitcnt api * parameterize test * cleanup * cleanup fp8 stub * add fp8 test cases; todo which input parameters are valid? * replace n for fp8 in test cases * add large shapes; fp8 fails again * change input init * test sync/async * time the test * clang-format test * use float instead of bfloat to cover a 4-byte type * fix logic - arg sections should be 'or'd * make block_sync_lds_direct_load interface similar to old ck * fix a few comment typos * name common shapes * revert the example to original logic of not waiting lds * clang-format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `a5fdc663c8`]	2025-07-23 00:14:02 -07:00
Cong Ma	baf244000e	ck_tile kernel for gemm with groupwise quantized A tensor (#2473 ) * ck_tile kernel for gemm with groupwise quantized A or B tensor. This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers. * [Block Scale Gemm] Only run gemm quant examples on __gfx94__ - Only run gemm quant examples on __gfx94__ for usage of `v_cvt_pk_fp8_f32` - Format the code * [Block Scale Gemm] Remove Bquant Gemm BlockScale This cleanup is in preparation for future development of bquant. By isolating Aquant-related code, we can streamline the codebase and make it easier to add and maintain bquant functionality in subsequent updates. * [Block Scale Gemm] Format code with clang-format-12 The latest clang-format (v19) in ROCm 7.0 generate different result than clang-format-12 which is used in CK CI. Format code with clang-format-12 for consistency. * [Block Scale Gemm] Split the k direction loop - Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp to make the logic clearer. - Disable C transposition. * [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm * [Block Scale Gemm] Update copyright * test * Add TailHandler * Move TileDistributionEncodingPatternAQ * Refactor * refactor * fix bug * fix bug * help solve the PR comment * Format the code * [Block Scale Gemm] Add unit tests * [Block Scale Gemm] Add support to 16x16x32 MFMA - Add support to 16x16x32 MFMA - Fix a bug when exchange data crossing lanes --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com> Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `e62710e461`]	2025-07-23 00:10:16 -07:00
John Shumway	23e4294747	Switch to C++20 standard for all CMake targets. (#2536 ) All our platforms support C++20 now, so update to C++20 standard for language features such as concepts, designated initializers, range-based for initializers, and consteval. This PR only switches the compiler flags to C++20, no other changes. [ROCm/composable_kernel commit: `67b2821623`]	2025-07-22 10:52:10 -07:00
Cong Ma	545819c362	[CK_TILE] Migrate CK Tile examples to Tests to autorun on CI (#2421 ) [CK_TILE] Add new ck tile unit test * Add new ck tile unit test smoke-gemm-universal * Add new ck tile unit test smoke-gemm-basic * Add new ck tile unit test topk_softmax * Add new ck tile unit test add_rmsnorm2d_rdquant_fwd [ROCm/composable_kernel commit: `f102eedfb3`]	2025-07-22 08:15:18 -06:00
Rostyslav Geyyer	3a1ea22cce	Update packed fp4 layout (#2523 ) [ROCm/composable_kernel commit: `c9886109b4`]	2025-07-21 16:58:59 -05:00
Emily Martins	d5fcf10b29	Tests for CK tile Permute and MOE Sorting (#2417 ) * Convert ck-tile 06_permute smoke test to unit tests for fp16, fp8, and fp32 * Apply clang format and update copy right year * Convert ck tile moe sorting example smoke tests to unit tests * fix CMakelists to ensure that permute and moe_sorting are built for gfx9 only. * Remove number prefix from permute and moe_sorting directory names * code cleanup * add missing test cases for fp16 permute * remove unecessary parentheses * Cleanup * Remove uneccessary final nullptr * update copyright and licensing statement in files * Add custom target for permute tests * Add missing new line at end of file for moe sorting CMakelist. * Update MOE sorting tests to account for MOE sorting example updates The ck_tile/13_moe_sorting example was updated to include different cases dependending on whether MOE_SORTING_FMOE_2D_BUF is set. So, the ck_tile tests for MOE sorting were updated to account for these changes. --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> [ROCm/composable_kernel commit: `1fa1c34b7e`]	2025-07-21 11:20:28 -07:00
Aviral Goel	682c158b28	fix(cmake-dev): cmake dev script works with non bash shells (#2530 ) [ROCm/composable_kernel commit: `84a7600bdc`]	2025-07-19 23:15:50 -07:00
Emily Martins	69287c3086	Tests for CK Tile Flatmm and MOE Smoothquant (#2458 ) * CK tile tests for flatmm using example * MOE smoothquant draft tests * fix create_arg default index to zero for MOE smoothquant * revert MOE smoothquant changes * code clean up * Add back MOE smoothquant changes * Add MOE smoothquant cases for different precisions and update cmake * clean up comments * Update flamm cmake * revert change made to moe_smoothquant smoke_test.sh EXE path * remove unecessary comment in MOE smoothquant cmakelist * comment out adding moe_smoothquant subdirectory for now due to bugs with GPU core dump issue on gfx942 and gfx90a * Clean up run_test_case function in MOE smootquant tests * update copyright and licensing on files * Remove flatmm test dir since tests should be done as weighted preshuffle gemm * Add flamm smoke test cases to weighted preshuffle gemm gtests * remove blank line from CMakeLists --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `20306db651`]	2025-07-19 23:13:36 -07:00
Illia Silin	ff763142f1	disable building CI for gfx942 by default (#2529 ) [ROCm/composable_kernel commit: `ead17e6265`]	2025-07-18 12:25:24 -07:00
Mingtao Gu	93ff979b13	[CK] Fixed MPerBlock=32 build issue for MXFP4 GEMM decode (#2512 ) * added MPerBlock=32 for MXFP4 GEMM decode * added two instance for M>128 scenario. * added 1 instance * format --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `0198257d79`]	2025-07-18 14:35:54 +08:00
Yi DING	bb1a05cc87	[CK_TILE] Fix tile_example_moe_sorting broke in #2436 (#2525 ) [ROCm/composable_kernel commit: `f0a8c18017`]	2025-07-17 22:50:58 -07:00
Linjun-AMD	fdcae9410a	h_dim256 fmha use async_qr pipeline (#2510 ) [ROCm/composable_kernel commit: `095393276a`]	2025-07-18 09:59:38 +08:00
Thrupti Raj Lakshmana Gowda	f3f2716ebb	[CKTILE] Layout Support for CK Tile engine (#2482 ) * Updating runtime log message for CK TILE ENGINE * CKTile layout from config * CKTile custom config for CI * Documentation for Layout Changes * CKTile Layout changes to Jenkins * Fixing Clang Format * Changes to Jenkins file to fix error * fix(cmake-ck-dev): no longer sets invalid values as gpu arch * style(py files): ruff formatting * fix(cmake-ck-release): no longer sets invalid values as gpu arch * chore(cmake-tile_engine): add reminder to uncomment user config json * Changes to jenkin file to address more cases * Changes to Jenkins to fix Error * Changes to Jenkins file for fixing an error * Update Jenkinsfile (#2517) * Update Jenkinsfile --------- Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `0f3083ab5c`]	2025-07-17 12:19:41 -07:00
Emily Martins	356180d6a3	Tests for CK Tile Batched Transpose and Smoothquant (#2453 ) * Create tests for ck tile batched transpose using example * Create ck tile tests for smoothquant using examples * fix precision input strings and convert batched transpose to regression tests * Code cleanup and fix asserts * add missing licenses * update copyright and licensing in files * Update smoothquant tests to use example's smoothquant.cpp * Add custom target for batched transpose tests * Add missing new lines at end of files for CMakelists * fix typo in batched transpose CMakeList target_compile_options --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> [ROCm/composable_kernel commit: `c08986b026`]	2025-07-17 09:53:34 -06:00
Mateusz Ozga	96e2d5f86a	Fix CI clang-format (#2521 ) [ROCm/composable_kernel commit: `7fc000d7b3`]	2025-07-17 14:41:29 +02:00
slippedJim	616e233794	update (#2519 ) [ROCm/composable_kernel commit: `05b65d0c7c`]	2025-07-17 15:24:19 +08:00
Haocong WANG	894750fd59	fix mfma32x32 dispatch (#2490 ) [ROCm/composable_kernel commit: `28072adc3a`]	2025-07-17 15:24:12 +08:00
Yi DING	cee5776046	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> [ROCm/composable_kernel commit: `f1d8ad2818`]	2025-07-16 23:56:22 -07:00
Khushbu Agarwal	f1f9b9635c	Fixing numerical error, and interchange preshuffle configs to match with flatmm (#2515 ) [ROCm/composable_kernel commit: `579bd73435`]	2025-07-16 22:33:03 -07:00
Po Yen Chen	e75bbb6a13	Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389 )" (#2514 ) This reverts commit c5014a86765c598215bf2b58f89c3a7c70f92bac. [ROCm/composable_kernel commit: `722c22fb15`]	2025-07-17 10:09:01 +08:00
linqunAMD	7b5207d652	[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459 ) 1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com> [ROCm/composable_kernel commit: `fbd9f32abe`]	2025-07-17 08:19:57 +08:00

1 2 3 4 5 ...

2130 Commits