composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Author	SHA1	Message	Date
lalala-sh	10a288c3a2	opt moe sorting (#2822 ) * opt moe storing for 2k * rm duplicated clear --------- Co-authored-by: root <root@hjbog-srdc-39.amd.com>	2025-09-18 08:57:20 +08:00
coderfeli	6a600f0cff	fix unittest	2025-09-10 10:14:22 +00:00
lalala-sh	72d13f3f55	merge felix/sorting	2025-09-10 09:31:50 +00:00
coderfeli	e6ffff0fbf	Merge remote-tracking branch 'origin/feat-mixed_input_flatmm' into wip_355	2025-09-09 11:02:07 +00:00
coderfeli	41bc59f5b1	merge nkpad	2025-09-09 10:59:14 +00:00
Feng Shijie	4b49839c63	fix the tensor_view shape of C in gemm1-gate-up	2025-09-09 07:41:32 +00:00
Feng Shijie	08bae35abd	fix backward compatibility	2025-09-09 04:27:33 +00:00
Feng Shijie	b09b6cdce9	fix scale_m gather load for a8w8 moe	2025-09-09 04:18:48 +00:00
Feng Shijie	027f5311c6	support persist-mode for moe-gemm	2025-09-09 03:52:03 +00:00
Feng Shijie	a725bc08a7	fix gate-up mismatch when OutputNRepeat > 1	2025-09-09 03:52:03 +00:00
Feng Shijie	9efde19691	Refactor scale and bias function, encapsulate scale/bias pointer to a tensor view and load data by tilewise operation	2025-09-09 03:52:03 +00:00
coderfeli	946be930dd	update	2025-09-07 08:22:07 +00:00
lalala-sh	adc8db73e0	only padding N in epilogue	2025-09-05 02:50:47 +00:00
Feng Shijie	5c484a5672	Add bias for f16xf4 moe_flatmm	2025-08-28 08:02:50 +00:00
Feng Shijie	65b702454c	support swiglu activaion and use rcpf to accelerate silu	2025-08-26 12:32:29 +00:00
root	d05eed931d	add line to last	2025-08-22 04:01:59 -05:00
root	d69cab7f0c	adjust A_LDS descriptor to avoid bankconflict	2025-08-22 03:20:46 -05:00
root	65989e940c	enable hotloop	2025-08-21 09:46:52 -05:00
Feng Shijie	85976b0b87	use int64_t as expert stride to avoid overflow	2025-08-21 06:58:55 +00:00
Feng Shijie	9fbcc8f8a4	use v4i32 as the storage type for B to avoid repack operation	2025-08-20 13:53:32 +00:00
Feng Shijie	c27eb0771a	optimize cvt_pkf4_to_f16 implementation	2025-08-20 04:39:14 +00:00
Feng Shijie	3ca0bd500a	optimize A_LDS descriptor to avoid bankconflict	2025-08-19 14:56:46 +00:00
Feng Shijie	f7f0306eea	fix gate-up when GU_NRepeat > 1	2025-08-18 18:43:37 +00:00
Feng Shijie	be55c0f9cb	add fp16xf4 moe	2025-08-18 17:28:11 +00:00
Feng Shijie	599e1f5b32	rename example	2025-08-17 17:51:18 +00:00
Feng Shijie	7899fb4a8d	remove additional check when e8m0->float	2025-08-15 06:20:46 +00:00
Feng Shijie	714b341797	eliminate repeat dequant	2025-08-14 09:34:12 +00:00
Feng Shijie	53e8c0c533	Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm	2025-08-13 16:51:49 +00:00
Feng Shijie	5de6208952	update f16xMXF4	2025-08-13 16:16:48 +00:00
Feng Shijie	732ebdee8b	update scale-preshuffle for MXF4	2025-08-13 10:48:53 +00:00
Feng Shijie	edb58d0680	update	2025-08-11 11:24:34 +00:00
Feng Shijie	cc9c7b9e58	optimize gemm2 atomic_add pattern	2025-08-11 08:38:23 +00:00
Feng Shijie	200a11afc8	update scale for mxfp4	2025-08-11 07:59:47 +00:00
Feng Shijie	f788d3d629	add mixed_prec fp16xfp4	2025-08-08 20:19:16 +00:00
Feng Shijie	3dea10a277	debug mixed_prec flatmm	2025-08-07 09:22:04 +00:00
Feng Shijie	6d3cbc7c0e	add moe_flatmm	2025-08-06 08:33:33 +00:00
coderfeli	c0cb4d036d	fix split k	2025-08-06 02:45:31 +00:00
Feng Shijie	90e910f3a7	fix flatmm with scaling when WarpTileM == 32	2025-08-04 07:16:36 +00:00
Feng Shijie	aa5e008fa5	optimize scaling epilogue	2025-08-01 11:01:23 +00:00
Feng Shijie	3f43b841d4	prune debug message	2025-07-30 06:37:26 +00:00
Feng Shijie	2e5d4c74cd	fix compile error	2025-07-30 04:52:08 +00:00
Feng Shijie	c117a1986a	Add persistent option on flatmm for tuning	2025-07-29 15:42:58 +00:00
AMD-dteng	a587701117	update pipeline v1: add atomic IGLP schedule	2025-07-29 14:59:32 +00:00
Feng Shijie	1b6d7cf407	crz idea	2025-07-28 08:24:51 +00:00
Feng Shijie	5473f06461	Add permuteN optimzization when NRepeat % 2 == 0 on flatmm	2025-07-27 11:57:38 +00:00
sjfeng	bfb9f4002f	try to remove c_shuffle_lds	2025-07-27 17:24:08 +08:00
Feng Shijie	1264f4d2ab	fix loop-dim mismatch and improve c_shuffle alu parallelism	2025-07-25 07:41:48 +00:00
Mateusz Ozga	b507d889c1	[CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520 ) * Init commit new API * apply clang-format * PreShuffle preapring * Apply Preshuffle condition to universal_gemm * Fix: convert size_t to index_t * Review changes * Mode 100755 -> 100644 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-07-24 20:39:56 +02:00
Yashvardhan Agarwal	606b0cc947	[CK_TILE] Support for elementwise kernel (#2246 ) * Elementwise kernel implementation Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: yashagar <yashagar@amd.com> * Elementwise with generalized nDims * Adding the n-ary input tensor feature * Generalize dimensions on top of inputs * Add TFLOPS + remove std usage for tuples * 1D basecase optimization * Cleanup code + refactoring to a common interface * Generalize to unary and add an example * Cleanup, refactoring and commenting * Suggestions for LWPCK-3170: elementwise kernel improvements * Clang-format: remod.py * Replace InputTensorType with XDataType as the type of input_tensors * Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs * Move examples to folder 19_elementwise * Add missing copyright headers and fix some existing ones * Replace an assert with throw std::runtime_error in elementwise example * Avoid reading the output by using make_static_distributed_tensor for y_tile * Removed two unused includes * No need to move windows to the next block when each workgroup processes a single tile * Only copy input tensors to the device * Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example * Adding output strides to the kernel, transposition example and update the other examples * Changes made by remod.py * Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp * Reuse generic reference binary/unary operation in examples + refactoring the transpose reference * Fix comments in elementwise_example.cpp - Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses - ElementWiseTraits was renamed to ElementWiseShape - Adopt suggestions made by Copilot when prompted to check for factual or typographical errors * Simplify CMakeLists.txt and remove the unused variables this uncovers * Rename a file and fix some copyright statements * Changes made by script/clang-format-overwrite.sh * Add basic unit test for ElementWiseKernel * Remove left-over uninformative comment in apply unit test * Changes made by clang-format-overwrite.sh * fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp * Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array * Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally * Move examples to folder 20_elementwise * Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK * Fix CLang formating * Bump up the elementwise example folder number * Elementwise: add padding + minor cleanup * Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view * Add isSupportedArg to Elementwise kernel + addapt example and unit tests * Fix clang-format on the unit test file --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-07-24 11:21:45 +02:00
lalala-sh	1239d8a546	merge flatmm -scale	2025-07-24 08:46:51 +00:00

1 2 3 4 5 ...

279 Commits