composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-05 14:47:26 +00:00

Author	SHA1	Message	Date
Ville Pietilä	9d5e5f7188	Remove obsolete packed cast tensor slice transfers.	2025-09-03 11:00:41 +00:00
Ville Pietilä	70d57ca8b9	Remove separate packed cast step.	2025-09-03 10:57:16 +00:00
Ville Pietilä	d56d7bc821	Use fused packed cats.	2025-09-02 12:00:21 +00:00
Ville Pietilä	d22ec6633a	Fused packed cast improvements.	2025-09-01 14:57:42 +00:00
Ville Pietilä	c69539fe3c	Optimize LDS write order for packed cast.	2025-08-28 15:01:55 +00:00
Ville Pietilä	9f66d9fbca	Packed cast improvement.	2025-08-27 14:32:00 +00:00
Ville Pietilä	b82e68c45f	Bug fix.	2025-08-27 14:30:13 +00:00
Ville Pietilä	f1a7cfba26	Packed cast imrovement.	2025-08-27 14:23:05 +00:00
Ville Pietilä	90b90dff08	Vectorized packed cast.	2025-08-27 13:29:38 +00:00
Ville Pietilä	54302c6f77	Remove obsolete version of the packed cast.	2025-08-27 11:28:55 +00:00
Ville Pietilä	481df169f2	Add packed cast to gridwise gemm multi d.	2025-08-27 11:27:03 +00:00
Ville Pietilä	6092643e9b	Performat packed cast implementation.	2025-08-27 10:10:18 +00:00
Ville Pietilä	cfbd669455	WIP: Vectorized access.	2025-08-26 12:43:38 +00:00
Ville Pietilä	2302ea9bc6	Add the vectorized option for packed cast.	2025-08-26 12:31:01 +00:00
Ville Pietilä	905cfb6623	Use thread scratch buffer in bf16 conversion.	2025-08-26 09:33:50 +00:00
Ville Pietilä	1b858936cb	WIP: Packed cast using threadwise sratch.	2025-08-25 15:31:50 +00:00
Ville Pietilä	4e7f9f7908	Fix packed cast tensor slice transfer.	2025-08-22 10:26:59 +00:00
Ville Pietilä	de93a48b04	Add back the separate packed cast step.	2025-08-20 11:31:07 +00:00
Ville Pietilä	6fbe1895f1	Code clean-up.	2025-08-20 08:19:13 +00:00
Ville Pietilä	0d34572f20	Code clean-up.	2025-08-19 18:37:01 +00:00
Ville Pietilä	b48ae7e447	Add perf test. Fix packed bf16 cast implementation.	2025-08-18 11:37:14 +00:00
Ville Pietilä	6b2b5e7c7c	Small optimization to the packed cast.	2025-08-18 06:39:26 +00:00
Ville Pietilä	c0b8f66674	Add packed cast pipeline into gridwise gemm xdlops bwd weight.	2025-08-18 06:10:14 +00:00
Ville Pietilä	00a3ce734a	Integrate new packed cast threadwise tensor slice transfer into gridwise gemm pipelines.	2025-08-15 12:06:44 +00:00
Ville Pietilä	51af3d7bac	Fix a bug in the packed cast threadwise transfer.	2025-08-15 10:41:06 +00:00
Ville Pietilä	62c66a7d9c	WIP: packed bf16 cast v3.	2025-08-14 12:39:18 +00:00
Ville Pietilä	938ff298b4	Add more unit tests.	2025-08-14 11:33:02 +00:00
Ville Pietilä	3ecc8aae74	Add unit test for vectorized packed cast.	2025-08-14 08:41:35 +00:00
Ville Pietilä	ade741dd45	WIP: PackedCast v3.	2025-08-13 15:13:35 +00:00
Ville Pietilä	50e318e072	Fix logging.	2025-08-12 15:53:00 +00:00
Ville Pietilä	ae4c727bc5	Add packed bf16 cast for universal GEMM.	2025-08-12 15:52:49 +00:00
Ville Pietilä	cee7644c85	Working version 2 of the packed cast.	2025-08-12 12:46:01 +00:00
Ville Pietilä	6148d1c75f	WIP: Packed cast v2.	2025-08-11 15:18:30 +00:00
Ville Pietilä	c675563468	Addlogging and specific unit tests for bf16 and gfx950.	2025-08-08 08:59:12 +00:00
Ville Pietilä	c47b80580d	Fix build issues when __gfx950__ macro is enabled.	2025-08-08 08:01:42 +00:00
Ville Pietilä	4b8a559da9	Fixed packed_cast implementation for slice access.	2025-08-06 11:04:29 +00:00
Ville Pietilä	44202b9d32	WIP: Integration of packed cast into gridwise_gemm_xdl_cshuffle_conv_v3.	2025-08-05 15:12:36 +00:00
Ville Pietilä	e92c0bf68e	Initial integaration of packed cast.	2025-08-04 15:34:35 +00:00
Ville Pietilä	e06548675f	Fix a bug in packed cast asm. Add more unit tests.	2025-08-04 09:16:12 +00:00
Ville Pietilä	9769fa68a7	Packed BF16 cast with asm volatile.	2025-08-01 13:22:41 +00:00
Ville Pietilä	b2b991d431	Added conversion of two floats into a packed bf16 value.	2025-07-31 15:13:37 +00:00
Ville Pietilä	e962a41638	Automatic deduction of split-K value for grouped convolution (#2491 ) * Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Use simple best occupancy model to calculate the split-K. * Handle split-K autodeduction in explicit gemm conv. * Add unit tests for split-K autodeduction. * Remove oversubscription. * Small fixes. * Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle. * Run clang formatting. * Fix error handling in the conv profiler. * Add missing documentation for the autodeducted split-K values. * Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver. * Fix clang formatting and split-K profiler documentation. * Rename max_occupancy value variable. * Calculate grid size for split-K autodeduction directly from input array shapes and template params. --------- Co-authored-by: Ville Pietilä <>	2025-07-31 12:08:45 +02:00
Anton Gorenko	7b074249f4	[CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571 ) * Add tests for host convesion f32/f16 to f8 * Add tests for host convesion from f8 to f32/f16 * Fix UB and corner cases in f32/f16 to/from f8 conversion * There are UBs when very small values are converted to f8: bitshifts can be larger that type width. Using unsigned long long does not help because exponent_diff >= 64 in such cases. This causes that values like 2.117582368e-22 are converted to non-zero f8 in host validation of FMHA tests, test_f8 crashes with segfault in completely irrelevant code like GTest internals or produces non-deterministic results etc. * Fix FNUZ conversion to return NaN for NaN inputs. * Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16 conversion. * Replace some magic numbers with values from numeric_traits * Build tests only on devices supporting the type	2025-07-31 09:54:17 +05:00
Gino Lu	b25d512e8a	add constexpr to pk_fp4::pack/unpack() (#2586 )	2025-07-30 10:29:04 -04:00
Khushbu Agarwal	61e21f5567	Update to gpu_timer for rotating_buffer (#2524 ) * update gpu_timer for rotating buffer as hipblasLt's implementation * timing fix * Updating gpu timer for old ck as well * Revert "Updating gpu timer for old ck as well" This reverts commit `958cd1bc99`. * code clean up with runtime argument; function rename * code cleanup * general timer fixes * bug fix * clang formatted * addressing reveiew comments * clang formatted * Addressing review comments * CI fix --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-29 15:21:05 -07:00
Thomas Ning	9d4b494f07	Expand the bandwidth of direct_global_to_lds for gfx950 (#2576 ) * Expand the bandwidth of direct_global_to_lds for gfx950 * clang-format * fix the remod.py and script for clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-28 23:56:53 -07:00
Illia Silin	49723e94bb	fix the clang-format (#2578 )	2025-07-28 20:49:55 -07:00
Yi DING	1926cd0cb8	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-29 09:31:14 +08:00
Bartłomiej Kocot	5b244105d9	Enable multiple D for grouped conv fwd large tensors (#2572 )	2025-07-28 22:39:07 +02:00
linqunAMD	0782ee8eb3	Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 ) * Remove HIP_COMPILE_DEVICE * add missing files * fix clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>	2025-07-28 13:01:07 -07:00

1 2 3 4 5 ...

975 Commits