From 7c14d97d0e41a773a3ba1906292276fd1a1ce41a Mon Sep 17 00:00:00 2001 From: spolifroni-amd Date: Fri, 17 Oct 2025 14:06:04 -0400 Subject: [PATCH] updated the changelog with 7.1 and beyond info [ROCm/composable_kernel commit: 1b95803431d50361d22c3b76c4caf6608e83069d] --- CHANGELOG.md | 84 +++++++++++++++++++++++++++------------------------- 1 file changed, 44 insertions(+), 40 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 9de78f3043..28bcaae5b6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,35 +2,17 @@ Documentation for Composable Kernel available at [https://rocm.docs.amd.com/projects/composable_kernel/en/latest/](https://rocm.docs.amd.com/projects/composable_kernel/en/latest/). -## Composable Kernel 1.2.0 for ROCm 7.0.0 +## (Unreleased) Composable Kernel for ROCm + +### Added -### Added * Added a compute async pipeline in the CK TILE universal GEMM on gfx950 * Added support for B Tensor type pk_int4_t in the CK TILE weight preshuffle GEMM. * Added the new api to load different memory sizes to SGPR. * Added support for B Tensor Preshuffle in CK TILE Grouped GEMM. * Added a basic copy kernel example and supporting documentation for new CK Tile developers. * Added support for grouped_gemm kernels to perform multi_d elementwise operation. -* Added support for bf16, f32, and f16 for 2D and 3D NGCHW grouped convolution backward data -* Added a fully asynchronous HOST (CPU) arguments copy flow for CK grouped GEMM kernels. -* Added support GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW, number of instances in instance factory for NGCHW/GKYXC/NGKHW has been reduced). -* Added support for GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW). -* Added support for GKCYX layout for grouped convolution backward weight (NGCHW/GKCYX/NGKHW). -* Added support for GKCYX layout for grouped convolution backward data (NGCHW/GKCYX/NGKHW). -* Added support for Stream-K version of mixed fp8/bf16 GEMM -* Added support for Multiple D GEMM * Added support for Multiple ABD GEMM -* Added GEMM pipeline for microscaling (MX) FP8/FP6/FP4 data types -* Added support for FP16 2:4 structured sparsity to universal GEMM. -* Added support for Split K for grouped convolution backward data. -* Added logit soft-capping support for fMHA forward kernels. -* Added support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv) -* Added support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv/bwd) -* Added benchmarking support for tile engine GEMM. -* Added Ping-pong scheduler support for GEMM operation along the K dimension. -* Added rotating buffer feature for CK_Tile GEMM. -* Added int8 support for CK_TILE GEMM. -* Added support for elementwise kernel. * Added benchmarking support for tile engine GEMM Multi D. * Added block scaling support in CK_TILE GEMM, allowing flexible use of quantization matrices from either A or B operands. * Added the row-wise column-wise quantization for CK_TILE GEMM & CK_TILE Grouped GEMM. @@ -39,19 +21,50 @@ Documentation for Composable Kernel available at [https://rocm.docs.amd.com/proj * Added support for batched contraction kernel. * Added pooling kernel in CK_TILE +### Changed + +* Removed `BlockSize` in `make_kernel` and `CShuffleEpilogueProblem` to support Wave32 in CK_TILE (#2594) + +## Composable Kernel 1.1.0 for ROCm 7.1.0 + +### Added + +* Added support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv/bwd) +* Added support for elementwise kernel. + +### Upcoming changes + +* Non-grouped convolutions are deprecated. Their functionality is supported by grouped convolution. + +## Composable Kernel 1.1.0 for ROCm 7.0.0 + +### Added + +* Added support for bf16, f32, and f16 for 2D and 3D NGCHW grouped convolution backward data +* Added a fully asynchronous HOST (CPU) arguments copy flow for CK grouped GEMM kernels. +* Added support GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW, number of instances in instance factory for NGCHW/GKYXC/NGKHW has been reduced). +* Added support for GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW). +* Added support for GKCYX layout for grouped convolution backward weight (NGCHW/GKCYX/NGKHW). +* Added support for GKCYX layout for grouped convolution backward data (NGCHW/GKCYX/NGKHW). +* Added support for Stream-K version of mixed fp8/bf16 GEMM +* Added support for Multiple D GEMM +* Added GEMM pipeline for microscaling (MX) FP8/FP6/FP4 data types +* Added support for FP16 2:4 structured sparsity to universal GEMM. +* Added support for Split K for grouped convolution backward data. +* Added logit soft-capping support for fMHA forward kernels. +* Added support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv) +* Added benchmarking support for tile engine GEMM. +* Added Ping-pong scheduler support for GEMM operation along the K dimension. +* Added rotating buffer feature for CK_Tile GEMM. +* Added int8 support for CK_TILE GEMM. + ### Optimized +* Optimize the gemm multiply multiply preshuffle & lds bypass with Pack of KGroup and better instruction layout. +* Added Vectorize Transpose optimization for CK Tile +* Added the asynchronous copy for gfx950 -* Optimize the gemm multiply multiply preshuffle & lds bypass with Pack of KGroup and better instruction layout. (#2166) -* Added Vectorize Transpose optimization for CK Tile (#2131) -* Added the asynchronous copy for gfx950 (#2425) - - -### Fixes - -None - -### Changes +### Changed * Removed support for gfx940 and gfx941 targets (#1944) * Replaced the raw buffer load/store intrinsics with Clang20 built-ins (#1876) @@ -59,15 +72,6 @@ None * Number of instances in instance factory for grouped convolution forward NGCHW/GKYXC/NGKHW has been reduced. * Number of instances in instance factory for grouped convolution backward weight NGCHW/GKYXC/NGKHW has been reduced. * Number of instances in instance factory for grouped convolution backward data NGCHW/GKYXC/NGKHW has been reduced. -* Removed `BlockSize` in `make_kernel` and `CShuffleEpilogueProblem` to support Wave32 in CK_TILE (#2594) - -### Known issues - -None - -### Upcoming changes - -* Non-grouped convolutions are deprecated. All of their functionality is supported by grouped convolution. ## Composable Kernel 1.1.0 for ROCm 6.1.0