* FMHA avoid unnecessary vmcnt0
Squashed commit of the following:
commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 22 03:15:51 2025 +0000
merge develop and solve conflicts
commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c
Merge: 2dac61a4f 13a6816fb
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 22 03:15:21 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue
commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140
Author: Ding, Yi <yi.ding@amd.com>
Date: Tue Aug 19 02:17:43 2025 +0000
update bwd
commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682
Author: Kevin Choi <kevin.choi@amd.com>
Date: Mon Aug 18 19:36:38 2025 +0000
add restrict to applicable functions
commit 45534dee5bcbe532da46fc5cd6601cde10d84387
Author: Ding, Yi <yi.ding@amd.com>
Date: Mon Aug 18 02:07:03 2025 +0000
bwd filter
commit 7abd7b372b82cba94a457238b6b4a81d093e7280
Author: Kevin Choi <kevin.choi@amd.com>
Date: Sat Aug 16 08:15:23 2025 +0000
remove noinline attr as it causes a lot more s_waitcnt's
commit 89c29746a09255c1d26038171157e91d1b68d14a
Author: Kevin Choi <kevin.choi@amd.com>
Date: Thu Aug 14 12:11:17 2025 +0000
remove innerloop, move restrict parameters to mainloop and add noinline attribute.
commit 6f61b3a5c80011411aa3aebf7983602f7c117566
Author: Kevin Choi <kevin.choi@amd.com>
Date: Thu Aug 14 07:06:51 2025 +0000
Create inner lambda with restrict parameters, add restrict to some parameters
commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a
Author: aska-0096 <haocwang@amd.com>
Date: Thu Aug 14 03:43:54 2025 +0000
save for debug
commit 5f2c3cfa86c6951208a1cc227fa556704a885a88
Merge: 25f067b4f 165a2723c
Author: aska-0096 <haocwang@amd.com>
Date: Wed Aug 13 02:15:22 2025 +0000
Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 25f067b4f09d6909a05e252c7621124046dfda57
Merge: 447c1c5d6 2ad2f97b7
Author: aska-0096 <haocwang@amd.com>
Date: Wed Aug 13 02:14:26 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 165a2723c557420b48891cc1ce3434e3675aef5d
Merge: 447c1c5d6 4491739ab
Author: asleepzzz <hanwen.chang@amd.com>
Date: Wed Aug 13 00:34:11 2025 +0800
Merge branch 'develop' into wip-async-tr-fa
commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 14:25:50 2025 +0000
refactor blockgemm change, isolate to v2;
commit 8f67083511ff77d31c880f4427d3bdf53a179568
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 09:26:13 2025 +0000
clang format
commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965
Merge: c84662c34 245071bcf
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 09:04:41 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit c84662c345755ec5f3d524fdde4aa951c8f86298
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 08:46:06 2025 +0000
Fix the bug
commit e0647ffa5646f8132529b152af02750c4010013d
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 04:02:41 2025 +0000
fix conflict. disable all v-col instance for fmha fwd
commit 781f98236c376f57591a6d481cc2ee04b36a148b
Merge: 241f3d7dc 6e03d9607
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 03:52:34 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 01:53:31 2025 +0000
clang format
commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f
Merge: 1a629c098 3639befe9
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 12 01:52:52 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 1a629c09876cc05f0750db7eade1d527dc32a1d3
Merge: f65874e5b b34a029cd
Author: aska-0096 <haocwang@amd.com>
Date: Mon Aug 11 15:59:40 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit f65874e5b07579d5b734b4c68877679a3ee04dac
Author: aska-0096 <haocwang@amd.com>
Date: Mon Aug 11 15:37:37 2025 +0000
change the warp setting for hdim32 fmha fwd
commit 7c5f5e65e97486c074ef9a138900ed9aafea547e
Author: aska-0096 <haocwang@amd.com>
Date: Mon Aug 11 14:21:09 2025 +0000
tempsave, update the blocksync functions
commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915
Author: aska-0096 <haocwang@amd.com>
Date: Sun Aug 10 06:00:51 2025 +0000
fix bug in pki4
commit 073db2e18af21f1ed1fb3d1f1c15830838df986f
Author: aska-0096 <haocwang@amd.com>
Date: Sat Aug 9 03:25:12 2025 +0000
fix bugs in gemm
commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 18:35:53 2025 +0000
fix bug on non-gfx950
commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 17:53:19 2025 +0000
fix bug
commit 30de97f473685e0bd5b82f15eee2493d9a05cffd
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 15:42:15 2025 +0000
fix bugs
commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 09:31:01 2025 +0000
fix clangformat with 18.1.3
commit e4cb185c41586d018771a5413efd909d8d53a8c5
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 09:07:40 2025 +0000
remove non-necessary change
commit 498f0d44cfba17287cce8d10855cce5c5de263db
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 09:04:02 2025 +0000
bug fix, clang format;
commit 3cb648cbc4883e6889340d85f48d803a21b9c805
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 08:08:03 2025 +0000
Remove unnecessary changes
commit 9e7ff3b611b7933b65973907a0cae312a15d31c6
Merge: a3c1bfe6d 7f14bd1df
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 07:50:12 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 06:19:31 2025 +0000
remove unnecessary files; rename some files
commit 6c257fa27729c005d539b5b71deeba3703031089
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 8 05:46:18 2025 +0000
merge fa_decode pipeline into fmha_fwd api
commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449
Author: aska-0096 <haocwang@amd.com>
Date: Wed Aug 6 05:58:43 2025 +0000
add __restrict__ to tr load
commit bbad2b979b701533b74f43452ffe0f775e019139
Author: aska-0096 <haocwang@amd.com>
Date: Tue Aug 5 07:23:51 2025 +0000
Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
commit d7fabd5f765e2a573ddbaf0857ce6f691407e562
Author: aska-0096 <haocwang@amd.com>
Date: Mon Aug 4 10:27:42 2025 +0000
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9
Author: aska-0096 <haocwang@amd.com>
Date: Mon Aug 4 10:02:17 2025 +0000
add vmcnt guard before load ktile
commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c
Author: aska-0096 <haocwang@amd.com>
Date: Mon Aug 4 06:49:01 2025 +0000
Load Q through lds, implement xor;
commit 62bb9f05177dfb8280d6c2be67a88492d6be4838
Author: aska-0096 <haocwang@amd.com>
Date: Fri Aug 1 10:44:54 2025 +0000
small refactor
commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 31 10:25:37 2025 +0000
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 31 05:13:27 2025 +0000
enable larger tile size; upgrade xor pattern
commit a468e59a01d6dd85c105ca30ac249491256c5915
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 30 12:25:33 2025 +0000
remove all lds bankconflict with xor layouts
commit 39ff55cdc377311112100fb24bc013adfd8960c0
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 30 03:51:06 2025 +0000
enable prefill overload operator().
commit a7b152a788e8035c93f8e4cbf317863182665d8f
Author: aska-0096 <haocwang@amd.com>
Date: Fri Jul 25 07:10:01 2025 +0000
fix the lds alignment caused performance regression
commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 23 09:05:57 2025 +0000
remove unnecessary features
commit 9758750801c7fd5a80f654eb982f43b87d674fa3
Author: aska-0096 <haocwang@amd.com>
Date: Tue Jul 22 08:04:05 2025 +0000
tempsave. asynccopy+trload sanity checked
commit 1c4c04d725047357224ebf8a2b94d9010a5651a6
Author: aska-0096 <haocwang@amd.com>
Date: Mon Jul 21 05:55:55 2025 +0000
tempsave, trload+asyncload done
commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c
Author: aska-0096 <haocwang@amd.com>
Date: Fri Jul 18 10:04:34 2025 +0000
compile pass
commit d41b5eace939909084d32281710fb81142ad5fec
Merge: 3f86a81ee 33204e15f
Author: aska-0096 <haocwang@amd.com>
Date: Fri Jul 18 05:17:27 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
commit 3f86a81eee75256a78df02032d50814aaa42b038
Author: aska-0096 <haocwang@amd.com>
Date: Fri Jul 18 05:16:39 2025 +0000
tempsave
commit 7d43f7446a9a20773f70e08462393f6c9afb7280
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 17 10:06:09 2025 +0000
temp save
commit 727629cd9115f1be9c1800bb65a8ea84ff06c250
Merge: aa5da19c9 94bceebc9
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 17 07:24:32 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
commit 94bceebc96ef4885e0ac861b7793e2e2897481bd
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 17 03:10:46 2025 +0000
move test_copy into test
commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 17 02:41:31 2025 +0000
remove unnecessary output
commit b1dbcacb1832560c6cc967a079dffce558228f0b
Merge: 5b0d311e6 0eaf3325a
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 17 02:26:13 2025 +0000
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit 5b0d311e649257557a7014c28fcfac0c327b77b5
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 17 02:26:10 2025 +0000
add input validation and bug fix
commit 0eaf3325a8e019402ff12a2402f446f8471f584f
Merge: a66e1d29a 08c5df68a
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Date: Wed Jul 16 11:23:57 2025 -0700
Merge branch 'develop' into test_copy_fix
commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 16 08:55:50 2025 +0000
fix vmcnt shift
commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 16 08:37:07 2025 +0000
Improve s_waitcnt_imm calculation
commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 16 05:39:50 2025 +0000
fix the s_waitcnt_imm calculation
commit 1c0870089a0e7c78ed71a278bf52d98fc780e482
Merge: d6ee05e36 d9de58c66
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 16 03:57:57 2025 +0000
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix
commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jul 16 03:54:33 2025 +0000
Add block_sync_lds_direct_load utility
commit c037a72040217471f52ee76bed9c07bf5b22aef4
Author: aska-0096 <haocwang@amd.com>
Date: Tue Jul 15 09:39:03 2025 +0000
fix async copytest bug
commit aa5da19c94022449b027e7a57668f2e219f0f171
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jul 10 04:29:33 2025 +0000
temp save, change all instance to 1wave
commit ddd172feb9eb2cb783420a8db6f44d51b350c370
Author: aska-0096 <haocwang@amd.com>
Date: Tue Jul 8 08:37:20 2025 +0000
tempsave, fmha_decode
commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33
Author: aska-0096 <haocwang@amd.com>
Date: Sat Jun 21 15:02:57 2025 +0000
temp save, waiting for debug
commit 71dd31f15bca01995c8cb0be9e903103f4657181
Author: aska-0096 <haocwang@amd.com>
Date: Thu Jun 19 05:11:52 2025 +0000
save an example for __bf16 type
commit cdf33e079fa7d7d5b03b06550df2356b02041d7b
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jun 18 07:27:24 2025 +0000
fix bwd code
commit d630998dc6751f44097b1e9a239bb5063a793736
Author: aska-0096 <haocwang@amd.com>
Date: Wed Jun 18 06:37:16 2025 +0000
Fix for fwd/bwd kernel build filter
commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com>
Date: Tue Aug 19 08:13:18 2025 +0000
Add restrict to avoid unnecessary vmcnt
---------
Co-authored-by: aska-0096 <haocwang@amd.com>
* Add comments for c-stype cast
* Better comments
---------
Co-authored-by: aska-0096 <haocwang@amd.com>
[ROCm/composable_kernel commit: de61e55493]
Composable Kernel
Note
The published documentation is available at Composable Kernel in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the
docsfolder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see Contribute to ROCm documentation.
The Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads across multiple architectures (GPUs, CPUs, etc.). The CK library uses general purpose kernel languages, such as HIP C++.
CK uses two concepts to achieve performance portability and code maintainability:
- A tile-based programming model
- Algorithm complexity reduction for complex machine learning (ML) operators. This uses an innovative technique called Tensor Coordinate Transformation.
The current CK library is structured into four layers:
- Templated Tile Operators
- Templated Kernel and Invoker
- Instantiated Kernel and Invoker
- Client API
General information
- CK supported operations
- CK Tile supported operations
- CK wrapper
- CK codegen
- CK profiler
- Examples (Custom use of CK supported operations)
- Client examples (Use of CK supported operations with instance factory)
- Terminology
- Contributors
CK is released under the MIT license.
Building CK
We recommend building CK inside Docker containers, which include all necessary packages. Pre-built Docker images are available on DockerHub.
-
To build a new Docker image, use the Dockerfile provided with the source code:
DOCKER_BUILDKIT=1 docker build -t ck:latest -f Dockerfile . -
Launch the Docker container:
docker run \ -it \ --privileged \ --group-add sudo \ -w /root/workspace \ -v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace \ ck:latest \ /bin/bash -
Clone CK source code from the GitHub repository and start the build:
git clone https://github.com/ROCm/composable_kernel.git && \ cd composable_kernel && \ mkdir build && \ cd buildYou must set the
GPU_TARGETSmacro to specify the GPU target architecture(s) you want to run CK on. You can specify single or multiple architectures. If you specify multiple architectures, use a semicolon between each; for example,gfx908;gfx90a;gfx942.cmake \ -D CMAKE_PREFIX_PATH=/opt/rocm \ -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -D CMAKE_BUILD_TYPE=Release \ -D GPU_TARGETS="gfx908;gfx90a" \ ..If you don't set
GPU_TARGETSon the cmake command line, CK is built for all GPU targets supported by the current compiler (this may take a long time). Tests and examples will only get built if the GPU_TARGETS is set by the user on the cmake command line.NOTE: If you try setting
GPU_TARGETSto a list of architectures, the build will only work if the architectures are similar, e.g.,gfx908;gfx90a, orgfx1100;gfx1101;gfx11012. Otherwise, if you want to build the library for a list of different architectures, you should use theGPU_ARCHSbuild argument, for exampleGPU_ARCHS=gfx908;gfx1030;gfx1100;gfx942. -
Build the entire CK library:
make -j"$(nproc)" -
Install CK:
make -j install
Optional post-install steps
-
Build examples and tests:
make -j examples tests -
Build and run all examples and tests:
make -j checkYou can find instructions for running each individual example in example.
-
Build and run smoke/regression examples and tests:
make -j smoke # tests and examples that run for < 30 seconds eachmake -j regression # tests and examples that run for >= 30 seconds each -
Build ckProfiler:
make -j ckProfilerYou can find instructions for running ckProfiler in profiler.
-
Build our documentation locally:
cd docs pip3 install -r sphinx/requirements.txt python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
Notes
The -j option for building with multiple threads in parallel, which speeds up the build significantly.
However, -j launches unlimited number of threads, which can cause the build to run out of memory and
crash. On average, you should expect each thread to use ~2Gb of RAM.
Depending on the number of CPU cores and the amount of RAM on your system, you may want to
limit the number of threads. For example, if you have a 128-core CPU and 128 Gb of RAM it's advisable to use -j32.
Additional cmake flags can be used to significantly speed-up the build:
-
DTYPES(default is not set) can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build instances of select data types only. The main default data types are fp32 and fp16; you can safely skip other data types. -
DISABLE_DL_KERNELS(default is OFF) must be set to ON in order not to build instances, such asgemm_dlorbatched_gemm_multi_d_dl. These instances are useful on architectures like the NAVI2x, as most other platforms have faster instances, such asxdlorwmma, available. -
DISABLE_DPP_KERNELS(default is OFF) must be set to ON in order not to build instances, such asgemm_dpp. These instances offer a slightly better performance of fp16 gemms on NAVI2x. But on other architectures faster alternatives are available. -
CK_USE_FP8_ON_UNSUPPORTED_ARCH(default is OFF) must be set to ON in order to build instances, such asgemm_universal,gemm_universal_streamkandgemm_multiply_multiplyfor fp8 data type for GPU targets which do not have native support for fp8 data type, such as gfx908 or gfx90a. These instances are useful on architectures like the MI100/MI200 for the functional support only.
Using sccache for building
The default CK Docker images come with a pre-installed version of sccache, which supports clang being used as hip-compiler (" -x hip"). Using sccache can help reduce the time to re-build code from hours to 1-2 minutes. In order to invoke sccache, you need to run:
sccache --start-server
then add the following flags to the cmake command line:
-DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache
You may need to clean up the build folder and repeat the cmake and make steps in order to take advantage of the sccache during subsequent builds.
Using CK as pre-built kernel library
You can find instructions for using CK as a pre-built kernel library in client_example.
Contributing to CK
When you contribute to CK, make sure you run clang-format on all changed files. We highly
recommend using git hooks that are managed by the pre-commit framework. To install hooks, run:
sudo script/install_precommit.sh
With this approach, pre-commit adds the appropriate hooks to your local repository and
automatically runs clang-format (and possibly additional checks) before any commit is created.
If you need to uninstall hooks from the repository, you can do so by running the following command:
script/uninstall_precommit.sh
If you need to temporarily disable pre-commit hooks, you can add the --no-verify option to the
git commit command.

