mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Go to file

Po Yen Chen f351f9775c [CK_TILE] fmha forward split-kv + combine kernels (#1338 )

* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

* Add num_splits option and dummy split-kv api method

* Generate fmha_fwd_splitkv()

* Add SplitKV kernel codegen logics

* Add SplitKV combine kernel codegen logics

* Fix mismatched return type

* Clean-up code

* Replace sentinel value before storing

* Fix wrong layout of LSE/LSEacc/Oacc

* Format codes

* Fix o_acc memory error

* Fix wrong kBlockSize used in policy

* Reduce # of combine kernels

* Fix split-kv combine kernel name

* Fix wrong LDS indexing logics

* Fix wrong loop counter step logic

* Undo vector size changes

* Remove no-longer used field

* Remove in-consistent comment

* Remove debug statements in example

* Remove more debug statements

* Add constness to local variables

* Clearn up generate.py

* Fix unstable clang-format comment

* Remove unused include directive

* Use shorter template parameter name

* Enable non-split-kv blobs

* Update license date

* Print num_splits conditionally

* Undo disabling data types

* Remove unnessary tile size for fp8

* Fix wrong pipeline args for fp8

* Fix example output format

* Remove more debug code in combine pipeline

* Add stride kernel arguments for LSE/O acc workspace

* Re-order split-kv pipeline call operator arguments

* Pass LSE/O strides in kernel argument

* Re-order pipeline call operator arguments

* Use tensor_descriptor to locate LSEacc elements

* Support providing invalid element for tensor view

* Set invalid element value for LSEacc tensor view

* Remove hand-written store_tile() code

* Remove necessary value-overwrite logic

* Add transposed lds descriptor

* Support load_tile() for tile_window_with_static_lengths<>

* Undo removing necessary value-overwrite logic

* Use read descriptor to locate lds elements

* Simplify pipeline source code

* Add constraint to kMaxSplits

* Default use kMaxSplits=64 in generate.py

* Revert "Add constraint to kMaxSplits"

This reverts commit 0a2132d758.

* Revert "Default use kMaxSplits=64 in generate.py"

This reverts commit c7d9c80b77.

* Decide alignment by the padding parameter

* Remove no-longer used utility functions

* Remove not-working code

* Add comment & remove no-longer used code

* Fix computation errors

* Add heuristic to override num_splits option

* Add constraint to kMaxSplits

* Fix compilation error

* Clean up pipeline code

* Wrap pointer access as lambda function

* Rename confusing methods

* Use kLogMasSplits as template parameter

* Finish splitkv combine kernel codegen

* Update kMaxSplits limit

* Use smaller kM0 for splitkv combine kernel

* Ignore droupout flag in splitkv pipeline

* Unify flag usage

* Add back flag kStoreLSE

* Merge lambda calls in pipeline

* Fix compilation errors

* Avoid all empty splits

* Always check for empty loop in splitkv pipelines

* Re-order parameters

* Remove redundant p_drop option check

* Add traits/problem for fwd splitkv kernel

* Conditionally enable uneven split boundary checks

* Add comment for the splitkv traits field

* Change even split criteria

* Re-order statements

* Refine occupancy value for hdim=128&256

* Refine occupancy value for hdim=32&64

* Remove redundant kernel argument

* Separate fmha bwd codegen logics

* Separate fmha fwd codegen logics

* Remove redundant direction parameter in fwd&bwd codegen logics

* Support generate multiple APIs for an example

* Let 'api' an alias of 'direction' option

* Remove choices for the 'direction' option

* Use dictionary to config all the functions

* Move fmha splitkv codegen logics to other file

* Add fwd_splitkv api for tile_example_fmha_fwd

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

[ROCm/composable_kernel commit: 0cb2e06ddc]

2024-06-26 17:41:15 +08:00

.azuredevops

Enable external CI pipeline triggers (#1310 )

2024-05-23 18:21:34 -04:00

.github

Add ROCm Doc team as codeowners for RTD yaml (#1277 )

2024-05-06 10:07:39 -06:00

client_example

Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326 )

2024-06-21 19:02:57 -06:00

cmake

Fix cmake warnings (#1342 )

2024-06-21 09:47:58 +02:00

codegen

CK Instance Gen (#1145 )

2024-06-25 16:37:35 -05:00

docs

Bump rocm-docs-core from 1.3.0 to 1.4.0 in /docs/sphinx (#1327 )

2024-06-06 22:38:26 -07:00

example

[CK_TILE] fmha forward split-kv + combine kernels (#1338 )

2024-06-26 17:41:15 +08:00

include

[CK_TILE] fmha forward split-kv + combine kernels (#1338 )

2024-06-26 17:41:15 +08:00

library

Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326 )

2024-06-21 19:02:57 -06:00

profiler

Switch to universal gemm in grouped gemm tile loop (#1335 )

2024-06-18 09:01:49 -05:00

python/ck4inductor

Make the library which generates CK instances for pytorch2 inductor's CK backend usage

2024-05-22 13:44:22 -07:00

script

Code clean-up (#1285 )

2024-05-10 09:41:39 -07:00

test

Fix cmake warnings (#1342 )

2024-06-21 09:47:58 +02:00

.clang-format

start adding convolution

2018-10-08 22:49:58 -05:00

.clang-tidy

ROCm 6.0 replaces all __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1106 )

2023-12-19 07:16:49 -08:00

.gitignore

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

.pre-commit-config.yaml

[HotFix] add config and version files to pass on build info (#856 )

2023-08-23 11:36:17 -07:00

.readthedocs.yaml

Update documentation requirements and configurations (#1272 )

2024-04-30 20:44:59 -07:00

CHANGELOG.md

update the changelog for ROCm6.1 release (#1205 )

2024-03-18 10:16:45 -07:00

CITATION.cff

Switch from ROCmSoftwarePlatform to ROCm org (#1091 )

2023-12-07 15:59:34 -08:00

CMakeLists.txt

Remove gfx900 and gfx906 from default target device to reduce package size (#1351 )

2024-06-19 11:47:18 -07:00

Config.cmake.in

Split the static library into several files. (#1044 )

2023-11-28 11:17:37 -08:00

CONTRIBUTORS.md

Update the list of contributors. (#836 )

2023-08-09 13:44:13 -07:00

dev-requirements.txt

upgrade the ccache version and update links (#1169 )

2024-02-15 15:46:01 -08:00

Dockerfile

Upgrade to ROCm6.1 and turn on the -enable-post-misched=0 compiler flag. (#1250 )

2024-04-18 11:10:23 -05:00

Jenkinsfile

disable the hipTensor test by default, only run once daily (#1321 )

2024-06-03 14:07:30 -07:00

LICENSE

Randyh docfix (#1130 )

2024-01-16 09:00:37 -08:00

pyproject.toml

Make the library which generates CK instances for pytorch2 inductor's CK backend usage

2024-05-22 13:44:22 -07:00

rbuild.ini

Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88 )

2022-03-03 16:59:42 -06:00

README.md

fix typo (#1067 )

2023-12-14 14:21:18 -08:00

requirements.txt

Fix device instance libarary to include all instances (#418 )

2022-09-23 13:30:18 -05:00

README.md

Composable Kernel

The Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads across multiple architectures (GPUs, CPUs, etc.). The CK library uses general purpose kernel languages, such as HIP C++.

CK uses two concepts to achieve performance portability and code maintainability:

A tile-based programming model
Algorithm complexity reduction for complex machine learning (ML) operators. This uses an innovative technique called Tensor Coordinate Transformation.

The current CK library is structured into four layers:

Templated Tile Operators
Templated Kernel and Invoker
Instantiated Kernel and Invoker
Client API

General information

To build our documentation locally, use the following code:

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

You can find a list of our developers and contributors on our Contributors page.

If you use CK, cite us as follows:

* [Realizing Tensor Operators Using Coordinate Transformations and Tile Based Programming](???):
  This paper will be available on arXiv soon.
* [CITATION.cff](/CITATION.cff)

CK is released under the MIT license.

Building CK

We recommend building CK inside Docker containers, which include all necessary packages. Pre-built Docker images are available on DockerHub.

To build a new Docker image, use the Dockerfile provided with the source code:
```
DOCKER_BUILDKIT=1 docker build -t ck:latest -f Dockerfile .
```

Launch the Docker container:

docker run                                     \
-it                                            \
--privileged                                   \
--group-add sudo                               \
-w /root/workspace                             \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace  \
ck:latest                                      \
/bin/bash

Clone CK source code from the GitHub repository and start the build:

git clone https://github.com/ROCm/composable_kernel.git && \
cd composable_kernel && \
mkdir build && \
cd build

You must set the GPU_TARGETS macro to specify the GPU target architecture(s) you want to run CK on. You can specify single or multiple architectures. If you specify multiple architectures, use a semicolon between each; for example, gfx908;gfx90a;gfx940.

cmake                                                                                             \
-D CMAKE_PREFIX_PATH=/opt/rocm                                                                    \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                         \
-D CMAKE_BUILD_TYPE=Release                                                                       \
-D GPU_TARGETS="gfx908;gfx90a"                                                                    \
..

If you don't set GPU_TARGETS on the cmake command line, CK is built for all GPU targets supported by the current compiler (this may take a long time).

Build the entire CK library:
```
make -j
```
Install CK:
```
make -j install
```

Optional post-install steps

Build examples and tests:
```
make -j examples tests
```
Build and run all examples and tests:
```
make -j check
```
You can find instructions for running each individual example in example.
Build ckProfiler:
```
make -j ckProfiler
```
You can find instructions for running ckProfiler in profiler.

Note the -j option for building with multiple threads in parallel. This speeds up the build significantly. Depending on the number of CPU cores and the amount of RAM on your system, you may want to limit the number of threads. For example, if you have a 128-core CPU and 64 Gb of RAM.

By default, -j launches one thread per CPU core, which can cause the build to run out of memory and crash. In such cases, you can reduce the number of threads to 32 by using -j32.

Additional cmake flags can be used to significantly speed-up the build:

INSTANCES_ONLY (default is OFF) must be set to ON in order to build only the instances and library while skipping all tests, examples, and profiler. This is useful in cases when you plan to use CK as a dependency and don't plan to run any examples or tests.
DTYPES (default is not set) can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build instances of select data types only. The main default data types are fp32 and fp16; you can safely skip other data types.
DL_KERNELS (default is OFF) must be set to ON in order to build instances, such as gemm_dl or batched_gemm_multi_d_dl. These instances are useful on architectures like the NAVI2x, as most other platforms have faster instances, such as xdl or wmma, available.

Using sccache for building

The default CK Docker images come with a pre-installed version of sccache, which supports clang being used as hip-compiler (" -x hip"). Using sccache can help reduce the time to re-build code from hours to 1-2 minutes. In order to invoke sccache, you need to run:

 sccache --start-server

then add the following flags to the cmake command line:

 -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache

You may need to clean up the build folder and repeat the cmake and make steps in order to take advantage of the sccache during subsequent builds.

Using CK as pre-built kernel library

You can find instructions for using CK as a pre-built kernel library in client_example.

Contributing to CK

When you contribute to CK, make sure you run clang-format on all changed files. We highly recommend using git hooks that are managed by the pre-commit framework. To install hooks, run:

sudo script/install_precommit.sh

With this approach, pre-commit adds the appropriate hooks to your local repository and automatically runs clang-format (and possibly additional checks) before any commit is created.

If you need to uninstall hooks from the repository, you can do so by running the following command:

script/uninstall_precommit.sh

If you need to temporarily disable pre-commit hooks, you can add the --no-verify option to the git commit command.

Description

[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror

Readme MIT Cite this repository 252 MiB

Languages

C++ 91.3%

Python 6.1%

CMake 1.6%

Shell 0.5%

Pawn 0.2%

Other 0.1%