mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 10:09:41 +00:00

Go to file

Anthony Chang e078585f04 Fused attention instances & padding tests (#395 )

* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected

[ROCm/composable_kernel commit: 868e5c555b]

2022-09-06 14:38:56 -05:00

client_example

Softmax client example (#396 )

2022-09-06 12:22:48 -05:00

cmake

int4 data type (#364 )

2022-08-18 14:53:47 -05:00

example

Fused attention instances & padding tests (#395 )

2022-09-06 14:38:56 -05:00

include/ck

Fused attention instances & padding tests (#395 )

2022-09-06 14:38:56 -05:00

library

Fused attention instances & padding tests (#395 )

2022-09-06 14:38:56 -05:00

profiler

Fused attention instances & padding tests (#395 )

2022-09-06 14:38:56 -05:00

script

Fixed splitk gemm fp32 (#384 )

2022-08-26 09:59:50 -05:00

test

Fused attention instances & padding tests (#395 )

2022-09-06 14:38:56 -05:00

.clang-format

start adding convolution

2018-10-08 22:49:58 -05:00

.clang-tidy

add tidy

2021-08-08 17:41:54 +00:00

.gitignore

layernorm external api (#379 )

2022-08-24 18:43:43 -05:00

CMakeLists.txt

Add an option to build CK with clang directly (#387 )

2022-08-26 12:51:39 -05:00

Config.cmake.in

Add host API (#220 )

2022-05-12 09:21:01 -05:00

dev-requirements.txt

Initial Setup for CI (#86 )

2022-02-18 21:44:11 -06:00

Dockerfile

Fix QA, allow switching compiler versions, fix google test compilation error. (#348 )

2022-08-08 13:49:14 -05:00

Jenkinsfile

Add an option to build CK with clang directly (#387 )

2022-08-26 12:51:39 -05:00

LICENSE

update license (#297 )

2022-06-23 01:27:30 -05:00

rbuild.ini

Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88 )

2022-03-03 16:59:42 -06:00

README.md

Clean up conv example, Instances, profiler and test (#324 )

2022-07-29 18:19:25 -05:00

requirements.txt

Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88 )

2022-03-03 16:59:42 -06:00

README.md

Docker script

docker run                                     \
-it                                            \
--privileged                                   \
--group-add sudo                               \
-w /root/workspace                             \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace  \
rocm/tensorflow:rocm5.1-tf2.6-dev              \
/bin/bash

Install newer version of rocm-cmake

https://github.com/RadeonOpenCompute/rocm-cmake

Build

mkdir build && cd build

# Need to specify target ID, example below is gfx908 and gfx90a
cmake                                                                 \
-D BUILD_DEV=OFF                                                      \
-D CMAKE_BUILD_TYPE=Release                                           \
-D CMAKE_CXX_FLAGS=" --offload-arch=gfx908 --offload-arch=gfx90a -O3" \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                             \
-D CMAKE_PREFIX_PATH=/opt/rocm                                        \
-D CMAKE_INSTALL_PREFIX=${PATH_TO_CK_INSTALL_DIRECTORY}               \
..

Build and Run Examples

 make -j examples

Instructions for running each individual examples are under example/

Tests

 make -j examples tests
 make test

Build ckProfiler

 make -j ckProfiler

Instructions for running ckProfiler are under profiler/

Install CK

make install

Using CK as pre-built kernel library

Instructions for using CK as a pre-built kernel library are under client_example/

Caveat

Kernel Timing and Verification

CK's own kernel timer will warn up kernel once, and then run it multiple times to get average kernel time. For some kernels that use atomic add, this will cause output buffer to be accumulated multiple times, causing verfication failure. To work around it, do not use CK's own timer and do verification at the same time. CK's own timer and verification in each example and ckProfiler can be enabled or disabled from command line.

Languages

C++ 93.1%

Python 4.5%

CMake 1.5%

Shell 0.5%

Pawn 0.2%