Po Yen Chen 313e570496 Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380)
* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle

* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface

* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle

* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'

* Add 'TransformConvFwdToGemm<>' utility class (from Chao)

* Use 'TransformConvFwdToGemm<>' to shorten code

* Fix ill-formed method declaration

* Re-implement MakeRGridDescriptor_M() function

* Change problem description

* Use macro to define layout types

* Define K-reduced output tensor layout types

* Let user to decide R output tensor layout

* Rename variables

* Add padding to the reduced output tensor if necessary

* Extract common code as helper method

* Remove debug message

* Add missing include directive

* Add partial fp16 Conv + Reduction example

* Add example verification code for 2D Conv problem

* Use type alias to simplify code

* Share code across different-dimension Conv problems

* Rename file/functions from run_conv_fwd* to run_convnd_fwd*

* Make example code more verbose

* Add code to support 1D & 3D Conv + Reduction on host

* Add more examples for data type: bf16, fp32

* Add example for int8

* Add custom target to group examples

* Use more general custom target name

* Change the description in error message

* Disable testing for example other than fp32

* Add examplel for int4 (just copy from int8)

* Fix wrong data type

* Use larger data type for intermediate tensors

* Finish int4 example

* Undefine macro PP_DEFINE_LAYOUT_TYPE() after use

* Use named variables to replace magic numbers

* Remove debug messages

* Use same A/B data type for host Conv in int4 example

* Add check for the 'RLayout' type argument

* Group same-dim-layouts together in 'LayoutSetting<>'

* Add 'final' specifier to utility classes

* Use different initialization method for examples

* Remove macro PP_DEFINE_LAYOUT_TYPE()

* Fix code-comment mismatch

* Use more reasonable initialization value for all data types

* Default use init_method=1 for all examples

* Remove never-used code

* Remove confusing out-of-date comments

* clean

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

[ROCm/composable_kernel commit: 46a675aa6f]
2022-08-31 16:32:17 -05:00
2022-08-18 14:53:47 -05:00
2022-08-26 09:59:50 -05:00
2022-08-25 17:20:23 -05:00
2018-10-08 22:49:58 -05:00
2021-08-08 17:41:54 +00:00
2022-08-24 18:43:43 -05:00
2022-05-12 09:21:01 -05:00
2022-06-23 01:27:30 -05:00

Docker script

docker run                                     \
-it                                            \
--privileged                                   \
--group-add sudo                               \
-w /root/workspace                             \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace  \
rocm/tensorflow:rocm5.1-tf2.6-dev              \
/bin/bash

Install newer version of rocm-cmake

https://github.com/RadeonOpenCompute/rocm-cmake

Build

mkdir build && cd build
# Need to specify target ID, example below is gfx908 and gfx90a
cmake                                                                 \
-D BUILD_DEV=OFF                                                      \
-D CMAKE_BUILD_TYPE=Release                                           \
-D CMAKE_CXX_FLAGS=" --offload-arch=gfx908 --offload-arch=gfx90a -O3" \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                             \
-D CMAKE_PREFIX_PATH=/opt/rocm                                        \
-D CMAKE_INSTALL_PREFIX=${PATH_TO_CK_INSTALL_DIRECTORY}               \
..

Build and Run Examples

 make -j examples

Instructions for running each individual examples are under example/

Tests

 make -j examples tests
 make test

Build ckProfiler

 make -j ckProfiler

Instructions for running ckProfiler are under profiler/

Install CK

make install

Using CK as pre-built kernel library

Instructions for using CK as a pre-built kernel library are under client_example/

Caveat

Kernel Timing and Verification

CK's own kernel timer will warn up kernel once, and then run it multiple times to get average kernel time. For some kernels that use atomic add, this will cause output buffer to be accumulated multiple times, causing verfication failure. To work around it, do not use CK's own timer and do verification at the same time. CK's own timer and verification in each example and ckProfiler can be enabled or disabled from command line.

Description
[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror
Readme MIT 234 MiB
Languages
C++ 93.1%
Python 4.5%
CMake 1.5%
Shell 0.5%
Pawn 0.2%