Fusion Conv+Bias+ReLU(+Add) (#62)

* fix relu

* clean up

* clean up

* adding 1x1 conv

* adding 1x1 conv

* added 1x1 conv

* refactor

* refactor

* refactor

* added profiler for conv+bias+relu+add

* clean up

* adding conv+bias+relu

* adding conv+bias+relu

* added conv+bias+relu

* Update README.md

* update cpu verification

* adding c shuffle

* update static_tensor for dealing with invalid element

* adding c shuffle

* debugging

* fix bug

* convert to fp16 before shuffle

* shuffle more than one M/NRepeat

* clean up

* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1

* clean up

* remove coordinate step hack from all gridwise gemm xdl

* clean up coordinate step hack

* clean up coordinate step hack

* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst

* adding output shuffle in conv+bias+relu+add

* update

* added conv+bias+relu+add with c shuffle

* added conv+bias+relu+add with c shuffle

* fix forward_sweep bugs in threadwise copy

* clean up

* refactor

* clean up

* clean up

* added conv_c_shuffle+bias_relu

* clean up

* added conv+bias+relu+atomic_add

* clean up

* clean up

* clean up

* clean up

* clean up

* clean up

* misc fixes; add 1x1 specialization

* clean up

* delete unused device op

* clean up

* add support for odd C value

[ROCm/composable_kernel commit: acbd7bd7c5]
This commit is contained in:
Chao Liu
2021-12-26 08:43:42 -06:00
committed by GitHub
parent 370a49bb29
commit 886680ae94
90 changed files with 13347 additions and 2639 deletions

View File

@@ -1,4 +1,3 @@
#include <boost/range/adaptor/transformed.hpp>
#include <cassert>
#include "host_tensor.hpp"
@@ -26,8 +25,12 @@ std::size_t HostTensorDescriptor::GetElementSize() const
std::size_t HostTensorDescriptor::GetElementSpace() const
{
auto ls = mLens | boost::adaptors::transformed([](std::size_t v) { return v - 1; });
return std::inner_product(ls.begin(), ls.end(), mStrides.begin(), std::size_t{0}) + 1;
std::size_t space = 1;
for(int i = 0; i < mLens.size(); ++i)
{
space += (mLens[i] - 1) * mStrides[i];
}
return space;
}
const std::vector<std::size_t>& HostTensorDescriptor::GetLengths() const { return mLens; }