* re-enable fp8 and bf8 for all targets
* restore the fp8 gemm instances
* re-enable conv_3d fp8 on all architectures
* diasble several fp8 gemm instances on all architectures except gfx94
* clang format fix
* Check compiler flags before using
The user's compiler may not support these flags, so check.
Resolves failures on Fedora.
Signed-off-by: Tom Rix <trix@redhat.com>
* fix syntax CMakeLists.txt
Fix syntax in the check_cxx_compiler_flag.
---------
Signed-off-by: Tom Rix <trix@redhat.com>
Co-authored-by: Tom Rix <trix@redhat.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
This fixes 2 issues when compiled with libc++.
First issue is attempt to call std::numeric_limits<ranges::range_value_t<_Float16>>::min().
_Float16 is extension of libstdc++, it does not exist in C++ standard[2].
Luckily, there is NumericLimits class in composable_kernel, which does everything needed.
Second issue with call to 'check_err' is ambiguous: there are 2 candidates.
It happens because composable_kernel relies on idea that f8_t (defined as _BitInt(8)) does not pass is_integral trait.
However, libc++ treats _BitInt(N) as integral (per standard "any implementation-defined extended integer types" can be integral).
Closes: #1460
Signed-off-by: Sv. Lockal <lockalsash@gmail.com>
* enable CI build and test on gfx1201
* skip DL kernels in CI for gfx12
* only run CI on gfx12 if rocm version >= 6.2
* remove the rocm version check for CI on gfx12
* add a switch for CI builds on gfx12
* run ck_tile benchmarks after the smoke tests and store logs
* change the path of fmha benchmark logs
* change the way of stashig ck_tile fmha logs
* prevent the errors in stages where no logs are generated
* fix the ck_tile fmha log names and headers
* generate the fmha performance logs in the root folder
* change jenkins scrip arguments format
* use exact file names for stashing
* modify scripts to process FMHA performance results
* unstash FMHA logs before parsing them
* adding mha as static lib
* add fmha fwd compile options
* typo
* fix python version
* python version to 3
* increase path length
* add max path flag in mha cmake
* fix long path issue
* mha currently only runs in gfx94x
* only buld mha in mi300
* populate gpu_list
* add mha compile flags
* avoid building mha in gpu other then gfx94x
* some comments and include ck_tile in rocm
* use rocm_install
* place ck_tile in include
* correct ck_tile path
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Support 64 bit indexing
* Add new grouped conv fwd kernel for large tensors
* Add instances large tensor
* Fixes for transform conv to gemm
* Fixes
* fixes
* Remove not needed instances
* examples fixes
* Remove not need ds arrays
* Fix tests
* Add 2GB check in gridwise dl
* Fixes
* add --offload-compress compiler flag
* only apply the --offload-compress flag to the ckProfiler
* move the --offload-compress flag back to main cmake file
* add offload-compress to target compile option of ckProfiler
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
A standard option in Fedora packaging that is used to check
the correctness of c++ use of the standard c++ library.
Signed-off-by: Tom Rix <trix@redhat.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* add ab_scale init support
* enabled interwave
* add scale type; update isSupport
* adjust example
* clean
* enable f8 pure gemm rcr ckprofiler
* Add gemm_multiply_multiply instances
* clang format
* Optimize for ScaleBlockMNK=128
* enable abscale f8 gemm ck profiler
* Add pure f8 gemm test suite
* Reverting to the state of project at f60fd77
* update copyright
* clang format
* update copyright
---------
Co-authored-by: root <jizhan@amd.com>
* init for reduce_threadwise multi_d
* add reduce_threadwise_multi_d
* add reduce_multi_d
* clean
* start add an other splitk device op
* add reduce template parameter to SplitKBatchOffset
* add reduce c matrix
* clean up code
* change example data type to bf16
* add bf16Ai8B example
* remove reduce template parameter
* add splitk atomic status to v4
* example add multi d parameters
* device op add multi-d parameters
* add multi-d to reduce
* fix kbach=1 bug
* change B layout to col in bf16Ai8B example
* remove float adding struct
* change multi-d interface
* change file and class name
* remove multi-d of bf16Ai8B example
* change IsReduce function to IsReduceAdd
* change example layout to RRR from RCR
* according layout to set ds stride
* reset parameter layout
* add gemm universal reduce instance
* add reduce factory
* add profile_gemm_universal_reduce
* add reduce to profiler
* fix reduce instance
* fix profiler reduce compiling bug
* format
* format library instance code
* add mem instance for reduce library
* fix call instance names
* add workspace for reduce in ckProfiler
* format
* add mnpading to reduce library instance
* add fp16 instance to reduce of profiler
* change copyright time
* restore profiler cmake file
* add reduce text to instances
* add DsLayout and DsDataType to instances template parameter
* fixed gemm_reduce_multi_d
* add an example without multi_d
* Update common.hpp
* Update gtest.cmake
* Update gemm_xdl_splitk_reduce_bf16.cpp
* clean
* Update gtest.cmake
* format
* fixe api
* format
* default parameter change to RRR
* add vector_len for multi_d
* format
* Update gtest.cmake
* fix bf16A iBB elementwiseop
* add ReduceDataType
* move ReduceDataType to end position
* format
* remove googletest git method address
* fix copyright time
* update init data
---------
Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>