* add ninja trace to CI builds
* fix ninja trace logic
* update the ninja trace logic in jenkins file
* limit the number of threads to run ninja build
* use ninja for installation after build
* update the path to ninjatracing tool
* use ninja to run check when using build trace
* fix jenkins logic
* fix typos
* set proper setup_args for all stages
* fix ninja syntax
* replace ninja check with ninja test
* enable ninja tracing with mainline and staging compilers
* Enable CMakePresets build
* Verify Convolution, Scaling and ReLU algorithms.
* Add tensor element-wise scale and type cast operation.
* Reduction implemented but does not work.
* Exploration of Reduction functionality.
* Completed example for Convolution scaled with ReLu activation and AMAX reduction.
* WIP: Add required instances for convolution.
* WIP: Create client example. Implement convolution stage.
* Add elementwise instances.
* Add elementwise scale + convert example.
* Add reduction instances.
* WIP: Client example for AMAX reduction.
* WIP: Add instances for multistage reduction.
* WIP: Implementation of multistage reduction.
* Refactoring.
* Clean up.
* Add CMakePresets.json
* Guard off FP8 instances when the data type is not available.
* Add example for Scaled FP8 Convolution with AMAX reduction.
* Refactor CombConvScaleRelu instances.
* Add CombConvScale instances.
* Add client example for Scaled FP8 Convolution with AMAX reduction.
* Cleanup.
* Set RNE fp8 conversion as a default
* Update f8 tests
* Disable failing test on gfx11
* Update bf8 tests
* Add a flag
* Fix the flag
* Raise flag for gfx10 as well
* Temp commit for tolerance testing
* Update tolerances
* re-enable fp8 and bf8 for all targets
* restore the fp8 gemm instances
* re-enable conv_3d fp8 on all architectures
* diasble several fp8 gemm instances on all architectures except gfx94
* clang format fix
* Check compiler flags before using
The user's compiler may not support these flags, so check.
Resolves failures on Fedora.
Signed-off-by: Tom Rix <trix@redhat.com>
* fix syntax CMakeLists.txt
Fix syntax in the check_cxx_compiler_flag.
---------
Signed-off-by: Tom Rix <trix@redhat.com>
Co-authored-by: Tom Rix <trix@redhat.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
This fixes 2 issues when compiled with libc++.
First issue is attempt to call std::numeric_limits<ranges::range_value_t<_Float16>>::min().
_Float16 is extension of libstdc++, it does not exist in C++ standard[2].
Luckily, there is NumericLimits class in composable_kernel, which does everything needed.
Second issue with call to 'check_err' is ambiguous: there are 2 candidates.
It happens because composable_kernel relies on idea that f8_t (defined as _BitInt(8)) does not pass is_integral trait.
However, libc++ treats _BitInt(N) as integral (per standard "any implementation-defined extended integer types" can be integral).
Closes: #1460
Signed-off-by: Sv. Lockal <lockalsash@gmail.com>
* enable CI build and test on gfx1201
* skip DL kernels in CI for gfx12
* only run CI on gfx12 if rocm version >= 6.2
* remove the rocm version check for CI on gfx12
* add a switch for CI builds on gfx12
* run ck_tile benchmarks after the smoke tests and store logs
* change the path of fmha benchmark logs
* change the way of stashig ck_tile fmha logs
* prevent the errors in stages where no logs are generated
* fix the ck_tile fmha log names and headers
* generate the fmha performance logs in the root folder
* change jenkins scrip arguments format
* use exact file names for stashing
* modify scripts to process FMHA performance results
* unstash FMHA logs before parsing them
* adding mha as static lib
* add fmha fwd compile options
* typo
* fix python version
* python version to 3
* increase path length
* add max path flag in mha cmake
* fix long path issue
* mha currently only runs in gfx94x
* only buld mha in mi300
* populate gpu_list
* add mha compile flags
* avoid building mha in gpu other then gfx94x
* some comments and include ck_tile in rocm
* use rocm_install
* place ck_tile in include
* correct ck_tile path
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Support 64 bit indexing
* Add new grouped conv fwd kernel for large tensors
* Add instances large tensor
* Fixes for transform conv to gemm
* Fixes
* fixes
* Remove not needed instances
* examples fixes
* Remove not need ds arrays
* Fix tests
* Add 2GB check in gridwise dl
* Fixes
* add --offload-compress compiler flag
* only apply the --offload-compress flag to the ckProfiler
* move the --offload-compress flag back to main cmake file
* add offload-compress to target compile option of ckProfiler
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>