* make sure cmake can handle xnack targets
* dont build xdl instances for gfx906:xnack-
* dont build xdl tests for gfx906:xnack-
[ROCm/composable_kernel commit: b6e74be1aa]
Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.
Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.
[ROCm/composable_kernel commit: 464abd235e]
* Calculate generic relative threshold pool3dfwd
* Calculate absolute error threshold pool3d fwd
* Generic threshold calculation take max input for relative error pool3dfwd
* Remove max possible value for error calculation at runtime
* Remove debug print in pool3dfwd
* Pool3d fwd adjusted types in generic threshold calculation
* Generic threshold calculation take into account number of accumulations and accdatatype
* Generic threshold fix final error formula
* Generic threshold calculation - num of accs fix
* Generic threshold calculation - adjust absolute error
* Generic threshold calculation - OutDataType in absolute error
[ROCm/composable_kernel commit: 9385caa306]
* The draft on ckProfiler instance add
* support the ck profiler instance with same data types
* add a small feature on the M and N variable switch.
* Partially solve the incorrect result problem
* fix based on ci cd
[ROCm/composable_kernel commit: 560917b161]
* Add a gpu gemm reference kernel
* Switch to gpu reference in gemm examples
* Remove redundant arguments
* Update all related examples
* Update more examples
* Try less threads per block
* Try even less threads per block
* Add support for all matrix layouts
* Increase block size
* Clean up
* Remove hardcoded strides
* Clean up
* Try a column-major case
* Revert back to row-major
* Run both CPU and GPU veriffication
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit: aa932445ea]
* update build logic with GPU_ARCHS
* fix the GPU_ARCHS build for codegen
* unset GPU_TARGETS when GPU_ARCHS are set
[ROCm/composable_kernel commit: 7d8ea5f08b]
* Add additional instances to device_mha_instance
* Add comment to describe what receipt 3 option filters
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit: 3528a523ff]
* Legacy support: customized filesystem
* Update cmakefile for python alternative path
* fix build issues
* CK has no boost dependency
* More fixes to issues found on legay systems
* fix clang format issue
* Check if blob is correctly generated in cmake
* fix the python issues
* add a compiler flag for codegen when using alternative python
* use target_link_options instead of target_compile_options
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
[ROCm/composable_kernel commit: 81bc1496b2]
* revert ckprofiler change
* temp save
* Add test and test pass
* test pass
* Fix bug inside rotating buffer when tensor is not packed
* bug fix
* clang format
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit: 5b10dae6a4]
* locate a newwer version of python when -DRHEL=ON flag is set
* allow setting python version on cmake command line
[ROCm/composable_kernel commit: 841009c5ee]
* Set RNE fp8 conversion as a default
* Update f8 tests
* Disable failing test on gfx11
* Update bf8 tests
* Add a flag
* Fix the flag
* Raise flag for gfx10 as well
* Temp commit for tolerance testing
* Update tolerances
[ROCm/composable_kernel commit: e20f20efbf]
* re-enable fp8 and bf8 for all targets
* restore the fp8 gemm instances
* re-enable conv_3d fp8 on all architectures
* diasble several fp8 gemm instances on all architectures except gfx94
* clang format fix
[ROCm/composable_kernel commit: c8b6b64240]
This fixes 2 issues when compiled with libc++.
First issue is attempt to call std::numeric_limits<ranges::range_value_t<_Float16>>::min().
_Float16 is extension of libstdc++, it does not exist in C++ standard[2].
Luckily, there is NumericLimits class in composable_kernel, which does everything needed.
Second issue with call to 'check_err' is ambiguous: there are 2 candidates.
It happens because composable_kernel relies on idea that f8_t (defined as _BitInt(8)) does not pass is_integral trait.
However, libc++ treats _BitInt(N) as integral (per standard "any implementation-defined extended integer types" can be integral).
Closes: #1460
Signed-off-by: Sv. Lockal <lockalsash@gmail.com>
[ROCm/composable_kernel commit: 50c423481b]
* adding mha as static lib
* add fmha fwd compile options
* typo
* fix python version
* python version to 3
* increase path length
* add max path flag in mha cmake
* fix long path issue
* mha currently only runs in gfx94x
* only buld mha in mi300
* populate gpu_list
* add mha compile flags
* avoid building mha in gpu other then gfx94x
* some comments and include ck_tile in rocm
* use rocm_install
* place ck_tile in include
* correct ck_tile path
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit: 840c5397bb]
* Support 64 bit indexing
* Add new grouped conv fwd kernel for large tensors
* Add instances large tensor
* Fixes for transform conv to gemm
* Fixes
* fixes
* Remove not needed instances
* examples fixes
* Remove not need ds arrays
* Fix tests
* Add 2GB check in gridwise dl
* Fixes
[ROCm/composable_kernel commit: 4ec5c52a0c]
* add --offload-compress compiler flag
* only apply the --offload-compress flag to the ckProfiler
* move the --offload-compress flag back to main cmake file
* add offload-compress to target compile option of ckProfiler
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
[ROCm/composable_kernel commit: 7f57b2e02c]