- Added a set of thresholds(based on input dimensions) that
determine and set the ideal number of threads to be used
for CGEMM (on ZEN4 and ZEN5 architectures).
- The thread-setting logic is as follows :
- The underlying kernels(single-threaded) work on blocks
of MRxk of A, kxNR of B and MRxNR of C. Thus, it is
initially assumed that the optimal number of threads is
ceil(m/MR)*ceil(n/NR). This is the upper bound on the
actual number of threads that is ideal.
- The actual ideal thread count could be lesser than the
upper bound, based on the work that every thread receives.
This is mainly determined by the value of 'k'.
- If 'k' is small, the arithmetic intensity(AI) is low and
memory bandwidth becomes the limiting factor, thus favoring
smaller thread counts. In contrast, if 'k' is high, the AI
is high and the workload scales well with higher thread counts.
- So, we limit the number of threads when 'k' is small to avoid
bandwidth contention. Using fewer threads ensures each thread
gets more bandwidth, improving efficiency. In contrast, we allow
more threads when 'k' is large, as the computation becomes more
compute-bound and less limited by memory bandwidth, thereby benefitting
with a higher-thread count.
- The new logic will now set the upper bound for the optimal number of threads
(based on the number of tiles), and then further reduce it based on the values
of 'm', 'n' and 'k'. This comes under the 'AOCL_DYNAMIC' feature for CGEMM,
specifically for ZEN4 and ZEN5 architectures.
AMD-Internal: [CPUPL-6498]
Co-authored-by: Vignesh Balasubramanian <vignbala@amd.com>
Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
AOCL-BLAS library
AOCL-BLAS is AMD's optimized version of BLAS targeted for AMD EPYC and Ryzen CPUs. It is developed as a forked version of BLIS (https://github.com/flame/blis), which is developed by members of the Science of High-Performance Computing (SHPC) group in the Institute for Computational Engineering and Sciences at The University of Texas at Austin and other collaborators (including AMD). All known features and functionalities of BLIS are retained and supported in AOCL-BLAS library. AOCL-BLAS is regularly updated with the improvements from the upstream repository.
AOCL BLAS is optimized with SSE2, AVX2, AVX512 instruction sets which would be enabled based on the target Zen architecture using the dynamic dispatch feature. All prominent Level 3, Level 2 and Level 1 APIs are designed and optimized for specific paths targeting different size spectrums e.g., Small, Medium and Large sizes. These algorithms are designed and customized to exploit the architectural improvements of the target platform.
For detailed instructions on how to configure, build, install, and link against AOCL-BLAS on AMD CPUs, please refer to the AOCL User Guide located on AMD developer portal.
The upstream repository (https://github.com/flame/blis) contains further information on BLIS, including background information on BLIS design, usage examples, and a complete BLIS API reference.
AOCL-BLAS is developed and maintained by AMD. You can contact us on the email-id toolchainsupport@amd.com. You can also raise any issue/suggestion on the git-hub repository at https://github.com/amd/blis/issues.