- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.
AMD-Internal: [CPUPL-2151]
Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
- Implemented an OpenMP based stand alone SGEMV kernel for
row-major (var 1) for multithread scenarios
- Smart threading is enabled when AOCL DYNAMIC is defined
- Number of threads are decided based on the input dims
using smart threading
AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.
Updated version number to 3.1.1
Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
- Impplemented her2 framework calls for transposed and non
transposed kernel variants.
- dher2 kernel operate over 4 columns at a time. It computes
4x4 triangular part of matrix first and remainder part is
computed in chunk of 4x4 tile upto m rows.
- remainder cases(m < 4) are handled serially.
AMD-Internal: [CPUPL-1968]
Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
compared to axpyf based implementation
AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
compared to axpyf based implementation
AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
accessible to all zen family architectures.
Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d