Commit Graph

8 Commits

Author SHA1 Message Date
Arnav Sharma
66b2231b65 Fixed CMake files for HER
- Removed subdirectory addition

Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1
2022-06-01 12:25:43 +05:30
Arnav Sharma
e5d5a43eab Optimized ZHER Implementation
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.

AMD-Internal: [CPUPL-2151]

Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
2022-05-25 14:03:01 +05:30
S, HariharaSudhan
a8bc55c373 Multithreaded SGEMV var 1 with smart threading
- Implemented an OpenMP based stand alone SGEMV kernel for
	  row-major (var 1) for multithread scenarios
	- Smart threading is enabled when AOCL DYNAMIC is defined
	- Number of threads are decided based on the input dims
	  using smart threading

AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
2022-05-17 18:10:39 +05:30
Dipal M Zambare
31921b9974 Updated windows build system to define BLIS_CONFIG_EPYC flag.
All AMD specific optimization in BLIS are enclosed in BLIS_CONFIG_EPYC
pre-preprocessor, this was not defined in CMake which are resulting in
overall lower performance.

Updated version number to 3.1.1

Change-Id: I9848b695a599df07da44e77e71a64414b28c75b9
2022-05-17 18:03:09 +05:30
Harsh Dave
351269219f Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-01-05 05:51:15 -06:00
Nageshwar Singh
cbd9ea76af Complex single standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
2021-11-12 08:58:55 +05:30
Nageshwar Singh
a3d04a21a0 Complex double standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
2021-11-12 08:58:54 +05:30
Meghana Vankadari
47744663d9 Enabling framework optimizations for zen family architectures.
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
  framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
  configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
  accessible to all zen family architectures.

Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d
2020-10-07 13:10:50 +05:30