Commit Graph

2657 Commits

Author SHA1 Message Date
Harish
77e8492cbd Added znver4 flag for config builds with AOCC 4.0 compiler version
Change-Id: I45f1031ed4c5ea2e3f594713f3821d6bbbecd4df
2022-06-28 02:47:48 -04:00
Chandrashekara K R
ff2ee0ae3f AOCL-WINDOWS: Added the windows build system to build bench folder on windows.
1. Added the checks in .c files of the bench folder to read the input parameters from the given input files on windows using fscanf.

Change-Id: Ie0497696304d318f345a646ab0ce3ba84debd4e2
2022-06-27 22:32:39 -04:00
Kiran Varaganti
47c344bc2e DGEMM Benchmark Optimizations
Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 dgemm kernel

Change-Id: I56b3df238b6d85a6f6861448c0c6f907c972146a
2022-06-27 09:32:39 -04:00
Dipal M Zambare
00c5048f65 Fixed high impact static analysis issues
Initialized ymm and xmm registers to zero to address
un-inilizaed variable errors reported in static analsys.

AMD-Internal: [CPUPL-2078]
Change-Id: Icfcc008a0f244278efd8145d7feef764ed5fcc04
2022-06-27 08:19:26 +00:00
satish kumar nuggu
aaf840d86e Disabled zgemm SUP path
- Need to identify new Thresholds for zgemm SUP path to avoid performance regression.

    AMD-Internal: [CPUPL-2148]

Change-Id: I0baa2b415dc5e296780566ba7450249445b93d43
2022-06-27 08:19:12 +00:00
satish kumar nuggu
13c71ca976 BugFix of AOCL_DYNAMIC in TRSM multithreaded small.
- Added initialization of rntm object before aocl_dynamic.
- Bugfixes in dtrsm right-side kernels,
  avoided accessing extra memory while using store for corner cases.

AMD-Internal: [CPUPL-2193] [CPUPL-2194]
Change-Id: I1c9d10edda93621626957d4de2f53d249ad531ba
2022-06-14 15:46:17 +05:30
mkadavil
e073e8b669 DAMAXV AXX512 micro kernel bug fix.
-DAMAXV AVX512 is giving wrong results when max element is present at
index in [n-u, n), where u < 32. This is a fallout of using wrong
start offset for the non-loop unrolled code.
-Functions for replacing NaN with negative numbers is replaced with
MACRO to avoid function call overhead and to remove static variables
used for stateful replacement numbers for NaN.

AMD-Internal: [CPUPL-2190]
Change-Id: Ie1435c38b264a271f869782793d0b52bbe6e1b2a
2022-06-13 10:52:53 +05:30
mkadavil
6c112632a7 Low precision gemm integrated as aocl_gemm addon.
- Multi-Threaded int8 GEMM (Input - uint8_t, int8_t, Output - int32_t).
AVX512_vnni based micro-kernel for int8 gemm. Paralellization supported
along m and n dimensions.
- Multi-Threaded B matrix reorder support for sgemm. Reordering B matrix
is packing entire B matrix upfront before sgemm. It allows sgemm to
take advantage of packed B matrix without incurring packing costs during
runtime.
- Makefile updates to addon make rules to compile avx512 code for
selected files in addon folder.
- CPU features query enhancements to check for AVX512_VNNI flag.
- Bench for int8 gemm and sgemm with B matrix reorder. Supports
performance mode for benchmarking and accuracy mode for testing code
correctness.

AMD-Internal: [CPUPL-2102]

Change-Id: I8fb25f5c2fbd97d756f95b623332cb29e3b8d182
2022-06-09 10:28:38 -04:00
Chandrashekara K R
c03699b97a AOCL-Windows: Updating windows build system to add /arch:AVX512 compiler flag only to zen4 specific source files.
Change-Id: Ia4fa65a831a00ce37f97075db6812be048bfe0bc
2022-06-09 14:32:54 +05:30
Dipal M Zambare
c87b9aab75 Added support for AVX512 for Windows and AMAVX
- Completed zen4 configuration support on windows
 - Enabled AVX512 kernels for AMAXV
 - Added zen4 configuration in amdzen for windows
 - Moved all zen4 kernels inside kernels/zen4 folder

AMD-Internal: [CPUPL-2108]
Change-Id: I9d2336998bbcdb8e2c4ca474977b5939bfa578ba
2022-06-08 11:09:48 +05:30
Dipal M Zambare
8cc15107ed Enabled AVX-512 kernels for Zen4 config
- Enabled AVX-512 skylake kernels in zen4 configuration.
    AVX-512 kernels are added for GEMM float and double types.

  - Enabled reference kernel for TRSM native path

AMD-Internal: [CPUPL-2108]
Change-Id: I66f3468346085c17183cbcbf4f2c8cfe07579b6f
2022-06-03 06:34:35 +00:00
Dipal M. Zambare
e61ec820f9 Fixed windows build issue for BLIS 4.0
- Removed extra AVX512 typedef from bli_amaxv_zen_int.c.

AMD-Internal: [CPUPL-2154]
Change-Id: Ieaa827c7d81b8d101f3a827827b99433570219f2
2022-06-02 17:52:43 +05:30
Arnav Sharma
66b2231b65 Fixed CMake files for HER
- Removed subdirectory addition

Change-Id: I419085db0b9034777409207a7d79b7ffa91eb8f1
2022-06-01 12:25:43 +05:30
Arnav Sharma
e5d5a43eab Optimized ZHER Implementation
- Implemented optimized her framework calls for double precision complex numbers.
- The zher kernel operates over 4 columns at a time. Initially, it computes the diagonal elements of the matrix, then the 4x4 triangular part is computed and finally the remaining part is computed as 4x4 tiles of the matrix upto m rows.

AMD-Internal: [CPUPL-2151]

Change-Id: I27430ee33ffb901b3ef4bdd97b034e3f748e9cca
2022-05-25 14:03:01 +05:30
Dipal M Zambare
6e2f536590 Removed Arch specific code from BLIS framework.
- Removed BLIS_CONFIG_EPYC macro
- The code dependent on this macro is handled in
  one of the three ways

  -- It is updated to work across platforms.
  -- Added in architecture/feature specific runtime checks.
  -- Duplicated in AMD specific files. Build system is updated to
      pick AMD specific files when library is built for any of the
     zen architecture

AMD-Internal: [CPUPL-1960]
Change-Id: I6f9f8018e41fa48eb43ae4245c9c2c361857f43b
2022-05-17 20:35:40 +05:30
Harsh Dave
f48ced0811 Optimized dher2 implementation
- Impplemented her2 framework calls for transposed and non
  transposed kernel variants.

- dher2 kernel operate over 4 columns at a time. It computes
  4x4 triangular part of matrix first and remainder part is
  computed in chunk of 4x4 tile upto m rows.

- remainder cases(m < 4) are handled serially.

AMD-Internal: [CPUPL-1968]

Change-Id: I12ae97b2ad673a7fd9b733c607f27b1089142313
2022-05-17 18:13:07 +05:30
Harsh Dave
718c6bc024 Optimized daxpy2v implementation
- Optimized axpy2v implementation for double
  datatype by handling rows in mulitple of 4
  and store the final computed result at the
  end of computation, preventing unnecessary
  stores for improving the performance.

- Optimal and reuse of vector registers for
  faster computation.

AMD-Internal: [CPUPL-1973]
Change-Id: I7b8ef94d0f67c1c666fdce26e9b2b7291365d2e9
2022-05-17 18:13:05 +05:30
Chandrashekara K R
43c16d8e08 AOCL-Windows: Updating the blis windows build system.
1. Removed the libomp.lib hardcoded from cmake scripts and made it user configurable. By default libomp.lib is used as an omp library.
2. Added the STATIC_LIBRARY_OPTIONS property in set_target_properties cmake command to link omp library to build static-mt blis library.
3. Updated the blis_ref_kernel_mirror.py to give support for zen4 architecture.

AMD-Internal: CPUPL-1630
Change-Id: I54b04cde2fa6a1ddc4b4303f1da808c1efe0484a
2022-05-17 18:12:51 +05:30
Dipal M Zambare
9c6c76613c Added support for zen4 architecture
- Added configuration option for zen4 architecture
  - Added auto-detection of zen4 architecture
  - Added zen4 configuration for all checks related
    to AMD specific optimizations

AMD-Internal: [CPUPL-1937]
Change-Id: I1a1a45de04653f725aa53c30dffb6c0f7cc6e39a
2022-05-17 18:12:49 +05:30
Harihara Sudhan S
0fddd9eb0d Level 1 Kernel: damaxv AVX512
Details:

- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
  to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
  implementation for better performance
- Unrolled the loop by order of 4.

Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
2022-05-17 18:12:16 +05:30
Dipal M Zambare
8f9be1766b Disable AOCL_VERBOSE feature
- AOCL_VERBOSE implementation is causing breakage in libFLAME.
    Currently DTL code is duplicated in BLIS and libFLAME,
    Which results in duplicate symbol errors when DTL is enabled
    in both the libraries.

  - It will be addressed by making DTL as separate library.

  - The input logs can still be enabled by setting
    AOCL_DTL_LOG_ENABLE = 1 in aocldtlcf.h and recompiling
    the BLIS library.

AMD-Internal: [CPUPL-2101]
Change-Id: I8e69b68d53940e306a1d16ffbb65019def7e655a
2022-05-17 18:10:40 +05:30
mkadavil
31f8820bab Bug fixes for open mp based multi-threaded GEMM/GEMMT SUP path.
- auto_factor to be disabled if BLIS_IC_NT/BLIS_JC_NT is set
irrespective of whether num_threads (BLIS_NUM_THREADS) is modified at
runtime. Currently the auto_factor is enabled if num_threads > 0 and not
reverted if ic/jc/pc/jr/ir ways are set in bli_rntm_set_ways_from_rntm.
This results in gemm/gemmt SUP path applying 2x2 factorization of
num_threads, and thereby modifying the preset factorization. This issue
is not observed in native path since factorization happens without
checking auto_factor value.
- Setting omp threads to n_threads using omp_set_num_threads after the
global_rntm n_threads update in bli_thread_set_num_threads. This ensures
that in bli_rntm_init_from_global, omp_get_max_threads returns the same
value as set previously.

AMD-Internal: [CPUPL-2137]
Change-Id: I6c5de0462c5837cfb64793c3e6d49ec3ac2b6426
2022-05-17 18:10:40 +05:30
mkadavil
8670992c3d Default sgemv kernel to be used in single-threaded scenarios.
- sgemv calls a multi-threading friendly kernel whenever it is compiled
with open mp and multi-threading enabled. However it was observed that
this kernel is not suited for scenarios where sgemv is invoked in a
single-threaded context (eg: sgemv from ST sgemm fringe kernels and with
matrix blocking). Falling back to the default single-threaded sgemv
kernel resulted in better performance for this scenario.

AMD-Internal: [CPUPL-2136]
Change-Id: Ic023db4d20b2503ea45e56a839aa35de0337d5a6
2022-05-17 18:10:40 +05:30
Harsh Dave
349dcc459a Fixed scalapack xcsep failer due to cdotxv kernel.
-Failure was observed in zen configuration as gcc
flag safe-math-optimization was being used for reference
kernel compilation.
- Optmized kernels were being compiled without this gcc flag
resulted in computation difference resulting in test case
failure.

AMD-Internal: [CPUPL-2121]
Change-Id: I5d86e589cdea633220aecadbcab84d9b88b31f57
2022-05-17 18:10:40 +05:30
Dipal M Zambare
7247e6a150 Fixed crash issue in TRSM on non-avx platform.
- Ensured that FMA, AVX2 based kernels are called only on platforms
   supporting these instructions, otherwise standard ‘C’ kernels will
   be called.
 - Code cleanup for optimization and consistency

AMD-Internal: [CPUPL-2126]
Change-Id: I203270892b2fad2ccc9301fb55e2bae75508e050
2022-05-17 18:10:39 +05:30
Nallani Bhaskar
7658067107 Added AOCL Dynamic feature for dtrmm
Description:

1. Tuned number of threads to achive better performance
   for dtrmm

  AMD-Internal: [ CPUPL-2100 ]

Change-Id: Ib2e3df224ba76d86185721bef1837cd7855dd593
2022-05-17 18:10:39 +05:30
Dipal M Zambare
4ccb438c18 Updated Zen3 architecture detection for Ryzen 5000
- Added support to detect Ryzen 5000 Desktop and APUs

AMD-Internal: [CPUPL-2117]
Change-Id: I312a7de1a84cf368b74ba20e58192803a9f7dace
2022-05-17 18:10:39 +05:30
satish kumar nuggu
09b70de635 Performance Improvement for ztrsm small sizes
Details:
    - Handled Overflow and Underflow Vulnerabilites in
      ztrsm small right implementations.
    - Fixed failures observed in Scalapack testing.

    AMD-Internal: [CPUPL-2115]

Change-Id: I22c1ba583e0ba14d1a4684a85fa1ca6e152e8439
2022-05-17 18:10:39 +05:30
Nallani Bhaskar
2acb3f6ed0 Tuned aocl dynamic for specific range in dgemm
Description:

1. Decision logic to choose optimal number of threads for
   given input dgemm dimensions under aocl dynamic feature
   were retuned based on latest code.

2. Updated code in few file to avoid compilation warnings.

3. Added a min check for nt in bli_sgemv_var1_smart_threading
   function

AMD-Internal: [ CPUPL-2100 ]
Change-Id: I2bc70cc87c73505dd5d2bdafb06193f664760e02
2022-05-17 18:10:39 +05:30
mkadavil
a3836a560d Smart Threading for GEMM (sgemm) v1.
- Cache aware factorization.
Experiments shows that ic,jc factorization based on m,n gives better
results compared to mu,nu on a generic data set in SUP path. Also
slight adjustments in the factorizations w.r.t matrix data loads can
help in improving perf further.

- Moving native path inputs to SUP path.
Experiments shows that in multi-threaded scenarios if the per thread
data falls under SUP thresholds, taking SUP path instead of native path
results in improved performance. This is the case even if the original
matrix dimensions falls in native path. This is not applicable if A
matrix transpose is required.

- Enabling B matrix packing in SUP path.
Performance improvement is observed when B matrix is packed in cases
where gemm takes SUP path instead of native path based on per thread
matrix dimensions.

AMD-Internal: [CPUPL-659]

Change-Id: I3b8fc238a0ece1ababe5d64aebab63092f7c6914
2022-05-17 18:10:39 +05:30
S, HariharaSudhan
a8bc55c373 Multithreaded SGEMV var 1 with smart threading
- Implemented an OpenMP based stand alone SGEMV kernel for
	  row-major (var 1) for multithread scenarios
	- Smart threading is enabled when AOCL DYNAMIC is defined
	- Number of threads are decided based on the input dims
	  using smart threading

AMD-Internal: [CPUPL-1984]
Change-Id: I9b191e965ba7468e95aabcce21b35a533017502e
2022-05-17 18:10:39 +05:30
Dipal M Zambare
16de63c818 Updated version and copyright notice.
Changed AMD-BLIS version to 3.1.2

AMD-Internal: [CPUPL-2111]
Change-Id: Id8fc3fbc112f08bd5e5def646c472047352e65b5
2022-05-17 18:10:39 +05:30
Dave
963a6aa099 Enabled zgemm_sup path and removed sqp path
- Previously zgemm computation failures were due to
  status variable did not have pre-defined initial
  value which resulted in zgemm computation to return
  without being computed by any kernel. Reflected
  same change in dgemm_ function as well. 

- Enabled sup zgemm as the issue is fixed with
  status variable with bli_zgemm_small call.

-Removed calling sqp method as it is disabled





Change-Id: I0f4edfd619bc4877ebfc5cb6532c26c3888f919d
2022-05-17 18:10:39 +05:30
Dipal M Zambare
f23233eb4c Added runtime control for DTL logging Feature
The logs can be enabled with following two methods:
  -- Environment variable based control: The feature can be enabled
     by specifying environment variable AOCL_VERBOSE=1.
  -- API based control: Two API's will be added to enable/disable
     logging at runtime
     1. AOCL_DTL_Enable_Logs()
     2. AOCL_DTL_Disable_Logs()
  -- The API takes precedence over the environment settings.

AMD-Internal: [CPUPL-2101]
Change-Id: Ie71c1095496fae89226049c9b9f80b00400350d5
2022-05-17 18:10:39 +05:30
satish kumar nuggu
1a3428ddfc Parallelization of dtrsm_small routine
1. Parallelized dtrsm_small across m-dimension or n-dimension based on side(Left/Right).
2. Fine-tuning with AOCL_DYNAMIC to achieve better performance.

    AMD-Internal: [CPUPL-2103]

Change-Id: I6be6a2b579de7df9a3141e0d68bdf3e8a869a005
2022-05-17 18:10:39 +05:30
Chandrashekara K R
8e6da6b844 Added the checks to not defining the bool type for C++ code in windows to avoid redefinition build time errror.
AMD-Internal: [CPUPL-2037]
Change-Id: I065da9206ab06f60876324f258ee12fb9fe83f88
2022-05-17 18:10:39 +05:30
Harsh Dave
f17d043e1c Implemented optimal dotxv kernel
Details:
- Intrinsic implementation of zdotxv, cdotxv kernel
- Unrolling in multiple of 8, remaining corner
  cases are handled serially for zdotxv kernel
- Unrolling in multiple of 16, remainig corner
  cases are handled serially for cdotxv kernel
- Added declaration in zen contexts

AMD-Internal: [CPUPL-2050]
Change-Id: Id58b0dbfdb7a782eb50eecc7142f051b630d9211
2022-05-17 18:10:39 +05:30
Dipal M Zambare
e712ffe139 Added AOCL progress support for BLIS
-- AOCL libraries are used for lengthy computations which can go
     on for hours or days, once the operation is started, the user
     doesn’t get any update on current state of the computation.
     This (AOCL progress) feature enables user to receive a periodic
     update from the libraries.
  -- User registers a callback with the library if it is interested
     in receiving the periodic update.
  -- The library invokes this callback periodically with information
     about current state of the operation.
  -- The update frequency is statically set in the code, it can be
     modified as needed if the library is built from source.
  -- These feature is supported for GEMM and TRSM operations.

  -- Added example for GEMM and TRSM.
  -- Cleaned up and reformatted test_gemm.c and test_trsm.c to
     remove warnings and making indentation consistent across the
     file.

AMD-Internal: [CPUPL-2082]
Change-Id: I2aacdd8fb76f52e19e3850ee0295df49a8b7a90e
2022-05-17 18:10:39 +05:30
Harsh Dave
52e4fd0f11 Performance Improvement for ctrsm small sizes
Details:
- Enable ctrsm small implementation
- Handled Overflow and Underflow Vulnerabilites in
ctrsm small implementations.
- Fixed failures observed in libflame testing.
- For small sizes, ctrsm small implementation is
used for all variants.

Change-Id: I17b862dcb794a5af0ec68f585992131fef57b179
2022-05-17 18:10:39 +05:30
Arnav Sharma
caa5b37005 Optimized S/DCOMPLEX DOTXAXPYF using AVX2 Intrinsics
Details:
- Optimized implementation of DOTXAXPYF fused kernel for single and double precision complex datatype using AVX2 Intrinsics
- Updated definitions zen context

AMD-Internal: [CPUPL-2059]
Change-Id: Ic657e4b66172ae459173626222af2756a4125565
2022-05-17 18:10:39 +05:30
Sireesha Sanga
cc3069fb5e Performance Improvement for ztrsm small sizes
Details:
- Optimization of ztrsm for Non-unit Diag Variants.
- Handled Overflow and Underflow Vulnerabilites in
  ztrsm small implementations.
- Fixed failures observed in libflame testing.
- Fine-tuned ztrsm small implementations for specific
  sizes 64<= m,n <= 256, by keeping the number of
  threads to the optimum value, under AOCL_DYNAMIC flag.
- For small sizes, ztrsm small implementation is
  used for all variants.

AMD-Internal: [SWLCSG-1194]
Change-Id: I066491bb03e5cda390cb699182af4350ae60be2d
2022-05-17 18:10:39 +05:30
satish kumar nuggu
fe7f0a9085 Changes to enable zgemm small from BLAS Layer
1. Removed small gemm call from native path to avoid Single threaded
calls as a part of MultiThreaded scenarios.
2. SUP and INDUCED Method path disabled.
3. Added AOCL Dynamic for optimum number of threads to achieve higher
performance.

Change-Id: I3c41641bef4906bdbdb5f05e67c0f61e86025d92
2022-05-17 18:10:38 +05:30
Sireesha Sanga
9621ef3067 Performance Improvement for ztrsm small sizes
Details:
- Enable ztrsm small implementation
- For small sizes, Right Variants and Left Unit Diag
  Variants are using ztrsm_small implementations.
- Optimization of Left Non-Unit Diagonal Variants,
  Work In Progress

AMD-Internal: [SWLCSG-1194]
Change-Id: Ib3cce6e2e4ac0817ccd4dff4bb0fa4a23e231ca4
2022-05-17 18:09:22 +05:30
Harsh Dave
015bcb88d4 Fixed ztrsm computational failure
- Fixed memory access for edge cases such that
  all load are within memory boundary only.

- Corrected ztrsm utility APIs for dcomplex
  multiplication and division.

AMD-Internal: [CPUPL-2093]
Change-Id: Ib2c65e7921f6391b530cd20d6ea6b50f24bd705e
2022-05-17 18:09:22 +05:30
Harsh Dave
0976ed9ce5 Implement zgemm_small kernel
Details:
- Intrinsic implementation of zgemm_small nn kernel.
- Intrinsic implementation of zgemm_small_At kernel.
- Added support conjugate and hermitian transpose
- Main loop operates in multiple of 4x3 tile.
- Edge cases are handles separately.

AMD-Internal: [CPUPL-2084]
Change-Id: I512da265e4d4ceec904877544f1d15cddc147a66
2022-05-17 18:09:22 +05:30
Nallani Bhaskar
eb0ff01871 Fine-tuning dynamic threading logic of DGEMM for small dimensions
Description:
    1. For small dimensions single threads dgemm_small performing
       better than dgemmsup and native paths.
    2. Irrespecive of given number of threads we are redirecting
       into single thread dgemm_small

       AMD-Internal:[CPUPL-2053]

Change-Id: If591152d18282c2544249f70bd2f0a8cd816b94e
2022-05-17 18:09:22 +05:30
Chandrashekara K R
34fee0fdbc AOCL-Windows: Added logic in the windows build system to generate cblas.h at configure time.
AMD-Internal: [CPUPL-2037]

Change-Id: Ie4ffd1d655079c895878f96dbb6f811547ad953d
2022-05-17 18:09:22 +05:30
Sireesha Sanga
6a2c4acc66 Runtime Thread Control using OpenMP API
Details:
-  During runtime, Application can set the desired number of threads using
   standard OpenMP API omp_set_num_threads(nt).
-  BLIS Library uses standard OpenMP API omp_get_max_threads() internally,
   to fetch the latest value set by the application.
-  This value will be used to decide the number of threads in the subsequent
   BLAS calls.
-  At the time of BLIS Initialization, BLIS_NUM_THREADS environment variable
   will be given precedence, over the OpenMP standard API omp_set_num_threads(nt)
   and OMP_NUM_THREADS environment variable.
-  Order of precedence followed during BLIS Initialization is as follows
	1. Valid value of BLIS_NUM_THREADS
	2. omp_set_num_threads(nt)
	3. valid value of OMP_NUM_THREADS
	4. Number of cores
-  After BLIS initialization, if the Application issues omp_set_num_threads(nt)
   during runtime, number of threads set during BLIS Initialization,
   is overridden by the latest value set by the Application.
-  Existing precedence of BLIS_*_NT environment variables and the decision of
   optimal number of threads over the number of threads derived from the
   above process remains as it is.

AMD-Internal: [CPUPL-2076]

Change-Id: I935ba0246b1c256d0fee7d386eac0f5940fabff8
2022-05-17 18:09:22 +05:30
mkurumel
ab06f17689 DGEMMT : Tuning SUP threshold to improve ST and MT performance.
Details :
          - SUP Threshold change for native vs SUP
          - Improved the ST performances for sizes n<800
	  - Introduce PACKB in SUP to improve ST performance between 320<n<800
	  - 16T SUP Tuning for n<1600.

        AMD-Internal: [CPUPL-1981]

Change-Id: Ie59afa4d31570eb0edccf760c088deaa2e10cdda
2022-05-17 18:09:22 +05:30
Dipal M Zambare
06e386f054 Updated Windows build system to pick AMD specific sources.
The framework cleanup was done for linux as part of
f63f78d7 Removed Arch specific code from BLIS framework.

This commit adds changes needed for windows build.

AMD-Internal: [CPUPL-2052]

Change-Id: Ibd503a0adeea66850de156fb95657b124e1c4b9d
2022-05-17 18:09:20 +05:30