Commit Graph

2760 Commits

Author SHA1 Message Date
Arnav Sharma
90f915d3a9 Vectorized and parallelized zdscal routine
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.

AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
2022-09-30 06:11:07 -04:00
satish kumar nuggu
9c292b79e2 Fixed ASAN reported issues in [s/c]trsm small kernels
Details:
1. Fixed the memory access paritial overflows for the variables
AlphaVal,ones reported by ASAN.
2. Using 128 bit packed broadcast with the 64 bit data types
after type casting would cause the garbage data to be filled
in the destination register.
3. Fixed this issue by using set_ps instruction instead of broadcast.
4. In cases of n remainder being 1, extra elements were accessed that
could cause out of memory access. Removed the extra element access.

AMD-Internal: [CPUPL-2578][CPUPL-2587]
Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75
2022-09-30 03:02:07 -04:00
mkadavil
f4702debb9 Zen4 compilation flag updates to support low precision gemm.
- BFloat16 flags added to zen4 make_defs in order to enable
compilation of low precision gemm by using zen4 config.
- Avoid -ftree-partial-pre optimization flag with gcc due to
non optimal code generation for intrinsics based kernels in
low precision gemm.
- Enable only Zen3 specific low precision gemm kernels (s16)
compilation when aocl_gemm addon is compiled on Zen3 machines.

AMD-Internal: [CPUPL-1545]
Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f
2022-09-29 08:19:40 -04:00
Eleni Vlachopoulou
1c1a0027a8 Bugfix in DNRM2 AVX path
Description:
Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy
errors.

AMD-Internal: [CPUPL-2576]
Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e
2022-09-29 04:46:04 -04:00
satish kumar nuggu
754627eae9 Addressed uninitialized variables in trsm small algo
1. Addressed uninitialized variables reported in coverity for all
datatypes of trsm small algo.

AMD-Internal: [CPUPL-2542]
Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71
2022-09-29 01:49:22 -04:00
Arnav Sharma
985c3f0fe1 Multithreaded DSCALV
- Added multithreaded support for dscalv.
- Optimal number of threads is calculated based on the input size.

AMD-Internal: [CPUPL-2583]
Change-Id: I0a253c28cb389a4f5214439dbca62304888ca5ae
2022-09-28 07:42:49 -04:00
Field G. Van Zee
4c3dd93eba Use kernel CFLAGS for 'kernels' subdirs in addons. (#658)
Details:
- Updated Makefile and common.mk so that the targeted configuration's
  kernel CFLAGS are applied to source files that are found in a
  'kernels' subdirectory within an enabled addon. For now, this
  behavior only applies when the 'kernels' directory is at the top
  level of the addon directory structure. For example, if there is an
  addon named 'foobar', the source code must be located in
  addon/foobar/kernels/ in order for it to be compiled with the target
  configurations's kernel CFLAGS. Any other source code within
  addon/foobar/ will be compiled with general-purpose CFLAGS (the same
  ones that were used on all addon code prior to this commit). Thanks
  to AMD (esp. Mithun Mohan) for suggesting this change and catching an
  intermediate bug in the PR.
- Comment/whitespace updates.

(cherry picked from commit fd885cf98f)
Change-Id: I9a678f78bde90b23a6293ce90377004876f51067
2022-09-27 06:37:38 -04:00
Dipal M Zambare
1f345f87f5 Fixed ASAN reported issues in bli_l3_packm.c
The local_mem_s is allocated on stack but in the inner scop of the
“if” block, however, it can be accessed through cntl_mem_p outside
the if block. This error is flagged by address sanitizer. Fixed this
issue by moving the variables declaration at the functions scope.

This fix address the issue reported for follwoing libflame APIs

geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs,
stedc, steqr, syevd

AMD-Internal: [CPUPL-2587]
Change-Id: I63749c7d406c7339d2b45b0488108ccd3f90a248
2022-09-23 16:41:00 +05:30
eashdash
d21cd51fde Accumulation type for alpha, beta values and BF16 bench integration
1. Correcting the type of alpha, and beta values from C_type
   (output type) to accumulation type.
   For the downscaled LPGEMM APIs, C_type will be the downscaled
   type but the required type for alpha and beta values should
   be the accumulation type.
2. BF16 bench integration with the LPGEMM bench for both the BF16
   (bf16bf16f32of32 and bf16bf16f32obf16) APIs

AMD-Internal: [CPUPL-2561]
Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797
2022-09-23 05:00:49 -04:00
Dipal M Zambare
3a3472e4e5 Revert "Fixed ASAN reported issues in bli_l3_packm.c"
This reverts commit 3401a0d202.

Got pushed without review.
2022-09-23 14:27:12 +05:30
Dipal M Zambare
3401a0d202 Fixed ASAN reported issues in bli_l3_packm.c
The local_mem_s is allocated on stack but in the inner scop of the
“if” block, however, it can be accessed through cntl_mem_p outside
the if block. This error is flagged by address sanitizer. Fixed this
issue by moving the variables declaration at the functions scope.

This fix address the issue reported for follwoing libflame APIs

geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs,
stedc, steqr, syevd

AMD-Internal: [CPUPL-2587]
Change-Id: Ib1e9f89bc4b6911e4b7e9b910d1ecc2ba00b286a
2022-09-23 14:24:55 +05:30
Mangala V
42434fbc31 Bug fix in TRSM small for S, D, C & Z datatype incase of MT
1. Corrected B buffer accessing to access by its offset instead of
   starting address which is required incase of MT.
3. When num_threads > 1, B buffer is divided in to blocks in m or n
   dimension based on side right or left. Hence need to access by its
   offset to access starting of the block.
4. Currently B Matrix is divided in to blocks for each thread and
   complete matrix A is used by all threads.
   Incase of design change in future, modified A buffer accessing by its
   offset to support partition of matrix A for MT
AMD-Internal:[CPUPL-2520]

Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea
2022-09-23 03:30:25 -04:00
Dipal M. Zambare
17694b6ca5 Enabled AVX512 compiler flags for the reference kernels.
- Updated zen4 configuration to enable AVX512 flags for the
    reference kernels
  - Reference and vector kernels will use the same compiler flags

AMD-Internal: [CPUPL-2533]
Change-Id: I5a2ba7e584dc3fb93625df12cca6b6c18f514ea8
2022-09-23 11:39:33 +05:30
Harihara Sudhan S
a45827b3f9 u8s8s16os16 bug fix for downscale operation
- Removed some read code from the macros for downscale
	- Store permute correction
	- Simplified macros for edge cases and corrected intermediate
	  operation

AMD-Internal:[CPUPL-2171]

Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68
2022-09-22 05:02:17 -04:00
Nallani Bhaskar
1e5d98322d Disabled DNRM2 AVX path temporarily
Description:
Disabled AVX2 optimized path for DNRM2 to avoid accuracy
issues in netlib blas test.

AMD-Internal: CPUPL-2576 ]
Change-Id: I0764725d4f6b1e4e0b5f60a255bc681bb698560e
2022-09-22 13:00:04 +05:30
mkadavil
bf4d1da1b9 Column major input support for BFloat16 gemm.
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.

AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
2022-09-22 02:50:46 -04:00
Dipal M Zambare
de3247b0da Removed extra prototypes for ?gemm3m APIs
- Removed prototypes for float(sgemm3m) and double(dgemm3m) types,
  as BLIS implements this API only for scomplex(cgemm3m) and
  dcomplex(zgemm3m)

AMD-Internal: [SWLCSG-1477]
Change-Id: Ifad86a74b4c939ed240743894b85bb4fa5e6d754
2022-09-20 15:51:56 +05:30
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
satish kumar nuggu
b7bc5f204f Fix in DTRSM Small MT
Details:
1. Changes are made in dtrsm small MT path,to avoid accuracy issues.

AMD-Internal: [SWLCSG-1470]
Change-Id: I65237225892f97b7222fe71f66b02841b5956560
2022-09-20 05:54:48 -04:00
Dipal M Zambare
2cdeea3c66 CBLAS/BLAS interface decoupling for the level 1 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dscal’
 internally invokes the BLAS API ‘dscal_’.

-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.

-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: I0e80071398af29c9313296d2a92e61e3897ac28e
2022-09-19 21:50:29 +05:30
Dipal M Zambare
866e8de7bf CBLAS/BLAS interface decoupling for the level 2 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dgemv’
 internally invokes the BLAS API ‘dgemv_’.

-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.

-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c
2022-09-15 17:51:05 +05:30
Dipal M Zambare
e18db8a172 CBLAS/BLAS interface decoupling for the level 3 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dgemm’
 internally invokes the BLAS API ‘dgemm_’.
-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.
-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: Id9e307154342d2c17b0ac6db580c36f1a9ee6409
2022-09-15 06:23:46 -04:00
Dipal M Zambare
61232d540c AOCL progress callback hardening
- BLIS uses callback function to report the progress of the
  operation. The callback is implemented in the user application
  and is invoked by BLIS.

- Updated callback function prototype to make all arguments const.
  This will ensure that any attempt to write using callback’s
  argument is prevented at the compile time itself.

AMD-Internal: [CPUPL-2504]
Change-Id: I8ceb671242365d2a9155b485301cd8c75043e667
2022-09-14 15:32:10 +05:30
mkadavil
9bc59cc500 Low Precision GEMM framework fixes for downscaling.
- The temporary buffer allocated for C matrix when downscaling is
enabled is not filled properly. This results in wrong gemm accumulation
when beta != 0, and thus wrong output after downscaling. The C panel
iterators used for filling the temporary buffer are updated to fix it.
- Low precision gemm bench updated for testing/benchmarking downscaling.

AMD-Internal: [CPUPL-2514]
Change-Id: Ib1ba25ba9df2d2997edaaf0763ff0113fb35d6eb
2022-09-13 07:42:29 -04:00
Harihara Sudhan S
5b6cc5d39d Bug fix in s16 downscale operation
- Store operations was done to c matrix and not to c buffer

AMD-Internal:[CPUPL-2171]

Change-Id: Ic0897a20850fdae96db52f0ccc6fa087c84239fa
2022-09-13 06:01:48 -04:00
eashdash
e1349c0c71 LPGEMM BF16 MT panel based balancing
Introduced multi-thread panel based balancing for BF16 to improve the
overall MT performance.

AMD-Internal: [CPUPL-2502]
Change-Id: Iddce9548fa96e5f57bd3d3eb3e8268855ca47f25
2022-09-07 03:20:50 -04:00
Dipal M Zambare
6b71fea1f4 GCC 8 support for zen4 (and amdzen) configuration
- Added check for GCC version 8,
- Added AVX512 compiler flags needed for zen4 build using GCC 8.

AMD-Internal: [CPUPL-2494]
Change-Id: I7fd72e4b197fdd754633f674a8b87f01da8dd320
2022-09-06 16:58:02 +05:30
Edward Smyth
d69473c8f7 Update zen4 test in bli_cpuid.c
Remove test on is_arch in bli_cpuid_is_zen4() so it will report true for
all Zen4 models, based on family and AVX512 feature tests.

AMD-Internal: [CPUPL-2474]
Change-Id: I85d2e230b33391d5c9779df4585ae2a358788e72
2022-09-01 05:12:24 -04:00
eashdash
32a9e735f1 BF16 Output downscaling functionality
- BF16 instructions output is accumulated at a higher precision of
FP32 which needs to be converted to a lower precison of bf16 post
the GEMM operations. This is required in AI workloads where both
input and output are in BF16 format.
- BF16 downscaling is implemented as post-ops inside the GEMM
microkernels.

Change-Id: Id1606746e3db4f3ed88cba385a7709c8604002a8
2022-08-30 13:46:09 -04:00
Harihara Sudhan S
5faab43e66 Downscaling as part of u8s8s16os16
- int16 c matrix intermediate values are converted to int32,
	  then the int32 values are converted to fp32. On these fp32
	  values scaling is done
	- The resultant value is down scaled to int8 and stored in a
	  separate buffer

AMD-Internal: [2171]
Change-Id: I76ff04098def04d55d1bd88ac8c8d3f267964cab
2022-08-30 13:41:36 -04:00
mkadavil
958c9238ac Output downscaling support for low precision GEMM.
- Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
This is required in AI workloads where quantization/dequantization
routines are used.
- New GEMM APIs are introduced specifically to support this use case.
Currently downscaling support is added for s32, s16 and bfloat16 GEMM.

AMD-Internal: [CPUPL-2475]
Change-Id: I81c3ee1ba5414f62427a7a0abb6ecef0c5ff71bf
2022-08-30 10:27:19 -04:00
satish kumar nuggu
d62f12a18a Fixed bug in DZGEMM
Details:
1. In zen4 dgemm and sgemm native kernels are column-prefer kernels,
cgemm and zgemm kernels are row-prefer kernels. zen3 and older arch
(uses row-prefer kernels for all datatypes). Induced-transpose carried
out based on kernel preference check in both crc and ccr. we don't
require kernel preference check to apply Induced-transpose.
2. In case of ccr, B matrix is real. We must Induce a transposition and
perform C+=A*B (crc). where A (formerly B) is real.
3. In case of crc, A matrix is real. We don't require any
Induced-transpose.

AMD-Internal: [CPUPL-2440] [CPUPL-2449]
Change-Id: I44c53a20c8def7ddbb84797ba20260acec2086a2
2022-08-30 09:34:05 -04:00
satish kumar nuggu
99dc9066fb Fix in ZTRSM Small MT
Details:
1. Changes are made in ztrsm small MT path,to avoid accuracy issues
reported in libflame tests.

AMD-Internal: [CPUPL-2476]
Change-Id: Ic279106343fb1744e89ff4c920023adbe1d0158a
2022-08-30 18:02:19 +05:30
Dipal M Zambare
5c42afada8 Revert "CBLAS/BLAS interface decoupling for level 3 APIs"
This reverts commit d925ebeb06.

Change-Id: I2e842b29c1fedbe14bf913949cf978f3e7515ff3
2022-08-30 14:50:38 +05:30
eashdash
e674fae758 Post-Ops for bf16bf16f32
Functionality - Post-ops is a set of operations performed elemnent
wise on the output matrix post GEMM operation. The support for
the same is added by fusing post-ops with GEMM operations.

- Post-ops Bias, Relu and Parametric Relu are added to all the
compute kernels of bf16bf16f32of32
- Modified bf16 interface files to add check for bf16 ISA support

Change-Id: I2f7069a405037a59ea188a41bd8d10c4aae72fb3
2022-08-30 08:14:14 +00:00
mkadavil
a7d1cc7369 Multi-Threading support for BFloat16 gemm.
-OpenMP based multi-threading support added for BFloat16 gemm.
Both gemm and reorder api's are parallelized.
-Multi-threading support for u8s8s16 reorder api.
-Typecast issues fixed for bfloat16 gemm kernels.

AMD-Internal: [CPUPL-2459]
Change-Id: I6502d71ab32aa73bb159245976ea3d3a8e0ed109
2022-08-30 02:54:19 -04:00
Dipal M Zambare
7e42b3d2e0 Revert "CBLAS/BLAS interface decoupling for level 2 APIs"
This reverts commit 192f5313a1.

Change-Id: I876cad90902970ebc61550f109eb0ce32539ea1c
2022-08-30 11:53:46 +05:30
Dipal M Zambare
6cff8b030e Revert "CBLAS/BLAS interface decoupling for level 1 APIs"
This reverts commit 95169ca806.

Change-Id: Ic441aca616be6f27c7f1ba64e4480edcc6b17632
2022-08-30 11:34:34 +05:30
Nallani Bhaskar
ea79efa915 Fixed out of bound memory access in sgemmsup zen rv kernels
Details:

1. In sgemmsup_zen_rv_?x2 kernels "vmovps" instruction
   is used to load B matrix in k loop and k last loop,
   which is loading 128 bit into xmm than 64 bit as expected.

2. Changed vmovps instruction to vmovsd instrucntions
   which load only 64 bit in xmm register

3. Avoided C memory access by vfma instruction when multiplying
   with non-beta at corner cases with required access to 128 bit
   which leads to out of bound. Replaced with vmovq first to
   get 64 bit data then peformed vfma on xmm register in rv_6x8m
   and rv_6x4m

   AMD-Internal: [CPUPL-2472]

Change-Id: Iad397f8f5b5cc607b4278b603b1e0ea3f6b082f2
2022-08-30 01:13:14 -04:00
Dipal M Zambare
40c71dd2e1 Revert "CBLAS/BLAS interface decoupling for swap api"
This reverts commit 2beaa6a0e6.

Reverting it as it is planned for the next release.

Change-Id: Ib9271acd0b5b4cfd10c8f8b7bbb6ef93a3d594ea
2022-08-30 10:10:06 +05:30
Edward Smyth
abf848ad12 Code cleanup and warnings fixes
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
  to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
  only defined in header file if BLIS_ENABLE_OPENMP is defined

AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
2022-08-29 08:22:30 -04:00
Edward Smyth
9b28371f45 BLIS: BLAS3 quick return functionality
Bugfix for http://gerrit-git.amd.com/q/I0ebe9d76465b0e48b2ff5c2f1cc2a75763fe187c changes in frame/compat/bla_gemm.c

AMD-Internal: [CPUPL-2373]
Change-Id: Ia58b76f060eda204bba68be6730105f4d8db7537
2022-08-29 08:17:50 -04:00
Arnav Sharma
eb83a0fe9d Enabled ZHER Optimized Path
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.

AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
2022-08-29 08:09:42 -04:00
jagar
2beaa6a0e6 CBLAS/BLAS interface decoupling for swap api
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it.
-   This change separates the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I8d81072aaca739f175318b82f6510d386103c24b
2022-08-29 16:26:01 +05:30
Dipal M Zambare
d3b503bbf2 Code cleanup and warnings fixes
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
  files

AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
2022-08-29 15:15:40 +05:30
jagar
95169ca806 CBLAS/BLAS interface decoupling for level 1 APIs
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it and may
    result in recursion
-   This change separate the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I0f4521e70a02f6132bdadbd4c07715c9d52fe62a
2022-08-29 14:28:20 +05:30
Harihara Sudhan S
326d8a557f Performance regression in u8s8s16os16
- Performance of u8s8s16os16 came down by 40% after the
	  introduction of post-ops
	- Analysis revealed that the target compiler assumed false
	  dependency and was generating sub-optimal code due to the
          post-ops structure
	- Inserted vzeroupper to hint the compiler that no ISA change
	  will occur

AMD-Internal: [CPUPL-2447]
Change-Id: I0b383b9742ad237d0e053394602428872691ef0c
2022-08-29 03:20:02 -04:00
jagar
192f5313a1 CBLAS/BLAS interface decoupling for level 2 APIs
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it and may
    result in recursion
-   This change separates the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I8380b6468683028035f2aece48916939e0fede8a
2022-08-29 09:47:19 +05:30
Chandrashekara K R
d925ebeb06 CBLAS/BLAS interface decoupling for level 3 APIs
->In BLIS the cblas interface is implemented as a wrapper around
  the blas interface. For example the CBLAS api ‘cblas_dgemm’
  internally invokes BLAS API ‘dgemm_’.
->If the end user wants to use the different libraries for CBLAS
  and BLAS, current implantation of BLIS doesn’t allow it and
  may result in recursion
->This change separate the CBLAS and BLAS implantation by adding
  and additional level of abstraction. The implementation of the
  API is moved to the new function which is invoked directly from
  the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]

Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5
2022-08-26 05:54:29 -04:00
satish kumar nuggu
2114a43df8 Fixes to avoid Out of Bound Memory Access in TRSM small algorithm
Details:
1. Fixed the issues corresponding to Out of bound memory access during
load and store.
2. In Intrinsic code:
   i. AVX2 Registers can hold 4 double elements.
   ii. In case of remainder when number of elements is lessthan
vectorised register. Though the required number of elements are lessthan
4, we are reading and writing in chunks of 4 elements due to
vectorization. This might cause out of bound memory access.
3. Redesigned code to restrict out of bound access by loading and
storing the exact number of elements required.

AMD-Internal: [SWLCSG-1470]
Change-Id: I786f8023cf5a5f3e5343bea413c59bd0e764df9b
2022-08-26 01:06:10 -04:00