Commit Graph

2789 Commits

Author SHA1 Message Date
Meghana Vankadari
fbd9dde869 Fix bug in DZGEMM
Details:
- In case of dzgemm, if the microkernel prefers column output,
  We will induce a transposition and perform C += A*B
  where A (formerly B) becomes complex and B(formerly A)
  becomes real. Hence attach complex alpha object to A instead of B.
- This commit reverts all the changes made by
  d62f12a18a and
  0b81f53074 as they are causing
  failures in make checkblis-md.

AMD-Internal: [CPUPL-2893]
Change-Id: I56b94ac136fb96003302c568ae2587142c836620
2023-01-09 03:30:57 -05:00
Vignesh Balasubramanian
dbd0c069d4 Implemented 3x4 RD kernels for ZGEMM in SUP path
-Implemented (r)ow preferential (d)ot product milli-kernels
 (m and n variants) for dcomplex datatype along SUP path.

-These computational kernels extend the support for handling RRC and
 CRC storage schemes along the SUP path. In case of BLAS api call,
 it corresponds to the input cases with transa equal to T and
 transb equal to N.

-In case of the B matrix being packed(conditionally), the inputs are
 redirected to the existing (r)ow preferential (v)ector load optimized
 kernels due to better performance.

-Added macro for vhsubpd assembly instruction, to support the arithmetic
 for complex datatype in its interleaved storage.

AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
2022-12-22 18:31:46 +05:30
Nithya V S
6e9defe1b5 Developed AVX512 based "sup" kernels for SGEMM
- Only for RRR case, var2m kernels are added
- Main kernel is of 12x32 (AVX512), associated fringe kernels of
	- 8x32, 4x32, 2x32, 1x32 (AVX512)
	- 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512)
	- 12x8, 8x8 (AVX2)
	- 12x4, 8x4 (SSE4)
	- 12x2, 8x2 (SSE4)
	- existing AVX2/SSE4 kernels are used for other fringe
	  cases
- Currently, these kernels are not invoked in zen4 path
- Once all AVX512 kernels (n and rd) are done, invoke all of them
  together in zen4 config

AMD-Internal: [CPUPL-2801]

Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196
2022-12-16 10:18:09 +05:30
Eleni Vlachopoulou
13aa3c8cd0 Adding AVX2 support for SNRM2 and SCNRM2
- For the cases where AVX2 is available, an optimized function is called,
    based on Blue's algorithm. The fallback method based on sumsqv is used
    otherwise.

    - Scaling is used to avoid overflow and underflow.

    - Works correctly for negative increments.

    AMD-Internal: [SWLCSG-1080]

Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9
2022-12-14 04:25:45 -05:00
Edward Smyth
2592774fe8 BLIS: Nested Parallelism issues (3)
Bugfix for parallel BLAS1 and BLAS2 routines. Threading information
was not being set correctly when initializing local rntm from global.

Also ensure th_rntm is initialized along with global_rntm by updating
it in bli_thread_init(), called by bli_init_once()

AMD-Internal: [CPUPL-2433]
Change-Id: Iba658f87ae13fe16a57ca1fc279e149b7fa294cf
2022-12-13 12:38:40 -05:00
Eleni Vlachopoulou
88e549e7bd Using CMake as the build system for both Linux and Windows:
- GoogleTest headers removed. GoogleTest gets fetched at
configuration time.
- BLIS headers removed. A BLIS installation path is required at
configuration time.
- Windows has been temporarily disabled.

AMD-Internal: [CPUPL-2732]
Change-Id: I9e55c8e43b2733f96cd8b6e5449d79623decad5c
2022-12-13 19:09:23 +05:30
jagar
cff29bde76 Added gtestsuite folder into blis repo
Moved blis gtestsuite from lib-confscript to blis repo
(branch: amd-main)

Change-Id: If7ad391eef66bac6d26cf5223e6043d52b746072
2022-12-07 23:57:13 -05:00
Edward Smyth
345aacf806 BLIS: Nested parallelism issues (2)
Improvements to recent parallelism changes:

1. BLIS specific threading options: In bli_thread_update_rntm_from_env()
   set threading variables in tl_rntm to serial values when OpenMP level
   for parallelism within BLIS will not be active. User supplied BLIS
   threading values remain unchanged in global_rntm.

2. Simplify code structure in bli_thread_update_rntm_from_env().

3. Change variable declarations in bli_thread_init_rntm_from_env()
   and bli_thread_update_rntm_from_env() to avoid unused variable
   warnings in non-OpenMP builds.

AMD-Internal: [CPUPL-2433]
Change-Id: I5505657e3d2722e69bc4a1c1bb9fd8df55407fdd
2022-12-07 04:34:07 -05:00
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Chandrashekara K R
5f1ea3246a Updated blis library version string to 4.0.1
Change-Id: I1f3eb9081bcc05ccec300e00860d905b23d9709a
2022-11-24 10:35:34 +05:30
Arnav Sharma
de7b7b8eb8 Fix for ZDSCAL Multi-thread implementation
- In the existing implementation, when the actual number of threads
  spawned is different from the indicated number of threads for the
  parallel region, partial job is being executed.
- Fix added to identify actual number of threads spawned and
  allocate the work load to single thread in case of discrepancy
  in the number of threads spawned vs indicated.

AMD-Internal: [CPUPL-2761]
Change-Id: I9eb9baccaef7f2149346be705ee151ae7af0a0e5
2022-11-22 11:03:07 +05:30
Arnav Sharma
2bc2d11e8a Fix for DSCAL Multi-thread implementation
- In the existing implementation, when the actual number of threads
  spawned is different from the indicated number of threads for the
  parallel region, partial job is being executed.
- Fix added to identify actual number of threads spawned and
  allocate the work load to single thread in case of discrepancy
  in the number of threads spawned vs indicated.

AMD-Internal: [CPUPL-2761]
Change-Id: Ife36e6e4993bdcc5a506349b54b2177173866e32
2022-11-21 08:49:07 -05:00
Arnav Sharma
69be3b0557 Fix for DDOT Multi-thread implementation
- In the existing implementation, when the actual number of threads
  spawned is different from the indicated number of threads for the
  parallel region, partial job is being executed.
- Fix added to identify actual number of threads spawned and
  allocate the work load to single thread in case of discrepancy
  in the number of threads spawned vs indicated.

AMD-Internal: [SWLCSG-1659]
Change-Id: I882d0af608009e502189a4469eb3483d659c2b3f
2022-11-21 06:17:26 -05:00
Harihara Sudhan S
11c42ce1d3 C matrix prefetch for BF16 GEMM
- Broke down the KR loop inside the compute kernel into
	  two pieces
	- Added C matrix prefetch between the two decomposed
          pieces of KR loop

AMD-Internal: [CPUPL-2693]
Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e
2022-11-21 04:57:19 -05:00
Harihara Sudhan S
2f6c26eff6 Dynamic block tuning for LPGEMM u8s8s16os16
- NC value is adjusted based on the n dimension of the B
	  matrix.
	- KC value is adjusted based on the k dimension input
	  matrices.
	- In order to ensure a handshake between the reorder
          function and compute function, the same dynamic block
	  tuning logic is called in both places.
	- Reorder followed by compute scenarios mean that tuning
	  KC value based on MC value and vice versa is not feasible.

AMD-Internal: [CPUPL-2638]
Change-Id: Ie515da7de83e3dfb59946e873c92d67f11559429
2022-11-21 01:39:35 -05:00
Vignesh Balasubramanian
078b0bc626 Enabling ZGEMM SUP path for single thread
-Enabled the sup path for zgemm in case of single thread.
-Fine tuned the thresholds for small path in order to handle
 certain spectrum of skinny matrices.
-The inputs are now redirected among small, sup and native
 paths in case of single thread.

AMD-Internal: [CPUPL-2632]
Change-Id: I3eca315bcb8bfa8e015349a7b2bc4ebd5448e065
2022-11-04 16:05:48 +05:30
Edward Smyth
34730a1e4c BLIS: Nested parallelism issues
1. Check OpenMP active level against max active levels when setting
   number of threads for starting a new parallel region in
   ./frame/thread/bli_thread.c to ensure the correct number of threads
   is used when BLIS is called within nested OpenMP parallelism.

2. In subsequent BLIS calls, threading choices could be incorrectly
   set based on values used and stored in global_rntm by a previous
   call. This could apply when the OpenMP number of threads differ from
   call to call, different nested parallelism is used in different
   parts of a user's code, or different threads at the user level
   request different numbers of OpenMP threads for BLIS calls.

   Keep threading information in both global_rntm and a new Thread
   Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
   environment (as appropriate) during bli_init_auto() calls in each
   BLIS routine. The details are:
   * global_rntm is initialized on first BLIS call based on OpenMP and
     BLIS threading environment variables.
   * global_rntm is updated by any BLIS threading function calls.
   * In bli_thread_update_tl(), called by bli_init_auto(), sync with
     any BLIS values set or updated in global_rntm. Then, if BLIS
     threading control is not used, check OpenMP ICVs and set thread
     count and auto_factor appropriately.
   * Setting BLIS threading locally (using expert interfaces to pass
     a user defined rntm data structure) should work as before.

3. bli_thread_get_is_parallel can now only be called outside of
   parallelism within BLIS routines. Change calls in trsm to reflect
   this.

4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
   if any BLIS_*_NT environment variables are set.

5. Set auto_factor = FALSE when the number of threads is 1.

6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.

7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().

8. For debugging, internal information on the rntm threading data can
   be printed by defining "PRINT_THREADING" at the top of bli_rntm.h

9. bli_rntm_print() now also prints the value of blis_mt.

10. Function prototypes in bli_rntm.h moved to top of file, so that
   bli_rntm_print() can be used within inline functions defined in
   this header file.

11. Comment out bli_init_auto() and bli_finalize_auto() calls in
    Fortran interfaces in frame/compat/blis/thread/b77_thread.c

12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
    set_pack_b functions outside of the auto_factor if statements.

13. Misc code tidying.

AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
2022-10-21 07:38:39 -04:00
Mangala V
1290c27b2a Revert "Enabled MT path in DTRSM Small"
This reverts commit bdbdd20996.

Reason for revert: Functionality issue in TRSM MT Path on Genoa

Change-Id: I197ce4d58641196de9fb32828b3cc91c568b5de7
2022-10-21 05:13:38 -04:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Vignesh Balasubramanian
f6185f2e53 Disabling ZGEMM SUP for single thread
-Disabled the sup path for zgemm with single thread temporarily
 until we tune the performance and update the thresholds.
 The redirection of inputs in zgemm is now among small and native path
 for single thread.
-Fine tuned the input size constraints for entry to small path in zgemm
 in case of single thread.

AMD-Internal: [CPUPL-2495]

Change-Id: Ifa19116757ca132d9636c3f2a0661524d2c1d3ec
2022-10-11 08:09:53 -04:00
eashdash
63864d7dfb Added clipping while downscaling for u8s8s32os8 and u8s8s16os8.
Clipping is done during the downscaling of the accumulated result
from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8,
to saturate the final output values between [-128,127]

AMD-Internal: [LWPZENDNN-493]

Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f
2022-10-11 07:28:06 -04:00
Mangala V
bdbdd20996 Enabled MT path in DTRSM Small
1. Fixed accuracy issues in dtrsm small multi-thread.
2. Fixed out of bound memory accesses in the patch
   "Fixes to avoid Out of Bound Memory Access in TRSM small algorithm"
3. Re-enabled DTRSM small MT with above fixes.

AMD-Internal: [CPUPL-2567]
Change-Id: Ibf2949b25fde4007a92cc635526bed0e0d897800
2022-10-10 09:47:45 -04:00
Harihara Sudhan S
492555785a Fixed bench accuracy issue in LPGEMM
Description:
- When the value of the result in s8 for u8s8s32 and u8s8s16 are
  close to 0. Values are getting ceiled to 1.
- Used nearbyintf to round the downscaled values in bench reference.
  This fixed the result mismatch issue between the vectorized kernel
  implementation and reference implementation in bench accuracy test.

AMD-Internal: [CPUPL-2617]
Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82
2022-10-07 09:48:21 +00:00
Edward Smyth
991f2ec0e8 BLIS: Errors in Netlib LAPACK Tests
Zen4 kernel bli_damaxv_zen_int_avx512 is causing incorrect results in
the netlib LAPACK tests, specifically in:

./xlintstd < ../dtest.in > dtest.out

in the TESTING/LIN directory. Given time constraints, i.e. the need to
finalize code for AOCL 4.0 release, disable calls to AVX512 kernel
(i.e. always use the AVX2 kernel) for now, and aim to correct
bli_damaxv_zen_int_avx512 for AOCL 4.1.

AMD-Internal: [CPUPL-2590]
Change-Id: I2603dd97c3931acb9730563e8126b109ec2b2572
2022-10-03 10:55:49 +00:00
Shubham Sharma
2d72d4c862 DTRSM Native path AVX 512 kernels are developed
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.

AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
2022-10-01 15:58:24 +00:00
Nallani Bhaskar
f1caa8e630 Fixed initialization issues in ddot multithread implementation
Description:

1. n value per thread and offset are defined outside omp loop
   which can vary for each thread. Moved them to inside the
   omp loop so that the output

2. Fixed few compilation warnings in dnrm2 avx2 implemenation

AMD Internal:[CPUPL-2606]

Change-Id: Ifba9b3707c3c1a66f31b5e1906ecb68eabef4f81
2022-10-01 14:43:39 +05:30
Mangala V
e440cbc91a Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m
Address sanitizer reports error when rbp regitser is modified.

Register rbp was stored with rs_a which was used during prefetch
of Matrix A. Usage of rbp is avoided by using rcx register as a
temporary storage register.
Hence rcx is updated with Matrix C address before storing the
computed data.

This fix address the issue reported by GEQP3 API of libflame

AMD-Internal: [CPUPL-2587]
Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d
2022-09-30 11:52:29 -04:00
Eleni Vlachopoulou
863b73dfaf Adding AVX2 support for DZNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

- Cleaned up some white space in the AVX2 implementation for DNRM2.

AMD-Internal: [CPUPL-2551]
Change-Id: I0875234ea735540307168fe7efc3f10fe6c40ffc
2022-09-30 07:51:04 -04:00
Chandrashekara K R
e36e13ec78 Parallelized DDOT using OpenMP.
Description:

1. Calculate the optimal number of threads required based on input
   dimension.

2. Request dynamic memory from memory pool to store the each
   thread output.

3. Create optimal number of threads and divide the given input data
   among all threads.

4. Accumulate the results in rho variable.

5. Free the dynamically allocated memory back to memory pool.

AMD-Internal: [CPUPL-2560]
Change-Id: I30982b622d531b29d3570b2f5f7569bd1d951d68
2022-09-30 07:36:17 -04:00
Arnav Sharma
90f915d3a9 Vectorized and parallelized zdscal routine
- Implemented optimized intrinsic kernel for zdscalv for the cases where AVX2 is supported.
- Also added multithreaded support for the same.
- The optimal number of threads is being calculated on the basis of input size.

AMD-Internal: [CPUPL-2602]
Change-Id: I4d05c3b1cc365a7770703286a89c6dce3875c067
2022-09-30 06:11:07 -04:00
satish kumar nuggu
9c292b79e2 Fixed ASAN reported issues in [s/c]trsm small kernels
Details:
1. Fixed the memory access paritial overflows for the variables
AlphaVal,ones reported by ASAN.
2. Using 128 bit packed broadcast with the 64 bit data types
after type casting would cause the garbage data to be filled
in the destination register.
3. Fixed this issue by using set_ps instruction instead of broadcast.
4. In cases of n remainder being 1, extra elements were accessed that
could cause out of memory access. Removed the extra element access.

AMD-Internal: [CPUPL-2578][CPUPL-2587]
Change-Id: Iaa918060c66287f2f46bcb9f69e9323f6707cf75
2022-09-30 03:02:07 -04:00
mkadavil
f4702debb9 Zen4 compilation flag updates to support low precision gemm.
- BFloat16 flags added to zen4 make_defs in order to enable
compilation of low precision gemm by using zen4 config.
- Avoid -ftree-partial-pre optimization flag with gcc due to
non optimal code generation for intrinsics based kernels in
low precision gemm.
- Enable only Zen3 specific low precision gemm kernels (s16)
compilation when aocl_gemm addon is compiled on Zen3 machines.

AMD-Internal: [CPUPL-1545]
Change-Id: Id3be3410bfbf141bb6fc4b4e3391115a4e0bb79f
2022-09-29 08:19:40 -04:00
Eleni Vlachopoulou
1c1a0027a8 Bugfix in DNRM2 AVX path
Description:
Enabled DNRM2 AVX path and fixed bug that caused numerical accuracy
errors.

AMD-Internal: [CPUPL-2576]
Change-Id: Ic9fda9d9668bdfe233621f79db6acce518b4d10e
2022-09-29 04:46:04 -04:00
satish kumar nuggu
754627eae9 Addressed uninitialized variables in trsm small algo
1. Addressed uninitialized variables reported in coverity for all
datatypes of trsm small algo.

AMD-Internal: [CPUPL-2542]
Change-Id: Ifae57ef6435493942732526720e6a9d6bec70e71
2022-09-29 01:49:22 -04:00
Arnav Sharma
985c3f0fe1 Multithreaded DSCALV
- Added multithreaded support for dscalv.
- Optimal number of threads is calculated based on the input size.

AMD-Internal: [CPUPL-2583]
Change-Id: I0a253c28cb389a4f5214439dbca62304888ca5ae
2022-09-28 07:42:49 -04:00
Field G. Van Zee
4c3dd93eba Use kernel CFLAGS for 'kernels' subdirs in addons. (#658)
Details:
- Updated Makefile and common.mk so that the targeted configuration's
  kernel CFLAGS are applied to source files that are found in a
  'kernels' subdirectory within an enabled addon. For now, this
  behavior only applies when the 'kernels' directory is at the top
  level of the addon directory structure. For example, if there is an
  addon named 'foobar', the source code must be located in
  addon/foobar/kernels/ in order for it to be compiled with the target
  configurations's kernel CFLAGS. Any other source code within
  addon/foobar/ will be compiled with general-purpose CFLAGS (the same
  ones that were used on all addon code prior to this commit). Thanks
  to AMD (esp. Mithun Mohan) for suggesting this change and catching an
  intermediate bug in the PR.
- Comment/whitespace updates.

(cherry picked from commit fd885cf98f)
Change-Id: I9a678f78bde90b23a6293ce90377004876f51067
2022-09-27 06:37:38 -04:00
Dipal M Zambare
1f345f87f5 Fixed ASAN reported issues in bli_l3_packm.c
The local_mem_s is allocated on stack but in the inner scop of the
“if” block, however, it can be accessed through cntl_mem_p outside
the if block. This error is flagged by address sanitizer. Fixed this
issue by moving the variables declaration at the functions scope.

This fix address the issue reported for follwoing libflame APIs

geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs,
stedc, steqr, syevd

AMD-Internal: [CPUPL-2587]
Change-Id: I63749c7d406c7339d2b45b0488108ccd3f90a248
2022-09-23 16:41:00 +05:30
eashdash
d21cd51fde Accumulation type for alpha, beta values and BF16 bench integration
1. Correcting the type of alpha, and beta values from C_type
   (output type) to accumulation type.
   For the downscaled LPGEMM APIs, C_type will be the downscaled
   type but the required type for alpha and beta values should
   be the accumulation type.
2. BF16 bench integration with the LPGEMM bench for both the BF16
   (bf16bf16f32of32 and bf16bf16f32obf16) APIs

AMD-Internal: [CPUPL-2561]
Change-Id: I3a99336c743f3880be1b96605ceeeae7c3bd4797
2022-09-23 05:00:49 -04:00
Dipal M Zambare
3a3472e4e5 Revert "Fixed ASAN reported issues in bli_l3_packm.c"
This reverts commit 3401a0d202.

Got pushed without review.
2022-09-23 14:27:12 +05:30
Dipal M Zambare
3401a0d202 Fixed ASAN reported issues in bli_l3_packm.c
The local_mem_s is allocated on stack but in the inner scop of the
“if” block, however, it can be accessed through cntl_mem_p outside
the if block. This error is flagged by address sanitizer. Fixed this
issue by moving the variables declaration at the functions scope.

This fix address the issue reported for follwoing libflame APIs

geqp3, geqrf, gerq2, gerqf, gesvd, ggev, ggevx, potrf, potrs,
stedc, steqr, syevd

AMD-Internal: [CPUPL-2587]
Change-Id: Ib1e9f89bc4b6911e4b7e9b910d1ecc2ba00b286a
2022-09-23 14:24:55 +05:30
Mangala V
42434fbc31 Bug fix in TRSM small for S, D, C & Z datatype incase of MT
1. Corrected B buffer accessing to access by its offset instead of
   starting address which is required incase of MT.
3. When num_threads > 1, B buffer is divided in to blocks in m or n
   dimension based on side right or left. Hence need to access by its
   offset to access starting of the block.
4. Currently B Matrix is divided in to blocks for each thread and
   complete matrix A is used by all threads.
   Incase of design change in future, modified A buffer accessing by its
   offset to support partition of matrix A for MT
AMD-Internal:[CPUPL-2520]

Change-Id: Ic09e9e945417b86e2bc2e2d4548f65db308cd2ea
2022-09-23 03:30:25 -04:00
Dipal M. Zambare
17694b6ca5 Enabled AVX512 compiler flags for the reference kernels.
- Updated zen4 configuration to enable AVX512 flags for the
    reference kernels
  - Reference and vector kernels will use the same compiler flags

AMD-Internal: [CPUPL-2533]
Change-Id: I5a2ba7e584dc3fb93625df12cca6b6c18f514ea8
2022-09-23 11:39:33 +05:30
Harihara Sudhan S
a45827b3f9 u8s8s16os16 bug fix for downscale operation
- Removed some read code from the macros for downscale
	- Store permute correction
	- Simplified macros for edge cases and corrected intermediate
	  operation

AMD-Internal:[CPUPL-2171]

Change-Id: Ifd2ff6b3d1c3874ac5cb8a545ff6daa7fb40ee68
2022-09-22 05:02:17 -04:00
Nallani Bhaskar
1e5d98322d Disabled DNRM2 AVX path temporarily
Description:
Disabled AVX2 optimized path for DNRM2 to avoid accuracy
issues in netlib blas test.

AMD-Internal: CPUPL-2576 ]
Change-Id: I0764725d4f6b1e4e0b5f60a255bc681bb698560e
2022-09-22 13:00:04 +05:30
mkadavil
bf4d1da1b9 Column major input support for BFloat16 gemm.
-The bf16 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.
-Bench changes to test low precision gemm for column major inputs.

AMD-Internal: [CPUPL-2570]
Change-Id: I22c76f52619fd76d0c0e41531828b437a1935495
2022-09-22 02:50:46 -04:00
Dipal M Zambare
de3247b0da Removed extra prototypes for ?gemm3m APIs
- Removed prototypes for float(sgemm3m) and double(dgemm3m) types,
  as BLIS implements this API only for scomplex(cgemm3m) and
  dcomplex(zgemm3m)

AMD-Internal: [SWLCSG-1477]
Change-Id: Ifad86a74b4c939ed240743894b85bb4fa5e6d754
2022-09-20 15:51:56 +05:30
Eleni Vlachopoulou
a5891f7ead Adding AVX2 support for DNRM2
- For the cases where AVX2 is available, an optimized function is called,
based on Blue's algorithm. The fallback method based on sumsqv is used
otherwise.

- Scaling is used to avoid overflow and underflow.

- Works correctly for negative increments.

AMD-Internal: [CPUPL-2551]
Change-Id: I5d8976b29b5af463a8981061b2be907ea647123c
2022-09-20 06:05:01 -04:00
satish kumar nuggu
b7bc5f204f Fix in DTRSM Small MT
Details:
1. Changes are made in dtrsm small MT path,to avoid accuracy issues.

AMD-Internal: [SWLCSG-1470]
Change-Id: I65237225892f97b7222fe71f66b02841b5956560
2022-09-20 05:54:48 -04:00
Dipal M Zambare
2cdeea3c66 CBLAS/BLAS interface decoupling for the level 1 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dscal’
 internally invokes the BLAS API ‘dscal_’.

-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.

-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: I0e80071398af29c9313296d2a92e61e3897ac28e
2022-09-19 21:50:29 +05:30
Dipal M Zambare
866e8de7bf CBLAS/BLAS interface decoupling for the level 2 APIs
-In BLIS, the CBLAS  interface is implemented as a wrapper around
 the BLAS interface. For example the CBLAS API  ‘cblas_dgemv’
 internally invokes the BLAS API ‘dgemv_’.

-This coupling between CBLAS and BLAS interface prevents the end
 user from overriding them individually by the application or
 other libraries.

-This change separates the CBLAS and BLAS implementation by adding
 an additional level of abstraction. The implementation of the
 API is moved to the new function which is invoked directly from
 the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]
Change-Id: Ie7cbbac86bbfa1075a5064b31b365e911f67786c
2022-09-15 17:51:05 +05:30