Commit Graph

2814 Commits

Author SHA1 Message Date
Shubham
1faee9f89e Fixed 32xk AVX512 double precision pack kernel
- Currently the pointer received as function argument is
  used for packing which causes only a partial copy of
  input buffer to output buffer due to strange optimizations
  by compiler.
- To fix this, instead of using a normal pointer for output
  buffer, we define a "restrict" local pointer variable.
- "restrict" keyword tells the compiler that the pointer is
  the only way to access the object pointed by the pointer.
- By defining "restrict" local pointer pointing to output
  buffer, the mysterious problem of incomplete copy has
  been solved.

Change-Id: Ie2355beb1d43ff4b60b940dd88c4e2bf6f361646
2023-02-20 21:18:54 -05:00
Shubham
3ae84c98fd Fixed seg fault in Testsuite for DTRSM micro kernel
- In zen4 arch TRSM and GEMM have different blocksizes.
  TRSM call will update blockize in global cntx object
  which is incorrect for GEMM, when GEMM and TRSM are
  called in parallel.
- Hence using a local copy of cntx which holds blocksizes
  would help.

AMD-Internal: [CPUPL-3019]
Change-Id: I5f0f5675b3917d2a11d582ac626ca5d8f4752c53
2023-02-20 05:34:42 +05:30
bhaskarn
91a9968a5e Developed intrinsic based f32 kernels in lpgemm
Description:

1. Developed row variant intrinsic Kernels for float32/sgemm
   which are called from lpgemm api aocl_gemm_f32f32f32of32()

2. 6x64m, 6x48m, 6x32m kernels and respective fringe kernels are
   developed using avx512.

3. 6x16m main kernel and respective n fringe and mn fringe are
   are developed based on avx2 and avx

4. Modularizing, K loop unroll, perf tuning, post-ops and dynamic
   dispatch are planned next

5. When leading dims are greater than dims bench_lpgemm need
   to be updated to test it and this is planned next.

Change-Id: I54c78fef639ea109d6ef2c2b05c07ce396c81370
2023-02-20 01:11:22 -05:00
Arnav Sharma
46965dfc57 Developed 6x64 SGEMM row-preferred kernels
- Added kernels for all rv and rd variants.
- Main kernel is of size 6x64, and the associated fringe kernels
  added are
	- 4x64, 2x64, 1x64
	- 6x32, 4x32, 2x32, 1x32
	- 6x16, 4x16, 2x16, 1x16
- Updated the zen4 config to enable these kernels in zen4 path.
- Added C-prefetching to 6x? row-stored main kernels.
- C-prefetching for column storage yet to be added.
- K-loop unrolling for fringe kernels yet to be added.

AMD-Internal: [CPUPL-3002]
Change-Id: Ide18412cc6178b43a12a3bc7a608ce9d298fb2e4
2023-02-19 23:56:58 -05:00
Harihara Sudhan S
535b49f150 Vectorized COPYV for double complex
- Added ZCOPYV kernel that uses AVX2 and SSE instructions for
  vectorization.
- The routine returns early when the vector dimension is zero.
- The kernel takes one among the two available paths based on
  conjugation requirement of X vector.
- VZEROUPPER is added before transitioning from AVX2 to SEE.
- Added function pointer to ZEN, ZEN 2, ZEN 3 and ZEN 4 contexts.

AMD-Internal: [CPUPL-2773]
Change-Id: Ibd8a2de42060716395ef698d753c8462654cc0f0
2023-02-16 09:32:10 -05:00
Shubham
18569b42ee Added DTRSM small RUNN/RLTN variant AVX512 kernels
- 8x8 kernels are used for DTRSM SMALL
- Matrix A(a10) is packed for GEMM operations.
- Packed martix A will be re-used in all the col-block
  along N-dimension.
- Diagonal elements of A matrix are packed(a11) for
  TRSM operations.
- Implemented fringe cases with following block sizes
   8x8, 8x4, 8x3, 8x2, 8x1
   4x8, 4x4, 4x3, 4x2, 4x1
   3x8, 3x4, 3x3, 3x2, 3x1
   2x8, 2x4, 2x3, 2x2, 2x1
   1x8, 1x4, 1x3, 1x2, 1x1

AMD-Internal: [CPUPL-2745]

Change-Id: I6a174e7f88a4c2c5778052525879552a1e82f6ad
2023-02-16 06:49:46 -05:00
mkadavil
63ee4c5e4c Remove memcpy usage in u8s8s32 lpgemm micro kernels.
-As of now, memcpy is used in u8s8s32 micro-kernel for copying in k
fringe loop (( k % 4 )!= 0) and NR' < 16 fringe kernels. However for
small k/n dimensions, memcpy invocation has high overhead.
-This issue is fixed by replacing memcpy with a MACRO based
implementation of copy routine, specifically optimized for the sizes
that will be encountered in fringe cases (k < 4, NR' < 16).

AMD-Internal: [CPUPL-3008]
Change-Id: I376bab0aac325832e42e370b291614e5fd5272dc
2023-02-16 05:52:19 -05:00
chandrkr
399831d7cb AOCL-Windows: Script Update to mirror reference kernel files with Unique names
Details:
1. Reference kernel File names are to be mirrored with unique names
for each architecture configuration while generating fat binary.
2. Python script "blis_ref_kernel_mirror.py" is updated to append
the filenames with each configuration (zen, zen2, zen3, zen4
and generic) that is built for windows.

AMD-Internal: [CPUPL-3009]
Change-Id: Ib02206382199cf2aebe14ff9c869b6089228e1c2
2023-02-14 23:37:13 -05:00
Harihara Sudhan S
f6960a3b6a Fixed bug in scal2v reference
- When alpha is equal 1 in scal2v the computation can be reduced to
  copyv. The condition check in else if condition has to be a check
  for alpha equal to one and not zero.
- BLIS_NO_CONJ has been passed while invoking copy. In case of single
  and double complex, the appropriate conjx needs to be passed with
  copyv so that the vector can be conjugated before copying to the
  destination.

AMD-Internal: [CPUPL-2985]
Change-Id: Ieb1ad111dff68098a08d870da7215bbd4ff5c9d0
2023-02-05 23:11:51 -05:00
Harihara Sudhan S
8d7593a940 AVX2 vectorized SCALV for double complex
- Added ZSCALV that uses AVX2 and SSE instructions for vectorization.
- Return early when the vector dimension is zero. When alpha is 1 there
  is no need to perform computation hence return early.
- When alpha is zero expert interface of ZSETV is invoked. In this case,
  all the elements of the input vector are set 0.
- Invocation of expert interface means that NULL pointer can be passed
  to the function in place of context. Expert interface of ZSETV will
  query the context and get the approriate function pointer.
- Added BLAS interface for ZSCALV. The architecture ID is used to decide
  the function that is to be invoked.
- Created a new macro INSERT_GENTFUNCSCAL_BLAS_C to instantiate SCALV
  BLAS macro interface only for single complex type and single complex,
  float mixed type

AMD-Internal: [CPUPL-2773]
Change-Id: I0d6995bce883c0ebdc5da0046608fc59d03f6050
2023-02-05 09:36:12 -05:00
Harihara Sudhan S
18ae57305e ZAXPYF4 optimization
- Vectorized alpha scaling of X vector using SSE instructions. This
  can be done irrespective of incx.
- Added code to prefetch A matrix and Y vector to L1 cache
- Vectorized fringe case computation and non-unit stride computation
  with SSE instructions.
- Increased unroll in unit stride cases for better register
  utilization.

AMD-Internal: [CPUPL-2773]
Change-Id: I217e6ce9e3f5753ebe271c684abd9a2274fd2715
2023-02-04 12:34:50 -05:00
mkadavil
d2713d3dc0 GCC compiler optimization flag (kernel) update for zen3 and zen4 config.
-Inefficient assembly is generated for s16 gemm micro-kernel(intrinsics
code) when compiled using gcc. The presence of -fschedule-insns +
-fschedule-insns2 + -ftree-pre in O2 compiler optimization flags
results in the code being optimized to reduce data stalls, and results
in the usage of stack to store intermediate C register output. Disabling
-ftree-pre in gcc fixes the issue, even in the presence of the other
two flags.

AMD-Internal: [CPUPL-2971]
Change-Id: Ibf0dcde20b5a18708a05faad34e684eb0a9a5463
2023-02-02 23:58:14 -05:00
eashdash
672544bc04 GeLU Activation Function Post-Op for LPGEMM S16, S32 and BF16
1. Added Tanh approximation based GeLU Post-Op for S16, S32 and BF16
2. Changes are done at frame and micro-kernel level to
   implement this post-op.
3. Efficient AVX-512 and AVX-2 vector versions of TANHF and EXPF
   functions are implemented for the GeLU post-operation.
4. TANH and EXPF math functions are efficiently implemented in
   macro-based fashion to exploit register level fusion of GeLU
   with GEMM operations for improved performance
5. LPGEMM bench is changed to pass GeLU post-op as input and
   support accuracy check to verify functional correctness

AMD-Internal: [CPUPL-2978]
Change-Id: I472ac35c00a4ea1ab983cc5f6ff6a123c8035f28
2023-02-02 08:25:04 -05:00
sireesha.sanga
fe999a9653 Removal of Prohibited Source Links as per Legal Scan
AMD-Internal: [SWCSD-5866]
Change-Id: I03011b114d37fcee8a399af8d631aee610b1f924
2023-01-27 07:56:01 -05:00
sireesha.sanga
71a25a5152 DAXPY Parallelization
Details:
- DAXPYV is parallelized based on the number of threads
set by Application.
- Using the empirical data collected on genoa, optimal
number of threads is set.
- If intended number of threads to be spawned is not
equal to actual number of threads, AXPY routine will
execute sequentially.

Change-Id: I48015a712ada6428056af6320aa5570596a02b1f
AMD-Internal: [CPUPL-2747]
2023-01-24 07:32:16 -05:00
Kiran Varaganti
201db7883c Integrated 32x6 DGEMM kernel for zen4 and its related changes are added.
Details:
- Now AOCL BLIS uses AX512 - 32x6 DGEMM kernel for native code path.
  Thanks to Moore, Branden <Branden.Moore@amd.com> for suggesting and
  implementing these optimizations.
- In the initial version of 32x6 DGEMM kernel, to broadcast elements of B packed
  we perform load into xmm (2 elements), broadcast into zmm from xmmm and then to get the
  next element, we do vpermilpd(xmm). This logic is replaced with direct broadcast from
  memory, since the elements of Bpack are stored contiguously, the first broadcast fetches
  the cacheline and then subsequent broadcasts happen faster. We use two registers for broadcast
  and interleave broadcast operation with FMAs to hide any memory latencies.
- Native dTRSM uses 16x14 dgemm - therefore we need to override the default blkszs (MR,NR,..)
  when executing trsm. we call bli_zen4_override_trsm_blkszs(cntx_local) on a local cntx_t object
  for double data-type as well in the function bli_trsm_front(), bli_trsm_xx_ker_var2, xx = {ll,lu,rl,ru}.
  Renamed "BLIS_GEMM_AVX2_UKR" to "BLIS_GEMM_FOR_TRSM_UKR" and in the bli_cntx_init_zen4() we replaced
  dgemm kernel for TRSM with 16x14 dgemm kernel.
- New packm kernels - 16xk, 24xk and 32xk are added.
- New 32xk packm reference kernel is added in bli_packm_cxk_ref.c and it is
  enabled for zen4 config (bli_dpackm_32xk_zen4_ref() )
- Copyright year updated for modified files.
- cleaned up code for "zen" config - removed unused packm kernels declaration in kernels/zen/bli_kernels.h
- [SWLCSG-1374], [CPUPL-2918]

Change-Id: I576282382504b72072a6db068eabd164c8943627
2023-01-19 23:11:36 +05:30
Harsh Dave
0a699c45f0 Corrected VEXTRACTF64X2 macro in bli_x86_asm_macro file.
- Previously VEXTRACTF64X2 macro was defined as
vextractf64x4.

Change-Id: I79727a85b7d6da3b4d524064e297fc8c71d4f466
2023-01-16 23:04:48 -05:00
mkadavil
4b5e24d0d9 Column major input support for f32 gemm (sgemm for lpgemm).
-The f32 gemm framework is modified to swap input column major matrices
and compute gemm for the transposed matrices (now row major) using the
existing row-major kernels. The output is written to C matrix assuming
it is transposed.
-Framework changes to support leading dimensions that are greater than
matrix widths.

AMD-Internal: [CPUPL-2919]
Change-Id: I805f1cb9ff934bb3106e01eb74e528915ffb90a3
2023-01-16 04:04:21 -05:00
Sireesha Sanga
540509f374 Enabling AVX2 path for SCNRM2 2023-01-13 10:27:54 +05:30
Eleni Vlachopoulou
758d68467f Disabling AVX2 path for SNRM2 and SCNRM2.
AMD-Internal: [CPUPL-2865]
Change-Id: I09c67115801a6b9446c7930c54fc937bd17908a3
2023-01-12 09:58:28 -05:00
Meghana Vankadari
f39dba9fd8 Added dzgemm, DZGEMM, DZGEMM_ prototypes to wrapper file.
AMD-Internal: [CPUPL-2199]
Change-Id: Ied814cb7be60d30b8217ec42ac436b4e628ea6d2
2023-01-12 01:35:29 -05:00
mkadavil
3870792e62 Low precision gemm s32 downscale optimization.
-The post operations attributes are moved to a new struct
lpgemm_post_op_attr, and an object of this struct is passed to the
low precision gemm kernels in place of the multiple parameters.
-The u8s8s32s8 api (downscale api) performance is low when the k
value is less (k < KC). Two scenarios are observed here:
a. beta = 0: Currently, for downscale api, a temporary buffer is
used to accumulate intermediate s32 output, so that it can be used
in later iterations of pc loop (k dim). The usage of this buffer
(store) can be avoided if k < KC. Here intermediate accumulation
is not required, since the after the first iteration of the pc loop,
the output can be downscaled and stored.
b. beta != 0: In this case the existing values of the original s8 C
output matrix needs to be converted to s32 and beta scaled. Currently
the s8 values are converted to s32 and stored in temporary buffer in
pc loop (5 loop algorithm) in blocks of mxNC. This temporary buffer
is passed to the micro kernel and beta scaling is applied on this.
However the mxNC block copy is costly and can be avoided if a new
condition is introduced for beta scaling in the micro kernel, whereby
the original s8 data is loaded instead of from the temporary buffer
to a register, converted to s32 and beta scaling applied on it.

AMD-Internal: [CPUPL-2884]
Change-Id: Id9b4650d500e1b553e48c4f1e4c902b3f553211c
2023-01-10 13:15:22 +05:30
Kiran Varaganti
d8d4499e54 AVX2 dgemm kernel optimization for AOCC
Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with
     k0 is typecasted to uint64_t to enable AOCC generate optimized code.
     Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting
     this change. Similar change was applied to sgemm, cgemm and zgemm kernels.
Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9
2023-01-09 07:49:41 -05:00
Shubham
9e8595356f Fixed bug related to block size override for TRSM
- In zen4 TRSM and GEMM have different blocksizes,
  when trsm is called, blocksizes are changed in
  global cntx object. If GEMM and TRSM are called
  in parallel, blocksizes in global cntx will not
  be correct for GEMM which will cause a seg fault.
- To fix this, a local copy of cntx is created
  and blocksizes are changed only in the local copy.

AMD-Internal:[CPUPL-2896]
Change-Id: I0e724520a92fc3b2ed0becf385ec41ab5d1b4490
2023-01-09 05:30:36 -05:00
Edward Smyth
82c2eb4e8e Code cleanup and warnings fixes
Corrections for some occurances of:
- Compiler warnings about initialization of float from double
- Spelling mistakes in comments
- Incorrect indentation of code and comments

AMD-Internal: [CPUPL-2870]
Change-Id: Icb68c789687bd0684844331d43071bfffecac9fc
2023-01-09 04:34:52 -05:00
Meghana Vankadari
fbd9dde869 Fix bug in DZGEMM
Details:
- In case of dzgemm, if the microkernel prefers column output,
  We will induce a transposition and perform C += A*B
  where A (formerly B) becomes complex and B(formerly A)
  becomes real. Hence attach complex alpha object to A instead of B.
- This commit reverts all the changes made by
  d62f12a18a and
  0b81f53074 as they are causing
  failures in make checkblis-md.

AMD-Internal: [CPUPL-2893]
Change-Id: I56b94ac136fb96003302c568ae2587142c836620
2023-01-09 03:30:57 -05:00
Vignesh Balasubramanian
dbd0c069d4 Implemented 3x4 RD kernels for ZGEMM in SUP path
-Implemented (r)ow preferential (d)ot product milli-kernels
 (m and n variants) for dcomplex datatype along SUP path.

-These computational kernels extend the support for handling RRC and
 CRC storage schemes along the SUP path. In case of BLAS api call,
 it corresponds to the input cases with transa equal to T and
 transb equal to N.

-In case of the B matrix being packed(conditionally), the inputs are
 redirected to the existing (r)ow preferential (v)ector load optimized
 kernels due to better performance.

-Added macro for vhsubpd assembly instruction, to support the arithmetic
 for complex datatype in its interleaved storage.

AMD-Internal: [CPUPL-2593]
Change-Id: If90834e55e9e31aa87d3d5b711efad9ef2458da8
2022-12-22 18:31:46 +05:30
Nithya V S
6e9defe1b5 Developed AVX512 based "sup" kernels for SGEMM
- Only for RRR case, var2m kernels are added
- Main kernel is of 12x32 (AVX512), associated fringe kernels of
	- 8x32, 4x32, 2x32, 1x32 (AVX512)
	- 12x16, 8x16, 4x16, 2x16, 1x16 (AVX512)
	- 12x8, 8x8 (AVX2)
	- 12x4, 8x4 (SSE4)
	- 12x2, 8x2 (SSE4)
	- existing AVX2/SSE4 kernels are used for other fringe
	  cases
- Currently, these kernels are not invoked in zen4 path
- Once all AVX512 kernels (n and rd) are done, invoke all of them
  together in zen4 config

AMD-Internal: [CPUPL-2801]

Change-Id: I7a206fee9151e92319d83dcc5f3eed61d3bf1196
2022-12-16 10:18:09 +05:30
Eleni Vlachopoulou
13aa3c8cd0 Adding AVX2 support for SNRM2 and SCNRM2
- For the cases where AVX2 is available, an optimized function is called,
    based on Blue's algorithm. The fallback method based on sumsqv is used
    otherwise.

    - Scaling is used to avoid overflow and underflow.

    - Works correctly for negative increments.

    AMD-Internal: [SWLCSG-1080]

Change-Id: I6bf2f42652ba6b8a8631a0a9e6f6297d5b3ea5d9
2022-12-14 04:25:45 -05:00
Edward Smyth
2592774fe8 BLIS: Nested Parallelism issues (3)
Bugfix for parallel BLAS1 and BLAS2 routines. Threading information
was not being set correctly when initializing local rntm from global.

Also ensure th_rntm is initialized along with global_rntm by updating
it in bli_thread_init(), called by bli_init_once()

AMD-Internal: [CPUPL-2433]
Change-Id: Iba658f87ae13fe16a57ca1fc279e149b7fa294cf
2022-12-13 12:38:40 -05:00
Eleni Vlachopoulou
88e549e7bd Using CMake as the build system for both Linux and Windows:
- GoogleTest headers removed. GoogleTest gets fetched at
configuration time.
- BLIS headers removed. A BLIS installation path is required at
configuration time.
- Windows has been temporarily disabled.

AMD-Internal: [CPUPL-2732]
Change-Id: I9e55c8e43b2733f96cd8b6e5449d79623decad5c
2022-12-13 19:09:23 +05:30
jagar
cff29bde76 Added gtestsuite folder into blis repo
Moved blis gtestsuite from lib-confscript to blis repo
(branch: amd-main)

Change-Id: If7ad391eef66bac6d26cf5223e6043d52b746072
2022-12-07 23:57:13 -05:00
Edward Smyth
345aacf806 BLIS: Nested parallelism issues (2)
Improvements to recent parallelism changes:

1. BLIS specific threading options: In bli_thread_update_rntm_from_env()
   set threading variables in tl_rntm to serial values when OpenMP level
   for parallelism within BLIS will not be active. User supplied BLIS
   threading values remain unchanged in global_rntm.

2. Simplify code structure in bli_thread_update_rntm_from_env().

3. Change variable declarations in bli_thread_init_rntm_from_env()
   and bli_thread_update_rntm_from_env() to avoid unused variable
   warnings in non-OpenMP builds.

AMD-Internal: [CPUPL-2433]
Change-Id: I5505657e3d2722e69bc4a1c1bb9fd8df55407fdd
2022-12-07 04:34:07 -05:00
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Chandrashekara K R
5f1ea3246a Updated blis library version string to 4.0.1
Change-Id: I1f3eb9081bcc05ccec300e00860d905b23d9709a
2022-11-24 10:35:34 +05:30
Arnav Sharma
de7b7b8eb8 Fix for ZDSCAL Multi-thread implementation
- In the existing implementation, when the actual number of threads
  spawned is different from the indicated number of threads for the
  parallel region, partial job is being executed.
- Fix added to identify actual number of threads spawned and
  allocate the work load to single thread in case of discrepancy
  in the number of threads spawned vs indicated.

AMD-Internal: [CPUPL-2761]
Change-Id: I9eb9baccaef7f2149346be705ee151ae7af0a0e5
2022-11-22 11:03:07 +05:30
Arnav Sharma
2bc2d11e8a Fix for DSCAL Multi-thread implementation
- In the existing implementation, when the actual number of threads
  spawned is different from the indicated number of threads for the
  parallel region, partial job is being executed.
- Fix added to identify actual number of threads spawned and
  allocate the work load to single thread in case of discrepancy
  in the number of threads spawned vs indicated.

AMD-Internal: [CPUPL-2761]
Change-Id: Ife36e6e4993bdcc5a506349b54b2177173866e32
2022-11-21 08:49:07 -05:00
Arnav Sharma
69be3b0557 Fix for DDOT Multi-thread implementation
- In the existing implementation, when the actual number of threads
  spawned is different from the indicated number of threads for the
  parallel region, partial job is being executed.
- Fix added to identify actual number of threads spawned and
  allocate the work load to single thread in case of discrepancy
  in the number of threads spawned vs indicated.

AMD-Internal: [SWLCSG-1659]
Change-Id: I882d0af608009e502189a4469eb3483d659c2b3f
2022-11-21 06:17:26 -05:00
Harihara Sudhan S
11c42ce1d3 C matrix prefetch for BF16 GEMM
- Broke down the KR loop inside the compute kernel into
	  two pieces
	- Added C matrix prefetch between the two decomposed
          pieces of KR loop

AMD-Internal: [CPUPL-2693]
Change-Id: Ib73bc2145de4c75bc8153d7d7d20fb057270c94e
2022-11-21 04:57:19 -05:00
Harihara Sudhan S
2f6c26eff6 Dynamic block tuning for LPGEMM u8s8s16os16
- NC value is adjusted based on the n dimension of the B
	  matrix.
	- KC value is adjusted based on the k dimension input
	  matrices.
	- In order to ensure a handshake between the reorder
          function and compute function, the same dynamic block
	  tuning logic is called in both places.
	- Reorder followed by compute scenarios mean that tuning
	  KC value based on MC value and vice versa is not feasible.

AMD-Internal: [CPUPL-2638]
Change-Id: Ie515da7de83e3dfb59946e873c92d67f11559429
2022-11-21 01:39:35 -05:00
Vignesh Balasubramanian
078b0bc626 Enabling ZGEMM SUP path for single thread
-Enabled the sup path for zgemm in case of single thread.
-Fine tuned the thresholds for small path in order to handle
 certain spectrum of skinny matrices.
-The inputs are now redirected among small, sup and native
 paths in case of single thread.

AMD-Internal: [CPUPL-2632]
Change-Id: I3eca315bcb8bfa8e015349a7b2bc4ebd5448e065
2022-11-04 16:05:48 +05:30
Edward Smyth
34730a1e4c BLIS: Nested parallelism issues
1. Check OpenMP active level against max active levels when setting
   number of threads for starting a new parallel region in
   ./frame/thread/bli_thread.c to ensure the correct number of threads
   is used when BLIS is called within nested OpenMP parallelism.

2. In subsequent BLIS calls, threading choices could be incorrectly
   set based on values used and stored in global_rntm by a previous
   call. This could apply when the OpenMP number of threads differ from
   call to call, different nested parallelism is used in different
   parts of a user's code, or different threads at the user level
   request different numbers of OpenMP threads for BLIS calls.

   Keep threading information in both global_rntm and a new Thread
   Local Storage copy tl_rntm. Update tl_rntm from OpenMP runtime
   environment (as appropriate) during bli_init_auto() calls in each
   BLIS routine. The details are:
   * global_rntm is initialized on first BLIS call based on OpenMP and
     BLIS threading environment variables.
   * global_rntm is updated by any BLIS threading function calls.
   * In bli_thread_update_tl(), called by bli_init_auto(), sync with
     any BLIS values set or updated in global_rntm. Then, if BLIS
     threading control is not used, check OpenMP ICVs and set thread
     count and auto_factor appropriately.
   * Setting BLIS threading locally (using expert interfaces to pass
     a user defined rntm data structure) should work as before.

3. bli_thread_get_is_parallel can now only be called outside of
   parallelism within BLIS routines. Change calls in trsm to reflect
   this.

4. Ensure blis_mt is set to TRUE in bli_thread_init_rntm_from_env()
   if any BLIS_*_NT environment variables are set.

5. Set auto_factor = FALSE when the number of threads is 1.

6. bli_rntm_set_num_threads() and bli_rntm_set_ways() set blis_mt=TRUE.

7. Set blis_mt=FALSE in BLIS_RNTM_INITIALIZER and bli_rntm_init().

8. For debugging, internal information on the rntm threading data can
   be printed by defining "PRINT_THREADING" at the top of bli_rntm.h

9. bli_rntm_print() now also prints the value of blis_mt.

10. Function prototypes in bli_rntm.h moved to top of file, so that
   bli_rntm_print() can be used within inline functions defined in
   this header file.

11. Comment out bli_init_auto() and bli_finalize_auto() calls in
    Fortran interfaces in frame/compat/blis/thread/b77_thread.c

12. In frame/3/bli_l3_sup_int_amd.c move two calls to set_pack_a and
    set_pack_b functions outside of the auto_factor if statements.

13. Misc code tidying.

AMD-Internal: [CPUPL-2433]
Change-Id: I8342c37fb4e280118e5e55164fbd6ea636f858ee
2022-10-21 07:38:39 -04:00
Mangala V
1290c27b2a Revert "Enabled MT path in DTRSM Small"
This reverts commit bdbdd20996.

Reason for revert: Functionality issue in TRSM MT Path on Genoa

Change-Id: I197ce4d58641196de9fb32828b3cc91c568b5de7
2022-10-21 05:13:38 -04:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Vignesh Balasubramanian
f6185f2e53 Disabling ZGEMM SUP for single thread
-Disabled the sup path for zgemm with single thread temporarily
 until we tune the performance and update the thresholds.
 The redirection of inputs in zgemm is now among small and native path
 for single thread.
-Fine tuned the input size constraints for entry to small path in zgemm
 in case of single thread.

AMD-Internal: [CPUPL-2495]

Change-Id: Ifa19116757ca132d9636c3f2a0661524d2c1d3ec
2022-10-11 08:09:53 -04:00
eashdash
63864d7dfb Added clipping while downscaling for u8s8s32os8 and u8s8s16os8.
Clipping is done during the downscaling of the accumulated result
from s32 to s8 for u8s8s32os8 and from s16 to s8 for u8s8s16os8,
to saturate the final output values between [-128,127]

AMD-Internal: [LWPZENDNN-493]

Change-Id: Ica9bba5044e87b815e2b4e35809bf440bb9dd41f
2022-10-11 07:28:06 -04:00
Mangala V
bdbdd20996 Enabled MT path in DTRSM Small
1. Fixed accuracy issues in dtrsm small multi-thread.
2. Fixed out of bound memory accesses in the patch
   "Fixes to avoid Out of Bound Memory Access in TRSM small algorithm"
3. Re-enabled DTRSM small MT with above fixes.

AMD-Internal: [CPUPL-2567]
Change-Id: Ibf2949b25fde4007a92cc635526bed0e0d897800
2022-10-10 09:47:45 -04:00
Harihara Sudhan S
492555785a Fixed bench accuracy issue in LPGEMM
Description:
- When the value of the result in s8 for u8s8s32 and u8s8s16 are
  close to 0. Values are getting ceiled to 1.
- Used nearbyintf to round the downscaled values in bench reference.
  This fixed the result mismatch issue between the vectorized kernel
  implementation and reference implementation in bench accuracy test.

AMD-Internal: [CPUPL-2617]
Change-Id: Ie42d612b1933bf622e6bd80eaf3db4bcb7a3ce82
2022-10-07 09:48:21 +00:00
Edward Smyth
991f2ec0e8 BLIS: Errors in Netlib LAPACK Tests
Zen4 kernel bli_damaxv_zen_int_avx512 is causing incorrect results in
the netlib LAPACK tests, specifically in:

./xlintstd < ../dtest.in > dtest.out

in the TESTING/LIN directory. Given time constraints, i.e. the need to
finalize code for AOCL 4.0 release, disable calls to AVX512 kernel
(i.e. always use the AVX2 kernel) for now, and aim to correct
bli_damaxv_zen_int_avx512 for AOCL 4.1.

AMD-Internal: [CPUPL-2590]
Change-Id: I2603dd97c3931acb9730563e8126b109ec2b2572
2022-10-03 10:55:49 +00:00
Shubham Sharma
2d72d4c862 DTRSM Native path AVX 512 kernels are developed
- Added DTRSM AVX512 kernels for lower and upper variants in the native path.
- Changes in framework are made to accommodate these kernels.

AMD-Internal: [CPUPL-2588]
Change-Id: I1f74273ef2389018343c0645870290373ce25efe
2022-10-01 15:58:24 +00:00