Commit Graph

2943 Commits

Author SHA1 Message Date
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian
758ec3b5ca ZGEMM optimizations for cases with k = 1
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
  bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
  ZGEMM. The kernel is built for handling the GEMM computation
  with inputs having k = 1, and the transpose values for A and
  B as N.

- The kernel dimension has been changed from 4x6 to 4x4,
  due to the following reasons :

  - The 1xNR block of B in the n-loop can be reused over multiple
    MRx1 blocks of A in the m-loop during computation. Similar
    analogy exists for the fringe cases.

  - Every 1xNR block of B was scaled with alpha and stored in
    registers before traversing in the m-dimension. Similar change
    was done for fringe cases in n-dimension.

  - These registers should not be modified during compute, hence
    the kernel dimension was changed from 4x6 to 4x4.

- The check for early exit(with regards to BLAS mandate) has been
  removed, since it is already present in the BLAS layer.

- The check for parallel ZGEMM has been moved post the redirection to
  this kernel, since the kernel is single-threaded.

- The bli_kernels_zen.h file was updated with the new kernel signature.

AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
2023-08-07 15:10:08 +05:30
Harihara Sudhan S
c97471dce0 Added AVX512 ZDSCALV kernel
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
  the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
  instructions.
- Removed the negative incx handling checks from the blis_impli layer
  of ZDSCAL as BLAS expects early return for incx <= 0.

AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
2023-08-06 01:51:47 -04:00
Harihara Sudhan S
b126c9943b ZSCALV kernel optimization
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
  followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
  layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.

AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
2023-08-04 06:57:18 -04:00
Shubham Sharma
9607f207da AOCL Dynamic tuning for DAXPYV
- Existing logic is not picking the ideal number
  of threads for some problem sizes.
- Problem size and their corresponding ideal number
  of threads are retuned for daxpy in aocl dynamic.

AMD-Internal: [CPUPL-3484]
Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7
2023-08-01 10:34:04 +05:30
Eleni Vlachopoulou
fa77d0415a Updating nrm2 GTestSuite testing
- Adding default template parameter for the type of the returned value from nrm2.
- Bugfix on NaN/Inf comparator for scalars.
- Tuning sizes of vector x to exercise the different paths for vectorized and scalar code.
- Adding wrong parameters and extreme value testing.
- Adding tests for overflow and underflow using max and min representable numbers for vectorized and scalar code.

AMD-Internal: [CPUPL-2732]
Change-Id: Ice8ee65095ecaa7b30ebd5f90ed2a890178533db
2023-07-28 05:03:00 -04:00
Shubham Sharma
954c97f858 Added NT in DTL logs for GEMMT, TRSM and NRM2
- Number of threads and gflops are added
  in the DTL logs for GEMMT, TRSM and NRM2

AMD-Internal: [CPUPL-2144]
Change-Id: If68887a5150bd0feda351180f379996497a1e678
2023-07-27 05:15:08 -04:00
Meghana Vankadari
79e174ff0a Level-3 triangular routines now use different block sizes and kernels.
Details:
    - Eliminated the need for override function in SUP for GEMMT/SYRK.
    - New set of block sizes, kernels and kernel preferences
      are added to cntx data structure for level-3 triangular routines.
    - Added supporting functions to set and get the above parameters from cntx.
    - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
      In case they are not set, use the default block sizes/kernels of
      Level-3 SUP.

AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
2023-07-26 01:26:11 -04:00
Chandrashekara K R
7b78d93282 Removing omp library linking to static multithreaded library build.
Description: We have seen the library dependency issue when we are
linking the libomp.lib or libiomp5md.lib while building the library
for static multithreaded scenario. So we are removing the linking of
openmp library for static multithreaded blis library build. So that
user can link any openmp library(libomp.lib or libiomp5md.lib) while
building their applications by linking static multithreaded blis library.

AMD-Internal: [SWLCSG-2196]
Change-Id: I96722f3587ee555af12de664957c211c56fcf03d
2023-07-13 06:54:02 -04:00
Eleni Vlachopoulou
660cd6d1b2 Adding nrm2 target for benchmarking on Windows.
Modifying blis/bench/CMakeLists.txt to include nrm2 target and produce the corresponding executable.

AMD-Internal: [CPUPL-3625]
Change-Id: I7945416142e07ac99510ed9500a2c620053c7e13
2023-07-10 14:03:05 -04:00
Harihara Sudhan S
ffbb0e83e5 ZGEMM optimization for cases when m = 1 or n = 1
- When n = 1 and A matrix is transposed ZGEMV row major variant is
  invoked.
- When m = 1 and B matrix is not transposed ZGEMV row major variant
  is invoked.
- This redirection happens before parallel ZGEMM check. This is done to
  avoid the unneccesary condition check. Any parallelization check is
  expected to happen in the invoked ZGEMV interface.

AMD-Internal: [CPUPL-2773]
Change-Id: I6b7b31db712edc682c089475d12e98730a960138
2023-06-30 04:54:42 -04:00
jagar
fb6f1380b2 Gtestsuite:Added util functions
- Functions to print matrix and vector elements.
- Functions to convert matrix to symmetric, hermitian
  triangular matrix and set diagonal elements in matrix.

AMD-Internal: [CPUPL-2732]
Change-Id: I1ffa5289329cbb8a9581bf545bdd157801cf5baa
2023-06-27 16:33:57 +05:30
Chandrashekara K R
cdba2db827 BLIS: Added address sanitizer flag for blis library on windows.
Description: Added cmake option to test address related issues
using address sanitizer(-fsanitizer=address) on windows.
When the user enable the ENABLE_ASAN_TESTS option, cmake will add
related compiler and linker flags along with dependent libraries.

AMD-Internal: [CPUPL-2984]
Change-Id: I6d2a0cfe84fe122fc6c40e3023d8c79211d5fa71
2023-06-22 13:42:38 -04:00
jagar
003d1e9ae6 GTestSuite: Using ELEMENT_TYPE to specify generation of random numbers in tests.
Since random numbers are specified from ELEMENT_TYPE and we never generate tests for both integer and floating point numbers at the same time, we update code as described below:
- random vector/matrix generators are updated to use ELEMENT_TYPE as a default parameter.
- ::testing::Values(ELEMENT_TYPE) is removed from all test generators.

AMD-Internal: [CPUPL-2732]
Change-Id: Ibc6b05044502f541c9e8a7687931b1ca2903fb0c
2023-06-21 11:30:15 -04:00
Eleni Vlachopoulou
7b35a1283b Updating CMake to select the correct Windows runtime libraries.
- Upgrated to 3.15 as minimum version of CMake.
- Used CMAKE_MSVC_RUNTIME_LIBRARY instead of CMAKE_C_FLAGS to set MT and MD flags correctly.

AMD-Internal: [CPUPL-3559]
Change-Id: Ib82821d245b6acaa1399166219168ad2535d8d92
2023-06-16 22:04:09 +05:30
Edward Smyth
94a4abe2e5 BLIS: Incorrect ifdef in cblas.h and cblas_f77.h
Remove unnecessary ifdef BLIS_ENABLE_CBLAS statement from cblas.h
and cblas_f77.h. These were erroneously added when fixing the
--disable-blas functionality but are not needed in the CBLAS
headers, as these files will not be generated when BLAS or CBLAS
is disabled.

This is a fix to commit 5bd2a777ba

AMD-Internal: [CPUPL-3541]
Change-Id: If38bd795d31098a7023d575672b0a913338c0d2d
2023-06-07 06:52:57 -04:00
Eleni Vlachopoulou
7b2924c079 Updating object library targets in CMakeLists.txt for zen4 based on configuration
AMD-Internal: [CPUPL-3516]
Change-Id: Ibfe66f50fa77d4011829d8386f0a91f140d38335
2023-06-01 17:29:37 +05:30
sireesha.sanga
85eb7880f7 README File Update
Updated with latest and relevant details.

AMD-Internal: [CPUPL-3007]
Change-Id: I6d86c5f0c49fd8739c656bcc8187a5f8a4dc9beb
2023-05-25 14:46:33 +00:00
Harsh Dave
655955dd3b Doxygen document generation from cmake build
- Added support to generate doxygen documentation from cmake build.
- If doxygen is already installed on machine, it will generate 
documentation and promtps the path for documentation.

AMD-Internal: [CPUPL-3188]

Change-Id: I6047f62df63844aa71836fd481b4df246b793696
2023-05-25 07:41:40 -04:00
Eleni Vlachopoulou
9c613c4c03 Windows CMake bugfix in object libraries for shared library option
Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory.
The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries.

AMD-Internal: [CPUPL-3241]
Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52
2023-05-24 17:30:16 +05:30
Edward Smyth
dea5fe4d12 BLIS: Missing clobbers (batch 5)
Add missing clobbers for AVX512 mask registers k0-k7
in zen4 kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I5f28c725d7af1466df4db4cdfa2d456bbc6ab36d
2023-05-23 15:40:29 -04:00
Edward Smyth
a3adfb68cf BLIS: Missing clobbers (batch 4)
Add missing clobbers haswell (sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I19fa97b85f75c8b8fe15d31b13768f937cc5e4cc
2023-05-23 14:57:08 -04:00
Edward Smyth
03965a4f07 BLIS: Missing clobbers (batch 3)
Add missing clobbers in haswell (non-sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I68f6ad0c01557fcde73b1775d250d48b5162c521
2023-05-23 14:37:31 -04:00
Edward Smyth
e960141fe2 BLIS: Missing clobbers (batch 2)
Add missing clobbers in other zen4 kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I5cceb44fe100e03269cfe21d8c4c0d2171b921c3
2023-05-23 13:12:20 -04:00
Edward Smyth
ea2eea5097 BLIS: Missing clobbers (batch 1)
Add missing clobbers in first batch of assembly kernels:
- zen3 bli_gemmsup*
- bli_zgemm_zen4_asm_12x4
- bli_gemmsup_rv_haswell_asm_sMx6

AMD-Internal: [CPUPL-3456]
Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1
2023-05-23 11:51:18 -04:00
Edward Smyth
6911d2dd21 zen config make_defs.mk improvements
Improvements to zen make_defs.mk files:
* Add -znver4 flag for GCC 13 and later.
* Add AVX512 flags or -znver4 as appropriate for upstream LLVM
  in config/zen4/make_defs.mk to enable BLIS to be build with
  LLVM rather than AOCC.
* zen make_defs.mk files were inheriting settings from the previous
  one (zen->zen2->zen3->zen4), when they should be independent
  of each other. Correct by including config/zen/amd_config.mk
  in all zen make_defs.mk files to reinitialize the compiler
  flags.
* Update zen2 and zen3 make_defs.mk for recent AOCC compiler
  releases, rather than rely on LLVM settings.
* Remove -mfpmath=sse flag in config/zen4/make_defs.mk as
  this is already specified in amd_config.mk (and should
  be the default setting anyway).
* Tidy files to simplify nested if structures and be more
  consistent with one another.

AMD-Internal: [CPUPL-3399]
Change-Id: Ice64ccedd90c2660fdee8b485348a6b405cfc5ac
2023-05-22 07:51:41 -04:00
Mangala V
5f5bc24989 Bug fix: AVX2 code being invoked on non-avx2 machine for ZGEMM API
Prevented calling avx2 based bli_zgemm_ref_k1_nn code on
non-supported systems.
Changed the name of the function bli_zgemm_ref_k1_nn to bli_zgemm_4x6_avx2_k1_nn().
Changed the name of the function bli_dgemm_ref_k1_nn to bli_dgemm_8x6_avx2_k1_nn().

Thanks to Kiran Varaganti <Kiran.Varaganti@amd.com>
for identifying and helping to fix the issue.

AMD-Internal: [CPUPL-3352]
Change-Id: I02530ab197ed84c96cbad4f7dd56eedca0109c35
2023-05-21 23:13:46 +05:30
eashdash
2c4f032e0f Fix for lack of BF16 instruction when compiled with GCC-11
GCC-11 and below support AVX512-BF16.
However, it doesn't support all the bf16 instructions required.

For bf16 downscale APIs, when beta scaling is done, C output
elements must be upscaled from BF16 type to Float type for
beta scaling operation.

For this upscaling operation of bf16 to float,
_mm512_cvtpbh_ps is used.

This however is not supported by GCC-11 and below
(but is supported on GCC 12 onwards)

Lack of this instruction support in gcc11, and below leads to
compilation issues with this instruction (_mm512_cvtpbh_ps)
not being recognized.

To fix, this, we use a set of instructions:
1. register containing bf16 type
   __m256bh a1
2. Convert bf16 to float with shift left ops
   __m512 float_a1 = (__m512)
   (_mm512_sllv_epi32
   (_mm512_cvtepi16_epi32 ((__m256i) a1), _mm512_set1_epi32 (16)));

AMD-Internal: [CPUPL-3454]
Change-Id: Ie4a9f04881c59ced088608633774b27f22b4ab8e
2023-05-19 10:15:08 +00:00
eashdash
061a68ff0d BF16 Downscale and Performance fix for bf16 API
This change contains the following:

1. Downscale optimization fix
   a. Similar to downscale optimizations made for s32 and s16 gemm,
      the following optimizations are done to improve the downscale
      performance for BF16 gemm
   b. The store to temporary float buffer can be avoided when k < KC
      since intermediate accumulation will not be required for the
      pc loop (only 1 iteration). The downscaled values (bf16) are
      written directly to the output C matrix.
   c. Within the micro-kernel when beta != 0, the bf16 data from the
      original C output matrix is loaded to a register, converted to
      float and beta scaling is applied on it at register level.
      This eliminates the requirement of previous design of copying the
      bf16 value to the temporary float buffer inside jc loop.

2. Alpha scaling
   a. Alpha scaling (multiply instruction) by default was resulting in
      performance regression when k dimension is small and alpha=1 in
      bf16 micro-kernels.
   b. Alpha scaling is now only done when alpha != 1.

3. K Fringe optimization
   a. Previously memcpy was used for K fringe case to load elements
      from A matrix in the microkernels
   b. Now, masked stores are used to store the downscaled and
      non-downscaled outputs without the need to use
      memcpy functions

4. N LT-16 fringe optimization
   a. Previously memcpy was used for N LT 16 fringe case in the
      microkernelsfor storing the downscaled and non-downscaled output.
   b. Now, masked stores are used to store the downscaled and
      non-downscaled outputs of BF16 without the need to use
      memcpy functions

5. Framework updates to avoid unnecessary pack buffer allocation
   a. The default allocation of the temporary pack buffer is removed
      and the pack buffer is now only allocated if k > KC.

AMD-Internal: [CPUPL-3437]
Change-Id: I71ff862e7d250559409a12a3533678c7a7951044
2023-05-18 10:02:56 -04:00
Shubham Sharma
26e120ea25 Fixed diagonal packing for C/Z TRSM small
- In C/Z TRSM small, packing in case of unit diagonal
  is not handled properly.
- Diagonal elements are still being read even in case of
  unit diagonal.
- This causes "Conditional jump or move depends on
  uninitialised value" error during valgrind tests.
- To fix this, diagonal elements should not be read
  in case of unit diagonal.

AMD-Internal: [CPUPL-3406]
Change-Id: If3d6965299998a83d87f3a032f654fc7f8c43d4e
2023-05-18 07:57:21 -04:00
Harihara Sudhan S
9ee95e171a Control flow issue reported during static code analysis
- Missing break statement will result in unexpected control flow.
  This function will not launch the threads for the API in question
  according to the AOCL dynamic logic without the break statement.

AMD-Internal: [CPUPL-3436]
Change-Id: Ic47d773169c09e84086a27b50cd59dba33529698
2023-05-18 04:53:03 -04:00
mkadavil
1e266bbcbc LPGEMM framework updates to avoid unnecessary pack buffer allocation.
-Currently when any of the downscale API is called, a temporary pack
buffer is allocated (with bli_membrk_acquire_m) by each thread. It is
used to persist intermediate higher precision output accumulated by the
micro-kernel across pc loop when the number of pc iterations is more
than 1 (k > KC). The bli_membrk_acquire_m is a thread safe operation and
uses locks (pthread_mutex) to ensure thread safe checkout of memory/
block from the memory pool.
-However when k < KC, this temporary buffer is not required. But since
this pack buffer is allocated by default in downscale API, the overhead
from locks affects performance when k < KC, m or n is sufficiently small
and the number of threads involved is high. This default allocation is
removed and the pack buffer is now only allocated if k > KC.

AMD-Internal: [CPUPL-3430]
Change-Id: I492586ff4c47bc7480d364efb7af3674e31bd2c1
2023-05-17 19:16:02 +05:30
Eleni Vlachopoulou
1a7f60ff5b Update CMake system to use object libraries for haswell, skx and zen4.
- AVX2 and AVX512 flags are set up locally for each object library that requires them.
- Default ENABLE_SIMD_FLAGS value is set to none and for AVX2 option the corresponding compiler flag is set globally.
- To be able to build zen4 codepath when ENABLE_SIMD_FLAGS=AVX2, the compiler option is removed by removing the definition before building the corresponding object library.

AMD-Internal: [CPUPL-3241]
Change-Id: Ia570e60f06c4c72b7c58f4c9ca73bac4c060ae73
2023-05-12 10:04:16 -04:00
Harsh Dave
07df6ec46b Ticket id correction for previous commit.
Previous commit (30b931ae60) is having incorrect ticket id.
Correct ticket id for that commit is
AMD-Internal:[CPUPL-3328]

Change-Id: If3242714984ae3d3d9bbb0198bda91b4dd9a4bdc
2023-05-12 08:43:12 -04:00
Harsh Dave
30b931ae60 Fixed compilation error due to inconsistent compiler behavior towards AVX512 zero masking instruction syntax
- Since the code used whitespace variant of AVX512 mask instruction. But some compilers
accept whitespace variant and some don't - to be safe, we removed whitespace.

- Whitespace variant of masked instruction "vmovupd    (%rax,%r8,1),%zmm8{%k2} {z}" is replaced with
  this instruction "vmovupd    (%rax,%r8,1),%zmm8{%k2}{z}" to resolve the compilation failure issue.

- Thanks to Shubham Sharma<shubham.sharma3@amd.com> for identifying issue.

AMD-Internal: [CPUPL-1963]

Change-Id: I290589132e8cce25cab0d1e4c195a7dd0a014937
2023-05-12 06:16:15 -04:00
mkadavil
b167e47091 LPGEMM frame and micro-kernel updates to fix gcc9.4 compilation issue.
-Micro-kernel: Some AVX512 intrinsics(eg: _mm512_loadu_epi32) were
introduced in later versions of gcc (>10) in addition to already
existing masked intrinsic(eg: _mm512_mask_loadu_epi32). In order to
support compilation using gcc 9.4, either the masked intrinsic or other
gcc 9.4 compatible intrinsic needs to be used (eg: _mm512_loadu_si512)
in LPGEMM Zen4 micro-kernels.
-Frame: BF16 LPGEMM api's (aocl_gemm_bf16bf16f32obf16/bf16bf16f32of32)
needs to be disabled if aocl_gemm (LPGEMM) addon is compiled using gcc
9.4. BF16 intrinsics are not supported in gcc 9.4, and the micro-kernels
for BF16 LPGEMM is excluded from compilation based on GNUC macro.

AMD-Internal: [CPUPL-3396]
Change-Id: I096b05cdceea77e3e7fec18a5e41feccdf47f0e7
2023-05-11 18:00:18 +05:30
Mangala V
7739a3fbfe Bug fix for 4xk AVX512 packing kernel
Few tests failed on windows OS as some registers were not added as part
of cobbler list

Updated below registers into clobber list:
In function bli_zpackm_zen4_asm_12xk : ZMM12-ZMM15
In function bli_zpackm_zen4_asm_4xk : ZMM4-ZMM7

AMD-Internal: [CPUPL-3253]

Change-Id: I3e42130bf1a3b48717c4b437179ae3f116e5cf1d
2023-05-05 04:15:25 +05:30
vignbala
9164427e86 Code cleanup: Mismatch in assembly macros
- In the bli_x86_asm_macros.h file, the set of vinsertf?x? and
  vextractf?x? instructions are facing macro expansion errors due to
  ambiguous macro redirection. The lower-case macro definitions of
  these instructions are not properly redirected to their corresponding
  upper-case macro definitions.

- This error occurs due to ambiguity in the upper-case macro name.
  At the place of lower-case macro definition, the redirection is to
  macros of the form VINSERTF?x? and VEXTRACTF?x?, while at the place
  of upper-case macro definition, they are of the form VINSERTF?X? and
  VEXTRACTF?X?. This causes a mismatch of the upper-case macro due to
  different case sensitive 'x' being used.

- This patch corrects this issue, by changing the lower-case 'x' to
  upper-case, among the upper case macros at the place of redirection.
  This provides uniformity and facilitates the expected macro-expansion.

AMD-Internal: [CPUPL-3276]
Change-Id: Id1f45f8e4bb083cd4b87632b713ff6baba616ff2
2023-05-04 08:49:58 -04:00
Harihara Sudhan S
a6621f1241 Incorrect accumulation of results in DDOTV
- When the number of threads launched is not equal to the
  number of threads requested the garbage value in the created
  buffer will not be overwritten by valid values.
- To handle the above scenario, the created temporary buffer is
  initialized with zeroes.

AMD-Internal: [CPUPL-3268]
Change-Id: I439a1da18eb1b380491fea14f42b0ede05ccf5a9
2023-05-04 10:44:15 +05:30
Eleni Vlachopoulou
bf26b8ffbc Removing /arch:AVX2 flag from-high level CMake
- Previously, this flag was set as a default at the high-level CMakeLists.txt which means that this flag is used to build everything,all files and all subdirectories, including ref_kernels and testsuite. Also, all files as target sources for this project and compiled with the same flags.
 - Now, we create object files using the source in kernels/ directory and add to the object files the AVX2 flag explicitly. So, now only those files will have this flag and it should not be used to compile ref_kernels, etc.
 - This is a quick solution to enable runs on non-AVX2 machines.

AMD-Internal: [CPUPL-3241]
Change-Id: Id569b26ffeea40eaa36ab4465b0c52b6446d7650
2023-04-28 09:22:13 -04:00
Harihara Sudhan S
828ac8e2dd Partial completion of work in L1 APIs
- Partial completion of compute was happening since BLIS was unable
  to launch the required number of threads. This was because rntm
  was returning a thread count greater than the maximum number of
  threads that can be launched in the subsequent parallel region.
- Added 'omp_get_num_threads' inside the parallel regions to get the
  actual number of threads spawned. The work distribution happens
  based on the actual number of threads launched in that region.

AMD-Internal: [CPUPL-3268]
Change-Id: I086ad4b9b644f966b7bab439e43222396f0c2bf0
2023-04-27 15:17:26 +05:30
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Edward Smyth
0f0277e104 Code cleanup: dos2unix file conversion
Source and other files in some directories were a mixture of
Unix and DOS file formats. Convert all relevant files to Unix
format for consistency. Some Windows-specific files remain in
DOS format.

AMD-Internal: [CPUPL-2870]
Change-Id: Ic9a0fddb2dba6dc8bcf0ad9b3cc93774a46caeeb
2023-04-21 08:41:16 -04:00
Harihara Sudhan S
ada88e3695 Mismatch in fuse factor and kernel fuse
- In Zen 4 context, there was a mismatch between the fuse factor
  initialized in the block size parameter and fuse factor of the
  corresponding kernel initialized.

AMD-Internal: [SWLCSG-2051]
Change-Id: I65f71532692a1459605abb860b91a2a360bcca5d
2023-04-21 06:30:11 -04:00
eashdash
a72fff2be9 Added NEW LPGEMM TYPE- s8s8s16os16 and s8s8s16os8
1. New LPGEMM type - s8s8s16os16 and s8s8s16os8 are added.
2. New interface, frame and kernel files are added.
3. Frame and kernel level files added and modified for s8s8s16
4. s8s8s16 type involves design changes of 2 operations -
   Pack B and Mat Mul
5. Pack B kernel routines to pack B matrix for s16 FMA and compute the
   sum of every column of B matrix to implement the s8s8s16 operation
   using the s16 FMA instructions.
5. Mat Mul Kernel files to compute the GEMM output using s16 FMA.
   Here the A matrix elements are converted from int8 to uint8 (s16 FMA
   works with A matrix type uint8 only) by adding extra 128 to
   every A matrix element
6. Post GEMM computation, additional operations are performed on the
   accumulated outputs to get the correct results.
   Final C = C - ( (sum of column of B matrix) * 128 )
   This is done to compensate for the addition of extra 128 to every
   A matrix elements
7. With this change, two new LPGEMM APIs are introduced in LPGEMM -
   s8s8s16os16 and s8s8s16os8.
8. All previously added post-ops are supported on s8s8os16/os8 also.

AMD-Internal: [CPUPL-3234]
Change-Id: I3cc23e3dcf27f215151dda7c8db29b3a7505f05c
2023-04-21 05:30:38 -04:00
mkadavil
3572baa9d3 aocl_softmax_f32 api's for softmax computation as part of lpgemm.
-Softmax is often used as the last activation function in a neural
network - softmax(xi) = exp(xi)/(exp(x0) + exp(x1) + ... + exp(xn))).
This step happens after the final low precision gemm computation,
and it helps to have the softmax functionality that can be invoked
as part of the lpgemm workflow. In order to support this, a new api,
aocl_softmax_f32 is introduced as part of aocl_gemm. This api
computes element-wise softmax of a matrix/vector of floats. This api
invokes ISA specific vectorized micro-kernels (vectorized only when
incx=1), and a cntx based mechanism (similar to lpgemm_cntx) is used
to dispatch to the appropriate kernel.

AMD-Internal: [CPUPL-3247]
Change-Id: If15880360947435985fa87b6436e475571e4684a
2023-04-21 05:26:08 -04:00
Arnav Sharma
4aace5f524 Smart Threading for SGEMM SUP for Zen4 Architecture
- Added Smart Threading logic for AVX-512 based SGEMM SUP.
- Calculating ic and jc for optimal work distribution to the allocated
  threads based on logic similar to Zen3.
- Zen4 Architecture specific Native-to-SUP check has been added to
  redirect few Native inputs to the SUP path based on the fact that in a
  multi-threaded environment some Native cases perfom better as SUP.
- For the same, the SUP thresholds, namely, BLIS_MT and BLIS_NT have
  been increased from 512 and 200 to 682 and 512, respectively.
- Further optimizations to the work distribution logic will be added
  subsequently.

AMD-Internal: [CPUPL-3248]
Change-Id: Ibccbbefef251010ec94bd37ffc86c35b7866a5ca
2023-04-21 12:54:03 +05:30
Harsh Dave
b85b856950 Added Doxygen support for extension APIs.
Details:
- Added Doxyfile, a configuration file in docs directory for generating Doxygen document from source files.
- Currently only CBLAS interface of (Batched gemm and gemmt)extension APIs are included.
- Support for BLAS interface is yet to be added.
- To generate Doxygen based document for extension API, use given command.
  $ doxygen docs/Doxyfile

AMD-Internal: [CPUPL-3188]

Change-Id: I76e70b08f0114a528e86514bcb01d666acc591e8
2023-04-21 00:54:19 -04:00
Edward Smyth
b531022bac BLIS cpuid: distinguish submodels within a microarchitecture
Incorporate a means of detecting submodels of a microarchitecture,
so that different optimizations e.g. block sizes or kernel choices
can be used. The details are as follows:
- Different models are currently only enabled for zen3 and zen4
  architectures (for server parts).
- There is a single enumeration (model_t) for all models for all
  architectures, but function bli_check_valid_model_id() should
  check the provided model_id against the suitable range within
  the enumeration for the provided arch_id.
- To enable the model_id to be used within the cntx setup functions,
  checking of a user specified value of BLIS_ARCH_TYPE against
  the enabled configurations is delayed to a separate function,
  bli_arch_check_id().
- Default selection based on hardware can be overridden using the
  BLIS_MODEL_TYPE environment variable. Valid values are:
    Genoa, Bergamo, Genoa-X, Milan, Milan-X
  Values are case-insensitive and -X can also be specified as _X or X
- Specifying an incorrect value for BLIS_MODEL_TYPE is not an error,
  but will result in the default option for that architecture being
  selected. This is different to specifying an incorrect value of
  BLIS_ARCH_TYPE, which is an error.
- The environment variable BLIS_MODEL_TYPE can be renamed using
  the --rename-blis-model-type argument to configure (or cmake
  equivalent), in a similar way to renaming BLIS_ARCH_TYPE with
  --rename-blis-arch-type.
- Configure option --disable-blis-arch-type will disable both
  BLIS_ARCH_TYPE and BLIS_MODEL_TYPE environment variables.
- Added code in bli_cpuid.c to detect L1, L2 and L3 cache sizes,
  currently only for AMD cpus. Functions are provided to query
  these from other parts of the code, namely:
    uint32_t bli_cpuid_query_{l1d,l1i,l2,l3}_cache_size()

AMD-Internal: [CPUPL-3033]
Change-Id: I37a3741abfd59a95e0e905d926c6ede9a0143702
2023-04-20 10:47:44 -04:00
Meghana Vankadari
f788618f27 Setting AVX-512 specific blocksizes as default for L3 SUP for zen4 config
Details:
- Overriding of blocksizes with avx-2 specific ones(6x8) is done
  for gemmt/syrk because near-to-square shaped kernel performs
  better than skewed/rectangular shaped kernel.
- Overriding is done for S,D and Z datatypes.

AMD-Internal: [CPUPL-3060]
Change-Id: I304ff4264ff735b7c31f7b803b046e1c49c9ad53
2023-04-20 08:52:34 -04:00