Commit Graph

2994 Commits

Author SHA1 Message Date
Harsh Dave
edbbfd9a86 Optimized AVX512 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

- Following kernels are modified:
bli_dgemmsup_rv_zen4_asm_24x4m,
bli_dgemmsup_rv_zen4_asm_24x3m,
bli_dgemmsup_rv_zen4_asm_24x2m,
bli_dgemmsup_rv_zen4_asm_24x1m,
bli_dgemmsup_rv_zen4_asm_24x1,
bli_dgemmsup_rv_zen4_asm_16x1,
bli_dgemmsup_rv_zen4_asm_8x1,
bli_dgemmsup_rv_zen4_asm_24x2,
bli_dgemmsup_rv_zen4_asm_16x2,
bli_dgemmsup_rv_zen4_asm_8x2,
bli_dgemmsup_rv_zen4_asm_24x3,
bli_dgemmsup_rv_zen4_asm_16x3,
bli_dgemmsup_rv_zen4_asm_8x3,
bli_dgemmsup_rv_zen4_asm_16x4,
bli_dgemmsup_rv_zen4_asm_8x4,
bli_dgemmsup_rv_zen4_asm_16x5,
bli_dgemmsup_rv_zen4_asm_8x5,
bli_dgemmsup_rv_zen4_asm_16x6,
bli_dgemmsup_rv_zen4_asm_8x6,
bli_dgemmsup_rv_zen4_asm_8x7,
bli_dgemmsup_rv_zen4_asm_8x8

AMD-Internal: [CPUPL-3574]
Change-Id: I318ff8e2f075820bcc0505aa1c13d0679f73af44
2023-10-16 04:03:56 -04:00
Eleni Vlachopoulou
46459a958d Updating BLIS C++ interface trsm test.
- Making A diagonally dominant to ensure that the problem at hand is solvable.

AMD-Internal: [CPUPL-3575]
Change-Id: I27cc76a212d4d10aacce880895e1e0d7532e4eb7
2023-10-16 03:59:52 -04:00
Shubham Sharma
9a2a4151ac Added improved ZTRSM AVX2 kernels
- Added 2x6 ZGEMM row-preferred kernel.
  - Kernel supports prefetch_a, prefetch_b,
    prefetch_a_next and prefetch_b_next.
  - Multiple Ways to prefetch c are supported.
  - prefetch_a and prefetch_c are enabled by
    default.
  - K loop is divided into multiple subloops for
    better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
  and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
  still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
  use of these kernels for TRSM in zen3 and
  zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
  windows build.

AMD-Internal: [CPUPL-3781]

Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
2023-10-13 07:43:33 -04:00
Harihara Sudhan S
105de694cf Optimized ZGEMV variant 1
- Added an explicit function definition for ZGEMV var 1. This
  removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
  for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
  instead of querying it.
- With this change fringe loop is vectorized using SSE
  instructions.

AMD-Internal:[CPUPL-3997]

Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
2023-10-13 05:03:53 -04:00
Meghana Vankadari
eb5ab3f762 LPGEMM: Added transB support for bf16bf16f32o<bf16|f32> APIs
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
  to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
  column-major input format.
- Updated Early-return check conditions to account for trans
  parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
2023-10-12 23:36:18 +05:30
mkadavil
ea0324ab95 Multi data type downscaling support for u8s8s16 - u8s8s16<u8|s8>
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.

AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
2023-10-12 09:19:56 -04:00
Vignesh Balasubramanian
a6a67fea2d ZAXPBYV optimizations for handling unit and non-unit strides
- Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational
  logic. The kernel performs two different sets of compute based
  on the value of alpha, for both unit and non-unit strides. There
  are no constraints on beta scaling of the 'y' vector.

- Updated the logic to support 'x' conjugate in the computation.
  The kernel supports conjugate/no conjugate operation through the
  usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... )
  intrinsics.

- Updated the early return condition in the kernel to adhere to
  the standard compliance.

- Updated the scalar computation with vector computation(using 128
  bit registers), in case of dealing with a single element(fringe case)
  in unit-stride or vectors with non-unit strides. A single dcomplex
  element occupies 128 bits in memory, thereby providing scope for
  this optimization.

- Added accuracy and extreme value testing with sufficient sizes
  and initializations, to test the required main and fringe cases
  of the computation.

AMD-Internal: [CPUPL-3623]
Change-Id: I7ae918856e7aba49424162290f3e3d592c244826
2023-10-12 06:31:08 -04:00
Meghana Vankadari
3a71550bc3 Enabling SUP blocksizes & kernels for generic config
Details:
- pack and compute extension APIs derive blocksizes(MR, NR...) from
  SUP cntx.
- SUP blocksizes are not set for generic/skx configs. As a result pack
  and compute APIs cause floating point exceptions.
- To fix these issues, we have enabled non-zero SUP blocksizes for
  generic config and zen4 SUP blocksizes for skx config.
- However, these changes will not enable SUP path for skx/generic config
  as thresholds are set to zero.
- To enable SUP path for skx config, more work is needed like non-zero
  thresholds and modifications to build system.

Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9
2023-10-12 05:27:10 -04:00
bhaskarn
5fd24c27a7 Updated expf max min precission fix nan issue in Tanh
Description:
The expf_max and expf_min have more precission than
the computation which is leading to corss the clipping at
the edge case which is causing nan's in the tanh output.

Updated the thresholds to less precission to clip the
edge cases to avoid nan's in the tanh output.

AMD-Internal: [SWLCSG-2423 ]
Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29
2023-10-12 01:04:59 -04:00
Meghana Vankadari
4874895a68 LPGEMM: Added transA support for bf16bf16f32o<bf16|f32> APIs
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
  aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
  column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
  format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
2023-10-11 07:16:08 -04:00
Shubham Sharma
25bab76f58 Changed threshold to use DTRSM Small MT
- Threshold to use DTRSM small MT code
  path is lowered.

AMD-Internal: [CPUPL-3781]
Change-Id: Ie1f232aa6d216b839df23657b54edb0448a64267
2023-10-11 01:33:00 -04:00
Chandrashekara K R
6132194468 Updated "HEADER_PATH" in main CMakeLists.txt file.
- Updated "HEADER_PATH" cmake variable to make sure blis header
  file is not missing function declarations on Windows.

AMD-Internal: [CPUPL-3881]
Change-Id: Id71ec16c800411cd727fc78e3f772ea1b751f971
2023-10-10 08:44:07 -04:00
Vignesh Balasubramanian
9828039030 Bugfix : Inversion of sign bit with early return in SNRM2_
- The bli_snormfv_unb_var1( ... ) function returns early in
  case of n = 1, and uses the blis macro bli_fabs( ... ) to
  set the norm to the absolute value of the element.

- This macro inverts the sign bit even if the element is 0.0.
  A check is added to re-invert the sign bit in this case, so
  that the norm is set to 0.0 instead of -0.0.

- Added the same early exit condition on bli_dnormfv_unb_var1( ... )
  when n = 1.

AMD-Internal: [CPUPL-3923]
Change-Id: If7f5ae41d2acfe89b505549d28215dde319d8c33
2023-10-10 04:21:09 -04:00
mkadavil
c3b97559c1 Zero Point support for <u|s>8s8s<32|16>os8 LPGEMM APIs
-Downscaled / quantized value is calculated using the formula
x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels
for these APIs only support scaling.
Zero point addition is implemented as part of this commit, with it
being fused as part of the downscale post-op in the micro-kernel. The
zero point input is a vector of int8 values, and currently only vector
based zero point addition is supported.
-Bench enhancements to test/benchmark zero point addition.

AMD-Internal: [SWLCSG-2332]
Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea
2023-10-10 12:05:47 +05:30
Edward Smyth
85f2bf6c4a Fix for x86_64 builds
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
  config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
  for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
  frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
  enabled.

Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.

AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
2023-10-09 07:24:21 -04:00
jagar
5d578684ea GtestSuite: Update in source code to make it compatible on MSVC(windows)
AMD-Internal: [CPUPL-2732]
Change-Id: Ifd9372bf9b0f00c2bf24442ea8519bfcf4e5db5b
2023-10-09 04:43:29 -04:00
jagar
712a84d50f Gtestsuite: Update in cmake to search reflib in given path
AMD-Internal: [CPUPL-2732]
Change-Id: Ide2b98a95f81f394c7c01cc3a3b5ae6fa0403a82
2023-10-05 05:39:27 -04:00
eashdash
30bdeecbcc Added BLAS Extension APIs - Get Size and Pack API
1. 4 new APIs are added to support packed compute GEMM operations
   1. dgemm_pack_get_size
   2. sgemm_pack_get_size
   3. dgemm_pack
   4. sgemm_pack

2. Pack_get_size API
   1. Returns size in bytes required for packing of input
   2. Requires identifier to identify the input matrix to
      be packed
   3. Additionally requires 3 integer parameters for input
      dimensions

3. Packed buffer is allocated using the pack size computed

4. Pack API:
   1. Performs full matrix packing of the input
   2. Additionally, performs the alpha scaling
   3. Packed buffer created contains the full packed matrix

5. The GEMM compute calls are required to be operated on the
   packed buffer with alpha = 1 since alpha scaling is already
   done by the Pack API

6. GEMM Pack API eliminate the cost of packing the input matrixes
   by avoiding on the go pack in the GEMM 5 loop.
   Packing of input matrixes are done when there is resue of
   matrixes across different GEMM calls.

AMD-Internal: [CPUPL-3560]
Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3
2023-10-04 06:43:59 -04:00
Edward Smyth
24e4d58f92 Tidy zen bli_cntx_init and bli_family files
Tidy formatting of config/*zen*/bli_cntx_init_zen*.c and
config/*zen*/bli_family_*.c files to make them more
consistent with each other and improve readability.

AMD-Internal: [CPUPL-3519]
Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b
2023-10-04 05:14:39 -04:00
Harsh Dave
df80f40ccd Fixed incorrect ymm registers usage in FMA operation.
- Incorrect ymm registers were used in dgemm SUP edge kernel,
    while computing FMA operation.

- Due to incorrect vector register, it resulted into incorrect result.

- Corrected vector registers usage for FMA operation.

AMD-Internal: [CPUPL-3964]

Change-Id: I37fcb5f8eeb5945fe994d8a5b69815a3bcca87df
2023-10-02 03:20:44 -04:00
Arnav Sharma
f0416cff08 SGEMM SUP Panel Stride Bug Fix
- The AVX512 SGEMM SUP rv m and n kernels did not accomodate for the
  use of panel strides in case of packed matrices, thus resulting in
  incorrect matrix strides when packing was explicitly enabled using
  BLIS_PACK_A=1, BLIS_PACK_B=1 or both.
- The kernels are updated to use panel strides for traversing both A
  and B matrix buffers accurately.

[AMD-Internal]: CPUPL-3673
Change-Id: I4341ed7e1e1419cc3e2063b06f278edcb9145adb
2023-09-27 03:02:24 -04:00
Kiran Varaganti
db4fbfe9a6 Fix compiler error for "inline" functions in LPGEMM bench Application
Functions which are declared as "inline" may trigger compiler error "undefined function"
    This linker error is eliminated by use "static" before "inline".
    Therefore added "static" before all inline functions.

Change-Id: I5952fb71112fc4792011c3e29be930ccfbce4562
2023-09-27 02:26:23 -04:00
jagar
29711dd5a3 Gtestsuite: Updated testings_basics.* to print matrix/vector name
AMD-Internal: [CPUPL-2732]
Change-Id: I89b4ffc97ea852e66f42b82058af67c16144fbf6
2023-09-26 08:27:19 -04:00
jagar
f96e20b894 Update in CMakeLists.txt to install on windows
Updated CMakeLists.txt to copy library and headers into
folder mentioned during cmake configuration.
Steps to install
1. cmake .. -G  ........ -DCMAKE_INSTALL_PREFIX=path_to_install
2. cmake --build . --config Release
3. cmake --install .  (install lib and headers)

Change-Id: Ic2728209a2e1d181cc92bab08b82a748bec583d4
2023-09-26 07:56:03 -04:00
Harsh Dave
e437469a99 Optimized AVX2 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

AMD-Internal: [CPUPL-3574]
Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54
2023-09-22 03:43:47 -04:00
Edward Smyth
ccb8dd26fd Compiler warnings when using --int-size=32
Correct compiler warnings when building with configure --int-size=32
- bla_imatcopy.c: Cast ints to longs to match %ld format
  specification in error printf statement and change this to fprintf
  to stderr. Also copy this additional fprintf statement to other
  variants of this function.
- bli_type_defs.h: siz_t should always be the same size as a pointer.
  This corrects an issue in bli_malloc.c when casting from a pointer
  to a siz_t integer value.

AMD-Internal: [CPUPL-3519]
Change-Id: Ic87cd6142b8a6fed177b7c55bc0bb6013c5b69ab
2023-09-19 06:08:19 -04:00
Edward Smyth
15f9a747af Fix for non-x86 builds: bli_gemmt_sup_var1n2m.c
bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to
frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore
bli_gemmt_sup_var1n2m.c as of commit 10ca8710f0 as variant
for non-AMD codepath builds.

AMD-Internal: [CPUPL-3838]
Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f
2023-09-13 16:03:53 +05:30
Edward Smyth
0d16d952dc BLIS: DTL enhancements
Several improvements to BLIS DTL functionality
- For APIs that report performance statistics, test for time=0.0
  before dividing by time when calculating GFLOPS.
- Call AOCL_DTL_TRACE_EXIT in the parameter checking functions
  inlined from ./frame/compat/check/bla_*_check.h
- Correct flop count for complex routines.

AMD-Internal: [CPUPL-3736]
Change-Id: Icc515d88810dd79e66e22ea8c47d84649ca9f768
2023-09-11 08:42:11 -04:00
orequest
09e34fd2bd Added optimised CGEMM function pointers in zen4 cntx
1. Two CGEMM function pointers are added for different storage schemes
   1. bli_cgemmsup_rv_zen_asm_3x8m
   2. bli_cgemmsup_rv_zen_asm_3x8n

2. In previous commit:
  (Level-3 triangular routines now use different block sizes and kernels
   Commit Id: 79e174ff0a)

   1. bli_cntx_set_l3_sup_tri_kers cntx function was created
   2. Function holds optimised function pointers for GEMMT/SYRK API's
   3. It avoids over riding default block sizes which improves the
      performance
   4. This function did not include optimised CGEMM function pointers
      leading to regression as reference kernels were invoked

3. With this commit, 2 optimized CGEMM function pointers are added in
   bli_cntx_set_l3_sup_tri_kers
   1. This fixes the regression as optimized CGEMM functions are invoked

AMD-Internal: [CPUPL-3831] [CPUPL-3830]

Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e
2023-09-11 06:38:31 -04:00
Edward Smyth
9d19effec5 Fix for non-x86 builds: cpuid query functions
Ensure functions bli_cpuid_query_id() and
bli_cpuid_query_model_id() are defined for all
architectures in bli_cpuid.c

AMD-Internal: [CPUPL-3838]
Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e
2023-09-11 04:19:14 -04:00
Vignesh Balasubramanian
32104c400c GTestSuite : Designing test cases for ZGEMM
- Designed test cases for unit testing of ZGEMM compute
  kernel for handling inputs when k == 1. The design
  uses value-parameterized testing for checking accuracy,
  and verifying the mandate in case of exception values
  on the inputs/output.

- The design uses type-parameterized testing for verifying
  BLAS standard for invalid input cases, and also for early
  return scenarios.

- Added the function template set_ev_mat( ... ) as part of
  testinghelpers. This function is used as a helper for
  inducing exception values onto indices specified as
  arguments to the test_gemm( ... ) interface.

- Abstracted the function definition of getValueString( ... )
  from the NRM2 testing interface to testinghelpers(renamed
  as get_value_string( ... ) for naming consistency), in order
  to use it as a helper function across all APIs in case of
  exception value testing.

AMD-Internal: [CPUPL-3823]
Change-Id: I0fea21f9c8759bbbdc88ba0a016202753e28f2a7
2023-09-08 17:36:57 +05:30
mkadavil
e5e9127a68 Fixes for aocl_gemm addon compilation issues
Certain functions were updated recently and now takes extra arguments
for error handling. Usage of the same are now updated in aocl_gemm.

Change-Id: I7daca4fd1f284d57034d564f0a08cc6410ccfd5c
2023-09-06 16:00:34 +05:30
Eleni Vlachopoulou
a6641dec0b Updating GTestSuite CMake system to enable testing BLIS libraries on Windows.
- Renaming ELEMENT_TYPE to BLIS_ELEMENT_TYPE, since the first is defined on a Windows header.
- Updating refCBLAS object to have different implementation depending on the platform.
- Removing dlfcn.h from all reference headers since it's linux specific and adding it conditionally on a higher level.
- Changes on all CMakeLists.txt files to enable building on Windows.

AMD-Internal: [CPUPL-2732]
Change-Id: I6e35656a3779b35dc815a2409cf84c22dd27f3e7
2023-08-29 16:11:22 +05:30
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Shubham Sharma
0000cc88de Removed local copy of cntx in TRSM
- TRSM and GEMM has different blocksizes in zen4, in order
  to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
  stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
  are removed. Instead of overriding the default blocksizes,
  TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
  and GEMM pack buffers have to be packed with default blocksizes.
  To check if we are packing for TRSM, "family" argument is added
  in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
  it is not set then BLIS_GEMM_UKR has to be used. This functionality
  has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
  to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
  signature of bli_packm_init_pack.

AMD-Internal: [CPUPL-3781]

Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
2023-08-16 08:09:01 -04:00
Harihara Sudhan S
278ca71706 Fixes for GEMV Functionality Issues
- Added call to dsetv in dscalv. When DSCALV is invoked by
  DGEMV the SCAL function is expected to SET the vector to
  zero when alpha is 0. This change is done to ensure BLAS
  compatibility of DGEMV.
- Fixed bug in DGEMV var 1. Reverted changes in DGEMV var
  1 to remove packing and dispatch logic.
- CMAKE now builds with _amd files for unf_var2 of GEMV.

AMD-Internal: [CPUPL-3772]
Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade
2023-08-14 13:54:48 +05:30
Chandrashekara K R
248d09c722 Version String Update
AOCL-BLIS: Updated version string to AOCL-BLIS 4.1.1 Build <YYMMDD>

Change-Id: Iced62a66d0859b3c7d4bcfe6f0e0527922e41cae
2023-08-08 07:27:41 -04:00
Harihara Sudhan S
03fa660792 Optimized xGEMV for non-unit stride X vector
- In GEMV variant 1, the input matrix A is in row major. X vector
  has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
  operation inside the kernel has been ensured by packing the input X
  vector to a temporary buffer with unit stride. Currently, the
  packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
  In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
  is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
  to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
2023-08-08 01:01:22 -04:00
Edward Smyth
c445f192d5 BLIS: Missing clobbers (batch 6)
More missing clobbers in skx and zen4 kernels, missed in
previous commits.

AMD-Internal: [CPUPL-3521]
Change-Id: I838240f0539af4bf977a10d20302a40c34710858
2023-08-07 10:52:23 -04:00
Harihara Sudhan S
3be43d264f Optimized xGEMV for non-unit stride Y vector
- In variant 2 of GEMV, A matrix is in column major. Y vector has
  to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
  GEMV operation inside the kernel has been ensured by packing the
  input Y vector to a temporary buffer with unit stride. As part of
  the packing Y is scaled by beta to reduce the number of times Y
  vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
  buffer are copied to the original buffer and the temporary one is
  released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
  to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
2023-08-07 08:12:44 -04:00
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian
758ec3b5ca ZGEMM optimizations for cases with k = 1
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
  bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
  ZGEMM. The kernel is built for handling the GEMM computation
  with inputs having k = 1, and the transpose values for A and
  B as N.

- The kernel dimension has been changed from 4x6 to 4x4,
  due to the following reasons :

  - The 1xNR block of B in the n-loop can be reused over multiple
    MRx1 blocks of A in the m-loop during computation. Similar
    analogy exists for the fringe cases.

  - Every 1xNR block of B was scaled with alpha and stored in
    registers before traversing in the m-dimension. Similar change
    was done for fringe cases in n-dimension.

  - These registers should not be modified during compute, hence
    the kernel dimension was changed from 4x6 to 4x4.

- The check for early exit(with regards to BLAS mandate) has been
  removed, since it is already present in the BLAS layer.

- The check for parallel ZGEMM has been moved post the redirection to
  this kernel, since the kernel is single-threaded.

- The bli_kernels_zen.h file was updated with the new kernel signature.

AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
2023-08-07 15:10:08 +05:30
Harihara Sudhan S
c97471dce0 Added AVX512 ZDSCALV kernel
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
  the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
  instructions.
- Removed the negative incx handling checks from the blis_impli layer
  of ZDSCAL as BLAS expects early return for incx <= 0.

AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
2023-08-06 01:51:47 -04:00
Harihara Sudhan S
b126c9943b ZSCALV kernel optimization
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
  followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
  layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.

AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
2023-08-04 06:57:18 -04:00
Shubham Sharma
9607f207da AOCL Dynamic tuning for DAXPYV
- Existing logic is not picking the ideal number
  of threads for some problem sizes.
- Problem size and their corresponding ideal number
  of threads are retuned for daxpy in aocl dynamic.

AMD-Internal: [CPUPL-3484]
Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7
2023-08-01 10:34:04 +05:30
Eleni Vlachopoulou
fa77d0415a Updating nrm2 GTestSuite testing
- Adding default template parameter for the type of the returned value from nrm2.
- Bugfix on NaN/Inf comparator for scalars.
- Tuning sizes of vector x to exercise the different paths for vectorized and scalar code.
- Adding wrong parameters and extreme value testing.
- Adding tests for overflow and underflow using max and min representable numbers for vectorized and scalar code.

AMD-Internal: [CPUPL-2732]
Change-Id: Ice8ee65095ecaa7b30ebd5f90ed2a890178533db
2023-07-28 05:03:00 -04:00
Shubham Sharma
954c97f858 Added NT in DTL logs for GEMMT, TRSM and NRM2
- Number of threads and gflops are added
  in the DTL logs for GEMMT, TRSM and NRM2

AMD-Internal: [CPUPL-2144]
Change-Id: If68887a5150bd0feda351180f379996497a1e678
2023-07-27 05:15:08 -04:00
Meghana Vankadari
79e174ff0a Level-3 triangular routines now use different block sizes and kernels.
Details:
    - Eliminated the need for override function in SUP for GEMMT/SYRK.
    - New set of block sizes, kernels and kernel preferences
      are added to cntx data structure for level-3 triangular routines.
    - Added supporting functions to set and get the above parameters from cntx.
    - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
      In case they are not set, use the default block sizes/kernels of
      Level-3 SUP.

AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
2023-07-26 01:26:11 -04:00
Chandrashekara K R
7b78d93282 Removing omp library linking to static multithreaded library build.
Description: We have seen the library dependency issue when we are
linking the libomp.lib or libiomp5md.lib while building the library
for static multithreaded scenario. So we are removing the linking of
openmp library for static multithreaded blis library build. So that
user can link any openmp library(libomp.lib or libiomp5md.lib) while
building their applications by linking static multithreaded blis library.

AMD-Internal: [SWLCSG-2196]
Change-Id: I96722f3587ee555af12de664957c211c56fcf03d
2023-07-13 06:54:02 -04:00
Eleni Vlachopoulou
660cd6d1b2 Adding nrm2 target for benchmarking on Windows.
Modifying blis/bench/CMakeLists.txt to include nrm2 target and produce the corresponding executable.

AMD-Internal: [CPUPL-3625]
Change-Id: I7945416142e07ac99510ed9500a2c620053c7e13
2023-07-10 14:03:05 -04:00