Commit Graph

1327 Commits

Author SHA1 Message Date
Edward Smyth
6e020ecc01 Include bli_lang_defs.h in cblas.h
Changes in commit 64a1f786d5 (via merge c6f3340125) included in
./frame/include/bli_type_defs.h a prototype that uses the C
restrict keyword. When using C++ we need to provide a definition
for this C language keyword. This is done in bli_lang_defs.h which
was included in blis.h but not in cblas.h.

AMD-Internal: [CPUPL-4188]
Change-Id: I75d5f32599d18794331ff452e562eb42afb5ae93
2023-11-14 10:52:35 -05:00
Edward Smyth
c6f3340125 Merge commit '5013a6cb' into amd-main
* commit '5013a6cb':
  More edits and fixes to docs/FAQ.md.
  Fixed newly broken link to CREDITS in FAQ.md.
  More minor fixes to FAQ.md and Sandboxes.md.
  Updates to FAQ.md, Sandboxes.md, and README.md.
  Safelist 'master', 'dev', 'amd' branches.
  Re-enable and fix fb93d24.
  Reverted fb93d24.
  Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
  Removed last vestige of #define BLIS_NUM_ARCHS.
  Added new packm var3 to 'gemmlike'.
  Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
  Fix more copy-paste errors in the haswell gemmsup code.
  Do a fast test on OSX. [ci skip]
  Fix AArch64 tests and consolidate some other tests.
  Use C++ cross-compiler for ARM tests.
  Attempt to fix cxx-test for OOT builds.
  Updated travis-ci.org link in README.md to .com.
  Disabled (at least temporarily) commit 8e0c425.
  Define BLIS_OS_NONE when using --disable-system.
  Updated stale calls to malloc_intl() in gemmlike.
  Blacklist clang10/gcc9 and older for 'armsve'.
  Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.
  Moved lang defs from _macro_def.h to _lang_defs.h.
  Minor tweaks to gemmlike sandbox.
  Added local _check() code to gemmlike sandbox.
  README.md citation updates (e.g. BLIS7 bibtex).
  Tweaks to gemmlike to facilitate 3rd party mods.
  Whitespace tweaks.
  Add row- and column-strides for A/B in obj_ukr_fn_t.
  Clean up some warnings that show up on clang/OSX.
  Remove schema field on obj_t (redundant) and add new API functions.
  Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.
  Disabled sanity check in bli_pool_finalize().
  Implement proposed new function pointer fields for obj_t.

AMD-Internal: [CPUPL-2698]
Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d
2023-11-10 13:05:12 -05:00
Mangala V
f6046784ce Re-Designed SGEMM SUP kernel to use mask load/store instruction
Added all fringe kernels with mask load store support
Fringe kernels cover m direction from 5 to 1 and
n direction from 15 to 1 for row storage format

- New edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmaskmovps, VMASKMOVPS for masked load-store.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

- Mask load-store is added only for row storage format

AMD-Internal: [CPUPL-4041]

Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456
2023-11-10 01:23:48 -05:00
Vignesh Balasubramanian
bd0b50a077 Introduced fast-path to kernels in DNRM2_ and DZNRM2_ APIs
- Added a conditional check to see if the vectorized kernels
  for DNRM2_ and DZNRM2_ can be called directly, without
  incurring any framework overhead.

- The condition to satisfy this fast-path is for the size to be
  such that the ideal threads required is 1, with the vector having
  unit stride( so that packing at the framework-level can be avoided ).

AMD-Internal: [CPUPL-4045]
Change-Id: Ie37e86f802ada0e226dff88e74f0341e97ebfe28
2023-11-09 21:13:10 +05:30
Eashan Dash
e4e4fe55fb Added Parameter Checks and DTL Trace for Extension APIs
1. Added input parameter checking for the extension APIs
   1. gemm_pack_get_size API
   2. gemm_pack API

2. Additionally added early returns for these APIs when
   m or n dimensions are 0.

3. Routines for input parameter check for all the 3
   BLAS extension APIs - gemm_pack_get_size, gemm_pack and
   gemm_compute are defined in:
   frame/compat/check/bla_gemm_pack_compute_check.h

4. Added AOCL DTL TRACE for all the functions of
   1. gemm_pack_get_size
   2. gemm_pack
   3. gemm_compute

AMD-Internal: [CPUPL-3560]
Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631
2023-11-09 18:53:59 +05:30
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Edward Smyth
9500cbee63 Code cleanup: spelling corrections
Corrections for some spelling mistakes in comments.

AMD-Internal: [CPUPL-3519]
Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba
2023-11-09 00:16:30 -05:00
Harsh Dave
75356d45e5 DGEMM improvement for very tiny sizes less than 24.
- This commit helps improving performance for very small input
by reducing framework check and routing all such inputs to
bli_dgemm_tiny_6x8_kernel. It forces single threaded computation
for such sizes.

- It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4
code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment
variable is set to avx512. In that case, such a small inputs are
routed to bli_dgemm_tiny_24x8_kernel avx512 kernel.

AMD-Internal: [CPUPL-1701]
Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e
2023-11-08 23:45:57 -05:00
Arnav Sharma
008b77e94d BLAS Compliance: SCALV Early Returns
- According to BLAS Standards, SCALV should return when incx .le. 0.
- To make SCALV compliant to this, added an early return inside the BLAS
  layer, for the cases where incx <= 0.
- Also, added early return for the case where alpha is a unit scalar.

AMD-Internal: [CPUPL-3562]
Change-Id: Id474fdd6ed9232226f5c5381d0398f43384e4a49
2023-11-08 06:36:02 -05:00
Vignesh Balasubramanian
5f9c8c6929 Bugfix : Fallback mechanism in SNRM2 and SCNRM2 kernels if packing fails
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
  a layer higher.

- Added a scalar loop to handle compute in case of non-unit strides.
  This loop ensures functionality in case packing fails at the
  framework level.

AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
2023-11-08 15:16:10 +05:30
mangala v
fa355c0049 Removed warning during compilation of gemv api for non-zen config
- When configured for haswell config "Warning unused variable 'zero'"
  was throwed during compilation.
- Removed zero variable which is not being used

AMD-Internal: [CPUPL-3973]
Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8
2023-11-08 01:43:33 -05:00
Harsh Dave
0de10cc86c Added k=1 avx512 dgemm kernel.
- This commit implements avx512 dgemm kernel for k=1 cases.
which gets called for zen4 codepath.

- Added architecture check for k=1 kernel in dgemm code path
to pick correct kernel based on cpu arhcitecture since now
blis is having avx2 and avx512 dgemm kernels for k=1 case.

- Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was
being called irrespective of architecture type.

- Added architecture check before calling the kernel for case where
k=1, so only for respective architectures this kernel is invoked.

AMD-Internal: [CPUPL-4017]
Change-Id: I418bbc933b41db41d323b331c6d89893868a6971
2023-11-07 01:10:09 -05:00
Harihara Sudhan S
bed6bd4941 Modified DSCALV AOCL dynamic
- AOCL dynamic logic that determines the number of threads to be
  launched has been modified.

AMD-Internal: [CPUPL-3956]
Change-Id: Ia6c052515bd24e93660f020a7d0894fc75a229fc
2023-11-06 22:35:14 -05:00
Arnav Sharma
44dfc7a515 Fix for gemm_compute BLAS Check
- BLAS compute checks updated to properly check for rs_c and cs_c.
- Updated BLAS compute checks to skip validity check if m==1 or n==1.
  For the same reason, added a check just before to validate rs_c and
  cs_c are greater than or equal to 1.
- Added tiny size tests to gtestsuite as a sanity check.
- Also updated the Invalid Input Tests to test for the updated checks.

AMD-Internal: [CPUPL-4140]
Change-Id: I984339ec7909778b58409ffcdbeed4ee33f28cfb
2023-11-03 09:41:16 -04:00
Arnav Sharma
8885510db2 Fix for Missing Symbols for gemm_pack_get_size
- Symbols for gemm_pack_get_size were not being exported properly when
  BLIS was built as a shared library.
- Correctly assigned the BLIS_EXPORT_BLAS macro to ?gemm_pack_get_size_
  function declaration.
- Added missing gemm_pack and gemm_pack_get_size macros to
  bli_macro_defs.h file.
- Removed an unnecessary BLIS_EXPORT_BLAS macro from dgemm_compute
  function definition.
- Updated bli_util_api_wrap with no underscore API wrappers for pack and
  compute set of BLAS Extension APIs:
  1. ?gemm_pack_get_size
  2. ?gemm_pack
  3. ?gemm_compute

AMD-Internal: [CPUPL-4083]
Change-Id: I78cd7642c2fcbfdf02676e654a377ad2aa5295c1
2023-11-03 08:58:59 -04:00
Eashan Dash
c3d1a3878c Parallelized Pack and Compute Extension APIs
1. OpenMP based multi-threading parallelism is added for BLAS
   extension APIs of Pack and Compute

2. Both pack and compute APIs are parallelized.

3. Multi-threading of pack and compute APIs done with different
   number of threads can lead to inconsistent results due to
   output difference of the full packed matrix buffer when packed
   with different number of threads.

4. In multi-threaded execution, we ensure output of packed buffer
   is exactly the same as in single threaded execution.

5. Similarly for compute API, read of packed buffer in multi-
   threaded execution is exactly the same as in single-threaded
   execution.

6. Routines are added to compute the offsets for thread workload
   distribution for MT execution.
   1. The offsets are calculated in such a way that it resembles
      the reorder buffer traversal in single threaded reordering.
   2. The panel boundaries (KCxNC) remain as it is accessed in
      single thread, and as a consequence a thread with jc_start
      inside the panel cannot consider NC range for reorder.
   3. It has to work with NC' < NC, and the offset is calulated
      using prev NC panels spanning k dim + cur NC panel spaning
      pc loop cur iteration + (NC - NC') spanning current
      kc0 (<= KC).

7. Routines to ensure the same are added for MT execution
   1. frame/base/bli_pack_compute_utils.c
   2. frame/base/bli_pack_compute_utils.h

AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
2023-11-03 08:47:17 -04:00
Vignesh Balasubramanian
84faccdd7d Enabling the vectorized path for SNRM2_
- Enabled the vectorized AVX-2 code-path for SNRM2_. The
  framework queries the architecture ID and calls the
  vectorized kernel based on the architecture support.

- In case of not having the architecture support, we use
  the default path based on the sumsqv method.

AMD-Internal: [CPUPL-3277]
Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f
2023-11-03 17:45:56 +05:30
Edward Smyth
d8b8f68066 Improvements to xerbla functionality (2)
Improvements to functionality introduced in commit
6d0444497f:
- Call to bli_init_auto() before calling PASTEBLACHK macro in
  gemv caused significant runtime overhead. Initialize stored
  info_value directly.
- Add similar code in frame/compat/f2c routines.

AMD-Internal: [CPUPL-3520]
Change-Id: I2df201aed7dbceb4cbe66d6c81b5a03e8092de89
2023-11-03 06:38:40 -04:00
Vignesh Balasubramanian
ef545b928e Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
  present in the bli_gemv_unf_var2_amd.c file, as part of the
  bli_sgemv_unf_var2( ... ) function. This was changed to
  bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
  from 6 to 5 ), in accordance to the function pointer present
  in the zen3 and zen4 context files.

- Changed the accumulator type to double from float, inside the
  fringe loop for unit-strides(vectorized path) and non-unit strides
  (scalar code).

AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
2023-11-03 01:37:43 -04:00
Harihara Sudhan S
106342f402 ZGEMV optimization for special cases in beta
- Avoiding scaling of y vector by beta when beta is 1.

AMD-Internal: [CPUPL-3829]
Change-Id: I9cf46f44c5f1c2da3653937ff035594b4046b4a1
2023-11-02 08:21:46 -04:00
Edward Smyth
248dc2af9a Implement AOCL_ENABLE_INSTRUCTIONS environment variable
Add AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative
to BLIS_ARCH_TYPE. The details are:

1. AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE env vars are both
   supported, with BLIS_ARCH_TYPE taking precedence if both are set.
2. Values of "avx2" and "avx512" are aliases for "zen3" and "zen4"
   code paths respectively in AMD focused builds, or for "skx" and
   "haswell" respectively in Intel focused builds. These names are
   not case-sensitive.
3. BLIS_ARCH_TYPE specifies the code path to use. If this is
   unsupported, e.g. zen4 code path on a Milan or earlier system,
   that code path is still executed, likely resulting in an illegal
   instruction error.
4. By contrast, AOCL_ENABLE_INSTRUCTIONS will check ISA support on
   the system (for AVX2 and AVX512), and try a "lower" ISA option if
   the desired one is not supported, i.e. AVX512->AVX2, AVX2->generic.
5. Appropriate messages are printed if BLIS_ARCH_DEBUG=1 is set.

AMD-Internal: [CPUPL-4105]
Change-Id: Ia941b41d4b7d11f5589d7c5e16f607618baed315
2023-10-27 14:59:33 -04:00
Shubham Sharma
d45d1d68c6 Reset ZMM Registers before exiting, in L3 APIs
- Register ZMM16 to ZMM31 are zeroed after L3 api calls.
- This change is done only for ZEN4 code path.
- bli_zero_zmm function is added which resets these registers.

AMD-Internal: [CPUPL-3882]
Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8
(cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)
2023-10-27 00:51:04 -04:00
Harsh Dave
7bcb701b79 Fixed functionality failure for dgemm tiny kernel.
- For k > KC, C matrix is getting scaled by beta on each
iteration. It should be scaled only once. Fixed the scaling
of C matrix by beta in K loop.

- Corrected A and B matrix buffer offsets, for cases where k > KC.

AMD-Internal: [CPUPL-4078]
AMD-Internal: [CPUPL-4079]
AMD-Internal: [CPUPL-4081]
AMD-Internal: [CPUPL-4080]
AMD-Internal: [CPUPL-4087]
Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa
2023-10-26 15:11:59 +05:30
Edward Smyth
f5505be9f3 Merge commit 'e366665c' into amd-main
* commit 'e366665c':
  Fixed stale API calls to membrk API in gemmlike.
  Fixed bli_init.c compile-time error on OSX clang.
  Fixed configure breakage on OSX clang.
  Fixed one-time use property of bli_init() (#525).
  CREDITS file update.
  Added Graviton2 Neoverse N1 performance results.
  Remove unnecesary windows/zen2 directory.
  Add vzeroupper to Haswell microkernels. (#524)
  Fix Win64 AVX512 bug.
  Add comment about make checkblas on Windows
  CREDITS file update.
  Test installation in Travis CI
  Add symlink to blis.pc.in for out-of-tree builds
  Revert "Always run `make check`."
  Always run `make check`.
  Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script.   if the string contains zen and zen2, and zen need to be replaced with another string, then zen2   also be incorrectly replaced.
  Update POWER10.md
  Rework POWER10 sandbox
  Skip clearing temp microtile in gemmlike sandbox.
  Fix asm warning
  Sandbox header edits trigger full library rebuild.
  Add vhsubpd/vhsubpd.
  Fixed bugs in cpackm kernels, gemmlike code.
  Armv8A Rename Regs for Safe Darwin Compile
  Armv8A Rename Regs for Clang Compile: FP32 Part
  Armv8A Rename Regs for Clang Compile: FP64 Part
  Asm Flag Mingling for Darwin_Aarch64
  Added a new 'gemmlike' sandbox.
  Updated Fugaku (a64fx) performance results.
  Add explicit compiler check for Windows.
  Remove `rm-dupls` function in common.mk.
  Travis CI Revert Unnecessary Extras from 91d3636
  Adjust TravisCI
  Travis Support Arm SVE
  Added 512b SVE-based a64fx subconfig + SVE kernels.
  Replace bli_dlamch with something less archaic (#498)
  Allow clang for ThunderX2 config

AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
2023-10-18 09:09:54 -04:00
Edward Smyth
6d0444497f Improvements to xerbla functionality
The following improvements have been implemented:
- Option to stop in xerbla on error. This is controlled by
  setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
  is controlled by setting the environment variable
  BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
  assuming xerbla was not set to stop on error. Example call is

     info = bli_info_get_info_value();

The default behaviour remains to print but don't stop on error,
i.e. the equivalent to

     export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0

Implementation details:
- Values of the environment variables are stored and retrieved
  from global_rntm.
- Info value is stored and retrieved from tl_rntm. It is set
  to 0 during initialization for all calls and updated by xerbla
  if an error has occurred.
- Call to bli_init_auto before calling PASTEBLACHK macro (which
  calls xerbla) will reinitialize info_value to 0 via call to
  bli_thread_update_rntm_from_env

AMD-Internal: [CPUPL-3520]
Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d
2023-10-16 08:48:51 -04:00
Arnav Sharma
c8f14edcf5 BLAS Extension API - ?gemm_compute()
- Added support for 2 new APIs:
	1. sgemm_compute()
	2. dgemm_compute()
  These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
  APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
  packed matrix identifier) and performs the GEMM operation:
  C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
  scheme isn't matching, and the respective matrix being loaded isn't
  packed either, on-the-go packing has been enabled for such cases to
  pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
  it is the responsibility of the user to pack only one matrix with
  alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
  and compute APIs are forced to take n_threads=1.

AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
2023-10-16 08:18:52 -04:00
Vignesh Balasubramanian
81161066e5 Multithreading the DNRM2 and DZNRM2 API
- Updated the bli_dnormfv_unb_var1( ... ) and
  bli_znormfv_unb_var1( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Added the logic to distribute the job among the threads such
  that only one thread has to deal with fringe case(if required).
  The remaining threads will execute only the AVX-2 code section
  of the computational kernel.

- Added reduction logic post parallel region, to handle overflow
  and/or underflow conditions as per the mandate. The reduction
  for both the APIs involve calling the vectorized kernel of
  dnormfv operation.

- Added changes to the kernel to have the scaling factors and
  thresholds prebroadcasted onto the registers, instead of
  broadcasting every time on a need basis.

- Non-unit stride cases are packed to be redirected to the
  vectorized implementation. In case the packing fails, the
  input is handled by the fringe case loop in the kernel.

- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
  and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
  cases of size = 2 ( and ) size = 1 or non-unit strides respectively.

AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
2023-10-16 07:26:27 -04:00
Harsh Dave
7a4f84fbac Optimized dgemm for tiny input sizes.
- This commit focused on enhancing the performance of dgemm
for matrices for very small dimenstions.

- blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing
the conventional SUP framework code path. As SUP framework code path
requires the creation and initilization of blis objects,
accessing all the needed meta-information from objects, querying contexts
which adds performance penaulty while computing for matrices with  very
small dimensions.

- To avoid such performance penaulty blis_dgemm_tiny function implements
a lightweight support code so that it can re-use dgemm SUP kernels such a way
that it directly operates on input buffers. It avoids framework overhead of
creating and intializing blis objects, context intialization, accessing other
large framework data structures.

- blis_dgemm_tiny function checks for threshold condition to match before
picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked
for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500.
While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is
invoked.

-blis_dgemm_tiny function supports single threaded computation as of now.

AMD-Internal: [CPUPL-3574]
Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f
2023-10-16 05:52:49 -04:00
Harihara Sudhan S
105de694cf Optimized ZGEMV variant 1
- Added an explicit function definition for ZGEMV var 1. This
  removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
  for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
  instead of querying it.
- With this change fringe loop is vectorized using SSE
  instructions.

AMD-Internal:[CPUPL-3997]

Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
2023-10-13 05:03:53 -04:00
Shubham Sharma
25bab76f58 Changed threshold to use DTRSM Small MT
- Threshold to use DTRSM small MT code
  path is lowered.

AMD-Internal: [CPUPL-3781]
Change-Id: Ie1f232aa6d216b839df23657b54edb0448a64267
2023-10-11 01:33:00 -04:00
Vignesh Balasubramanian
9828039030 Bugfix : Inversion of sign bit with early return in SNRM2_
- The bli_snormfv_unb_var1( ... ) function returns early in
  case of n = 1, and uses the blis macro bli_fabs( ... ) to
  set the norm to the absolute value of the element.

- This macro inverts the sign bit even if the element is 0.0.
  A check is added to re-invert the sign bit in this case, so
  that the norm is set to 0.0 instead of -0.0.

- Added the same early exit condition on bli_dnormfv_unb_var1( ... )
  when n = 1.

AMD-Internal: [CPUPL-3923]
Change-Id: If7f5ae41d2acfe89b505549d28215dde319d8c33
2023-10-10 04:21:09 -04:00
Edward Smyth
85f2bf6c4a Fix for x86_64 builds
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
  config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
  for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
  frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
  enabled.

Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.

AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
2023-10-09 07:24:21 -04:00
eashdash
30bdeecbcc Added BLAS Extension APIs - Get Size and Pack API
1. 4 new APIs are added to support packed compute GEMM operations
   1. dgemm_pack_get_size
   2. sgemm_pack_get_size
   3. dgemm_pack
   4. sgemm_pack

2. Pack_get_size API
   1. Returns size in bytes required for packing of input
   2. Requires identifier to identify the input matrix to
      be packed
   3. Additionally requires 3 integer parameters for input
      dimensions

3. Packed buffer is allocated using the pack size computed

4. Pack API:
   1. Performs full matrix packing of the input
   2. Additionally, performs the alpha scaling
   3. Packed buffer created contains the full packed matrix

5. The GEMM compute calls are required to be operated on the
   packed buffer with alpha = 1 since alpha scaling is already
   done by the Pack API

6. GEMM Pack API eliminate the cost of packing the input matrixes
   by avoiding on the go pack in the GEMM 5 loop.
   Packing of input matrixes are done when there is resue of
   matrixes across different GEMM calls.

AMD-Internal: [CPUPL-3560]
Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3
2023-10-04 06:43:59 -04:00
Edward Smyth
ccb8dd26fd Compiler warnings when using --int-size=32
Correct compiler warnings when building with configure --int-size=32
- bla_imatcopy.c: Cast ints to longs to match %ld format
  specification in error printf statement and change this to fprintf
  to stderr. Also copy this additional fprintf statement to other
  variants of this function.
- bli_type_defs.h: siz_t should always be the same size as a pointer.
  This corrects an issue in bli_malloc.c when casting from a pointer
  to a siz_t integer value.

AMD-Internal: [CPUPL-3519]
Change-Id: Ic87cd6142b8a6fed177b7c55bc0bb6013c5b69ab
2023-09-19 06:08:19 -04:00
Edward Smyth
15f9a747af Fix for non-x86 builds: bli_gemmt_sup_var1n2m.c
bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to
frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore
bli_gemmt_sup_var1n2m.c as of commit 10ca8710f0 as variant
for non-AMD codepath builds.

AMD-Internal: [CPUPL-3838]
Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f
2023-09-13 16:03:53 +05:30
Edward Smyth
0d16d952dc BLIS: DTL enhancements
Several improvements to BLIS DTL functionality
- For APIs that report performance statistics, test for time=0.0
  before dividing by time when calculating GFLOPS.
- Call AOCL_DTL_TRACE_EXIT in the parameter checking functions
  inlined from ./frame/compat/check/bla_*_check.h
- Correct flop count for complex routines.

AMD-Internal: [CPUPL-3736]
Change-Id: Icc515d88810dd79e66e22ea8c47d84649ca9f768
2023-09-11 08:42:11 -04:00
Edward Smyth
9d19effec5 Fix for non-x86 builds: cpuid query functions
Ensure functions bli_cpuid_query_id() and
bli_cpuid_query_model_id() are defined for all
architectures in bli_cpuid.c

AMD-Internal: [CPUPL-3838]
Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e
2023-09-11 04:19:14 -04:00
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Shubham Sharma
0000cc88de Removed local copy of cntx in TRSM
- TRSM and GEMM has different blocksizes in zen4, in order
  to accommodate this, a local copy of cntx was created in TRSM.
- Local copy of cntx has been removed and TRSM blocksizes are
  stored in cntx->trsmblkszs.
- Functions to override and restore default blocksizes for TRSM
  are removed. Instead of overriding the default blocksizes,
  TRSM blocksizes are stored separately in cntx.
- Pack buffers for TRSM have to be packed with TRSM blocksizes
  and GEMM pack buffers have to be packed with default blocksizes.
  To check if we are packing for TRSM, "family" argument is added
  in bli_packm_init_pack function.
- BLIS_GEMM_FOR_TRSM_UKR has to be used for TRSM if it is set, if
  it is not set then BLIS_GEMM_UKR has to be used. This functionality
  has been added to all TRSM macro kernels.
- Methods to retrieve TRSM blocksizes from cntx are added
  to bli_cntx.h.
- Tests for micro kernels are modified to accommodate the change in
  signature of bli_packm_init_pack.

AMD-Internal: [CPUPL-3781]

Change-Id: Ia567215d6d1aa0f14eae5d3177f4a3dd63b4b20a
2023-08-16 08:09:01 -04:00
Harihara Sudhan S
278ca71706 Fixes for GEMV Functionality Issues
- Added call to dsetv in dscalv. When DSCALV is invoked by
  DGEMV the SCAL function is expected to SET the vector to
  zero when alpha is 0. This change is done to ensure BLAS
  compatibility of DGEMV.
- Fixed bug in DGEMV var 1. Reverted changes in DGEMV var
  1 to remove packing and dispatch logic.
- CMAKE now builds with _amd files for unf_var2 of GEMV.

AMD-Internal: [CPUPL-3772]
Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade
2023-08-14 13:54:48 +05:30
Harihara Sudhan S
03fa660792 Optimized xGEMV for non-unit stride X vector
- In GEMV variant 1, the input matrix A is in row major. X vector
  has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
  operation inside the kernel has been ensured by packing the input X
  vector to a temporary buffer with unit stride. Currently, the
  packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
  In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
  is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
  to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
2023-08-08 01:01:22 -04:00
Harihara Sudhan S
3be43d264f Optimized xGEMV for non-unit stride Y vector
- In variant 2 of GEMV, A matrix is in column major. Y vector has
  to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
  GEMV operation inside the kernel has been ensured by packing the
  input Y vector to a temporary buffer with unit stride. As part of
  the packing Y is scaled by beta to reduce the number of times Y
  vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
  buffer are copied to the original buffer and the temporary one is
  released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
  to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
2023-08-07 08:12:44 -04:00
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Vignesh Balasubramanian
758ec3b5ca ZGEMM optimizations for cases with k = 1
- Implemented bli_zgemm_4x4_avx2_k1_nn( ... ) kernel to replace
  bli_zgemm_4x6_avx2_k1_nn( ... ) kernel in the BLAS layer of
  ZGEMM. The kernel is built for handling the GEMM computation
  with inputs having k = 1, and the transpose values for A and
  B as N.

- The kernel dimension has been changed from 4x6 to 4x4,
  due to the following reasons :

  - The 1xNR block of B in the n-loop can be reused over multiple
    MRx1 blocks of A in the m-loop during computation. Similar
    analogy exists for the fringe cases.

  - Every 1xNR block of B was scaled with alpha and stored in
    registers before traversing in the m-dimension. Similar change
    was done for fringe cases in n-dimension.

  - These registers should not be modified during compute, hence
    the kernel dimension was changed from 4x6 to 4x4.

- The check for early exit(with regards to BLAS mandate) has been
  removed, since it is already present in the BLAS layer.

- The check for parallel ZGEMM has been moved post the redirection to
  this kernel, since the kernel is single-threaded.

- The bli_kernels_zen.h file was updated with the new kernel signature.

AMD-Internal: [CPUPL-3622]
Change-Id: Iaf03b00d5075dd74cc412290d77a401986ba0bea
2023-08-07 15:10:08 +05:30
Harihara Sudhan S
c97471dce0 Added AVX512 ZDSCALV kernel
- Added AVX512-based kernel for ZDSCAL. This will be dispatched from
  the BLAS layer for machines that have AVX512 flags.
- In AVX2 kernel for ZDSCALV, vectorized fringe compute using SSE
  instructions.
- Removed the negative incx handling checks from the blis_impli layer
  of ZDSCAL as BLAS expects early return for incx <= 0.

AMD-Internal: [CPUPL-3648]
Change-Id: I820808e3158036502b78b703f5f7faa799e5f7d9
2023-08-06 01:51:47 -04:00
Harihara Sudhan S
b126c9943b ZSCALV kernel optimization
- ZSCALV kernel now uses fmaddsub intrinsics instead of mul
  followed by addsub instrinsics.
- Removed the negative incx handling checks from the BLAS impli
  layer as BLAS expects early return for incx <= 0.
- Moved all exceptions in the kernel to the BLAS impli layer.

AMD-Internal: [SWLCSG-2224]
Change-Id: I03b968d21ca5128cb78ddcef5acfd5e579b22674
2023-08-04 06:57:18 -04:00
Shubham Sharma
9607f207da AOCL Dynamic tuning for DAXPYV
- Existing logic is not picking the ideal number
  of threads for some problem sizes.
- Problem size and their corresponding ideal number
  of threads are retuned for daxpy in aocl dynamic.

AMD-Internal: [CPUPL-3484]
Change-Id: Ice874ceef0a1815383f74f1a4b9677677b276af7
2023-08-01 10:34:04 +05:30
Shubham Sharma
954c97f858 Added NT in DTL logs for GEMMT, TRSM and NRM2
- Number of threads and gflops are added
  in the DTL logs for GEMMT, TRSM and NRM2

AMD-Internal: [CPUPL-2144]
Change-Id: If68887a5150bd0feda351180f379996497a1e678
2023-07-27 05:15:08 -04:00
Meghana Vankadari
79e174ff0a Level-3 triangular routines now use different block sizes and kernels.
Details:
    - Eliminated the need for override function in SUP for GEMMT/SYRK.
    - New set of block sizes, kernels and kernel preferences
      are added to cntx data structure for level-3 triangular routines.
    - Added supporting functions to set and get the above parameters from cntx.
    - Modified GEMMT/SYRK SUP code to use these new block sizes/kernels.
      In case they are not set, use the default block sizes/kernels of
      Level-3 SUP.

AMD-Internal: [CPUPL-3649]
Change-Id: Iee11bd4c4f1d8fbbb749c296258d1b8121c009a0
2023-07-26 01:26:11 -04:00
Harihara Sudhan S
ffbb0e83e5 ZGEMM optimization for cases when m = 1 or n = 1
- When n = 1 and A matrix is transposed ZGEMV row major variant is
  invoked.
- When m = 1 and B matrix is not transposed ZGEMV row major variant
  is invoked.
- This redirection happens before parallel ZGEMM check. This is done to
  avoid the unneccesary condition check. Any parallelization check is
  expected to happen in the invoked ZGEMV interface.

AMD-Internal: [CPUPL-2773]
Change-Id: I6b7b31db712edc682c089475d12e98730a960138
2023-06-30 04:54:42 -04:00