Commit Graph

3063 Commits

Author SHA1 Message Date
Vignesh Balasubramanian
84faccdd7d Enabling the vectorized path for SNRM2_
- Enabled the vectorized AVX-2 code-path for SNRM2_. The
  framework queries the architecture ID and calls the
  vectorized kernel based on the architecture support.

- In case of not having the architecture support, we use
  the default path based on the sumsqv method.

AMD-Internal: [CPUPL-3277]
Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f
2023-11-03 17:45:56 +05:30
Edward Smyth
d8b8f68066 Improvements to xerbla functionality (2)
Improvements to functionality introduced in commit
6d0444497f:
- Call to bli_init_auto() before calling PASTEBLACHK macro in
  gemv caused significant runtime overhead. Initialize stored
  info_value directly.
- Add similar code in frame/compat/f2c routines.

AMD-Internal: [CPUPL-3520]
Change-Id: I2df201aed7dbceb4cbe66d6c81b5a03e8092de89
2023-11-03 06:38:40 -04:00
Meghana Vankadari
f8f4343b55 Updated cntx with packA function pointer for AVX512_VNNI support
Details:
- Modified bench to support testing for sizes where matrix
  strides are larger than the corresponding dimensions.
- Modified early-return checks in all interface APIs to
  check validity of strides in relation to the corresponding
  dimension rather than checking if strides are equal to dimensions.

Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4
2023-11-03 04:50:00 -04:00
Vignesh Balasubramanian
ef545b928e Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
  present in the bli_gemv_unf_var2_amd.c file, as part of the
  bli_sgemv_unf_var2( ... ) function. This was changed to
  bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
  from 6 to 5 ), in accordance to the function pointer present
  in the zen3 and zen4 context files.

- Changed the accumulator type to double from float, inside the
  fringe loop for unit-strides(vectorized path) and non-unit strides
  (scalar code).

AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
2023-11-03 01:37:43 -04:00
Harihara Sudhan S
106342f402 ZGEMV optimization for special cases in beta
- Avoiding scaling of y vector by beta when beta is 1.

AMD-Internal: [CPUPL-3829]
Change-Id: I9cf46f44c5f1c2da3653937ff035594b4046b4a1
2023-11-02 08:21:46 -04:00
mkadavil
d1844678f4 LPGEMM <u|s>8s8s16ou8 fixes for incorrect zero point addition.
-The zero point data type is different based on the downscale data
type. For int8_t downscale type, zero point type is int8_t whereas for
uint8_t downscale type, it is uint8_t. During downscale post-op, the
micro-kernels upscales the zero point from its data type (int8_t or
uint8_t) to that of the accumulation data type and then performs the
zero point addition. The accumulated output is then stored as downscaled
type in a later storage phase. For the <u|s>8s8s16 micro-kernels, the
upscaling to int16_t (accumulation type) is always performed assuming
the zero point is int8_t using the _mm256_cvtepi8_epi16 instruction.
However this will result in incorrect upscaled zero point values if the
downscale type is uint8_t and the associated zero point type is also
uint8_t. This issue is corrected by switching between the correct
upscale instruction based on the zero point type.

AMD-Internal: [SWLCSG-2500]
Change-Id: I92eed4aed686c447d29312836b9e551d6dd4b076
2023-11-02 01:30:48 -04:00
Nallani Bhaskar
b3391ef5da Updated ERF threshold and packa changes in bf16
Description:
    1. Updated ERF function threshold from 3.91920590400 to 3.553
       to match with the reference erf float implementation which
       reduced errors a the borders and also clipped the output
       to 1.0
    2. Updated packa function call with pack function ptr in bf16
       api to avoid compilation issues for non avx512bf16 archs

    3. Updated lpgemm bench

    [AMD-Internal: SWLCSG-2423 ]

Change-Id: Id432c0669521285e6e6a151739d9a72a7340381d
2023-10-29 23:55:46 +05:30
Edward Smyth
248dc2af9a Implement AOCL_ENABLE_INSTRUCTIONS environment variable
Add AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative
to BLIS_ARCH_TYPE. The details are:

1. AOCL_ENABLE_INSTRUCTIONS and BLIS_ARCH_TYPE env vars are both
   supported, with BLIS_ARCH_TYPE taking precedence if both are set.
2. Values of "avx2" and "avx512" are aliases for "zen3" and "zen4"
   code paths respectively in AMD focused builds, or for "skx" and
   "haswell" respectively in Intel focused builds. These names are
   not case-sensitive.
3. BLIS_ARCH_TYPE specifies the code path to use. If this is
   unsupported, e.g. zen4 code path on a Milan or earlier system,
   that code path is still executed, likely resulting in an illegal
   instruction error.
4. By contrast, AOCL_ENABLE_INSTRUCTIONS will check ISA support on
   the system (for AVX2 and AVX512), and try a "lower" ISA option if
   the desired one is not supported, i.e. AVX512->AVX2, AVX2->generic.
5. Appropriate messages are printed if BLIS_ARCH_DEBUG=1 is set.

AMD-Internal: [CPUPL-4105]
Change-Id: Ia941b41d4b7d11f5589d7c5e16f607618baed315
2023-10-27 14:59:33 -04:00
Edward Smyth
834bf604c1 Option to use shared library for BLIS tests
Current BLIS makefile always uses the static library on Linux for
all BLIS test programs. This commit adds the option to use the shared
library instead by specifying e.g.

make checkblis USE_SHARED=yes

Executables are generated in different sub-directories for static
and shared libraries.

AMD-Internal: [CPUPL-4107]
Change-Id: I3ab5d505cfbc5f6ef47aa28fcbb846c52d56c3f2
2023-10-27 11:32:00 -04:00
Shubham Sharma
d45d1d68c6 Reset ZMM Registers before exiting, in L3 APIs
- Register ZMM16 to ZMM31 are zeroed after L3 api calls.
- This change is done only for ZEN4 code path.
- bli_zero_zmm function is added which resets these registers.

AMD-Internal: [CPUPL-3882]
Change-Id: I7f16fde567c72ae6e9d5d6c6d5d167dd7d54a3b8
(cherry picked from commit d245ef5fb264cd1fcfa03c842ea97a436a26e7a2)
2023-10-27 00:51:04 -04:00
Harsh Dave
7bcb701b79 Fixed functionality failure for dgemm tiny kernel.
- For k > KC, C matrix is getting scaled by beta on each
iteration. It should be scaled only once. Fixed the scaling
of C matrix by beta in K loop.

- Corrected A and B matrix buffer offsets, for cases where k > KC.

AMD-Internal: [CPUPL-4078]
AMD-Internal: [CPUPL-4079]
AMD-Internal: [CPUPL-4081]
AMD-Internal: [CPUPL-4080]
AMD-Internal: [CPUPL-4087]
Change-Id: I27f426caf48e094fd75f1f719acb4ac37d9daeaa
2023-10-26 15:11:59 +05:30
Meghana Vankadari
ac3e8ff01b Bug fix and enhancements in bf16bf16f32obf16|f32
Details:
- Updated pack function call in ic loop to accept correct params.
- Modified documentation in bench file to reflect updated usage of
  bench for downscaled APIs.
- Modified memory allocation for C panel in BF16 APIs to use
  BLIS_BUFFER_FOR_GEN_USE while requesting for memory from pool.

Change-Id: Id624ed92ae7c8dafd7f6a32fc1554d2357de4df5
2023-10-25 23:28:31 +05:30
mkadavil
26d1ab5ebc <u|s>8s8s<16|32>os8 memory allocation fix to circumvent scaling issue.
-When bli_pba_acquire_m api is used for packbuf type BLIS_BUFFER_FOR_
<A_BLOCK|B_PANEL|C_PANEL>, the memory is allocated by checking out a
block from an internal memory pool. In order to ensure thread safety,
the memory pool checkout is protected using mutex (bli_pba_lock/
bli_pba_unlock). When the number of threads trying to checkout memory
(in parallel) are high, these locks tend to become a scaling bottleneck,
especially when the memory is to be used for non-packing purposes
(packing could hide some of this cost). LPGEMM uses bli_pba_acquire_m
with BLIS_BUFFER_FOR_C_PANEL to checkout memory when downscale is
enabled for temporary C accumulation. This multi-threaded lock overhead
becomes prominent when m/n dimensions are relatively small, even when k
is large. In order to address this, bli_pba_acquire_m is used with
BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE, the memory is
allocated using aligned malloc instead of checking out from memory pool.
Experiments have shown malloc costs to be far lower than memory pool
guarded by locks, especially for higher thread count.
-LPGEMM bench fixes for crash observed when benchmarking with post-ops
enabled and no downscale.

AMD-Internal: [SWLCSG-2354]
Change-Id: I4e92feadd2cf638bb26dd03b773556800a1a3d50
2023-10-23 10:00:32 -04:00
Edward Smyth
f5505be9f3 Merge commit 'e366665c' into amd-main
* commit 'e366665c':
  Fixed stale API calls to membrk API in gemmlike.
  Fixed bli_init.c compile-time error on OSX clang.
  Fixed configure breakage on OSX clang.
  Fixed one-time use property of bli_init() (#525).
  CREDITS file update.
  Added Graviton2 Neoverse N1 performance results.
  Remove unnecesary windows/zen2 directory.
  Add vzeroupper to Haswell microkernels. (#524)
  Fix Win64 AVX512 bug.
  Add comment about make checkblas on Windows
  CREDITS file update.
  Test installation in Travis CI
  Add symlink to blis.pc.in for out-of-tree builds
  Revert "Always run `make check`."
  Always run `make check`.
  Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script.   if the string contains zen and zen2, and zen need to be replaced with another string, then zen2   also be incorrectly replaced.
  Update POWER10.md
  Rework POWER10 sandbox
  Skip clearing temp microtile in gemmlike sandbox.
  Fix asm warning
  Sandbox header edits trigger full library rebuild.
  Add vhsubpd/vhsubpd.
  Fixed bugs in cpackm kernels, gemmlike code.
  Armv8A Rename Regs for Safe Darwin Compile
  Armv8A Rename Regs for Clang Compile: FP32 Part
  Armv8A Rename Regs for Clang Compile: FP64 Part
  Asm Flag Mingling for Darwin_Aarch64
  Added a new 'gemmlike' sandbox.
  Updated Fugaku (a64fx) performance results.
  Add explicit compiler check for Windows.
  Remove `rm-dupls` function in common.mk.
  Travis CI Revert Unnecessary Extras from 91d3636
  Adjust TravisCI
  Travis Support Arm SVE
  Added 512b SVE-based a64fx subconfig + SVE kernels.
  Replace bli_dlamch with something less archaic (#498)
  Allow clang for ThunderX2 config

AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
2023-10-18 09:09:54 -04:00
Arnav Sharma
c1612f6838 Gtestsuite Framework and Unit Tests for Pack and Compute Extension APIs
- Added framework for unit testing of BLAS and CBLAS interfaces for the
  Pack and Compute Extension APIs.
- These test the integrated functionality of the trio of
  ?gemm_pack_get_size(), ?gemm_pack() and ?gemm_compute() APIs.
- Note: Only MKL can be used as reference for now.

AMD-Internal: [CPUPL-3560]
Change-Id: I801654447a716da06c9ccf9db01d553817871571
2023-10-16 09:35:42 -04:00
Edward Smyth
6d0444497f Improvements to xerbla functionality
The following improvements have been implemented:
- Option to stop in xerbla on error. This is controlled by
  setting the environment variable BLIS_STOP_ON_ERROR=1
- Option to disable printing of error message from BLIS. This
  is controlled by setting the environment variable
  BLIS_PRINT_ON_ERROR=0
- Added a function to return the value of INFO passed to xerbla,
  assuming xerbla was not set to stop on error. Example call is

     info = bli_info_get_info_value();

The default behaviour remains to print but don't stop on error,
i.e. the equivalent to

     export BLIS_PRINT_ON_ERROR=1 BLIS_STOP_ON_ERROR=0

Implementation details:
- Values of the environment variables are stored and retrieved
  from global_rntm.
- Info value is stored and retrieved from tl_rntm. It is set
  to 0 during initialization for all calls and updated by xerbla
  if an error has occurred.
- Call to bli_init_auto before calling PASTEBLACHK macro (which
  calls xerbla) will reinitialize info_value to 0 via call to
  bli_thread_update_rntm_from_env

AMD-Internal: [CPUPL-3520]
Change-Id: I151f6de9b5a437c3a6e3fcf453d5b8fa9c579b9d
2023-10-16 08:48:51 -04:00
Arnav Sharma
c8f14edcf5 BLAS Extension API - ?gemm_compute()
- Added support for 2 new APIs:
	1. sgemm_compute()
	2. dgemm_compute()
  These are dependent on the ?gemm_pack_get_size() and ?gemm_pack()
  APIs.
- ?gemm_compute() takes the packed matrix buffer (represented by the
  packed matrix identifier) and performs the GEMM operation:
  C := A * B + beta * C.
- Whenever the kernel storage preference and the matrix storage
  scheme isn't matching, and the respective matrix being loaded isn't
  packed either, on-the-go packing has been enabled for such cases to
  pack that matrix.
- Note: If both the matrices are packed using the ?gemm_pack() API,
  it is the responsibility of the user to pack only one matrix with
  alpha scalar and the other with a unit scalar.
- Note: Support is presently limited to Single Thread only. Both, pack
  and compute APIs are forced to take n_threads=1.

AMD-Internal: [CPUPL-3560]
Change-Id: I825d98a0a5038d31668d2a4b84b3ccc204e6c158
2023-10-16 08:18:52 -04:00
Vignesh Balasubramanian
81161066e5 Multithreading the DNRM2 and DZNRM2 API
- Updated the bli_dnormfv_unb_var1( ... ) and
  bli_znormfv_unb_var1( ... ) function to support
  multithreaded calls to the respective computational
  kernels, if and when the OpenMP support is enabled.

- Added the logic to distribute the job among the threads such
  that only one thread has to deal with fringe case(if required).
  The remaining threads will execute only the AVX-2 code section
  of the computational kernel.

- Added reduction logic post parallel region, to handle overflow
  and/or underflow conditions as per the mandate. The reduction
  for both the APIs involve calling the vectorized kernel of
  dnormfv operation.

- Added changes to the kernel to have the scaling factors and
  thresholds prebroadcasted onto the registers, instead of
  broadcasting every time on a need basis.

- Non-unit stride cases are packed to be redirected to the
  vectorized implementation. In case the packing fails, the
  input is handled by the fringe case loop in the kernel.

- Added the SSE implementation in bli_dnorm2fv_unb_var1_avx2( ... )
  and bli_dznorm2fv_unb_var1_avx2( ... ) kernels, to handle fringe
  cases of size = 2 ( and ) size = 1 or non-unit strides respectively.

AMD-Internal: [CPUPL-3916][CPUPL-3633]
Change-Id: Ib9131568d4c048b7e5f2b82526145622a5e8f93d
2023-10-16 07:26:27 -04:00
Harsh Dave
7a4f84fbac Optimized dgemm for tiny input sizes.
- This commit focused on enhancing the performance of dgemm
for matrices for very small dimenstions.

- blis_dgemm_tiny function re-uses dgemm sup kernels, bypassing
the conventional SUP framework code path. As SUP framework code path
requires the creation and initilization of blis objects,
accessing all the needed meta-information from objects, querying contexts
which adds performance penaulty while computing for matrices with  very
small dimensions.

- To avoid such performance penaulty blis_dgemm_tiny function implements
a lightweight support code so that it can re-use dgemm SUP kernels such a way
that it directly operates on input buffers. It avoids framework overhead of
creating and intializing blis objects, context intialization, accessing other
large framework data structures.

- blis_dgemm_tiny function checks for threshold condition to match before
picking the kernel. For zen, zen2, zen3 architecture tiny kernel is invoked
for any shape as long as m < 8 and k <= 1500 or m < 1000 and n <= 24 and k <=1500.
While for zen4 as long as dimensions are less than 1500 for m,n,k tiny kernel is
invoked.

-blis_dgemm_tiny function supports single threaded computation as of now.

AMD-Internal: [CPUPL-3574]
Change-Id: Ife66d35b51add4fccbeebd29911e0c957e59a05f
2023-10-16 05:52:49 -04:00
Harsh Dave
edbbfd9a86 Optimized AVX512 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

- Following kernels are modified:
bli_dgemmsup_rv_zen4_asm_24x4m,
bli_dgemmsup_rv_zen4_asm_24x3m,
bli_dgemmsup_rv_zen4_asm_24x2m,
bli_dgemmsup_rv_zen4_asm_24x1m,
bli_dgemmsup_rv_zen4_asm_24x1,
bli_dgemmsup_rv_zen4_asm_16x1,
bli_dgemmsup_rv_zen4_asm_8x1,
bli_dgemmsup_rv_zen4_asm_24x2,
bli_dgemmsup_rv_zen4_asm_16x2,
bli_dgemmsup_rv_zen4_asm_8x2,
bli_dgemmsup_rv_zen4_asm_24x3,
bli_dgemmsup_rv_zen4_asm_16x3,
bli_dgemmsup_rv_zen4_asm_8x3,
bli_dgemmsup_rv_zen4_asm_16x4,
bli_dgemmsup_rv_zen4_asm_8x4,
bli_dgemmsup_rv_zen4_asm_16x5,
bli_dgemmsup_rv_zen4_asm_8x5,
bli_dgemmsup_rv_zen4_asm_16x6,
bli_dgemmsup_rv_zen4_asm_8x6,
bli_dgemmsup_rv_zen4_asm_8x7,
bli_dgemmsup_rv_zen4_asm_8x8

AMD-Internal: [CPUPL-3574]
Change-Id: I318ff8e2f075820bcc0505aa1c13d0679f73af44
2023-10-16 04:03:56 -04:00
Eleni Vlachopoulou
46459a958d Updating BLIS C++ interface trsm test.
- Making A diagonally dominant to ensure that the problem at hand is solvable.

AMD-Internal: [CPUPL-3575]
Change-Id: I27cc76a212d4d10aacce880895e1e0d7532e4eb7
2023-10-16 03:59:52 -04:00
Shubham Sharma
9a2a4151ac Added improved ZTRSM AVX2 kernels
- Added 2x6 ZGEMM row-preferred kernel.
  - Kernel supports prefetch_a, prefetch_b,
    prefetch_a_next and prefetch_b_next.
  - Multiple Ways to prefetch c are supported.
  - prefetch_a and prefetch_c are enabled by
    default.
  - K loop is divided into multiple subloops for
    better c prefetch.
- Added 2x6 ZTRSM row-preferred lower
  and upper kernels using AVX2 ISA.
- These kernels are used for ZTRSM only, zgemm
  still uses 3x4 kernel.
- Kernels support row/col/gen storage.
- Updated the zen3 and zen4 config to enable
  use of these kernels for TRSM in zen3 and
  zen4 path.
- Updated CMakeLists.txt with ZGEMM kernels for
  windows build.

AMD-Internal: [CPUPL-3781]

Change-Id: I236205f63a7f6b60bf1a5127a677d27425511e73
2023-10-13 07:43:33 -04:00
Harihara Sudhan S
105de694cf Optimized ZGEMV variant 1
- Added an explicit function definition for ZGEMV var 1. This
  removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
  for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
  instead of querying it.
- With this change fringe loop is vectorized using SSE
  instructions.

AMD-Internal:[CPUPL-3997]

Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
2023-10-13 05:03:53 -04:00
Meghana Vankadari
eb5ab3f762 LPGEMM: Added transB support for bf16bf16f32o<bf16|f32> APIs
Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
  to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
  column-major input format.
- Updated Early-return check conditions to account for trans
  parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c
2023-10-12 23:36:18 +05:30
mkadavil
ea0324ab95 Multi data type downscaling support for u8s8s16 - u8s8s16<u8|s8>
Downscaling is used when GEMM output is accumulated at a higher
precision and needs to be converted to a lower precision afterwards.
Currently the u8s8s16 flavor of api only supports downscaling to s8
(int8_t) via aocl_gemm_u8s8s16os8 after results are accumulated at
int16_t.
LPGEMM is modified to support downscaling to different data types,
like u8, s16, apart from s8. The framework (5 loop) passes the
downscale data type to the micro-kernels. Within the micro-kernel,
based on the downscale type, appropriate beta scaling and output
buffer store logic is executed. This support is only enabled for
u8s8s16 flavor of api's.
The LPGEMM bench is also modified to support passing downscale data
type for performance and accuracy testing.

AMD-Internal: [SWLCSG-2313]
Change-Id: I723d0802baf8649e5e41236b239880a6043bfd30
2023-10-12 09:19:56 -04:00
Vignesh Balasubramanian
a6a67fea2d ZAXPBYV optimizations for handling unit and non-unit strides
- Updated the bli_zaxpbyv_zen_int( ... ) kernel's computational
  logic. The kernel performs two different sets of compute based
  on the value of alpha, for both unit and non-unit strides. There
  are no constraints on beta scaling of the 'y' vector.

- Updated the logic to support 'x' conjugate in the computation.
  The kernel supports conjugate/no conjugate operation through the
  usage of _mm256_fmsubadd_pd( ... ) and _mm256_addsub_pd( ... )
  intrinsics.

- Updated the early return condition in the kernel to adhere to
  the standard compliance.

- Updated the scalar computation with vector computation(using 128
  bit registers), in case of dealing with a single element(fringe case)
  in unit-stride or vectors with non-unit strides. A single dcomplex
  element occupies 128 bits in memory, thereby providing scope for
  this optimization.

- Added accuracy and extreme value testing with sufficient sizes
  and initializations, to test the required main and fringe cases
  of the computation.

AMD-Internal: [CPUPL-3623]
Change-Id: I7ae918856e7aba49424162290f3e3d592c244826
2023-10-12 06:31:08 -04:00
Meghana Vankadari
3a71550bc3 Enabling SUP blocksizes & kernels for generic config
Details:
- pack and compute extension APIs derive blocksizes(MR, NR...) from
  SUP cntx.
- SUP blocksizes are not set for generic/skx configs. As a result pack
  and compute APIs cause floating point exceptions.
- To fix these issues, we have enabled non-zero SUP blocksizes for
  generic config and zen4 SUP blocksizes for skx config.
- However, these changes will not enable SUP path for skx/generic config
  as thresholds are set to zero.
- To enable SUP path for skx config, more work is needed like non-zero
  thresholds and modifications to build system.

Change-Id: I54483ab0c196845ca175b8cb8deeb9e9ac2a42b9
2023-10-12 05:27:10 -04:00
bhaskarn
5fd24c27a7 Updated expf max min precission fix nan issue in Tanh
Description:
The expf_max and expf_min have more precission than
the computation which is leading to corss the clipping at
the edge case which is causing nan's in the tanh output.

Updated the thresholds to less precission to clip the
edge cases to avoid nan's in the tanh output.

AMD-Internal: [SWLCSG-2423 ]
Change-Id: I25a665475692f47443f30ca5dd09e8e06a0bfe29
2023-10-12 01:04:59 -04:00
Meghana Vankadari
4874895a68 LPGEMM: Added transA support for bf16bf16f32o<bf16|f32> APIs
Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
  aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
  column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
  format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.

AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4
2023-10-11 07:16:08 -04:00
Shubham Sharma
25bab76f58 Changed threshold to use DTRSM Small MT
- Threshold to use DTRSM small MT code
  path is lowered.

AMD-Internal: [CPUPL-3781]
Change-Id: Ie1f232aa6d216b839df23657b54edb0448a64267
2023-10-11 01:33:00 -04:00
Chandrashekara K R
6132194468 Updated "HEADER_PATH" in main CMakeLists.txt file.
- Updated "HEADER_PATH" cmake variable to make sure blis header
  file is not missing function declarations on Windows.

AMD-Internal: [CPUPL-3881]
Change-Id: Id71ec16c800411cd727fc78e3f772ea1b751f971
2023-10-10 08:44:07 -04:00
Vignesh Balasubramanian
9828039030 Bugfix : Inversion of sign bit with early return in SNRM2_
- The bli_snormfv_unb_var1( ... ) function returns early in
  case of n = 1, and uses the blis macro bli_fabs( ... ) to
  set the norm to the absolute value of the element.

- This macro inverts the sign bit even if the element is 0.0.
  A check is added to re-invert the sign bit in this case, so
  that the norm is set to 0.0 instead of -0.0.

- Added the same early exit condition on bli_dnormfv_unb_var1( ... )
  when n = 1.

AMD-Internal: [CPUPL-3923]
Change-Id: If7f5ae41d2acfe89b505549d28215dde319d8c33
2023-10-10 04:21:09 -04:00
mkadavil
c3b97559c1 Zero Point support for <u|s>8s8s<32|16>os8 LPGEMM APIs
-Downscaled / quantized value is calculated using the formula
x' = (x / scale_factor) + zero_point. As it stands, the micro-kernels
for these APIs only support scaling.
Zero point addition is implemented as part of this commit, with it
being fused as part of the downscale post-op in the micro-kernel. The
zero point input is a vector of int8 values, and currently only vector
based zero point addition is supported.
-Bench enhancements to test/benchmark zero point addition.

AMD-Internal: [SWLCSG-2332]
Change-Id: I96b4b1e5a384a4683b50ca310dcfb63debb1ebea
2023-10-10 12:05:47 +05:30
Edward Smyth
85f2bf6c4a Fix for x86_64 builds
Configuration x86_64 includes all Intel and AMD sub-configurations.
Fixes to enable this to work correctly again are:
- In config_registry use amdzen rather than amd64 in x86_64 family.
- Copy settings from config/amdzen/bli_family_amdzen.h to
  config/x86_64/bli_family_x86_64.h
- Modify configure to set enable_aocl_zen=yes for x86_64, but not
  for amd64_legacy.
- Add "if defined(BLIS_FAMILY_X86_64)" to frame/3/bli_l3_sup.c and
  frame/3/bli_l3_sup_int_amd.c so zen-specific code paths are
  enabled.

Note: sub-configurations knl and bulldozer use instructions that are
not supported on most x86_64 processors.

AMD-Internal: [CPUPL-3838]
Change-Id: I0bd8fd89ccd846f80e5491ef44ade7d409970b04
2023-10-09 07:24:21 -04:00
jagar
5d578684ea GtestSuite: Update in source code to make it compatible on MSVC(windows)
AMD-Internal: [CPUPL-2732]
Change-Id: Ifd9372bf9b0f00c2bf24442ea8519bfcf4e5db5b
2023-10-09 04:43:29 -04:00
jagar
712a84d50f Gtestsuite: Update in cmake to search reflib in given path
AMD-Internal: [CPUPL-2732]
Change-Id: Ide2b98a95f81f394c7c01cc3a3b5ae6fa0403a82
2023-10-05 05:39:27 -04:00
eashdash
30bdeecbcc Added BLAS Extension APIs - Get Size and Pack API
1. 4 new APIs are added to support packed compute GEMM operations
   1. dgemm_pack_get_size
   2. sgemm_pack_get_size
   3. dgemm_pack
   4. sgemm_pack

2. Pack_get_size API
   1. Returns size in bytes required for packing of input
   2. Requires identifier to identify the input matrix to
      be packed
   3. Additionally requires 3 integer parameters for input
      dimensions

3. Packed buffer is allocated using the pack size computed

4. Pack API:
   1. Performs full matrix packing of the input
   2. Additionally, performs the alpha scaling
   3. Packed buffer created contains the full packed matrix

5. The GEMM compute calls are required to be operated on the
   packed buffer with alpha = 1 since alpha scaling is already
   done by the Pack API

6. GEMM Pack API eliminate the cost of packing the input matrixes
   by avoiding on the go pack in the GEMM 5 loop.
   Packing of input matrixes are done when there is resue of
   matrixes across different GEMM calls.

AMD-Internal: [CPUPL-3560]
Change-Id: Ieeb5df2d2f3b10ebf2d00dab6f455cf64a047de3
2023-10-04 06:43:59 -04:00
Edward Smyth
24e4d58f92 Tidy zen bli_cntx_init and bli_family files
Tidy formatting of config/*zen*/bli_cntx_init_zen*.c and
config/*zen*/bli_family_*.c files to make them more
consistent with each other and improve readability.

AMD-Internal: [CPUPL-3519]
Change-Id: I32c2bf6dc8365264a748a401cf3c83be4976f73b
2023-10-04 05:14:39 -04:00
Harsh Dave
df80f40ccd Fixed incorrect ymm registers usage in FMA operation.
- Incorrect ymm registers were used in dgemm SUP edge kernel,
    while computing FMA operation.

- Due to incorrect vector register, it resulted into incorrect result.

- Corrected vector registers usage for FMA operation.

AMD-Internal: [CPUPL-3964]

Change-Id: I37fcb5f8eeb5945fe994d8a5b69815a3bcca87df
2023-10-02 03:20:44 -04:00
Arnav Sharma
f0416cff08 SGEMM SUP Panel Stride Bug Fix
- The AVX512 SGEMM SUP rv m and n kernels did not accomodate for the
  use of panel strides in case of packed matrices, thus resulting in
  incorrect matrix strides when packing was explicitly enabled using
  BLIS_PACK_A=1, BLIS_PACK_B=1 or both.
- The kernels are updated to use panel strides for traversing both A
  and B matrix buffers accurately.

[AMD-Internal]: CPUPL-3673
Change-Id: I4341ed7e1e1419cc3e2063b06f278edcb9145adb
2023-09-27 03:02:24 -04:00
Kiran Varaganti
db4fbfe9a6 Fix compiler error for "inline" functions in LPGEMM bench Application
Functions which are declared as "inline" may trigger compiler error "undefined function"
    This linker error is eliminated by use "static" before "inline".
    Therefore added "static" before all inline functions.

Change-Id: I5952fb71112fc4792011c3e29be930ccfbce4562
2023-09-27 02:26:23 -04:00
jagar
29711dd5a3 Gtestsuite: Updated testings_basics.* to print matrix/vector name
AMD-Internal: [CPUPL-2732]
Change-Id: I89b4ffc97ea852e66f42b82058af67c16144fbf6
2023-09-26 08:27:19 -04:00
jagar
f96e20b894 Update in CMakeLists.txt to install on windows
Updated CMakeLists.txt to copy library and headers into
folder mentioned during cmake configuration.
Steps to install
1. cmake .. -G  ........ -DCMAKE_INSTALL_PREFIX=path_to_install
2. cmake --build . --config Release
3. cmake --install .  (install lib and headers)

Change-Id: Ic2728209a2e1d181cc92bab08b82a748bec583d4
2023-09-26 07:56:03 -04:00
Harsh Dave
e437469a99 Optimized AVX2 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

AMD-Internal: [CPUPL-3574]
Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54
2023-09-22 03:43:47 -04:00
Edward Smyth
ccb8dd26fd Compiler warnings when using --int-size=32
Correct compiler warnings when building with configure --int-size=32
- bla_imatcopy.c: Cast ints to longs to match %ld format
  specification in error printf statement and change this to fprintf
  to stderr. Also copy this additional fprintf statement to other
  variants of this function.
- bli_type_defs.h: siz_t should always be the same size as a pointer.
  This corrects an issue in bli_malloc.c when casting from a pointer
  to a siz_t integer value.

AMD-Internal: [CPUPL-3519]
Change-Id: Ic87cd6142b8a6fed177b7c55bc0bb6013c5b69ab
2023-09-19 06:08:19 -04:00
Edward Smyth
15f9a747af Fix for non-x86 builds: bli_gemmt_sup_var1n2m.c
bli_gemmt_sup_var1n2m.c contained x86 specific code. Move to
frame/3/gemmt/bli_gemmt_sup_var1n2m_amd.c and restore
bli_gemmt_sup_var1n2m.c as of commit 10ca8710f0 as variant
for non-AMD codepath builds.

AMD-Internal: [CPUPL-3838]
Change-Id: I88db20b93b2dbcbbf5092a4cb78f14dd1179975f
2023-09-13 16:03:53 +05:30
Edward Smyth
0d16d952dc BLIS: DTL enhancements
Several improvements to BLIS DTL functionality
- For APIs that report performance statistics, test for time=0.0
  before dividing by time when calculating GFLOPS.
- Call AOCL_DTL_TRACE_EXIT in the parameter checking functions
  inlined from ./frame/compat/check/bla_*_check.h
- Correct flop count for complex routines.

AMD-Internal: [CPUPL-3736]
Change-Id: Icc515d88810dd79e66e22ea8c47d84649ca9f768
2023-09-11 08:42:11 -04:00
orequest
09e34fd2bd Added optimised CGEMM function pointers in zen4 cntx
1. Two CGEMM function pointers are added for different storage schemes
   1. bli_cgemmsup_rv_zen_asm_3x8m
   2. bli_cgemmsup_rv_zen_asm_3x8n

2. In previous commit:
  (Level-3 triangular routines now use different block sizes and kernels
   Commit Id: 79e174ff0a)

   1. bli_cntx_set_l3_sup_tri_kers cntx function was created
   2. Function holds optimised function pointers for GEMMT/SYRK API's
   3. It avoids over riding default block sizes which improves the
      performance
   4. This function did not include optimised CGEMM function pointers
      leading to regression as reference kernels were invoked

3. With this commit, 2 optimized CGEMM function pointers are added in
   bli_cntx_set_l3_sup_tri_kers
   1. This fixes the regression as optimized CGEMM functions are invoked

AMD-Internal: [CPUPL-3831] [CPUPL-3830]

Change-Id: Ie8b41a5e62439de2a65e7df0b07d63ee2383e51e
2023-09-11 06:38:31 -04:00
Edward Smyth
9d19effec5 Fix for non-x86 builds: cpuid query functions
Ensure functions bli_cpuid_query_id() and
bli_cpuid_query_model_id() are defined for all
architectures in bli_cpuid.c

AMD-Internal: [CPUPL-3838]
Change-Id: I7b0582a4d63d9f28076761749cf5c24d87316f3e
2023-09-11 04:19:14 -04:00
Vignesh Balasubramanian
32104c400c GTestSuite : Designing test cases for ZGEMM
- Designed test cases for unit testing of ZGEMM compute
  kernel for handling inputs when k == 1. The design
  uses value-parameterized testing for checking accuracy,
  and verifying the mandate in case of exception values
  on the inputs/output.

- The design uses type-parameterized testing for verifying
  BLAS standard for invalid input cases, and also for early
  return scenarios.

- Added the function template set_ev_mat( ... ) as part of
  testinghelpers. This function is used as a helper for
  inducing exception values onto indices specified as
  arguments to the test_gemm( ... ) interface.

- Abstracted the function definition of getValueString( ... )
  from the NRM2 testing interface to testinghelpers(renamed
  as get_value_string( ... ) for naming consistency), in order
  to use it as a helper function across all APIs in case of
  exception value testing.

AMD-Internal: [CPUPL-3823]
Change-Id: I0fea21f9c8759bbbdc88ba0a016202753e28f2a7
2023-09-08 17:36:57 +05:30