Commit Graph

1457 Commits

Author SHA1 Message Date
Edward Smyth
1f0fb05277 Code cleanup: Copyright notices (2)
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.

AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
2025-02-07 05:41:44 -05:00
Shubham Sharma
f8c83fedb6 Added new ZTRSM small code path for ZEN5
- Added new ZTRSM kernels for right and left variants.
- Kernel dimensions are 12x4.
- 12x4 ZGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- These kernels do not support conjugate transpose.
- Only column major inputs are supported.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6356]
Change-Id: I33ba3d337b0fcd972ca9cfe4668cb23d2b279b6e
2025-02-06 18:01:10 +05:30
Hari Govind S
b6e9cde317 Revert "Optimisation of AOCL-dynamic for dotv API"
This reverts commit a028108cbb.

Reason for revert: With libgomp, scalability issues were observed with a
                   higher number of threads, leading to the use of fewer
                   threads. However, with different OpenMP libraries
                   like libomp, this scalability issue was not observed,
                   and using fewer threads resulted in performance loss.
                   The AOCL dynamic logic has been updated to select a
                   higher number of threads, considering the iomp OpenMP
                   library.

Change-Id: I2432b715eff01fc99b2c0f8b60bdecfaf5a6568f
2025-02-05 06:33:06 -05:00
Hari Govind S
fe73445813 Introduced fast-path in DCOPYV API and fix compiler warning for AXPYV
- Added a conditional check to invoke the vectorized
  DCOPYV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal for
  single-threaded execution. Thus, we avoid the call to
  bli_nthreads_l1() function to set the ideal number of threads.

- Used macros to protect the declaration of fast_path_thresh in
  DAXPYV API to avoid compiler warnings.

AMD-Internal: [CPUPL-4875][CPUPL-5895]
Change-Id: Id4141cd22e2382ece9e36fc02934bf6c11bd02cb
2025-02-05 04:41:55 -05:00
Shubham Sharma
bac0fed3cf Fixed Bug in Dynamic Blocksizes
- Mixed precision datatypes use a modified cntx.
- For some variants of mixed precision, complex and real blocksizes
  are needed to be same. This is achieved by creating a local copy of
  cntx and copying complex blocksizes onto real blocksizes.
- By using the dynamic blocksizes, the changes made to the
  blocksizes for mixed precision are overwritten by changes made
  by dynamic blocksizes.
- This mismatch between complex and real blocksizes is causing a issue
  where the pack buffer is allocated based on complex blocksizes but
  amount of data packed is based on real blocksizes.
- This makes the pack buffer sizes smaller than the required sizes.
- To fix this, dynamic blocksizes are disabled for mixed precision.

AMD-Internal: [CPUPL-6384]
Change-Id: Ib9792f90b4ea113e54059a0da8fb4241622b5f83
2025-02-05 01:09:24 -05:00
Hari Govind S
3d2653f1ab DDOTV Optimization for ZEN3 Architecture
- Reduced the blocking size of 'bli_ddotv_zen_int10'
  kernel from 40 elements to 20 elements for better
  utilization of vector registers

- Replaced redundant 'for' loops in 'bli_ddotv_zen_int10'
  kernel with 'if' conditions to handle reminder
  iterations. As only a single iteration is used when
  reminder is less than the primary unroll factor.

- Added a conditional check to invoke the vectorized
  DDOTV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal
  for single-threaded execution. Thus, we avoid the
  call to bli_nthreads_l1() function to set the ideal
  number of threads.

- Updated getestsuite ukr tests for 'bli_ddotv_zen_int10'
  kernel.

AMD-Internal: [CPUPL-4877]
Change-Id: If43f0fcff1c5b1563ad233005717398b5b6fb8f2
2025-02-04 06:01:04 -05:00
Shubham Sharma
26bd265cfd Optimized DTRSV for tiny sizes
- Replaced switch case with if else, lookup table for switch case
  is palced at the end of .text section which causes a huge jump.
- Reduced number of branches for tiny sizes.
- Cpuid query is slow, therefore added a new if statement which avoids cpuid
  query for tiny sizes(<200).
- Redirected tiny sizes to AVX2 kernel.

AMD-Internal: [CPUPL-5407]
Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c
2025-01-30 01:23:09 -05:00
harsh dave
c5e842e8d3 Revert changes to force dgemm inputs under threshold to arch-specific kernels in single-threaded mode
- This patch reverts the previous changes that removed the enforcement
  of dgemm inputs under a certain threshold to be processed by kernels
  selected based on architecture ID and handled in single-threaded mode.

- This change is now forcing such small inputs to be computed in tiny
  path. Previously when this check was not there, it was routing these
  inputs to SUP path and causing performance regression due to framework
  overhead.

AMD-Internal: [CPUPL-5927]
Change-Id: I4a4b21fdcf7c3ffaa09efa46ba12798eca0f10bb
2025-01-29 04:22:16 -05:00
Shubham Sharma
9fd2aebd25 Tuned DTRSM thresholds for ZEN5
- Tuned AOCL_dynamic thresholds for DTRSM for ZEN5.
- Tuned thresholds for better selection of code path.

AMD-Internal: [CPUPL-5408]
Change-Id: Ic40b5c8d276c8ce8399fd49ce0d0569f79ec98be
2025-01-28 07:33:10 -05:00
Vignesh Balasubramanian
fb6dcc4edb Support for Tiny-GEMM interface(ZGEMM)
- As part of AOCL-BLAS, there exists a set of vectorized
  SUP kernels for GEMM, that are performant when invoked
  in a bare-metal fashion.

- Designed a macro-based interface for handling tiny
  sizes in GEMM, that would utilize there kernels. This
  is currently instantiated for 'Z' datatype(double-precision
  complex).

- Design breakdown :
  - Tiny path requires the usage of AVX2 and/or AVX512
    SUP kernels, based on the micro-architecture. The
    decision logic for invoking tiny-path is specific
    to the micro-architecture. These thresholds are defined
    in their respective configuration directories(header files).

  - List of AVX2/AVX512 SUP kernels(lookup table), and their
    lookup functions are defined in the base-architecture from
    which the support starts. Since we need to support backward
    compatibility when defining the lookup table/functions, they
    are present in the kernels folder(base-architecture).

- Defined a new type to be used to create the lookup table and its
  entries. This type holds the kernel pointer, blocking dimensions
  and the storage preference.

- This design would only require the appropriate thresholds and
  the associated lookup table to be defined for the other datatypes
  and micro-architecture support. Thus, is it extensible.

- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
         kernels. Thus, the blocking in framework is done accordingly.
         In case of adding the support for n-var, the variant
         information could be encoded in the object definition.

- Added test-cases to validate the interface for functionality(API
  level tests). Also added exception value tests, which have been
  disabled due to the SUP kernel optimizations.

AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
2025-01-24 12:59:26 -05:00
Hari Govind S
349fc47ec5 DGEMV Optimizations for TRANSPOSE Cases
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
  AVX2 kernels for Zen1/2/3 architectures. These kernels are written
  from the ground up and are independent of fused kernels.

- The DGEMV primary kernel processes the calculation in chunks of
  8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
  kernels, which are invoked by the primary kernel as needed.

- Implemented the kernels by computing the dot product of matrix A
  columns with vector x in chunks of 32 elements, storing the results
  in accumulator registers. Fringe elements are handled in chunks
  of 16, 8, etc. The data in the accumulator registers is then reduced
  and added to vector y.

AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
2025-01-24 00:38:34 -05:00
Arnav Sharma
66461b8df3 Improved Multi-threaded Performance of DSCALV
- Added AOCL_DYNAMIC thresholds for DSCALV for Zen4 and Zen5
  architectures, since earlier they were using the Zen thresholds.
- Also updated ST_THRESH for Zen4 and Zen5 to avoid the OpenMP overheads
  incurred when the single-threaded path is optimally performant.

AMD-Internal: [CPUPL-5934]
Change-Id: I2d89cf5392516206fab83b672498fb8d98a5b033
2025-01-22 03:55:38 -05:00
Edward Smyth
3c2dedb13c Restore or add update from env in bli_thread_get APIs
We want bli_thread_get_num_threads() and bli_thread_get_*_nt()
to report the threading values modified to reflect what will
be in effect given OpenMP nesting and active levels. This was
lost in commit 0c6d006225 for
bli_thread_get_num_threads() and wasn't previously implemented
in bli_thread_get_*_nt()

AMD-Internal: [CPUPL-6168]
Change-Id: Ife2d281546d2f79fc17cd712e574f29b06c30ccd
2025-01-20 08:58:22 -05:00
Vignesh Balasubramanian
8e660215c3 Introduced fast-path in DAXPYV API
- Added a conditional check to invoke the vectorized
  DAXPYV kernels directly(fast-path), without incurring
  any additional framework overhead.

- The fast-path is taken when the input size is ideal for
  single-threaded execution. Thus, we avoid the call to
  bli_nthreads_l1() function to set the ideal number of threads.

AMD-Internal: [CPUPL-4878]
Change-Id: I001fd1b8bbd2d691ecb3e2423ec7998e130850bb
2025-01-10 09:19:38 -05:00
Vignesh Balasubramanian
345204d69b Additional updates to the thresholds for ZGEMM small path
- Further updated the thresholds for entry to ZGEMM small
  path(AVX2), when the execution is mulithreaded. The newer
  thresholds account for more skinnier inputs, compatible with
  single-threaded small path, as opposed to multithreaded
  SUP path.

AMD-Internal: [CPUPL-6040][CPUPL-5930]
Change-Id: I333f97d8af49733310e4ae48b12baba15ef828d6
2025-01-10 08:29:31 -05:00
Edward Smyth
567039a7fe Fortran interfaces for bli_thread_get APIs
Create and export Fortran interfaces for bli_thread_get_num_threads()
and bli_thread_get_{jc,pc,ic,jr,ir}_nt() APIs.

bli_thread_get_is_parallel() is intended for internal BLIS usage, so
not adding a Fortran interfaces for it at this time.

AMD-Internal: [CPUPL-6168]
Change-Id: Ieba2537e5455cc289536aec3de5d4b5866e607f1
2025-01-10 05:07:33 -05:00
harsh dave
aeda581539 Fix: Rename bli_tiny_gemm.c to bli_tiny_gemm_amd.c
When compiling with config generic (or any non-zen build),
the bli_dgemm_tiny_6x8 kernel is not defined. Since bli_dgemm_tiny()
is only used within amd specific file, bli_tiny_gemm.c has been renamed
to bli_tiny_gemm_amd.c to reflect its specific usage.

Thanks to Smyth, Edward<edward.smyth@amd.com> for identifying and helping to fix the issue.

Change-Id: If5d134aeba6d30d0a51e6d7d6fa9b3c4450a3307
2025-01-07 23:29:32 -05:00
Vignesh Balasubramanian
cdaa2ac7fd Bugfix and optimizations for AVX512 AMAXV micro-kernels
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
  incorrect functionality with multiple absolute maximums.
  They returned the last index when having multiple occurences,
  instead of the first one.

- Implemented a bug-fix to handle this issue on these AVX512
  kernels. Also ensured that the kernels are compliant with
  the standard when handling exception values.

- Further optimized the code by decoupling the logic to find
  the maximum element and its search space for index. This way,
  we use lesser latency instructions to compute the maximum
  first.

- Updated the unit-tests, exception value tests and early return
  tests for the API to ensure code-coverage.

AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
2025-01-07 22:56:20 +05:30
Vignesh Balasubramanian
f548f42607 Fixing compiler warnings on ZGEMM
- Scoped some of the variables used in zgemm_blis_impl()
  when determining the thresholds to small path. These
  variables will be used only when the architecture is
  ZEN5 or ZEN4.

AMD-Internal: [CPUPL-5895]
Change-Id: I6f90856f34454423ac777e33c74fe5ec6bb94e13
2025-01-07 10:59:43 +05:30
Edward Smyth
4ce708c316 Move some BLAS extension APIs to extra subdirectories
In preparation for merging next group of changes from upstream BLIS,
move some BLAS extension APIs to new extra subdirectories in
frame/compat and frame/compat/cblas/src. Other extension APIs will
be moved in later commits.

Some tidying up to better match upstream BLIS code has also been done.

AMD-Internal: [CPUPL-2698]
Change-Id: I0780a775d37242fba562c3f13666da0ad2b2cdfb
2024-12-17 04:54:39 -05:00
Hari Govind S
a028108cbb Optimisation of AOCL-dynamic for dotv API
-  Adding AOCL-dynamic logic for dotv on zen4
   and zen5 architecture.

AMD-Internal: [CPUPL-4877]
Change-Id: I421ad5e48cf001d1fc5c13e9f1cbbf4db0bf5f37
2024-12-16 22:09:59 -05:00
Edward Smyth
0c6d006225 Changes to rntm to reduce mutex operations
Change usage of global_rntm and tl_rntm to elimate need
for mutex operations when accessing global_rntm. Usage of
these data structures is now as follows:
* global_rntm is set once during bli_init_apis and includes
  all getenv calls to check BLIS threading and error printing
  environment variables. global_rntm is then read-only.
* tl_rntm is intialized once from global_rntm on each
  application thread. Any calls to BLIS set threading/ways
  APIs will update tl_rntm for that application thread only
  (Previously they updated global_rntm for all application threads).
* Re-initialize info_value in tl_rntm in every call to bli_init APIs.
* In bli_rntm_init_from_global() we initialize the local (per API
  call) rntm as a copy of tl_rntm and then update threading values
  in bli_thread_update_rntm_from_env() to reflect the current status
  of OpenMP runtime ICVs.

AMD-Internal: [CPUPL-6168][SWLCSG-3143]
Change-Id: Ib9387ee2b51f507ed08cc38267057109acea14a6
2024-12-16 04:45:26 -05:00
Shubham Sharma
beaea1b88f Added new DTRSM small code path for ZEN5
- Added new DTRSM kernels for right  and left variants.
- Kernel dimensions are 24x8.
- 24x8 DGEMM SUP kernels are used internally
  for solving GEMM subproblem.
- Tuned thresholds to pick efficent code path for ZEN5.

AMD-Internal: [CPUPL-6016]
Change-Id: I743d6dc47717952c2913085c0db3454ae9d046db
2024-12-16 10:38:45 +05:30
Vignesh Balasubramanian
609af9bfe2 Threshold tuning for ZGEMM small path
- Updated the threshold check for ZGEMM small path to include
  runtime checks for redirection, specific to the micro-architecture.

- The current ZGEMM small path has only its AVX2 variant available.
  Post implementing an AVX512(same/different algorithm), the thresholds
  will further be fine-tuned.

- Included the dot-product based AVX512 ZGEMM kernels in the ZEN5
  context. It will be used as part of handling RRC and CRC storage
  schemes of C, A and B matrices in both single-thread and multi-thread
  runs.

AMD-Internal: [CPUPL-5949]
Change-Id: Ic8b7cf0e00b7c477f748669f160c4b01df995c75
2024-12-13 12:51:22 -05:00
harsdave
54b46ec1ed Enhance 24x8 DGEMM SUP/Tiny Kernel Performance with Optimized Loops and Edge Kernels
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:

1. **IR Loop Optimization:**
   - The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
     with `begin_asm` and `end_asm` calls, resulting in more efficient execution.

2. **JR Loop Integration:**
   - The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
     of stack frame management for each JR iteration, thereby enhancing loop performance.

3. **Kernel Decomposition Strategy:**
   - The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
   - For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.

1. **Interleaved Scaling by Alpha:**
   - Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
     and reduce latency.

2. **Efficient Mask Preparation:**
   - Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
     minimizing unnecessary overhead.

3. **Broadcast Instruction Optimization:**
   - In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
     the broadcast instruction is replaced with `mem_1to8`.
   - This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
     dependency chains and improving execution efficiency.

4. **C Matrix Update Optimization:**
   - During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
     This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
     performance bottlenecks and enhancing throughput.

These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.

This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.

For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.

AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
2024-12-13 00:03:00 -05:00
Edward Smyth
5c946e10dc Always define function bli_nthreads_optimum
bli_nthreads_optimum is exported and called directly by AOCL libFLAME,
however it was only defined if building a multithreaded BLIS library
with AOCL_DYNAMIC enabled. Change to always define this function. If
BLIS is serial or if AOCL_DYNAMIC is disabled, this function returns
without modifying the supplied rntm.

Change-Id: Ie65690e9e6ec2a8ea77b3778f96676a68e6260be
2024-12-12 12:43:43 -05:00
Arnav Sharma
25e59fcbb9 DGEMV Optimizations for NO_TRANSPOSE Cases
- AVX512 specific DGEMV native kernels are added for Zen4/5
  architectures to handle the NO_TRANSPOSE cases and are independent of
  the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
  beta scaling of y vector within the kernel itself and handle cases
  where n is less than 5:
    - bli_dgemv_n_zen_int_32x8n_avx512( ... )
    - bli_dgemv_n_zen_int_32x4n_avx512( ... )
    - bli_dgemv_n_zen_int_32x2n_avx512( ... )
    - bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
  m-dimension and for this kernel beta scaling is handled beforehand
  within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
  using fused kernel, namely AXPYF, to perform the GEMV operation.

AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
2024-12-12 10:26:50 -05:00
Vignesh Balasubramanian
da6e9defcb Dynamic selection of AVX2 or AVX512 DNRM2 kernels
- Added a kernel selection logic based on the input
  dimension(runtime parameter), to choose between
  deploying AVX2 or AVX512 computational kernel for
  single-thread execution.

- An empirical analysis was conducted to arrive at the
  thresholds, for ZEN4 and ZEN5 architectures.

- Updated the fast-path threshold for ZEN4 to be in hand
  with the tipping points of its dynamic thread-setter(used
  when AOCL_DYNAMIC is enabled).

AMD-Internal: [CPUPL-5937]
Change-Id: I96d7f167658c9e25a0098c4c67e12e4ba673e228
2024-12-10 10:53:54 +05:30
Shubham Sharma.
be6fbadd95 BlockSize Tuning for ZEN4 and ZEN5
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.

AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
2024-11-29 03:19:16 -05:00
Shubham Sharma
f2320a1fef Enabled DGEMM row major kernel for ZEN4
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
  kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
  is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.

AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
2024-11-29 08:18:48 +00:00
Shubham Sharma
266bd32dea Enable fringe case handling in DGEMM ZEN5 macro kernel
- Generic kernel is used if N is not multiple of NR
  or M is not multiple of MR.
- This limit the maximum values of NR that can be used.
- Support for fringe case handling is added in DGEMM
  macro kernel so that macro kernel can be used for
  all problem sizes.

AMD-Internal: [CPUPL-5912]
Change-Id: I85c17e91d7511bb35ffed0f346d6ff0376baf62f
2024-11-29 00:22:33 -05:00
Kiran Varaganti
3e2795f406 OpenMP barrier overhead bug fix
In the function bli_thread_update_rntm_from_env()mutex is used for reading global_rntm
"bli_pthread_mutex_lock( &global_rntm_mutex );" This causes regression when application is
Multithreaded. The cause of this regression is due to these mutexes, Imagine a scenario
two threads launched, one thread acquires this mutex, second thread stalls till mutex is
freed by first thread, as a result second thread will be slower to arrive at openmp barrier
in application thereby increasing the openmp barrier overhead.
Things get worst when more number of threads are launched.
Thanks to rocHPL for sharing standalone panelfact application to reproduce this issue.
Thanks to @Edward Symth (edward.smyth@amd.com) for finding this bug.

[SWLCSG-3143]
2024-11-22 15:36:30 +05:30
Vignesh Balasubramanian
4da1ad2cd9 Added CBLAS wrappers for complex precision ?ROT and ?ROTG APIs
- Added the appropriate CBLAS wrappers for CROTG, CSROT,
  ZROTG and ZDROT APIs. These would internally call their
  ?_blis_impl() layer.

AMD-Internal: [CPUPL-5813]
Change-Id: I6037f20092f99cc5a5e2794d03bbe76d6a55eb97
2024-09-19 08:49:46 -04:00
Shubham Sharma
b3b56ae3bb BugFix: Fixed GEMM mixed precision failure in ZEN5
- Optimized DGEMM macro kernel does not
  support mixed precision.
- This kernel was being used for solving
  some of the mixed precision problems.
- Currently only ( bli_obj_elem(A) == 8 ) is used for checking
  if the problem being solved is mixed precision.
- bli_obj_elem(A) will be equal to 8 for both double precision
  data type and mixed precision case single-complex.
- Added extra checks (bli_obj_is_real( a )) to make sure that
  A and B are real and DGEMM macro kernel is being used only
  for DDDGEMM.

AMD-Internal: [CPUPL-5804]
Change-Id: Iaa1accf8d851d11533f8ba31dc0235fbc14f89a9
2024-09-19 04:54:53 -04:00
Edward Smyth
a07e041b1f SCALV alpha=zero BLAS compliance
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.

Changes in this commit:
- Standardize early returns from SCALV reference and optimized
  kernels.
- User supplied N<0 is handled at the top level API layer. Use
  negative values of N in kernel calls to signify that SETV
  should _not_ be used when alpha=0. This should only be
  required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
  overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.

AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
2024-09-16 07:10:28 -04:00
Edward Smyth
711dce14d0 Export full set of _blis_impl interfaces
The _blis_impl layer provide a BLAS-like API for use in builds
where BLAS and CBLAS interfaces are not desirable. This patch
generates interfaces in uppercase and with and without trailing
underscores, to match what is generated for the regular BLAS
interface.

AMD-Internal: [CPUPL-5650]
Change-Id: I3ba9d0992291b0977479ab479acb71e42277c7c2
2024-09-03 04:13:06 -04:00
Hari Govind S
6dd8f06aff Bug Fix: When calculating number of threads for level1 APIs when BLIS_IC_* or BLIS_JC_* are set
-  Reverted the change done for tuning ddotv API. When number of threads
   is mentioned using BLIS_IC_NT or BLIS_JC_NT, ... number of threads
   are not calculated and as a result number of threads value is -1.
   OpenMP threads are launched with -1 value. This results in crash.
   This bug is fixed by correctly calculating number of threads.

AMD-Internal: [SWLCSG-3028][CPUPL-5689]
Change-Id: Ib9284dca02bdb115752926109beb28dc342e300a
2024-08-29 05:42:03 -04:00
Edward Smyth
1f18eeb267 Determine AMD FP/SIMD execution datapath width
Different Zen processors may have a 512-bit, 256-bit or 128-bit
FP/SIMD execution datapath width (FP512, FP256, FP128). Zen5 allows
a selection of FP512 or FP256 width in BIOS settings. Add cpuid
code to detect the width and store an indication of it in the
global variable bli_fp_datapath. This should be accessed internally
via the function bli_cpuid_query_fp_datapath(). This functionality
is currently only enabled on x86_64 platforms and only currently
reports a value for AMD CPUs.

Also add Zen3 as a fallback path for any unknown AMD processors if
AVX512 is not supported or has been disabled.

AMD-Internal: [CPUPL-4415]
Change-Id: Idf3fb5a697b43bc035ce110e86f60706dcc67f2a
2024-08-28 12:25:57 -04:00
Vignesh Balasubramanian
189a0b7224 Bugfix for {D/C/Z}AXPBY and ZAXPY BLAS APIs
- Bug : For non-zen architectures, {D/C/Z}AXPBY had
  incorrect datatypes passed when querying the computational
  kernel from context. The right datatype is now passed to
  each variant.

- Bug : For ZAXPY, a NULL context was passed to the kernel
  when using the single-threaded path. In case of further
  using the context inside the kernel, this would be an issue.
  We now pass the context instead of a null pointer.

AMD-Internal: [CPUPL-5643]
Change-Id: I01bb78bda6be61c43543b16fda0ac02a988a07bf
2024-08-22 14:12:14 +05:30
Shubham Sharma
d322cc11f8 Tiny size optimization for DTRSV var2
- Use AVX2 kernels for tiny sizes on genoa.
- Removed the runtime init overhead for small sizes.

AMD-Internal: [CPUPL-5407]
Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30
2024-08-20 00:40:25 -04:00
Shubham Sharma.
1a17e95332 Fixed failures for mixed precision on ZEN5 in GEMM
- Optimized macro kernel (bli_dgemm_avx512_asm_8x24_macro_kernel)
  for zen5 do not support alpha scaling. Alpha scaling is
  supported by zen5 micro kernel (bli_dgemm_avx512_asm_8x24).
- Optimized macro kernel expects alpha scaling to be done during
  packing. The packing kernel used for mixed precision do not support
  alpha scaling. Therefore, the optimized Zen5 macro kernel is not
  compatible with existing packing logic.
- Changes have been made to use the generic macro kernel which in turn
  used zen5 micro kernel for mixed precision which supports alpha scaling.

AMD-Internal: [CPUPL-5058]
Change-Id: I1bfeb32ae07eedafadad7dd2c62d63913a46e446
2024-08-20 00:40:12 -04:00
Edward Smyth
7fff7b4026 Code cleanup: Miscellaneous fixes
- Delete unused cmake files.
- Add guards around call to bli_cpuid_is_avx2fma3_supported
  in frame/3/bli_l3_sup.c, currently assumes that non-x86
  platforms will not use bli_gemmtsup.
- Correct variable in frame/base/bli_arch.c on non-x86
  builds.
- Add guards around omp pragma to avoid possible gcc
  compiler warning in kernels/zen/2/bli_gemv_zen_int_4.c.
- Add missing registers in clobber list in
  kernels/zen4/1/bli_dotv_zen_int_avx512.c.
- Add gtestsuite ERS_IIT tests for TRMV, copied from TRSV.
- Correct calls to cblas_{c,z}swap in gtestsuite.
- Correct test name in ddotxf gtestsuite program.

AMD-Internal: [CPUPL-4415]
Change-Id: I69ad56390017676cc609b4d3aba3244a2df6a6b5
2024-08-06 06:56:01 -04:00
Hari Govind S
d349f89df6 Fix warning caused by dscalv
-  Setting the value for ST_THRESH for default
   code path in dscalv API to avoid warning
   message.

Change-Id: I8ace2070350267904faa498197b8356de9af58d1
2024-08-06 12:13:23 +05:30
Edward Smyth
89f52a6df5 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-4500]
Change-Id: I33e28932b0e26bbed850c55602dee12fd002da7f
2024-08-05 16:18:51 -04:00
Edward Smyth
82bdf7c8c7 Code cleanup: Copyright notices
- Standardize formatting (spacing etc).
- Add full copyright to cmake files (excluding .json)
- Correct copyright and disclaimer text for frame and
  zen, skx and a couple of other kernels to cover all
  contributors, as is commonly used in other files.
- Fixed some typos and missing lines in copyright
  statements.

AMD-Internal: [CPUPL-4415]
Change-Id: Ib248bb6033c4d0b408773cf0e2a2cda6c2a74371
2024-08-05 15:35:08 -04:00
Edward Smyth
09c45525f4 Missing early returns (2)
Add missing early return in axpyv.

AMD-Internal: [CPUPL-5540]
Change-Id: I522fd6f5551a4dab24e8c164fa38818c900b89f8
2024-08-05 12:18:33 -04:00
Edward Smyth
591a3a7395 Code cleanup: file formats and permissions
- Remove execute file permission from source and make files.
- dos2unix conversion.
- Add missing eol at end of files.

Also update .gitignore to not exclude build directory but to
exclude any build_* created by cmake builds.

AMD-Internal: [CPUPL-4415]
Change-Id: I5403290d49fe212659a8015d5e94281fe41eb124
2024-08-05 11:52:33 -04:00
Shubham Sharma.
dac0524ac8 BugFix in AVX512 DGEMMT SUP ST RRC variant
- C<- alpha * op(A) *op(B) + beta *C.
  C(nxn) - A(n x k) * B(k x n)
  For ZEN4 and ZEN5
  DGEMM is col-preferred kernel
  DGEMMT = DGEMM + DGEMMT
  DGEMM is col-preferred and DGEMMT is row-preferred.
  DGEMM is evaluated as C = A*B (all col-storage)
  whereas DGEMMT is evaluated as C = B * A (row-storage).
  When A is packed it is packed as row-panels with col-stored elements.
  So DGEMM is evaluated as C = A*B (A is col-stored) it aligns
  with col-stored preference.
  For DGEMMT: C = B * A, here A will become col-stored
  because of packingand as result it will break the DGEMMT
  kernel assumption that A is row-storage.
- Fixed this by disabling this optimization for ZEN4
  and ZEN5.

AMD-Internal: [CPUPL-5542}
Change-Id: I9645624be009d1050ecb908d65c04aadcfa04379
2024-08-05 05:21:24 -04:00
Arnav Sharma
0a5c057475 DGEMV Optimizations for Tiny Sizes
- Added reference kernel for dgemv that handles computation for tiny
  sizes (m < 8 && n < 8).

- The reference kernel, bli_dgemv_zen_ref( ... ), supports both
  row/column storage schemes as well as transpose and no transpose
  cases.

- Added additional unit-tests for functional verification.

AMD-Internal: [CPUPL-5098]
Change-Id: I66fdf0a40e90bdb3fed40152c45ab28a17a87ada
2024-08-05 12:19:42 +05:30
Hari Govind S
3ae466697b Fixed performance drop of multi-threaded dscalv
-  Avoid performance degradation of dscalv for ST when OpenMP is enabled
   by using fast-path to skip the overhead caused by 'bli_nthreads_l1'
   function if the input size is less than a particular threshold.

-  Replaced 'bli_thread_vector_partition' work distribution function
   with 'bli_thread_range_sub'.

AMD-Internal: [CPUPL-5522]
Change-Id: I4ad0041d6e448c4a26fcd47ce44e0321a41b8b9f
2024-08-05 01:51:30 -04:00