166 Commits

Author SHA1 Message Date
Smyth, Edward
05e837d176 BLIS: Implement zen6 sub-configuration
Implement zen6 cpuid and arch changes, and add zen6 as a
separate BLIS sub-configuration and code path within amdzen
configuration family. Currently all optimization choices are
copies of zen5 sub-configuration.

AMD-Internal: [CPUPL-7162]
2026-03-05 13:33:56 +00:00
Smyth, Edward
011c75dddb Remove unnecessary OpenMP include (AOCL)
Copy of similar change in upstream BLIS (843a5e8) to fix issues
https://github.com/flame/blis/issues/873 and
https://github.com/amd/blis/issues/50

Details:
- Previously, `<omp.h>` was included in `bli_thrcomm_openmp.h` so that the
  framework could access the necessary OpenMP functions.
- As @melven reported (#873), this causes issues when `blis.h` is included
  in C++ code since the `<omp.h>` include happens with `extern "C"`.
- Move the include from the header to the necessary .c files so that it
  does not "pollute" `blis.h`.

Thanks to @DaAwesomeP and @bartoldeman for reporting this issue in
AOCL BLIS

AMD-Internal: [CPUPL-7303]
2026-02-06 10:41:38 +00:00
Smyth, Edward
8310b2d5d3 Optimize bli_arch_query_id and related functions
bli_arch_query_id() is used to select kernels in optimized BLAS APIs. Previous
implementation incurred the overhead of multiple function calls. This has
been reduced by:
- Changing the function to be defined in a header file so it can be inlined.
- Avoiding call to bli_arch_check_id_once that was a wrapper for a call to
  bli_pthread_once. Instead bli_pthread_once is called directly.
- For builds with a single BLIS sub-configuration, correct arch_id is taken
  directly from a header file in the corresponding config subdirectory,
  avoiding the bli_pthread_once call and making the value explicit at
  compile time, which may enable additional optimizations.

To enable these changes, the variables arch_id and model_id defined in
frame/base/bli_arch.c are no longer static, as they must be accessed in multiple
files (i.e. they are now global variables). Rename to g_arch_id and g_model_id
to distinguish from any locally defined arch_id or model_id variables.
2026-02-04 13:16:46 +00:00
S, Hari Govind
4ecfbde082 Fix extreme values handling in GEMV
- When alpha == 0, we are expected to only scale y vector with beta and not read A or X at all.
- This scenario is not handled properly in all code paths which causes NAN and INF from A and X being wrongly propagated. For example, for non-zen architecture (default block in switch case) no such check is present, similarly some of the avx512 kernels are also missing these checks.
- When beta == 0, we are not expected to read Y at all, this also is not handled correctly in one of the avx512 kernel.
- To fix these, early return condition for alpha == 0 is added to bla layer itself so that each kernel does not have to implement the logic.
- DGEMV AVX512 transpose kernel has been fixed to load vector Y only when beta != 0.

AMD-Internal: [CPUPL-7585]
2025-11-08 12:30:03 +05:30
Varaganti, Kiran
49961aa569 Fix DTL dynamic thread logging in BLAS operations (#230)
- Remove redundant AOCL_DTL_LOG_NUM_THREADS calls from early return paths
- Update thread count logging to use AOCL_get_requested_threads_count() for early exits
- Clean up duplicate DTL logging in gemv_unf_var1 and gemv_unf_var2 implementations
- Remove thread count logging from bli_dgemv_n_zen4_int kernel variants
- Simplify aocldtl_blis.c AOCL_DTL_log_gemv_sizes by removing redundant conditional
- Standardize DTL trace exit patterns across axpy, scal, and gemv operations
- Remove commented-out DTL logging code in zen4 gemv kernel

This patch ensures thread count is logged only once per operation and uses
the correct API (AOCL_get_requested_threads_count) for early exit scenarios
where the actual execution thread count may differ from requested threads.
2025-10-24 13:34:00 +01:00
Rayan, Rohan
dc4e0f72c1 Fixing an integer division in GEMV that was supposed to be a double operation (#218)
---------

Co-authored-by: Rayan <rohrayan@amd.com>
2025-09-30 14:04:39 +05:30
Varaganti, Kiran
807de2a990 DTL Log update
* DTL Log update
Updates logs with nt and AOCL Dynamic selected nt for axpy, scal and dgemv
Modified bench_gemv.c to able to process modified dtl logs.

* Updated DTL log for copy routine with actual nt and dynamic nt

* Refactor OpenMP pragmas and clean up code

Removed unnecessary nested OpenMP pragma and cleaned up function end comment.

* Fixed DTL log for sequential build

* Added thread logging in bla_gemv_check for invalid inputs

---------

Co-authored-by: Smyth, Edward <Edward.Smyth@amd.com>
2025-09-22 11:32:00 +05:30
Smyth, Edward
ae6c7d86df Tidying code
- AMD specific BLAS1 and BLAS2 franework: changes to make variants
  more consistent with each other
- Initialize kernel pointers to NULL where not immediately set
- Fix code indentation and other other whitespace changes in DTL
  code and addon/aocl_gemm/frame/s8s8s32/lpgemm_s8s8s32_sym_quant.c
- Fix typos in DTL comments
- Add missing newline at end of test/CMakeLists.txt
- Standardize on using arch_id variable name

AMD-Internal: [CPUPL-6579]
2025-09-16 14:52:54 +01:00
Smyth, Edward
fb2a682725 Miscellaneous changes
- Change begin_asm and end_asm comments and unused code in files
     kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c
     kernels/zen4/3/sup/bli_gemmsup_cd_zen4_asm_z12x4m.c
  to avoid problems in clobber checking script.
- Add missing clobbers in files
     kernels/zen4/1m/bli_packm_zen4_asm_d24xk.c
     kernels/zen4/1m/bli_packm_zen4_asm_z12xk.c
     kernels/zen4/3/sup/bli_gemmsup_cv_zen4_asm_z12x4m.c
- Add missing newline at end of files.
- Update some copyright years for recent changes.
- Standardize license text formatting.

AMD-Internal: [CPUPL-6579]
2025-08-26 16:37:43 +01:00
Smyth, Edward
509aa07785 Standardize Zen kernel names
Naming of Zen kernels and associated files was inconsistent with BLIS
conventions for other sub-configurations and between different Zen
generations. Other anomalies existed, e.g. dgemmsup 24x column
preferred kernels names with _rv_ instead of _cv_. This patch renames
kernels and file names to address these issues.

AMD-Internal: [CPUPL-6579]
2025-08-19 18:19:51 +01:00
Sharma, Shubham
b0a4914417 Added DGEMV no transpose multithreaded Implementations (#12)
* Added DGEMV no transpose multithreaded Implementations
- Added new avx512 M and N kernels for DGEMV.
- Added multiple MT implementations for same kernels.
- Added AOCL_dynamic logic for L2 apis.
- Tuned AOCL_dynamic and code path selection for DGEMV on ZEN5.
- Added same kernels for SGEMV, but these kernels are not enabled yet.
- Added SGEMV reference kernel.

AMD-Internal: [SWLCSG-3408]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-08-12 10:39:12 +05:30
S, Hari Govind
273a05f0bd Fix for performance regression caused by non-unit stride y in DGEMV API (#91)
- Temperory fix for regression in DGEMV for non-unit stride y inputs. The code
  section responsible for handling non-unit stride y has been removed from the
  frame.

- The kernel code is extended with if condition to handle both unit and non-unit
  stride y.

AMD-Internal: [CPUPL-6869]
2025-07-25 10:57:57 +05:30
S, Hari Govind
8d41565822 Fix build failure when AOCL_DYNAMIC is disabled (#57)
- The build was failing when AOCL_DYNAMIC was disabled because
  `fast_path_thresh` was only declared when both AOCL_DYNAMIC and
  OpenMP were enabled. This variable was used in an `if` condition for
  single-thread execution without an AOCL_DYNAMIC guard.

- To resolve this, the test expression for single-thread execution has
  been replaced with a macro. This macro is set to 0 when AOCL_DYNAMIC
  is disabled, ensuring the condition is handled correctly.

AMD-Internal: [CPUPL-6854]
2025-06-23 15:56:15 +05:30
S, Hari Govind
e097346658 Implemented Multithreading Support and Optimization of DGEMV API (#10)
- Implemented multithreading framework for the DGEMV API on Zen architectures. Architecture specific AOCL-dynamic logic determines the optimal number of threads for improved performance.

- The condition check for the value of beta is optimized by utilizing masked operations. The mask value is set based on value of beta, and the masked operations are applied when the vector y is loaded or scaled with beta.

AMD-Internal: [CPUPL-6746]
2025-06-17 12:39:48 +05:30
Smyth, Edward
49ae7db89a Avoid including .c files (#40)
Including a C file directly in another C file is not recommended, and some
build systems (e.g. Bazel and Buck) do not allow .c files to include other
.c files. This commit changes the tapi and oapi framework files that are
included from the _ex and _ba file variants from .c filenames to .h
filenames.

AMD-Internal: [CPUPL-6784]

Co-authored-by: Varaganti, Kiran <Kiran.Varaganti@amd.com>
2025-06-10 11:33:33 +05:30
Hari Govind S
29f30c7863 Optimisation for DCOPY API
-  Introducted new assembly kernel that copies data from source
   to destination from the front and back of the vector at the
   same time. This kernel provides better performance for larger
   input sizes.

-  Added a wrapper function responsible for selecting the kernel
   used by DCOPYV API to handle the given input for zen5
   architecture.

-  Updated AOCL-dynamic threshold for DCOPYV API in zen4 and
   zen5 architectures.

-  New unit-tests were included in the grestsuite for the new
   kernel.

AMD-Internal: [CPUPL-6650]
Change-Id: Ie2af88b8e97196b6aa02c089e59247742002f568
2025-04-28 05:58:21 -04:00
Hari Govind S
8998839c71 Optimisation of DGEMV Transpose Case for unit stride
- Included a new code section to handle input having non-unit strided y
  vector for dgemv transpose case. Removed the same from the respective
  kernels to avoid repeated branching caused by condition checks within
  the 'for' loop.

- The condition check for beta is equal to zero in the primary kernels
  are moved outside the for loop to avoid repeated branching.

- The '_mm512_reduce_pd' operations in the primary kernel is replaced by
  a series of operations to reduce the number of instructions required
  to reduce the 8 registers.

- Changing naming convention for DGEMV transpose kernels.

- Modified unit kernel test to avoid y increment for dgemv tranpose
  kernels during the test.

AMD-Internal: [CPUPL-6565]
Change-Id: I1ac516d6b8f156ac53ac9f6eb18badd50e152e05
2025-03-06 05:15:58 -05:00
Arnav Sharma
b4c1026ec2 Added Support for General Stride in DGEMV
- Updated the bli_dgemv_zen_ref( ... ) kernel to support general stride.
- Since the latest dgemv kernels don't support general stride, added
  checks to invoke bli_dgemv_zen_ref( ... ) when A matrix has a general
  stride.
- Thanks to Vignesh Balasubramanian <vignesh.balasubramanian@amd.com>
  for finding this issue.

AMD-Internal: [CPUPL-6492]
Change-Id: Ia987ce7674cb26cb32eea4a6e9bd6623f2027328
2025-02-27 12:47:21 -05:00
Shubham Sharma
26bd265cfd Optimized DTRSV for tiny sizes
- Replaced switch case with if else, lookup table for switch case
  is palced at the end of .text section which causes a huge jump.
- Reduced number of branches for tiny sizes.
- Cpuid query is slow, therefore added a new if statement which avoids cpuid
  query for tiny sizes(<200).
- Redirected tiny sizes to AVX2 kernel.

AMD-Internal: [CPUPL-5407]
Change-Id: I8e73777b2f00c9dcff9775ddfcb7ca3f74fa901c
2025-01-30 01:23:09 -05:00
Hari Govind S
349fc47ec5 DGEMV Optimizations for TRANSPOSE Cases
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
  AVX2 kernels for Zen1/2/3 architectures. These kernels are written
  from the ground up and are independent of fused kernels.

- The DGEMV primary kernel processes the calculation in chunks of
  8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
  kernels, which are invoked by the primary kernel as needed.

- Implemented the kernels by computing the dot product of matrix A
  columns with vector x in chunks of 32 elements, storing the results
  in accumulator registers. Fringe elements are handled in chunks
  of 16, 8, etc. The data in the accumulator registers is then reduced
  and added to vector y.

AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
2025-01-24 00:38:34 -05:00
Arnav Sharma
25e59fcbb9 DGEMV Optimizations for NO_TRANSPOSE Cases
- AVX512 specific DGEMV native kernels are added for Zen4/5
  architectures to handle the NO_TRANSPOSE cases and are independent of
  the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
  beta scaling of y vector within the kernel itself and handle cases
  where n is less than 5:
    - bli_dgemv_n_zen_int_32x8n_avx512( ... )
    - bli_dgemv_n_zen_int_32x4n_avx512( ... )
    - bli_dgemv_n_zen_int_32x2n_avx512( ... )
    - bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
  m-dimension and for this kernel beta scaling is handled beforehand
  within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
  using fused kernel, namely AXPYF, to perform the GEMV operation.

AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
2024-12-12 10:26:50 -05:00
Shubham Sharma
d322cc11f8 Tiny size optimization for DTRSV var2
- Use AVX2 kernels for tiny sizes on genoa.
- Removed the runtime init overhead for small sizes.

AMD-Internal: [CPUPL-5407]
Change-Id: I0db7d93abc659012916ef706f22528c7fabb4e30
2024-08-20 00:40:25 -04:00
Arnav Sharma
9583ee2e23 DGEMV Optimizations for NO_TRANSPOSE cases
- Enabled AVX512 DAXPYF kernels for DGEMV var2 for NO_TRANSPOSE cases.

- Added DAXPYF kernels with fuse factors of 2, 4, 6 and 16.

- Added a wrapper for DAXPYF kernels for redirection to kernels with a
  smaller fuse factor than 32.

- Also added UKR tests for the new fused kernels.

AMD-Internal: [CPUPL-5098]
Change-Id: I0b102b67c6c068873393bac0494284f379c253f2
2024-07-24 15:59:36 +05:30
Hari Govind S
38824244d5 Implementation of AXPYF Kernels for DTRSV
-  Implemented two new axpyf kernels for fused factors 8 and 12
   by manually unrolling the loops. Used to achieve better performance
   in var2 case.

AMD-Internal: [CPUPL-5184]
Change-Id: I40d2930d003c6ce90323b5c8a52564563d1f23f5
2024-07-16 06:23:01 -04:00
Shubham Sharma
7553abad8e Fixed compilation error with AOCC in TRSV
- Added a {} around zen4 switch case to avoid AOCC error.
- Error is caused because in C declarations are not a statement, therefore
  they cannot be labled hence compiler is not able to create a lable
  for jump.

AMD-Internal: [CPUPL-4880]
Change-Id: Icfeedafd80bf9a955e430ca967b6a93dcbbf075e
2024-05-03 21:08:38 +05:30
Shubham Sharma
1d983e6124 Added AVX512 kernels for DAXPYF and DDOTXF
- Added DAXPYF and DDOTXF AVX512 kernels.
- Fuse factor for ddotxf kernel is 8.
- 2 DAXPYF kernels are added, with fuse
  factor 8 and 32.
- Multithreading is also added to the DAXPYf
  kernel with fuse factor 32.
- These kernels are internally used by TRSM.
- Added changes in TRSV to call these kernels
  in ZEN4

AMD-Internal: [CPUPL-4880]
Change-Id: I12850de974b437bbca07677b68bc3d6a35858770
2024-05-03 05:10:22 -04:00
Vignesh Balasubramanian
4e2966f9b0 AVX512 optimizations for ZGEMV API with transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with transpose to A matrix.

- This includes the set of ZDOTXF and ZDOTXV kernels. ZDOTXF
  kernels include those with fuse-factor 8 (main kernel), 4
  and 2(fringe kernels).

- Updated the bli_zgemv_unf_var1( ... ) function to update
  the function pointers to these kernels, based on the
  configuration.

AMD-Internal: [CPUPL-4974]
Change-Id: I313ae0abe9dc119de849da42f9825b71f11b1fda
2024-05-03 04:38:52 -04:00
Vignesh Balasubramanian
53cb83d0cc AVX512 optimizations for ZGEMV API with no-transpose case
- Implemented AVX512 kernels for handling the calls to ZGEMV
  with no-transpose to A matrix.

- This includes the ZAXPYF, ZAXPYV and ZSETV kernels.
  The set of ZAXPYF kernels include those with fuse-factor 8
  (main kernel), 4 and 2(fringe kernels).

- Updated the bli_zgemv_unf_var2( ... ) function to set
  the function pointers to these kernels, based on the
  configuration. Further added the call to ZSETV at this
  layer in case beta is 0.

AMD-Internal: [CPUPL-4974]
Change-Id: Iee4b724719e49023138bb16479765be44d677cd9
2024-05-03 07:04:47 +00:00
Shubham Sharma
632c32767b Avoid alpha scaling in ZTRSV/ZTRSM when alpha = 1
- Scaling vector X is skipped when alpha is 1 in ZTRSV.
- Scaling matrix A is skipped when alpha is 1 in ZTRSM.

AMD-Internal: [CPUPL-4324]
Change-Id: I03c5a454ed1f5be36dac0f121408749bfc9cfc81
2024-04-16 02:24:02 -04:00
Edward Smyth
2450a1813b BLIS: Implement zen5 sub-configuration
Implement full support for zen5 as a separate BLIS sub-configuration
and code path within amdzen configuration family.

AMD-Internal: [CPUPL-3518]
Change-Id: Iaa5096e0b83bf0f0c3fd1c41e601ccd29bda3c09
2024-04-12 07:26:31 -04:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
mangala v
fa355c0049 Removed warning during compilation of gemv api for non-zen config
- When configured for haswell config "Warning unused variable 'zero'"
  was throwed during compilation.
- Removed zero variable which is not being used

AMD-Internal: [CPUPL-3973]
Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8
2023-11-08 01:43:33 -05:00
Vignesh Balasubramanian
ef545b928e Bugfix : Changing fuse factor for the call to vectorized SAXPYF kernel
- The call to the bli_saxpyf_zen_int_6( ... ) is explicitly
  present in the bli_gemv_unf_var2_amd.c file, as part of the
  bli_sgemv_unf_var2( ... ) function. This was changed to
  bli_saxpyf_zen_int_5( ... )( thereby changing the fuse factor
  from 6 to 5 ), in accordance to the function pointer present
  in the zen3 and zen4 context files.

- Changed the accumulator type to double from float, inside the
  fringe loop for unit-strides(vectorized path) and non-unit strides
  (scalar code).

AMD-Internal: [CPUPL-4028]
Change-Id: Iab1a0318f461cba9a7041093c6865ae8396d231e
2023-11-03 01:37:43 -04:00
Harihara Sudhan S
106342f402 ZGEMV optimization for special cases in beta
- Avoiding scaling of y vector by beta when beta is 1.

AMD-Internal: [CPUPL-3829]
Change-Id: I9cf46f44c5f1c2da3653937ff035594b4046b4a1
2023-11-02 08:21:46 -04:00
Harihara Sudhan S
105de694cf Optimized ZGEMV variant 1
- Added an explicit function definition for ZGEMV var 1. This
  removes the need to query the context for Zen architectures.
- Added a new INSERT_GENTFUNC to generate the definition only
  for scomplex type.
- Rewrote ZDOTXF kernel and added the function name for ZDOTV
  instead of querying it.
- With this change fringe loop is vectorized using SSE
  instructions.

AMD-Internal:[CPUPL-3997]

Change-Id: I790214d528f9e39f63387bc95bf611f84d3faca3
2023-10-13 05:03:53 -04:00
Edward Smyth
bb4c158e63 Merge commit 'b683d01b' into amd-main
* commit 'b683d01b':
  Use extra #undef when including ba/ex API headers.
  Minor preprocessor/header cleanup.
  Fixed typo in cpp guard in bli_util_ft.h.
  Defined eqsc, eqv, eqm to test object equality.
  Defined setijv, getijv to set/get vector elements.
  Minor API breakage in bli_pack API.
  Add err_t* "return" parameter to malloc functions.
  Always stay initialized after BLAS compat calls.
  Renamed membrk files/vars/functions to pba.
  Switch allocator mutexes to static initialization.

AMD-Internal: [CPUPL-2698]
Change-Id: Ied2ca8619f144d4b8a7123ac45a1be0dda3875df
2023-08-21 07:01:38 -04:00
Harihara Sudhan S
278ca71706 Fixes for GEMV Functionality Issues
- Added call to dsetv in dscalv. When DSCALV is invoked by
  DGEMV the SCAL function is expected to SET the vector to
  zero when alpha is 0. This change is done to ensure BLAS
  compatibility of DGEMV.
- Fixed bug in DGEMV var 1. Reverted changes in DGEMV var
  1 to remove packing and dispatch logic.
- CMAKE now builds with _amd files for unf_var2 of GEMV.

AMD-Internal: [CPUPL-3772]
Change-Id: I0d60c9e1025a3a56419d6ae47ded509d50e5eade
2023-08-14 13:54:48 +05:30
Harihara Sudhan S
03fa660792 Optimized xGEMV for non-unit stride X vector
- In GEMV variant 1, the input matrix A is in row major. X vector
  has to be of unit stride if the operation is to be vectorized.
- In cases when X vector is non-unit stride, vectorization of the GEMV
  operation inside the kernel has been ensured by packing the input X
  vector to a temporary buffer with unit stride. Currently, the
  packing is done using the SCAL2V.
- In case of DGEMV, X vector is scaled by alpha as part of packing.
  In CGEMV and ZGEMV, alpha is passed as 1 while packing.
- The temporary buffer created is released once the GEMV operation
  is complete.
- In DGEMV variant 1, moved problem decomposition for Zen architecture
  to the DOTXF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3475]
Change-Id: Icd9fd91eccd831f1fcb9fbf0037fcbbc2e34268e
2023-08-08 01:01:22 -04:00
Harihara Sudhan S
3be43d264f Optimized xGEMV for non-unit stride Y vector
- In variant 2 of GEMV, A matrix is in column major. Y vector has
  to be of unit stride if the operation is to be vectorized.
- In cases when Y vector is non-unit stride, vectorization of the
  GEMV operation inside the kernel has been ensured by packing the
  input Y vector to a temporary buffer with unit stride. As part of
  the packing Y is scaled by beta to reduce the number of times Y
  vector is to be loaded.
- After performing the GEMV operation, the results in the temporary
  buffer are copied to the original buffer and the temporary one is
  released.
- In DGEMV var 2, moved problem decomposition for Zen architecture
  to the AXPYF kernel.
- Removed flag check based kernel dispatch logic from DGEMV. Now,
  kernels will be picked from the context for non-avx machines. For
  avx machines, the kernel(s) to be dispatched is(are) assigned to
  the function pointer in the unf_var layer.

AMD-Internal: [CPUPL-3485]
Change-Id: I7b2efb00a9fa9abca65abca07ee80f38229bf654
2023-08-07 08:12:44 -04:00
Edward Smyth
7e50ba669b Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

Also correct file format of windows/tests/inputs.yaml, which
was missed in commit 0f0277e104

AMD-Internal: [CPUPL-2870]
Change-Id: Icb83a4a27033dc0ff325cb84a1cf399e953ec549
2023-04-21 10:02:48 -04:00
Edward Smyth
6835205ba8 Code cleanup: spelling corrections
Corrections for spelling and other mistakes in code comments
and doc files.

AMD-Internal: [CPUPL-2870]
Change-Id: Ifbb5df7df2d6312fe73e06ee6d41c00b16c593ce
2023-04-19 12:44:56 -04:00
Aayush Kumar
71272ab574 .Fixed Compiler warnings for GCC 12 and AOCC 4.0
- Set the variables to zero to avoid the compiler warning
  (-Wmaybe-uninitialized) in bli_dgemm_ref_k1.c,
  bli_gemm_small.c, bli_trsm_small.c, bli_zgemm_ref_k1.c and
  bli_trsm_small_AVX512.c

- Changed the datatype from dim_t to siz_t for i,k,j
  in bli_hemv_unf_var1_amd.c and bli_hemv_unf_var3_amd.c to
  avoid the compiler warning (-Waggressive-loop-optimizations)

AMD-Internal: [CPUPL-2870]

Change-Id: Ib2bc050fa47cb8a280d719283ab4539c70e19d03
2023-04-14 13:29:17 +00:00
Harihara Sudhan S
2e6724262e ZGEMV var 2 bug fix
- Fixed segmentation fault that was seen on non zen and non avx2
  machines.
- cntx object was not passed to the invoked kernel causing a seg
  fault.

AMD-Internal: [CPUPL-3167]
Change-Id: I2640d3f905e78398935cf6ed667b04a6418baa5d
2023-04-05 01:31:24 -04:00
Edward Smyth
1ac03e64b5 BLIS cpuid tidy and bugfix.
Improvements to BLIS cpuid functionality:
- Tidy names of avx support test functions, especially rename
  bli_cpuid_is_avx_supported() to bli_cpuid_is_avx2fma3_supported()
  to more accurately describe what it tests.
- Fix bug in frame/base/bli_check.c related to changes in commit
  6861fcae91

AMD-Internal: [CPUPL-3031]
Change-Id: Iacd8fb0ffbd45288e536fc6314660709055ea2d5
2023-04-03 08:46:37 -04:00
Harihara Sudhan S
4b36529a8b Added vector packing logic to ZGEMV variant 2
- In cases when incy != 1, a buffer is created for y vector. The
  contents of vector y is scaled by beta and stored in this buffer.
- After performing the compute using ZAXPYF kernel, the results in
  y buffer memory is copied back to the orginal buffer using ZCOPYV.
- In cases when alpha is zero, we only scale the y vector by beta
  without using the buffer and return.
- The kernels are picked based on the architecture ID. For any zen
  based architecture, AVX2 kernels are invoked. For other, the
  kernels are invoked based on the context.
- In ZSCAL2V, query for the context if NULL pointer is passed.

AMD-Internal: [CPUPL-2773]
Change-Id: If409ca5c438fc2eebe73480c011577088d52c65f
2023-03-22 03:19:18 -04:00
Edward Smyth
7f86561d26 BLIS-Nov2022: HPL memory issues with GCC.
HPL script was using BLIS manual way to set threading, i.e. setting
BLIS_IC_NT explicitly. This causes bli_rntm_num_threads() to return
-1, which wasn't trapped in parallelised BLAS1 and BLAS2 routines.

Fix: if this occurs, set local number of threads based on product of
BLIS_JC_NT * BLIS_PC_NT * BLIS_IC_NT * BLIS_JR_NT * BLIS_IR_NT values.

Note: BLIS_PC_NT should always be 1, but this environment variable
is currently being read (contrary to documentation), so include it
for now.

Other changes:
* implement _Pragma convention in all code used on AMD
* frame/2/gemv/bli_gemv_unf_var1_amd.c: Remove is_omp_mt_enabled flag

AMD-Internal: [CPUPL-2803]
Change-Id: I37e8b038e5640d6693a87be0609888186322b465
2022-12-06 05:10:34 -05:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Edward Smyth
abf848ad12 Code cleanup and warnings fixes
- Removed some additional compiler warnings reported by GCC 12.1
- Fixed a couple of typos in comments
- frame/3/bli_l3_sup.c: routines were returning before final call
  to AOCL_DTL_TRACE_EXIT
- frame/2/gemv/bli_gemv_unf_var1_amd.c: bli_multi_sgemv_4x2 is
  only defined in header file if BLIS_ENABLE_OPENMP is defined

AMD-Internal: [CPUPL-2460]
Change-Id: I2eacd5687f2548d8f40c24bd1b930859eefbbcde
2022-08-29 08:22:30 -04:00
Arnav Sharma
eb83a0fe9d Enabled ZHER Optimized Path
- While calculating the diagonal and corner elements, the combined
operation of calculating the product of x and x hermitian and
simultaneously scaling it with alpha and adding the result to the matrix
was the cause of increased underflow and overflow errors in netlib
tests.
- So the above calculation is now being done in three steps: scaling x
vector with alpha, then calculating its product with x hermitian and
later adding the final result to the matrix.

AMD-Internal: [CPUPL-2213]
Change-Id: I32df572b013bc3189340662dbf17eddcaec9f0f8
2022-08-29 08:09:42 -04:00