Commit Graph

2716 Commits

Author SHA1 Message Date
Dipal M Zambare
d3b503bbf2 Code cleanup and warnings fixes
- Removed all compiler warnings as reported by GCC 11 and AOCC 3.2
- Removed unused files
- Removed commented and disabled code (#if 0, #if 1) from some
  files

AMD-Internal: [CPUPL-2460]
Change-Id: Ifc976f6fe585b09e2e387b6793961ad6ef05bb4a
2022-08-29 15:15:40 +05:30
jagar
95169ca806 CBLAS/BLAS interface decoupling for level 1 APIs
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it and may
    result in recursion
-   This change separate the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I0f4521e70a02f6132bdadbd4c07715c9d52fe62a
2022-08-29 14:28:20 +05:30
Harihara Sudhan S
326d8a557f Performance regression in u8s8s16os16
- Performance of u8s8s16os16 came down by 40% after the
	  introduction of post-ops
	- Analysis revealed that the target compiler assumed false
	  dependency and was generating sub-optimal code due to the
          post-ops structure
	- Inserted vzeroupper to hint the compiler that no ISA change
	  will occur

AMD-Internal: [CPUPL-2447]
Change-Id: I0b383b9742ad237d0e053394602428872691ef0c
2022-08-29 03:20:02 -04:00
jagar
192f5313a1 CBLAS/BLAS interface decoupling for level 2 APIs
-   In BLIS the cblas interface is implemented as a wrapper around
    the blas interface. For example the CBLAS api ‘cblas_dgemm’
    internally invokes BLAS API ‘dgemm_’.
-   If the end user wants to use the different libraries for CBLAS
    and BLAS, current implantation of BLIS doesn’t allow it and may
    result in recursion
-   This change separates the CBLAS and BLAS implantation by adding
    an additional level of abstraction. The implementation of the
    API is moved to the new function which is invoked directly from
    the CBLAS and BLAS wrappers.

AMD-Internal: [SWLCSG-1477]

Change-Id: I8380b6468683028035f2aece48916939e0fede8a
2022-08-29 09:47:19 +05:30
Chandrashekara K R
d925ebeb06 CBLAS/BLAS interface decoupling for level 3 APIs
->In BLIS the cblas interface is implemented as a wrapper around
  the blas interface. For example the CBLAS api ‘cblas_dgemm’
  internally invokes BLAS API ‘dgemm_’.
->If the end user wants to use the different libraries for CBLAS
  and BLAS, current implantation of BLIS doesn’t allow it and
  may result in recursion
->This change separate the CBLAS and BLAS implantation by adding
  and additional level of abstraction. The implementation of the
  API is moved to the new function which is invoked directly from
  the CBLAS and BLAS wrappers.
AMD-Internal: [SWLCSG-1477]

Change-Id: I6218a3e81060fc8045f4de0ace87f708465dfae5
2022-08-26 05:54:29 -04:00
satish kumar nuggu
2114a43df8 Fixes to avoid Out of Bound Memory Access in TRSM small algorithm
Details:
1. Fixed the issues corresponding to Out of bound memory access during
load and store.
2. In Intrinsic code:
   i. AVX2 Registers can hold 4 double elements.
   ii. In case of remainder when number of elements is lessthan
vectorised register. Though the required number of elements are lessthan
4, we are reading and writing in chunks of 4 elements due to
vectorization. This might cause out of bound memory access.
3. Redesigned code to restrict out of bound access by loading and
storing the exact number of elements required.

AMD-Internal: [SWLCSG-1470]
Change-Id: I786f8023cf5a5f3e5343bea413c59bd0e764df9b
2022-08-26 01:06:10 -04:00
Harihara Sudhan S
5ca632e0f0 Added API to check for BF16 ISA support
- Checking for AVX512 bfloat 16 instructions support in
	  architecture using the CPUID

AMD-Internal: [CPUPL-2446]
Change-Id: I088a8aa46b037af837b2e58a96b59eae70c1dbf0
2022-08-25 11:00:35 -04:00
mkadavil
584069bf74 Parametric ReLU post-ops support for u8s8s32 and u8s8s16 GEMM.
-Parametric ReLU is the generalization of leaky ReLU in which the
leakage coefficient is tunable. The support for the same is added
following the register-level fusion technique.
-Low precision bench enhancement to check accuracy/performance of
low precision gemm with PReLU.
-Bug fixes in low precision gemm kernels.

AMD-Internal: [CPUPL-2442]
Change-Id: I81336405b185a994297d122b2d868b758ae6dad5
2022-08-25 13:33:02 +05:30
eashdash
4e3e00fb7e Added low precision GEMM - bf16bf16f32of32
Feature Addition: Added a new variant of low precision GEMM to addon - BFloat16. The kernel takes bf16 type inputs and perform BF16 GEMM operations. The intermediate accumulation and output are in float.

1. Compute kernels will perform computations only if B matrix is reordered in accordance with the usage of AVX-512 BF16 instruction - dpbf16_ps
2. Kernel for packing B matrix is provided

Change-Id: If5d08213068869eff060c9998596d2d2703a6793
2022-08-24 03:27:00 -04:00
satish kumar nuggu
219c41ded9 ZTRSM Improvements
Details:

1. Optimized ztrsm for small sizes upto 500 in multi thread scenarios.
2. Enabled multithreading execution for bli_trsm_small implementation
for double complex data type.
3. Added decision logic to choose between native vs multi-threaded small
path for sizes upto 500 and threads upto 8.

AMD-Internal: [CPUPL-2340]
Change-Id: I4df9d7e6ee152baa9cf33e58d36e1c17f75a00c1
2022-08-20 05:08:47 -04:00
satish kumar nuggu
0b81f53074 Fixed bug in DZGEMM
1. In zen4 dgemm and sgemm native kernels are column-prefer kernels,
cgemm and zgemm native kernels are row-prefer kernels. zen3 and older
arch (uses row-prefer kernels for all datatypes) hence induced-transpose
carried out based on kernel preference check. Added a condition check,
output matrix storage format need to be checked along with kernel
preference to avoid induced-transpose for zen4.

2. Added functions bli_cntx_l3_vir_ukr_dislikes_storage_of_md,
bli_cntx_l3_vir_ukr_prefers_storage_of_md for checking output matrix
storage format and micro kernel preference of mixed datatypes.

AMD-Internal: [CPUPL-2347]
Change-Id: Ib77676f4e2152f7876ad7dc91de716547f5ba3a5
2022-08-19 12:37:01 -04:00
Shubham Sharma
b8b339416a DGEMMT optimizations
Details:

1. For lower and upper, "B" column major storage variants of gemmt,
   new kernels are developed and optimized to compute only the
   required outputs in the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to
   compute Lower and Upper Variant diagonal outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I9748b2b52557718e7497ecf046530d3031636a63
2022-08-19 12:31:35 -04:00
Arnav Sharma
035ed98b51 Temporarily disabling optimized ZHER
- Disabling optimized ZHER pending verification with netlib BLAS test.

AMD-Internal: [CPUPL-2416]
Change-Id: I74c4d16e1c99ddeb1df91130a8e14feafd0952d0
2022-08-19 12:16:46 -04:00
Edward Smyth
6861fcae91 BLIS: Improve architecture selection at runtime
Make BLIS_ARCH_TYPE=0 be an error, so that incorrect meaningful names
will get an error rather than "skx" code path. BLIS_ARCH_TYPE=1 is
now "generic", so that it should be constant as new code paths are
added. Thus all other code path enum values have increased by 2.

Also added new options to BLIS configure program to allow:
1. BLIS_ARCH_TYPE functionality to be disabled, e.g.:

./configure --disable-blis-arch-type amdzen

2. Renaming the environment variable tested from "BLIS_ARCH_TYPE" to a
   specified value, e.g.:

./configure --rename-blis-arch-type=MY_NAME_FOR_ARCH_TYPE amdzen

On Windows, these can be enabled with e.g.:

cmake ... -DDISABLE_BLIS_ARCH_TYPE=ON

or

cmake ... -DRENAME_BLIS_ARCH_TYPE=MY_NAME_FOR_ARCH_TYPE

This implements changes 2 and 3 in the Jira ticket below.

AMD-Internal: [CPUPL-2235]
Change-Id: Ie42906bd909f9d83f00a90c5bef9c5bf3ef5adb4
2022-08-19 10:59:35 -04:00
Sireesha Sanga
22af681a11 Runtime Thread Control Feature Update
Details:
1.  Runtime Thread Control Feature is enhanced to create a provision
    for the application to allocate a different number of threads to
    BLIS from the number of threads application is using for itself.

2.  In the previous implementation, if application sets BLIS_NUM_THREADS
    with a valid value, BLIS internally calls omp_set_num_threads() API with
    same value. Due to this, application could not differentiate between
    the number of threads used in BLIS library and the application.

3.  With the current solution, if Application wants to allocate
    different number of threads for BLIS API and application, Application
    can choose either BLIS_NUM_THREADS environment variable or
    bli_thread_set_num_threads(nt) API for BLIS,
    and OpenMP APIs or environment variables for itself, respectively.

4.  If BLIS_NUM_THREADS is set with a valid value, same value
    will be used in the subsequent parallel regions unless
    bli_thread_set_num_threads() API is used by the Application
    to modify the desired number of threads during BLIS API execution.

5.  Once BLIS_NUM_THREADS environment variable or
    bli_thread_set_num_threads(nt) API is used by the application,
    BLIS module would always give precedence to these values. BLIS API would
    not consider the values set using OpenMP API omp_set_num_threads(nt) API
    or OMP_NUM_THREADS environment variable.

6.  If BLIS_NUM_THREADS is not set, then if Application is multithreaded and
    issued omp_set_num_threads(nt) with desired number of threads,
    omp_get_max_threads() API will fetch the number of threads set earlier.

7.  If BLIS_NUM_THREADS is not set, omp_set_num_threads(nt) is not called
    by the application, but only OMP_NUM_THREADS is set,
    omp_get_max_threads() API will fetch the value of OMP_NUM_THREADS.

8.  If both environment variables are not set, or if they are set with
    invalid values, and omp_set_num_threads(nt) is not issued by
    application, omp_get_max_threads() API will return the number of the
    cores in the current context.

9.  BLIS will initialize rntm->num_threads with the same value.
    However if omp_set_nested is false - BLIS APIs called from parallel
    threads will run in sequential. But if nested parallelism is enabled
    Then each application will launch MT BLIS.

10. Order of precedence used for number of threads:
      0. value set using bli_thread_set_num_threads(nt) by the application
      1. valid value set for BLIS_NUM_THREADS environment variable
      2. omp_set_num_threads(nt) issued by the application
      3. valid value set for OMP_NUM_THREADS environment variable
      4. Number of cores

11. If nt is not a valid value for omp_set_num_threads(nt) API,
    number of threads would be set to 1.
    omp_get_max_threads() API will return 1.

12. OMP_NUM_THREADS env. variable is applicable only when OpenMP is enabled.

AMD-Internal: [CPUPL-2342]
Change-Id: I2041ac1d824f0b57a23a2a69abd6017c800f21b6
2022-08-19 05:43:01 -04:00
Vignesh Balasubramanian
cf31fcd020 Fine tuned threshold and aocl dynamic for zgemm for skinny matrices.
-Updated optimal threads in zgemm sup path for skinny matrices.
-Fine tuned the threshold values for small and sup paths
 to improve overall zgemm.
-Zgemm small is selected for inputs with transb as N.
-Redirection of input among small, sup and native path
 was fine tuned.

AMD-Internal : [CPUPL-1900]

Change-Id: Ide37c8255def770b4b74bc6e7c6edb5ee15d3b1f
2022-08-19 01:19:14 -04:00
Shubham Sharma
32c9239c7f Optimization of DGEMMT SUP kernels
Details:
1. Optimized the kernels by replacing the macros with
   the actual computation of required output elements.

AMD-Internal: [CPUPL-2341]
Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5
2022-08-18 08:30:19 -04:00
Shubham Sharma
8adef27aca Optimization of DGEMMT SUP kernel for beta zero cases.
Details:
1. In kernels for non-transpose variants, changes
   are made to optimize the cases of beta zero.
2. Validated the changes with BLIS Testsuite,
   GTestSuite(Functionality, Valgrind, Integer Tests)
   and Netlib Tests.
3. Fixed warnings during the build process.

AMD-Internal: [CPUPL-2341]
Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a
2022-08-18 08:05:58 -04:00
Edward Smyth
7f322da01d BLIS: BLAS3 quick return functionality
Implement netlib BLAS style quick return functionality for when no
work is required. Similar functionality was already in HERK and HER2K
routines.

AMD copyrights updated.

AMD-Internal: [CPUPL-2373]
Change-Id: I0ebe9d76465b0e48b2ff5c2f1cc2a75763fe187c
2022-08-18 04:09:56 -04:00
mkadavil
171fb7358d SGEMM Optimization
-sup GEMM - 2 variants var2m (block-panel) and var1n (panel-block).
We added decision logic to choose between var1n and var2m for single
thread SGEMM.var1n is favorable option when "n" is very large compared
to "m".
-Also fixed a bug related to fetching "MR" "NR" values in
bli_gemmsup_int(). We replaced "bli_cntx_get_blksz_def_dt()(used for
Native)" with "bli_cntx_get_l3_sup_blksz_def_dt()".

AMD-Internal: [CPUPL-2406]
Change-Id: If36529015b1c5f8f87eb40c05ebcf433c471d4d5
2022-08-18 01:39:42 -04:00
Harsh Dave
46e7727ea8 DGEMM Improvements
- Incase of DGEMM when m, n and leading dimensions are large 
  packing of A and B matrixes are required for optimal performance.

- Modified decision logic to choose between sup vs native, 
  now apart from matrix dimensions, we also incorporate matrix 
  leading dimensions into this decision.

AMD-Internal: [CPUPL-2366]
Change-Id: I255db5f7049d783e22d7c912edf8bbf023e32ed8
2022-08-18 01:14:40 -04:00
Nallani Bhaskar
39196d163e Enable packing of A & B dynamically in dgemmsup.
Details:
- When work distributed for each thread is larger than caches,
  it is advisable to perform packing of B for sup dgemm.

- Work distribution per thread is calculated based on the values
  of jc_nt and ic_nt.

- For RRC and CRC cases we want to avoid rd kernels which are not
  efficient in performance compared to rv kernels. Therefore we
  perform packing of A as well so that rv kernels are invoked for
  these cases.

- These changes result in improved DGEMM performance.

- Dynamic packing is done using the API "bli_rntm_set_pack_b( 1, rntm )"

Change-Id: I8344520b4a2591e57518bb54183a15957f60f94b
2022-08-17 12:36:15 -04:00
Harsh Dave
e2e1dadee1 DGEMM Improvements
- We prefetch next panel while packing 8xk panel.
- Modified prefetch offsets for dgemm native and 
  dgemm_small kernel.

AMD-Internal: [CPUPL-2366]

Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed
2022-08-16 08:07:30 -04:00
satish kumar nuggu
88e44c64e3 Fixed Memory Leaks in TRSM
1. Fixed the memory leaks in corner cases which caused due to extra
loads in all datatypes(s,d,c,z).
2. In remainder cases instead of loading required number of elements,
loaded extra elements which lead to memory leaks. Fixed memory leaks by
restricting number of loads to required number of elements.

AMD-Internal: [CPUPL-2280]

Change-Id: Ia49a02565e01d5ed05e98090b7773a444587cd8a
2022-08-16 05:08:40 -04:00
mkadavil
6fbdfc3cf2 Low precision gemm refactoring and bug fixes.
-The micro-kernel function signatures follow a common pattern. These
functions can be represented as an instantiation of a MACRO as is done
in BLIS, and thus the number of micro-kernel header files can be brought
down. A new single header file containing all the MACRO definitions with
the instantiation is added, and the existing unnecessary header files
are removed.
-The bias addition in micro-kernel for n remaining < 16 reads the bias
array assuming it contains 16 elements. This can result in seg-faults,
since out of bound memory is accessed. It is fixed by copying required
elements to an intermediate buffer and using that buffer for loading.
-Input matrix storage type parameter is added to lpgemm APIs. It can be
either row or column major, denoted by r and c respectively. Currently
only row major input matrices are supported.
-Bug fix in s16 fringe micro-kernel to use correct offset while storing
output.

AMD-Internal: [CPUPL-2386]
Change-Id: Idfa23e69d54ad7e06a67b1e36a5b5558fbff03a3
2022-08-14 17:39:00 +05:30
Shubham Sharma
f5ef30a44a Fix in DGEMMT SUP kernel
Details:
1. Due to error in C output buffer address computation in
   kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid
   memory is being accessed. This is causing seg fault in
   libflame netlib testing.
2. Validated the fix with libflame netlib testing.

AMD-Internal: [CPUPL-2341]
Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed
2022-08-12 21:03:36 +05:30
Arnav Sharma
a226e54421 AVX512 based SGEMM Optimizations
- Updated with optimal cache-blocking sizes for MC, KC and NC for AVX512 Native SGEMM kernel.

AMD-Internal: [CPUPL-2385]
Change-Id: I1feae5ac79e960c6b26df24756d460243820b797
2022-08-12 02:33:39 -04:00
Dipal M. Zambare
c85bbfdb50 Updated BLIS version string format
- Updated version string to match the recommended format
  “AOCL-BLIS 3.2.1 Build 20220727”.
- Fixed issues with include paths which was preventing compile time
  version sting definition passing via build commands.
- Removed version string determination based on git tag
  using ‘git describe’, version string will always be taken from
  the version file.

AMD-Internal: [CPUPL-2324]
Change-Id: Idc7edf1211f66d348ec3b5b43f2507c2b810f088
2022-08-12 05:53:35 +00:00
Shubham Sharma
4bca7f6f4a DGEMMT optimizations
Details:

1. For lower and upper, non-transpose variants of gemmt, new kernels
   are developed and optimized to compute only the required outputs in
   the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Kernels specific to compute Lower and Upper Variant diagonal
   outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a
2022-08-11 06:16:33 -04:00
Mangala.V
8504ef013d Optimisation of DTRSM and ZTRSM
1. Extract instruction replaced with cast when accessing first 128bit,
   as cast inst needs no cycle but extract takes few cycles
2. Added prefetch of A buffer when computing gemm operation
3. Added prefetch of C11 buffer before TRSM operation, with offset of 7 to cs_c

With above changes performance improvements observed in case of Single thread

Change-Id: Id377c490ddac8b06384acfa9a6d89dbe11bbc7be
2022-08-11 01:39:40 -04:00
Edward Smyth
737e08cd7a BLIS: Improve architecture selection at runtime
Enable meaningful names as options for BLIS_ARCH_TYPE environment
variable. For example,
BLIS_ARCH_TYPE=zen4 or BLIS_ARCH_TYPE='ZEN4' or BLIS_ARCH_TYPE=6
will select the same code path (in this release). The meaningful
names are not case sensitive.

This implements change 1 in the Jira ticket below.

Following review comments:
1. Use names from arch_t enum in function bli_env_get_var_arch_type()
   rather than directly using numbers.
2. AMD copyrights updated.

AMD-Internal: [CPUPL-2235]
Change-Id: I8cfd43d34765d5e8c7e35680d18825d9934753ad
2022-08-10 08:26:49 -04:00
Harihara Sudhan S
d1eaf65a26 Post-Ops for u8s8s16os16
Functionality - Post-ops is an operation performed on every element
of the output matrix after GEMM operation is completed.

	- Post-ops relu and bias added to all the compute kernels
	  of u8s8s16os16
	- Post-ops are done on the value loaded into the register
	  to avoid reloading of C matrix elements
	- Minor bug fixes in openmp thread decorator of lpgemm
	- Added test cases to lpgemm bench input file

AMD-Internal: [CPUPL-2171]

Change-Id: If49f763fdfac19749f6665c172348691165d8631
2022-08-09 14:52:41 +05:30
Harihara Sudhan S
60de0a1856 Multithreading and support for unpacked B matrix in u8s8s16os16
Fucntionality - When the B matrix is not reordered before the
u8s8s16os16 compute kernel call packing of B matrix is done as
part of the five loop algorithm. The state of B matrix (packed
or unpacked) is given as an user input.

	- Packing of B matrix is done as part of the five loop
	  compute.
	- Temprorary buffer for pack B is allocated in the five
	  loop algorithm
	- Multithreading for computation kernel
	- Configuration constants for u8s8s16os16 are part of the
	  lpgemm config

AMD-Internal: [CPUPL-2171]

Change-Id: I22b4f0ec7fc29a2add4be0cff7d75f92dd3e60b8
2022-08-05 19:28:37 +05:30
mkadavil
828d3cd3d3 Post operations support for low precision gemm.
- Low precision gemm is often used in ML/DNN workloads and is used
in conjunction with pre and post operations. Performing gemm and ops
together at the micro kernel level results in better overall performance
due to cache/register reuse of output matrix. The provision for defining
the post-operations and invoking the micro-kernel with it from the
framework is added as part of this change. This includes adding new data
structures/functions to define the post-ops to be applied and an
extensible template using which new post-ops can easily be integrated.
As for the post-operations, RELU and Bias Add for u8s8s32 is implemented
in this first cut.
- aocl_gemm bench modifications to test/benchmark RELU and Bias Add.

AMD-Internal: [CPUPL-2316]
Change-Id: Iad5fe9e54965bb52d5381ae459a69800946c7d18
2022-08-05 11:53:05 +05:30
Harihara Sudhan S
e5d4fc2a70 Added low precision GEMM (u8s8s16os16)
Feature Addition : Added low precision GEMM to addon. The
kernel takes unsigned int8 and signed int8 as inputs and
performs GEMM operation. The intermediate accumulation and
output are in signed int16.

	- The compute kernel will perform computation only
	  if B matrix reordered to suit the usage of AVX2
	  instruction vpmaddubsw.
	- Kernel for packing the B matrix is provided.
	- LPGEMM bench code was modified to test the
	  performance and accuracy of the new variant.

AMD-Internal: [CPUPL-2171]

Change-Id: Id9a6d90b79f4bf82fb2e2f3093974dbf37275f9b
2022-08-02 02:20:00 -04:00
Devin Matthews
76fbf1233d Add vzeroupper to Haswell microkernels. (#524)
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
  microkernels so as to avoid a performance penalty when mixing AVX
  and SSE instructions. These vzeroupper instructions were once part
  of the haswell kernels, but were inadvertently removed during a source
  code shuffle some time ago when we were managing duplicate 'haswell'
  and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
  and re-inserting the missing instructions.

Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9
2022-08-01 09:29:04 +05:30
Field G. Van Zee
2a81437bd8 Fixed bugs in cpackm kernels, gemmlike code.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
  bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
  kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
  of 4 bytes) from the real component. This was almost certainly a copy-
  paste bug carried over from the corresonding zpackm kernels. Thanks to
  Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. (This bug was never observed in output but rather
  noticed analytically. It probably would have also manifested as
  intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
  relating to debugging.

Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011
2022-08-01 09:11:58 +05:30
Minh Quan Ho
6d4d6a7514 Alloc at least 1 elem in pool_t block_ptrs. (#560)
Details:
- Previously, the block_ptrs field of the pool_t was allowed to be
  initialized as any unsigned integer, including 0. However, a length of
  0 could be problematic given that malloc(0) is undefined and therefore
  variable across implementations. As a safety measure, we check for
  block_ptrs array lengths of 0 and, in that case, increase them to 1.
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>

Change-Id: I1e885d887aaba5e73df091ef52e6c327fd6418de
2022-08-01 08:45:00 +05:30
Minh Quan Ho
3c01fcb9fc Fix insufficient pool-growing logic in bli_pool.c. (#559)
Details:
- The current mechanism for growing a pool_t doubles the length of the
  block_ptrs array every time the array length needs to be increased
  due to new blocks being added. However, that logic did not take in
  account the new total number of blocks, and the fact that the caller
  may be requesting more blocks that would fit even after doubling the
  current length of block_ptrs. The code comments now contain two
  illustrating examples that show why, even after doubling, we must
  always have at least enough room to fit all of the old blocks plus
  the newly requested blocks.
- This commit also happens to fix a memory corruption issue that stems
  from growing any pool_t that is initialized with a block_ptrs length
  of 0. (Previously, the memory pool for packed buffers of C was
  initialized with a block_ptrs length of 0, but because it is unused
  this bug did not manifest by default.)
- Co-authored-by: Minh Quan Ho <minh-quan.ho@kalray.eu>

Change-Id: Ie4963c56e03cbc197d26e29f2def6494f0a6046d
2022-08-01 08:19:30 +05:30
Devin Matthews
3d655a951b Fix data race in testsuite.
Change-Id: I7704037bad0f7485e7b352de68c2c4535d364226
2022-08-01 07:49:19 +05:30
Devin Matthews
9495401b73 Fix more copy-paste errors in the haswell gemmsup code.
Fixes #486.

Change-Id: I568386b5d67a698ea9c0b6b17f133df86c2894bd
2022-07-31 21:36:21 +05:30
Devin Matthews
ea163fc23b Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
The fix is to use the same (valid) source register twice in the horizontal addition.

Change-Id: I96ed39e289aaeeb44be9117074b32bd8d4c19de6
2022-07-31 21:15:28 +05:30
Field G. Van Zee
faff30b46a Fixed out-of-bounds bug in sup s6x16m haswell kernel.
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
  assembly kernels. This bug is similar to the one fixed in 17b0caa
  and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
  Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.

Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9
2022-07-31 21:10:58 +05:30
Field G. Van Zee
4b1663213c Fixed out-of-bounds read in haswell gemmsup kernels.
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
  kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
  single-precision elements of C, via instructions such as:

	vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

  in situations where only two elements are guaranteed to exist. (These
  bugs may not have manifested in earlier tests due to the leading
  dimension alignment that BLIS employs by default.) The issue was fixed
  by replacing lines like the one above with:

	vmovsd(mem(rcx), xmm0)
	vfmadd231ps(xmm0, xmm3, xmm4)

  Thus, we use vmovsd to explicitly load only two elements of C into
  registers, and then operate on those values using register addressing.
  Thanks to Daniël de Kok for reporting these bugs in #635, and to
  Bhaskar Nallani for proposing the fix).
- CREDITS file update.

Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b
2022-07-31 21:06:08 +05:30
Nallani Bhaskar
1d31386c02 Fixed few out of bound memory reads in sgemmsup kernels
Details:
 Fixed memory access bugs in the bli_sgemmsup_rd_zen_asm_s1x16()
  kernel. The bugs were caused by loading four
  single-precision elements of C, via instructions such as:

	vfmadd231ps(mem(rcx, 0*32), ymm3, ymm4)

        or

        vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

  in situations where only two elements are guaranteed to exist. (These
  bugs may not have manifested in earlier tests due to the leading
  dimension alignment that BLIS employs by default.) The issue was fixed
  by replacing lines like the one above with:

	vmovsd(mem(rcx), xmm0)
	vfmadd231ps(xmm0, xmm3, xmm4)

  Thus, we use vmovsd to explicitly load only two elements of C into
  registers, and then operate on those values using register addressing.

  AMD_CPUPLID: CPUPL-2279

Change-Id: Ic39290d651f5218b2e548351a87ac5e4b5b79c68
2022-07-29 09:09:12 -04:00
Chandrashekara K R
fde812015f Updated blis library version from 4.0 to 3.2.1
AMD-Internal: [CPUPL-2322]
Change-Id: I3a6a61543dd2754e2590d7f5f22442c9fdeaee95
2022-07-29 15:55:10 +05:30
Mangala V
6c1acc74c8 ZGEMM optimizations
-- Conditionally packing of B matrix is enabled in zgemmsup path
   which is performing better when B matrix is large

-- Incorporated decision logic to choose between zgemm_small vs
   zgemm sup based on matrix dimensions "m, n and k".

-- Calling of ZGEMV when matrix dimension m or n = 1.
   Very good performance improvement is observed.

Change-Id: I7c64020f4f78a6a51617b184cc88076213b5527d
2022-07-28 14:55:24 +05:30
Vignesh Balasubramanian
808d79a610 Implemented efficient ZGEMM algorithm when k=1
Problem statement :
To improve the performance of the zgemm kernel for dealing with input sizes with k=1 by fine tuning its previous implementation.
In the previous implementation, usage of SIMD parallelism along m and n dimensions instead of the k dimension proved to provide a better performance to the zgemm kernel. This code was subjected to further improvements along the following lines:

- Cases to deal with alpha=0 and beta!=0 (i.e. just scaling of C) were handled at the beginning separately, using the bli_zscalm api.
- Register blocking was further improved, resulting in the kernel size to increase from 4x5 to 4x6.
- Prefetching was added to the code, by empirically finding out a suitable value to be added to the pointer. Overall, it provided a mild improvement to the performance.
- Conditional statements were removed from the kernel loop, and a logic was deduced to allow such removal without affecting the output.

The performance improvement of this single threaded implementation also proved to compete with that of the default implementation for multiple threads, as long as m and n are under 128. An improvement to this patch would be to find out a suitable feature which would establish a relationship between the number of threads and the input size constraints, thereby providing a unique size constraint for different number of threads.

AMD-Internal: [CPUPL-2236]
Change-Id: I3d401c8fd78bec80ce62eef390fa85e6287df847
2022-07-28 02:09:45 -04:00
Kiran Varaganti
6054b888fb Fixed Bug in bench_trsm.c
When bli_trsm() API is called, we make sure the "side" argument is "side_t" and
not f77_char and argument is passed by value and not by its address.

Change-Id: I5a616eb054c034be2d67640b8ab3b9615706a8c9
2022-07-25 15:38:30 +00:00
Kiran Varaganti
eff436c653 Bug Fix to replace vzeroall
Fixed syntax in AVX512 dgemm native kernel.
zen4 configuration follows Intel ASM syntax whereas other AMD configs
follow AT&T ASM syntax. Bug was introduced due to following AT&T syntax
in AVX512 dgemm kernel. In this commit we changed the syntax to Intel ASM
format. src and dst operands are interchanged.

Change-Id: Ie61dc7c5e8309b79437d471331318f3104bcd447
2022-07-22 03:42:17 -04:00