Commit Graph

92 Commits

Author SHA1 Message Date
Mangala V
e9124ffca7 BUGFIX: Updated ZGEMM microkernel to handle alpha = 0 case
BUG:
When alpha real and imaginary is zero
Output is computed as C= Beta * C + A * B instead of C = Beta * C

FIX:
Updated kernel to scale A * B product with alpha in case of alpha=0

Existing framework design:
- When alpha real and imaginary value is zero, framework handles to skip
kernel call to avoid alpha * A * B operation
- SCALM is invoked to perform Beta * C

- Accuracy issue was not observed as alpha=0 was handled in framework
- If we call kernel directly with alpha=0, results would be wrong
- Issue was figured out during microkernel testing using gtestsuite

AMD-Internal: [CPUPL-4454]
Change-Id: Ib6113f5226cd7c26a63781cdd20d35660f453803
2024-06-20 02:58:43 -04:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Edward Smyth
50608f28df BLIS: Missing clobbers (batch 7)
Add missing clobbers in:
- bli_gemmsup_rv_haswell kernels
- spare copies of kernels in old, other and broken subdirectories
- misc kernels for legacy platforms

AMD-Internal: [CPUPL-3521]
Change-Id: I7cdb7fd1cb29630d8b7fa914b1002a270dfe9ef5
2023-11-22 17:51:46 -05:00
Edward Smyth
f471615c66 Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
2023-11-22 17:11:10 -05:00
Edward Smyth
c6f3340125 Merge commit '5013a6cb' into amd-main
* commit '5013a6cb':
  More edits and fixes to docs/FAQ.md.
  Fixed newly broken link to CREDITS in FAQ.md.
  More minor fixes to FAQ.md and Sandboxes.md.
  Updates to FAQ.md, Sandboxes.md, and README.md.
  Safelist 'master', 'dev', 'amd' branches.
  Re-enable and fix fb93d24.
  Reverted fb93d24.
  Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
  Removed last vestige of #define BLIS_NUM_ARCHS.
  Added new packm var3 to 'gemmlike'.
  Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
  Fix more copy-paste errors in the haswell gemmsup code.
  Do a fast test on OSX. [ci skip]
  Fix AArch64 tests and consolidate some other tests.
  Use C++ cross-compiler for ARM tests.
  Attempt to fix cxx-test for OOT builds.
  Updated travis-ci.org link in README.md to .com.
  Disabled (at least temporarily) commit 8e0c425.
  Define BLIS_OS_NONE when using --disable-system.
  Updated stale calls to malloc_intl() in gemmlike.
  Blacklist clang10/gcc9 and older for 'armsve'.
  Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.
  Moved lang defs from _macro_def.h to _lang_defs.h.
  Minor tweaks to gemmlike sandbox.
  Added local _check() code to gemmlike sandbox.
  README.md citation updates (e.g. BLIS7 bibtex).
  Tweaks to gemmlike to facilitate 3rd party mods.
  Whitespace tweaks.
  Add row- and column-strides for A/B in obj_ukr_fn_t.
  Clean up some warnings that show up on clang/OSX.
  Remove schema field on obj_t (redundant) and add new API functions.
  Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.
  Disabled sanity check in bli_pool_finalize().
  Implement proposed new function pointer fields for obj_t.

AMD-Internal: [CPUPL-2698]
Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d
2023-11-10 13:05:12 -05:00
Harsh Dave
77161c1e5d Design change of DGEMM 6x8 native kernel.
- Following optimizations are included for dgemm 6x8 native kernel.
1) Reorganized the C update and store to reduce register dependencies.
2) moved the C prefetch to part-way through the kernel for efficiently
prefetching C matrix at appropriate distance.
3) Offsetting A matrix, so that kernel can use a smaller instruction
encoding saving, saving i-cache space.
4) Aligned the K iteration loop.

- Thanks to Moore, Branden <Branden.Moore@amd.com> for these design
  changes of DGEMM 6x8 native kernels.

- Additional change, reorganization of C update and store for
beta zero case to facilitate out of order execution of storing of C
matrix.


Change-Id: I9d1ec8d39f1154b0f38b136bd6a04b05d7d1e6ba
2023-11-09 23:07:43 -05:00
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Edward Smyth
f5505be9f3 Merge commit 'e366665c' into amd-main
* commit 'e366665c':
  Fixed stale API calls to membrk API in gemmlike.
  Fixed bli_init.c compile-time error on OSX clang.
  Fixed configure breakage on OSX clang.
  Fixed one-time use property of bli_init() (#525).
  CREDITS file update.
  Added Graviton2 Neoverse N1 performance results.
  Remove unnecesary windows/zen2 directory.
  Add vzeroupper to Haswell microkernels. (#524)
  Fix Win64 AVX512 bug.
  Add comment about make checkblas on Windows
  CREDITS file update.
  Test installation in Travis CI
  Add symlink to blis.pc.in for out-of-tree builds
  Revert "Always run `make check`."
  Always run `make check`.
  Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script.   if the string contains zen and zen2, and zen need to be replaced with another string, then zen2   also be incorrectly replaced.
  Update POWER10.md
  Rework POWER10 sandbox
  Skip clearing temp microtile in gemmlike sandbox.
  Fix asm warning
  Sandbox header edits trigger full library rebuild.
  Add vhsubpd/vhsubpd.
  Fixed bugs in cpackm kernels, gemmlike code.
  Armv8A Rename Regs for Safe Darwin Compile
  Armv8A Rename Regs for Clang Compile: FP32 Part
  Armv8A Rename Regs for Clang Compile: FP64 Part
  Asm Flag Mingling for Darwin_Aarch64
  Added a new 'gemmlike' sandbox.
  Updated Fugaku (a64fx) performance results.
  Add explicit compiler check for Windows.
  Remove `rm-dupls` function in common.mk.
  Travis CI Revert Unnecessary Extras from 91d3636
  Adjust TravisCI
  Travis Support Arm SVE
  Added 512b SVE-based a64fx subconfig + SVE kernels.
  Replace bli_dlamch with something less archaic (#498)
  Allow clang for ThunderX2 config

AMD-Internal: [CPUPL-2698]
Change-Id: I561ca3959b7049a00cc128dee3617be51ae11bc4
2023-10-18 09:09:54 -04:00
Harsh Dave
df80f40ccd Fixed incorrect ymm registers usage in FMA operation.
- Incorrect ymm registers were used in dgemm SUP edge kernel,
    while computing FMA operation.

- Due to incorrect vector register, it resulted into incorrect result.

- Corrected vector registers usage for FMA operation.

AMD-Internal: [CPUPL-3964]

Change-Id: I37fcb5f8eeb5945fe994d8a5b69815a3bcca87df
2023-10-02 03:20:44 -04:00
Harsh Dave
e437469a99 Optimized AVX2 DGEMM SUP edge kernels
- For edge kernels which handles the corner cases and specially
for cases where there is really small amount of computation to
be done, executing FMA efficiently becomes very crucial.

- In previous implementation, edge kernels were using same, limited
number of vector register to hold FMA result, which indirectly creates
dependency on previous FMA to complete before CPU can issue new FMA.

- This commit address this issue by using different vector registers
that are available at disposal to hold FMA result.

- That way we hold FMA results in two sets of vector registers, so that
sub-sequent FMA won't have to wait for previous FMA to complete.

- At the end of un-rolled K loop these two sets of vector registers are
added together to store correct result in intended vector registers.

AMD-Internal: [CPUPL-3574]
Change-Id: I48fa9e29b6650a785321097b9feeddc3326e3c54
2023-09-22 03:43:47 -04:00
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Eleni Vlachopoulou
9c613c4c03 Windows CMake bugfix in object libraries for shared library option
Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory.
The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries.

AMD-Internal: [CPUPL-3241]
Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52
2023-05-24 17:30:16 +05:30
Edward Smyth
a3adfb68cf BLIS: Missing clobbers (batch 4)
Add missing clobbers haswell (sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I19fa97b85f75c8b8fe15d31b13768f937cc5e4cc
2023-05-23 14:57:08 -04:00
Edward Smyth
03965a4f07 BLIS: Missing clobbers (batch 3)
Add missing clobbers in haswell (non-sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I68f6ad0c01557fcde73b1775d250d48b5162c521
2023-05-23 14:37:31 -04:00
Edward Smyth
ea2eea5097 BLIS: Missing clobbers (batch 1)
Add missing clobbers in first batch of assembly kernels:
- zen3 bli_gemmsup*
- bli_zgemm_zen4_asm_12x4
- bli_gemmsup_rv_haswell_asm_sMx6

AMD-Internal: [CPUPL-3456]
Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1
2023-05-23 11:51:18 -04:00
Eleni Vlachopoulou
1a7f60ff5b Update CMake system to use object libraries for haswell, skx and zen4.
- AVX2 and AVX512 flags are set up locally for each object library that requires them.
- Default ENABLE_SIMD_FLAGS value is set to none and for AVX2 option the corresponding compiler flag is set globally.
- To be able to build zen4 codepath when ENABLE_SIMD_FLAGS=AVX2, the compiler option is removed by removing the definition before building the corresponding object library.

AMD-Internal: [CPUPL-3241]
Change-Id: Ia570e60f06c4c72b7c58f4c9ca73bac4c060ae73
2023-05-12 10:04:16 -04:00
Harsh Dave
238d9fda9e Fixed ASAN memory issue due to modifying RBP register
- RBP is base pointer which points to base of current stack frame.
  ASAN tool rely on rbp and rsp for stack related validations. So over-writting
  or modifying RBP register results in application termination with the error code
  of stack overflow.
- Removed all the code snippets which were using rbp register for prefetching matrices
  and sometimes loading elements from memory in all of the gemm sup kernels for double
  datatype.
- Removed reference to rbp from register clobber list as well to completely avoid the
  usage of rbp register.

AMD-Internal: [CPUPL-2613, CPUPL-2587]

Change-Id: Idd402d3c644c4dd66e8d4988aede539ad8c77b28
2023-03-23 21:23:44 -04:00
Harsh Dave
222e00e840 Obliterated usage of rbp register in SUP gemm kernel
- Mx4 edge kernels were overwriting rbp
registers for prefetches.
- Since rbp along with rsp defines stack frame,
it resulted in stack overflow issue.
- Replaced rbp with rdx register for prefetches.

AMD-Internal: [CPUPL-2987]
Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd
2023-03-01 11:09:57 -05:00
Kiran Varaganti
d8d4499e54 AVX2 dgemm kernel optimization for AOCC
Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with
     k0 is typecasted to uint64_t to enable AOCC generate optimized code.
     Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting
     this change. Similar change was applied to sgemm, cgemm and zgemm kernels.
Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9
2023-01-09 07:49:41 -05:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Mangala V
e440cbc91a Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m
Address sanitizer reports error when rbp regitser is modified.

Register rbp was stored with rs_a which was used during prefetch
of Matrix A. Usage of rbp is avoided by using rcx register as a
temporary storage register.
Hence rcx is updated with Matrix C address before storing the
computed data.

This fix address the issue reported by GEQP3 API of libflame

AMD-Internal: [CPUPL-2587]
Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d
2022-09-30 11:52:29 -04:00
Shubham Sharma
b8b339416a DGEMMT optimizations
Details:

1. For lower and upper, "B" column major storage variants of gemmt,
   new kernels are developed and optimized to compute only the
   required outputs in the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to
   compute Lower and Upper Variant diagonal outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I9748b2b52557718e7497ecf046530d3031636a63
2022-08-19 12:31:35 -04:00
Shubham Sharma
32c9239c7f Optimization of DGEMMT SUP kernels
Details:
1. Optimized the kernels by replacing the macros with
   the actual computation of required output elements.

AMD-Internal: [CPUPL-2341]
Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5
2022-08-18 08:30:19 -04:00
Shubham Sharma
8adef27aca Optimization of DGEMMT SUP kernel for beta zero cases.
Details:
1. In kernels for non-transpose variants, changes
   are made to optimize the cases of beta zero.
2. Validated the changes with BLIS Testsuite,
   GTestSuite(Functionality, Valgrind, Integer Tests)
   and Netlib Tests.
3. Fixed warnings during the build process.

AMD-Internal: [CPUPL-2341]
Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a
2022-08-18 08:05:58 -04:00
Harsh Dave
e2e1dadee1 DGEMM Improvements
- We prefetch next panel while packing 8xk panel.
- Modified prefetch offsets for dgemm native and 
  dgemm_small kernel.

AMD-Internal: [CPUPL-2366]

Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed
2022-08-16 08:07:30 -04:00
Shubham Sharma
f5ef30a44a Fix in DGEMMT SUP kernel
Details:
1. Due to error in C output buffer address computation in
   kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid
   memory is being accessed. This is causing seg fault in
   libflame netlib testing.
2. Validated the fix with libflame netlib testing.

AMD-Internal: [CPUPL-2341]
Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed
2022-08-12 21:03:36 +05:30
Shubham Sharma
4bca7f6f4a DGEMMT optimizations
Details:

1. For lower and upper, non-transpose variants of gemmt, new kernels
   are developed and optimized to compute only the required outputs in
   the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Kernels specific to compute Lower and Upper Variant diagonal
   outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a
2022-08-11 06:16:33 -04:00
Devin Matthews
76fbf1233d Add vzeroupper to Haswell microkernels. (#524)
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
  microkernels so as to avoid a performance penalty when mixing AVX
  and SSE instructions. These vzeroupper instructions were once part
  of the haswell kernels, but were inadvertently removed during a source
  code shuffle some time ago when we were managing duplicate 'haswell'
  and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
  and re-inserting the missing instructions.

Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9
2022-08-01 09:29:04 +05:30
Field G. Van Zee
2a81437bd8 Fixed bugs in cpackm kernels, gemmlike code.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
  bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
  kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
  of 4 bytes) from the real component. This was almost certainly a copy-
  paste bug carried over from the corresonding zpackm kernels. Thanks to
  Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. (This bug was never observed in output but rather
  noticed analytically. It probably would have also manifested as
  intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
  relating to debugging.

Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011
2022-08-01 09:11:58 +05:30
Devin Matthews
9495401b73 Fix more copy-paste errors in the haswell gemmsup code.
Fixes #486.

Change-Id: I568386b5d67a698ea9c0b6b17f133df86c2894bd
2022-07-31 21:36:21 +05:30
Devin Matthews
ea163fc23b Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
The fix is to use the same (valid) source register twice in the horizontal addition.

Change-Id: I96ed39e289aaeeb44be9117074b32bd8d4c19de6
2022-07-31 21:15:28 +05:30
Field G. Van Zee
faff30b46a Fixed out-of-bounds bug in sup s6x16m haswell kernel.
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
  assembly kernels. This bug is similar to the one fixed in 17b0caa
  and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
  Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.

Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9
2022-07-31 21:10:58 +05:30
Field G. Van Zee
4b1663213c Fixed out-of-bounds read in haswell gemmsup kernels.
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
  kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
  single-precision elements of C, via instructions such as:

	vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

  in situations where only two elements are guaranteed to exist. (These
  bugs may not have manifested in earlier tests due to the leading
  dimension alignment that BLIS employs by default.) The issue was fixed
  by replacing lines like the one above with:

	vmovsd(mem(rcx), xmm0)
	vfmadd231ps(xmm0, xmm3, xmm4)

  Thus, we use vmovsd to explicitly load only two elements of C into
  registers, and then operate on those values using register addressing.
  Thanks to Daniël de Kok for reporting these bugs in #635, and to
  Bhaskar Nallani for proposing the fix).
- CREDITS file update.

Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b
2022-07-31 21:06:08 +05:30
Kiran Varaganti
86134c7278 Replaced vzeroall
Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels
-both AVX2 and AVX512. vzeroall is expensive instruction and replaced it
with faster version of zeroing all registers. vzeroupper() instruction is
also added at the end of AVX2 kernels to avoid any AVX2/SSE transition
penalities. Kindly note only the main kernels are modified.

Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4
2022-07-20 01:16:59 -04:00
mkurumel
9f1ce594a5 BLIS : Compiler warning fixes
Details :
  - Fixed warnings with AOCC and GCC compilers.

AMD-Internal: [CPUPL-1662]

Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce
2021-11-12 08:58:52 +05:30
Devin Matthews
e3dc1954ff Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
The fix is to use the same (valid) source register twice in the horizontal addition.
2021-09-16 10:59:37 -05:00
Devin Matthews
5191c43fac Fix more copy-paste errors in the haswell gemmsup code.
Fixes #486.
2021-09-16 10:16:17 -05:00
Devin Matthews
4f70eb7913 Clean up some warnings that show up on clang/OSX. 2021-08-13 11:12:43 -05:00
Devin Matthews
17729cf449 Add vzeroupper to Haswell microkernels. (#524)
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' 
  microkernels so as to avoid a performance penalty when mixing AVX
  and SSE instructions. These vzeroupper instructions were once part 
  of the haswell kernels, but were inadvertently removed during a source 
  code shuffle some time ago when we were managing duplicate 'haswell' 
  and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down 
  and re-inserting the missing instructions.
2021-07-09 14:59:48 -05:00
nphaniku
2bdee3cd6c Unifying BLIS Windows and Linux codebase
1. Removed dependency on bli_config.h inclusion in blis.h
 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags.
 3. CMAKE changes to incorporate new changes as per 3.1 code base.
 4. Removed zen2 folder from Windows directory.

AMD Internal : [CPUPL-1532]

Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47
2021-06-03 15:28:10 +05:30
Field G. Van Zee
7f7d72610c Fixed bugs in cpackm kernels, gemmlike code.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
  bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
  kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
  of 4 bytes) from the real component. This was almost certainly a copy-
  paste bug carried over from the corresonding zpackm kernels. Thanks to
  Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. (This bug was never observed in output but rather
  noticed analytically. It probably would have also manifested as
  intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
  relating to debugging.
2021-05-31 16:50:18 -05:00
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
nphaniku
d78defa0fc AOCL Windows: 3.1 BLIS changes
1. CMake script changes for adding new files to the build.
 2. Added Upper case support for couple of API's.
 3. bool is not support in clang so defined it.

AMD Internal : [CPUPL-1422]

Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61
2021-03-08 22:32:13 +05:30
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00
Meghana Vankadari
943b1362c7 Enabled vectorized pack kernels for zen2 configuration.
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
  implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.

AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
2021-02-12 19:16:57 +05:30
Madan mohan Manokar
3ab9104dae Handling zgemm real(+/-1) alpha and beta
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.

Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
2021-02-10 02:58:58 -05:00
Field G. Van Zee
ed50c94738 Merge branch 'master' into dev 2021-01-04 14:31:44 -06:00
Field G. Van Zee
7038bbaa05 Optionally disable trsm diagonal pre-inversion.
Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
  optionally disables the pre-inversion of diagonal elements of the
  triangular matrix in the trsm operation and instead uses division
  instructions within the gemmtrsm microkernels. Pre-inversion is
  enabled by default. When it is disabled, performance may suffer
  slightly, but numerical robustness should improve for certain
  pathological cases involving denormal (subnormal) numbers that would
  otherwise result in overflow in the pre-inverted value. Thanks to
  Bhaskar Nallani for reporting this issue via #461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
  gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
  to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
  instructions.
2020-12-04 16:08:15 -06:00