Commit Graph

77 Commits

Author SHA1 Message Date
Harsh Dave
5bdf5e2aaa Optimized AVX2 DGEMM SUP and small edge kernels.
- Re-designed the new edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmovdqu, VMOVDQU for setting up the mask.
  vmaskmovpd, VMASKMOVPD for masked load-store

- Following edge kernels are added for 6x8m dgemm sup.
  n-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_6x7m
  - bli_dgemmsup_rv_haswell_asm_6x5m
  - bli_dgemmsup_rv_haswell_asm_6x3m

  m-left edge kernels
  - bli_dgemmsup_rv_haswell_asm_5x7
  - bli_dgemmsup_rv_haswell_asm_4x7
  - bli_dgemmsup_rv_haswell_asm_3x7
  - bli_dgemmsup_rv_haswell_asm_2x7
  - bli_dgemmsup_rv_haswell_asm_1x7

  - bli_dgemmsup_rv_haswell_asm_5x5
  - bli_dgemmsup_rv_haswell_asm_4x5
  - bli_dgemmsup_rv_haswell_asm_3x5
  - bli_dgemmsup_rv_haswell_asm_2x5
  - bli_dgemmsup_rv_haswell_asm_1x5

  - bli_dgemmsup_rv_haswell_asm_5x3
  - bli_dgemmsup_rv_haswell_asm_4x3
  - bli_dgemmsup_rv_haswell_asm_3x3
  - bli_dgemmsup_rv_haswell_asm_2x3
  - bli_dgemmsup_rv_haswell_asm_1x3

- For 16x3 dgemm_small, m_left computation is handled
  with masked load-store instructions avoid overhead
  of conditional checks for edge cases.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

AMD-Internal: [CPUPL-3574]

Change-Id: I976d6a9209d2a1a02b2830d03d21d200a5aad173
2023-08-07 07:30:50 -04:00
Eleni Vlachopoulou
9c613c4c03 Windows CMake bugfix in object libraries for shared library option
Defining BLIS_IS_BUILDING_LIBRARY if BUILD_SHARED_LIBS=ON for the object libraries created in kernels/ directory.
The macro definition was not propagated from high level CMake, so we need to define explicitly for the object libraries.

AMD-Internal: [CPUPL-3241]
Change-Id: Ifc5243861eb94670e7581367ef4bc7467c664d52
2023-05-24 17:30:16 +05:30
Edward Smyth
a3adfb68cf BLIS: Missing clobbers (batch 4)
Add missing clobbers haswell (sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I19fa97b85f75c8b8fe15d31b13768f937cc5e4cc
2023-05-23 14:57:08 -04:00
Edward Smyth
03965a4f07 BLIS: Missing clobbers (batch 3)
Add missing clobbers in haswell (non-sup) kernels.

AMD-Internal: [CPUPL-3456]
Change-Id: I68f6ad0c01557fcde73b1775d250d48b5162c521
2023-05-23 14:37:31 -04:00
Edward Smyth
ea2eea5097 BLIS: Missing clobbers (batch 1)
Add missing clobbers in first batch of assembly kernels:
- zen3 bli_gemmsup*
- bli_zgemm_zen4_asm_12x4
- bli_gemmsup_rv_haswell_asm_sMx6

AMD-Internal: [CPUPL-3456]
Change-Id: I33c321043a197b2b885cfd6cd589532fc633a6a1
2023-05-23 11:51:18 -04:00
Eleni Vlachopoulou
1a7f60ff5b Update CMake system to use object libraries for haswell, skx and zen4.
- AVX2 and AVX512 flags are set up locally for each object library that requires them.
- Default ENABLE_SIMD_FLAGS value is set to none and for AVX2 option the corresponding compiler flag is set globally.
- To be able to build zen4 codepath when ENABLE_SIMD_FLAGS=AVX2, the compiler option is removed by removing the definition before building the corresponding object library.

AMD-Internal: [CPUPL-3241]
Change-Id: Ia570e60f06c4c72b7c58f4c9ca73bac4c060ae73
2023-05-12 10:04:16 -04:00
Harsh Dave
238d9fda9e Fixed ASAN memory issue due to modifying RBP register
- RBP is base pointer which points to base of current stack frame.
  ASAN tool rely on rbp and rsp for stack related validations. So over-writting
  or modifying RBP register results in application termination with the error code
  of stack overflow.
- Removed all the code snippets which were using rbp register for prefetching matrices
  and sometimes loading elements from memory in all of the gemm sup kernels for double
  datatype.
- Removed reference to rbp from register clobber list as well to completely avoid the
  usage of rbp register.

AMD-Internal: [CPUPL-2613, CPUPL-2587]

Change-Id: Idd402d3c644c4dd66e8d4988aede539ad8c77b28
2023-03-23 21:23:44 -04:00
Harsh Dave
222e00e840 Obliterated usage of rbp register in SUP gemm kernel
- Mx4 edge kernels were overwriting rbp
registers for prefetches.
- Since rbp along with rsp defines stack frame,
it resulted in stack overflow issue.
- Replaced rbp with rdx register for prefetches.

AMD-Internal: [CPUPL-2987]
Change-Id: I4e52cf691b70be5ab63f562d7630d640b29e1cfd
2023-03-01 11:09:57 -05:00
Kiran Varaganti
d8d4499e54 AVX2 dgemm kernel optimization for AOCC
Details: k0 is always positive in bli_dgemm_haswell_asm_6x8(), the operation involved with
     k0 is typecasted to uint64_t to enable AOCC generate optimized code.
     Thanks for Jini Susan (jinisusan.george@amd.com) from compiler team for suggesting
     this change. Similar change was applied to sgemm, cgemm and zgemm kernels.
Change-Id: I423c949e0c1835652142a6931dadf4a7d190aeb9
2023-01-09 07:49:41 -05:00
Harihara Sudhan S
42d631bced Copyright modification
- Added copyright information to modified/newly created
          files missing them

Change-Id: If4e73b680246d0363de09587d6dc54bee00ecd71
2022-10-14 12:43:35 +05:30
Mangala V
e440cbc91a Fixed ASAN reported issues in bli_dgemmsup_rd_haswell_asm_6x8m
Address sanitizer reports error when rbp regitser is modified.

Register rbp was stored with rs_a which was used during prefetch
of Matrix A. Usage of rbp is avoided by using rcx register as a
temporary storage register.
Hence rcx is updated with Matrix C address before storing the
computed data.

This fix address the issue reported by GEQP3 API of libflame

AMD-Internal: [CPUPL-2587]
Change-Id: Ica790259010d8e71528c3d0ab1cd49069c56fc1d
2022-09-30 11:52:29 -04:00
Shubham Sharma
b8b339416a DGEMMT optimizations
Details:

1. For lower and upper, "B" column major storage variants of gemmt,
   new kernels are developed and optimized to compute only the
   required outputs in the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Customized bli_dgemmsup_rd_haswell_asm_6x8m Kernels specific to
   compute Lower and Upper Variant diagonal outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I9748b2b52557718e7497ecf046530d3031636a63
2022-08-19 12:31:35 -04:00
Shubham Sharma
32c9239c7f Optimization of DGEMMT SUP kernels
Details:
1. Optimized the kernels by replacing the macros with
   the actual computation of required output elements.

AMD-Internal: [CPUPL-2341]
Change-Id: Ieefb80ac9b2dc2955b683710e259cf45d581e1b5
2022-08-18 08:30:19 -04:00
Shubham Sharma
8adef27aca Optimization of DGEMMT SUP kernel for beta zero cases.
Details:
1. In kernels for non-transpose variants, changes
   are made to optimize the cases of beta zero.
2. Validated the changes with BLIS Testsuite,
   GTestSuite(Functionality, Valgrind, Integer Tests)
   and Netlib Tests.
3. Fixed warnings during the build process.

AMD-Internal: [CPUPL-2341]
Change-Id: I8bb53ad619eb2413c999fe18eafd67c75fe1f83a
2022-08-18 08:05:58 -04:00
Harsh Dave
e2e1dadee1 DGEMM Improvements
- We prefetch next panel while packing 8xk panel.
- Modified prefetch offsets for dgemm native and 
  dgemm_small kernel.

AMD-Internal: [CPUPL-2366]

Change-Id: Ife609e789c8b87169c73bb0a30d6f1af20fb30ed
2022-08-16 08:07:30 -04:00
Shubham Sharma
f5ef30a44a Fix in DGEMMT SUP kernel
Details:
1. Due to error in C output buffer address computation in
   kernel bli_dgemmsup_rv_haswell_asm_6x8m_6x8_L, invalid
   memory is being accessed. This is causing seg fault in
   libflame netlib testing.
2. Validated the fix with libflame netlib testing.

AMD-Internal: [CPUPL-2341]
Change-Id: I9ca0cf09cf2d177ade73f840054b5028eae3a0ed
2022-08-12 21:03:36 +05:30
Shubham Sharma
4bca7f6f4a DGEMMT optimizations
Details:

1. For lower and upper, non-transpose variants of gemmt, new kernels
   are developed and optimized to compute only the required outputs in
   the diagonal blocks.
2. In the previous implementation, all the 48 outputs of the given
   6x8 block of C matrix are computed and stored into a temporary
   buffer. Later,the required elements are copied into the final C
   output buffer.
3. Changes are made to compute only the required outputs of the 6x8
   block of C matrix and directly stored in the final C output buffer.
4. With this optimization, we are avoiding copy operation and also
   reducing the number of computations.
5. Kernels specific to compute Lower and Upper Variant diagonal
   outputs have been added.
6. SUP Framework changes to integrate the new kernels have been added.
7. These kernels are part of the SUP framework.

AMD-Internal: [CPUPL-2341]
Change-Id: I0ec8f24a0fb19d9b1ef7254732b8e09f06e1486a
2022-08-11 06:16:33 -04:00
Devin Matthews
76fbf1233d Add vzeroupper to Haswell microkernels. (#524)
Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
  microkernels so as to avoid a performance penalty when mixing AVX
  and SSE instructions. These vzeroupper instructions were once part
  of the haswell kernels, but were inadvertently removed during a source
  code shuffle some time ago when we were managing duplicate 'haswell'
  and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
  and re-inserting the missing instructions.

Change-Id: I418fea9fed27ba3ad7d395cf96d1be507955d8e9
2022-08-01 09:29:04 +05:30
Field G. Van Zee
2a81437bd8 Fixed bugs in cpackm kernels, gemmlike code.
Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
  bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
  kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
  of 4 bytes) from the real component. This was almost certainly a copy-
  paste bug carried over from the corresonding zpackm kernels. Thanks to
  Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
  bls_gemm_bp_var2.c that initializes the elements of the temporary
  microtile to zero. (This bug was never observed in output but rather
  noticed analytically. It probably would have also manifested as
  intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
  relating to debugging.

Change-Id: I899e20df203806717fb5270b5f3dd0bf1f685011
2022-08-01 09:11:58 +05:30
Devin Matthews
9495401b73 Fix more copy-paste errors in the haswell gemmsup code.
Fixes #486.

Change-Id: I568386b5d67a698ea9c0b6b17f133df86c2894bd
2022-07-31 21:36:21 +05:30
Devin Matthews
ea163fc23b Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
The fix is to use the same (valid) source register twice in the horizontal addition.

Change-Id: I96ed39e289aaeeb44be9117074b32bd8d4c19de6
2022-07-31 21:15:28 +05:30
Field G. Van Zee
faff30b46a Fixed out-of-bounds bug in sup s6x16m haswell kernel.
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
  assembly kernels. This bug is similar to the one fixed in 17b0caa
  and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
  Kannan for reporting this bug (and a suitable fix) in #635.
- CREDITS file update.

Change-Id: I10ccf4d4f471d93e8c8cc4df422c686438fb04e9
2022-07-31 21:10:58 +05:30
Field G. Van Zee
4b1663213c Fixed out-of-bounds read in haswell gemmsup kernels.
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
  kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
  single-precision elements of C, via instructions such as:

	vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

  in situations where only two elements are guaranteed to exist. (These
  bugs may not have manifested in earlier tests due to the leading
  dimension alignment that BLIS employs by default.) The issue was fixed
  by replacing lines like the one above with:

	vmovsd(mem(rcx), xmm0)
	vfmadd231ps(xmm0, xmm3, xmm4)

  Thus, we use vmovsd to explicitly load only two elements of C into
  registers, and then operate on those values using register addressing.
  Thanks to Daniël de Kok for reporting these bugs in #635, and to
  Bhaskar Nallani for proposing the fix).
- CREDITS file update.

Change-Id: Ib525c36bcbf20b2bbbe380da3d74d142b338fe9b
2022-07-31 21:06:08 +05:30
Kiran Varaganti
86134c7278 Replaced vzeroall
Replaced vzeroall instruction with vxorpd and vmovapd for dgemm kernels
-both AVX2 and AVX512. vzeroall is expensive instruction and replaced it
with faster version of zeroing all registers. vzeroupper() instruction is
also added at the end of AVX2 kernels to avoid any AVX2/SSE transition
penalities. Kindly note only the main kernels are modified.

Change-Id: Ieb9bc629db01f0f94dd0e8e55550940d3d7eb2a4
2022-07-20 01:16:59 -04:00
mkurumel
9f1ce594a5 BLIS : Compiler warning fixes
Details :
  - Fixed warnings with AOCC and GCC compilers.

AMD-Internal: [CPUPL-1662]

Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce
2021-11-12 08:58:52 +05:30
nphaniku
2bdee3cd6c Unifying BLIS Windows and Linux codebase
1. Removed dependency on bli_config.h inclusion in blis.h
 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags.
 3. CMAKE changes to incorporate new changes as per 3.1 code base.
 4. Removed zen2 folder from Windows directory.

AMD Internal : [CPUPL-1532]

Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47
2021-06-03 15:28:10 +05:30
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
nphaniku
d78defa0fc AOCL Windows: 3.1 BLIS changes
1. CMake script changes for adding new files to the build.
 2. Added Upper case support for couple of API's.
 3. bool is not support in clang so defined it.

AMD Internal : [CPUPL-1422]

Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61
2021-03-08 22:32:13 +05:30
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00
Meghana Vankadari
943b1362c7 Enabled vectorized pack kernels for zen2 configuration.
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
  implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.

AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
2021-02-12 19:16:57 +05:30
Madan mohan Manokar
3ab9104dae Handling zgemm real(+/-1) alpha and beta
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.

Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
2021-02-10 02:58:58 -05:00
Field G. Van Zee
ed50c94738 Merge branch 'master' into dev 2021-01-04 14:31:44 -06:00
Field G. Van Zee
7038bbaa05 Optionally disable trsm diagonal pre-inversion.
Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
  optionally disables the pre-inversion of diagonal elements of the
  triangular matrix in the trsm operation and instead uses division
  instructions within the gemmtrsm microkernels. Pre-inversion is
  enabled by default. When it is disabled, performance may suffer
  slightly, but numerical robustness should improve for certain
  pathological cases involving denormal (subnormal) numbers that would
  otherwise result in overflow in the pre-inverted value. Thanks to
  Bhaskar Nallani for reporting this issue via #461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
  gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
  to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
  instructions.
2020-12-04 16:08:15 -06:00
Field G. Van Zee
b43dae9a5d Fixed copy-paste bugs in edge-case sup kernels.
Details:
- Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and
  bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly
  instructions that were left over from when the kernels were first
  written. These instructions would cause segmentation faults in some
  situations where extra memory was not allocated beyond the end of
  the matrix buffers. Thanks to Kiran Varaganti for reporting these
  bugs and to Bhaskar Nallani for identifying the cause and solution.
2020-12-01 16:44:38 -06:00
bhaskarn
91909c1562 Fix for segmentation crash in dgemmsup kernels
Description:

[AMD Internal]: CPUPL-1336

Removed extra/un-nesseary loads in dgemmmsup kernels which are
accessing the memory beyond the boundaries and causing segmentation
issue.

Kernels:
bli_dgemmsup_rd_haswell_asm_1x4
bli_dgemmsup_rv_haswell_asm_1x6

Change-Id: Idaeed36ebd9f13550943394a37e372b8d015b2d3
2020-11-24 10:15:57 -05:00
Kumar, Phani
477fc41fff Cmake script changes and blis.h changes for amd-staging-milan-3.0
AMD Internal : [CPUPL-1083]

Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580
2020-11-24 06:12:25 -05:00
Meghana Vankadari
9a330f1754 Added debug trace and log support for gemmt and TRSM APIs
Details:
- Added debug trace support for DGEMMT and DTRSM APIs.
- Added log support for gemmt, trsm APIs.
- Modified gemm dump_sizes function to dump transpose parameters.

AMD-Internal: [CPUPL-1210]
Change-Id: Ice1effe27ec349203ce5def030a6b85b204bd91e
2020-10-02 12:31:47 +05:30
Field G. Van Zee
e293cae2d1 Implemented sgemmsup assembly kernels.
Details:
- Created a set of single-precision real millikernels and microkernels
  comparable to the dgemmsup kernels that already exist within BLIS.
- Added prototypes for all kernels within bli_kernels_haswell.h.
- Registered entry-point millikernels in bli_cntx_init_haswell.c and
  bli_cntx_init_zen.c.
- Added sgemmsup support to the Makefile, runme.sh script, and source
  file in test/sup. This included edits that allow for separate "small"
  dimensions for single- and double-precision as well as for single-
  vs. multithreaded execution.
2020-09-15 16:09:11 -05:00
dzambare
267a959af1 Rebased amd-staging-milan-3.0 branch on master
-- Rebased on top of master commit # 6e522e5823
  -- Updated merged code to remove duplicated code added by auto-merging
  -- Updated merged code to rename bool_t type
  -- Updated merged code to rename bli_thread_obarrier
  -- Updated merged code to rename bli_thread_obroadcast

Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c
AMD-Internal: [CPUPL-1067]
2020-08-06 10:09:29 +05:30
Mangala V
5b8c2bc9e2 Revert "CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed"
This reverts commit 725bf5aceb.

Reason for revert: <INSERT REASONING HERE>

Change-Id: I7dd6b84731f091c8b39080ed9321a708fa5f11d8
2020-08-06 10:09:29 +05:30
managalv
2a0928aad4 CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed
Details:
-  Problem:
	If row major, first four elements of last column on output matrix C was not updated
	If col major, first four elements of last row on output matrix C was not updated
- Solution:
	Updating elements after computation is done on right offset in bli_dgemmsup_rv_haswell_asm_5x8()

Change-Id: I588c60f2f3cd5f51e475cfc140e3bf0e9d5a4dae
2020-08-06 10:09:29 +05:30
Devrajegowda, Kiran
6b5c68b9ed "Merge Selective Packing code from amd branch flame/blis"
Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed
2020-08-06 10:09:28 +05:30
Kiran Varaganti
307ddc3110 Revert " Merge Selective Packing code from amd branch flame/blis"
This reverts commit e4a6af33f5.

Reason for revert: <Review not done>

Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2
2020-08-06 10:09:28 +05:30
Field G. Van Zee
889b90888f Implemented gemm on skinny/unpacked matrices.
Details:
- Implemented a new sub-framework within BLIS to support the management
  of code and kernels that specifically target matrix problems for which
  at least one dimension is deemed to be small, which can result in long
  and skinny matrix operands that are ill-suited for the conventional
  level-3 implementations in BLIS. The new framework tackles the problem
  in two ways. First the stripped-down algorithmic loops forgo the
  packing that is famously performed in the classic code path. That is,
  the computation is performed by a new family of kernels tailored
  specifically for operating on the source matrices as-is (unpacked).
  Second, these new kernels will typically (and in the case of haswell
  and zen, do in fact) include separate assembly sub-kernels for
  handling of edge cases, which helps smooth performance when performing
  problems whose m and n dimension are not naturally multiples of the
  register blocksizes. In a reference to the sub-framework's purpose of
  supporting skinny/unpacked level-3 operations, the "sup" operation
  suffix (e.g. gemmsup) is typically used to denote a separate namespace
  for related code and kernels. NOTE: Since the sup framework does not
  perform any packing, it targets row- and column-stored matrices A, B,
  and C. For now, if any matrix has non-unit strides in both dimensions,
  the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
  bli_gemmsup_ref_var2() provides a block-panel variant (in which the
  2nd loop around the microkernel iterates over n and the 1st loop
  iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
  variant (2nd loop over m and 1st loop over n). However, these variants
  are not used by default and provided for reference only. Instead, the
  default sup handler calls _var2m() and _var1n(), which are similar
  to _var2() and _var1(), respectively, except that they defer to the
  sup kernel itself to iterate over the m and n dimension, respectively.
  In other words, these variants rely not on microkernels, but on
  so-called "millikernels" that iterate along m and k, or n and k.
  The benefit of using millikernels is a reduction of function call
  and related (local integer typecast) overhead as well as the ability
  for the kernel to know which micropanel (A or B) will change during
  the next iteration of the 1st loop, which allows it to focus its
  prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
  of A changes while the same upanel of B is reused. In _var1n()'s, the
  upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
  enabled by default. However, the default thresholds at which the
  default sup handler is activated are set to zero for each of the m, n,
  and k dimensions, which effectively disables the implementation. (The
  default sup handler only accepts the problem if at least one dimension
  is smaller than or equal to its corresponding threshold. If all
  dimensions are larger than their thresholds, the problem is rejected
  by the sup front-end and control is passed back to the conventional
  implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
  the sup framework, most notably:
  - sup thresholds: the thresholds at which the sup handler is called.
  - sup handlers: the address of the function to call to implement
    the level-3 skinny/unpacked matrix implementation.
  - sup blocksizes: the register and cache blocksizes used by the sup
    implementation (which may be the same or different from those used
    by the conventional packm-based approach).
  - sup kernels: the kernels that the handler will use in implementing
    the sup functionality.
  - sup kernel prefs: the IO preference of the sup kernels, which may
    differ from the preferences of the conventional gemm microkernels'
    IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
  handling should be enabled/disabled. This allows per-call control
  of whether the sup implementation is used, which is useful for test
  drivers that wish to switch between the conventional and sup codes
  without having to link to different copies of BLIS. The corresponding
  accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
  directory, kernels/haswell/3/sup. These kernels include two general
  implementation types--'rd' and 'rv'--for the 6x8 base shape, with
  two specialized millikernels that embed the 1st loop within the kernel
  itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
  gemmsup microkernels. NOTE: These microkernels, unlike the current
  crop of conventional (pack-based) microkernels, do not use constant
  loop bounds. Additionally, their inner loop iterates over the k
  dimension.
- Defined new typedef enums:
  - stor3_t: captures the effective storage combination of the level-3
    problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
    special value of BLIS_XXX is used to denote an arbitrary combination
    which, in practice, means that at least one of the operands is
    stored according to general stride.
  - threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
  can be passed "-1, -1" as a lazy request for row storage. (Note that
  "0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
  including imul, vhaddps/pd, and other instructions related to integer
  vectors.
- Disabled the older small matrix handling code inserted by AMD in
  bli_gemm_front.c, since the sup framework introduced in this commit
  is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
  drivers, a Makefile, a runme.sh script, and an 'octave' directory
  containing scripts compatible with GNU Octave. (They also may work
  with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
  level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
  of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
2020-08-03 11:48:42 +05:30
Field G. Van Zee
4f5b014c05 Added missing rv_d?x6 edge cases to sup kernel.
Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
  various n = 6 edge cases with a single sup kernel call. Previously,
  only n = {4,2,1} were handled explicitly as single kernel calls;
  that is, cases where n = 6 were previously being executed via two
  kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.
2020-08-03 11:23:40 +05:30
Field G. Van Zee
0651b466c2 Bugfixes, cleanup of sup dgemm ukernels.
Details:
- Fixed a few not-really-bugs:
  - Previously, the d6x8m kernels were still prefetching the next upanel
    of A using MR*rs_a instead of ps_a (same for prefetching of next
    upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
    that the upanels might be packed, using ps_a or ps_b is the correct
    way to compute the prefetch address.
  - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
    executed as intended even though it was based on a faulty pointer
    management. Basically, in the rd_d6x8m kernel, the pointer for B
    (stored in rdx) was loaded only once, outside of the jj loop, and in
    the second iteration its new position was calculated by incrementing
    rdx by the *absolute* offset (four columns), which happened to be the
    same as the relative offset (also four columns) that was needed. It
    worked only because that loop only executed twice. A similar issue
    was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
  - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
    that it is loaded only once outside of the loops rather than
    multiple times inside the loops.
  - Changed outer loop in rd kernels so that the jump/comparison and
    loop bounds more closely mimic what you'd see in higher-level source
    code. That is, something like:
      for( i = 0; i < 6; i+=3 )
    rather than something like:
      for( i = 0; i <= 3; i+=3 )
  - Switched row-based IO to use byte offsets instead of byte column
    strides (e.g. via rsi register), which were known to be 8 anyway
    since otherwise that conditional branch wouldn't have executed.
  - Cleaned up and homogenized prefetching a bit.
  - Updated the comments that show the before and after of the
    in-register transpositions.
  - Added comments to column-based IO cases to indicate which columns
    are being accessed/updated.
  - Added rbp register to clobber lists.
  - Removed some dead (commented out) code.
  - Fixed some copy-paste typos in comments in the rv_6x8n kernels.
  - Cleaned up whitespace (including leading ws -> tabs).
  - Moved edge case (non-milli) kernels to their own directory, d6x8,
    and split them into separate files based on the "NR" value of the
    kernels (Mx8, Mx4, Mx2, etc.).
  - Moved config-specific reference Mx1 kernels into their own file
    (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
  - Added rd_dMx1 assembly kernels, which seems marginally faster than
    the corresponding reference kernels.
  - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
    the row-oriented reference kernels for all storage combos.
2020-08-03 11:22:32 +05:30
Field G. Van Zee
2605eb4d99 Added missing rv_d?x6 edge cases to sup kernel.
Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
  various n = 6 edge cases with a single sup kernel call. Previously,
  only n = {4,2,1} were handled explicitly as single kernel calls;
  that is, cases where n = 6 were previously being executed via two
  kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.
2020-07-15 15:25:19 -05:00
phakumar
ccf0772d6e BLIS library porting on to Windows:
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
 AMD internal:[CPUPL-657]

Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
2020-06-16 18:29:00 +05:30