Commit Graph

419 Commits

Author SHA1 Message Date
Nageshwar Singh
cbd9ea76af Complex single standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 8x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: Ic9a5e59363290caf26284548638da9065952fd48
2021-11-12 08:58:55 +05:30
Harsh Dave
590c763e22 Implemented ctrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 8x3 CGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 24 scomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ctrsm_small for in ctrsm_ BLAS path for single thread
   when (m,n)<1000 and multithread (m+n)<320
-- Taken care of --disable_pre_inversion configuration
-- Achieved 13% average performance improvement for sizes less than 1000
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I557c5bcd8cb7c034acd99ce0666bc411e9c4fe64
2021-11-12 08:58:55 +05:30
satish kumar nuggu
23278627f4 STRSM small kernel implementation
Details:
-- AMD Internal Id: [CPUPL-1702]
-- Used 16x6 SGEMM kernel with vector fma by utilizing ymm registers
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Taken care of --disable_pre_inversion configuration
-- modularized strsm 16 combinations of trsm into 4 kernels

Change-Id: I30a1551967c36f6bae33be3b7ae5b7fcc7c905ea
2021-11-12 08:58:55 +05:30
Nageshwar Singh
a3d04a21a0 Complex double standalone gemv implementation independent of axpyf.
Details
- For axpyf implementation there are function(axpyf) calling overhead.
- New implementations reduces function calling overhead.
- This implementation uses kernel of size 4x4.
- This implementation gives better performance for smaller sizes when
  compared to axpyf based implementation

AMD-Internal: [CPUPL-1402]
Change-Id: I5fa421b8c1d2b44c991c2a05e8f5b01b83eb4b37
2021-11-12 08:58:54 +05:30
Harsh Dave
a7f600b3a4 Implemented ztrsm small kernels
Details:
-- AMD Internal Id: CPUPL-1702
-- Used 4x3 ZGEMM kernel with vector fma by utilizing ymm registers
   efficiently to produce 12 dcomplex outputs at a time
-- Used packing of matrix A to effectively cache and reuse
-- Implemented kernels using macro based modular approach
-- Added ztrsm_small for in ztrsm_ BLAS path for single thread
   when (m,n)<500 and multithread (m+n)<128
-- Taken care of --disable_pre_inversion configuration
-- Achieved 10% average performance improvement for sizes less than 500
-- modularized all 16 combinations of trsm into 4 kernels

Change-Id: I3cb42a1385f6b3b82d6c470912242675789cce75
2021-11-12 08:58:54 +05:30
Nageshwar Singh
a263146a4c Optimized scalv for complex data-types c and z (cscalv and zscalv)
AMD-Internal: [CPUPL-1551]
Change-Id: Ie6855409d89f1edfd2a27f9e5f9efa6cd94bc0c9
2021-11-12 08:58:53 +05:30
satish kumar nuggu
3bafdf3923 DGEMM kernel implementation for case k = 1.
Details :
      - DGEMM kernel implementation for case k = 1, vectorized with 8x6 block implementation (Rank-1 update in DGEMM Optimization).

Change-Id: I7d06378adeb8bcc5b965e2a94314d731629d0b4c
2021-11-12 08:58:53 +05:30
mkurumel
595f7b7edf dnrm2 optimization with dot method
1.  Added new kernel bli_dnorm2fv_unb_var1 kernel to compute
	norm with dot operation.
    2.  Added vectorization to compute square of 32 double element
	block size from vector X.
    3.  Defined a new Macro BLIS_ENABLE_DNRM2_FAST under config header
	to compute nrm2 using new kernel.
    4.  Dot kernel definitions and implementation have a possibility for
	accuracy issues .we can switch to traditional implementation by
	disabling the MACRO BLIS_ENABLE_DNRM2_FAST to compute L2-norm
	for Vector X .

    AMD-Internal: [CPUPL-1757]

Change-Id: I1adcaf1b3b4e33837758593c998c25705ff0fe11
2021-11-12 08:58:53 +05:30
mkurumel
9f1ce594a5 BLIS : Compiler warning fixes
Details :
  - Fixed warnings with AOCC and GCC compilers.

AMD-Internal: [CPUPL-1662]

Change-Id: Ia0e298a169d4dd4664b11e03a4e3cd340e9fdfce
2021-11-12 08:58:52 +05:30
Meghana Vankadari
10ca8710f0 Optimized SUP code for GEMMT
Details:
- Eliminated the IR loop in ref_var2m functions.
- Handled the rectangular and triangular portions of C matrix
  separately.
- Added a condition to check and eliminate zero regions inside IC loop.
- modified kc selection logic to choose optimal KC in SUP
- Updated thresholds to choose between SUP and native.

Change-Id: I21908eaa6bc3a8f37bdea29f7bfca7e6fcfee724
2021-11-12 08:58:52 +05:30
Kiran Varaganti
7196b86f05 Removed packm kernels of zen
intrinsic optimized packm kernels written for zen are no longer used.
Therefore removing it. Currently packm kernels from haswell configuration
are being used for zen2 and zen3 configs.
2021-11-12 08:58:51 +05:30
Madan mohan Manokar
3dda4ebf22 Induced method turned off, fix for beta=0 & C = NAN
1. Induced Method turned off, till the path fully tested for different alpha,beta conditions.
2. Fix for Beta =0, and C = NAN done.

Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000
2021-11-12 08:58:50 +05:30
Nallani Bhaskar
15b7fff159 Fixed reading C when beta=0 in few sgemm asm kernels
Description:

1. When beta is zero we should not be doing any arthemetic operation
   on C data and should not assume anything on values of C matrix

2. It is taken care in all sgemmsup kernels already except
   in bli_sgemmsup_rv_zen_asm_2x8 and bli_sgemmsup_rv_zen_asm_3x8
   kernels, when beta zero and C is column storage case. Fixed
   this issue by removing reading C matrix in these kernels.

3. When C has NaN or Inf and when we multiply NaN or Inf with zero
   (beta) the result becomes NaN only.

Change-Id: I3fb8c0cd37cf1d52a7909f6b402aa9c40c7c3846
2021-11-12 08:58:50 +05:30
Meghana Vankadari
5c770cafee Removed syrk_small code
- The current implementation of syrk_small computes the entire C matrix
  rather than computing triangular part. This implementation is not
  efficient.

AMD-Internal: [CPUPL-1571]
Change-Id: I9a153207471a55e52634429062d18ba1a225fed9
2021-11-12 08:58:49 +05:30
Abhiram S
7787bc79b1 Level1 samaxv: AVX512 implementation
Details:
 1. Unrolled by a factor 5. This gave around 1GFLOPS gain
 2. Changed CMP to subs and remove nan. CMP uses a lot of
    compare, which is higher in latency and more number of
    instructions. Replacing with subs and remove nan
    reduced it to 3 instructions and lighter ones.
 3. Added remove nan function.
 4. Added AVX512 definition in skx context.
 5. Disabled code in AMAXV kernel depending on AVX512 flag
    exists or not

Change-Id: I191725a55bc33edf8d537156292cf997d6a5fe35
2021-09-27 16:10:08 +05:30
Harihara Sudhan S
bdb5e32176 Level 1 Kernel: damaxv AVX512
Details:

- Developed damaxv for AVX512 extension
- Implemented removeNAN function that converts NAN values
  to negative values based on the location
- Usage COMPARE256/COMPARE128 avoided in AVX512
  implementation for better performance
- Unrolled the loop by order of 4.

Change-Id: Icf2a3606cf311ecc646aeb3db0628b293b9a3326
2021-09-02 09:47:08 +05:30
satish kumar nuggu
5adb7bf1a4 Combined variants to reduce redundancy in dtrsm small
1. Left Lower non-trans,Left Upper trans
2. Left Upper non-trans,Left Lower trans
3. Right Lower non-trans.Right Upper trans
4. Right Upper non-trans,Right Lower trans

Change-Id: I0b0155d7c3a55ec74d53c8f1f49f1bceb63b15f5
2021-08-26 15:04:14 +05:30
Meghana Vankadari
6bad157754 Added a new field in cntx to store l3 threshold function pointers
Details:
- Adding threshold function pointers to cntx gives flexibility to choose
  different threshold functions for different configurations.
- In case of fat binary where configuration is decided at run-time,
  adding threshold functions under a macro enables these functions for
  all the configs under a family. This can be avoided by adding function
  pointers to cntx which can be queried from cntx during run-time
  based on the config chosen.

Change-Id: Iaf7e69e45ae5bb60e4d0f75c7542a91e1609773f
2021-08-16 00:10:01 -04:00
Kiran Varaganti
adfd569591 DGEMM Optimizations
Improve DGEMM performance for smaller sizes. AOCL DYNAMIC is incorporated at blas interface to enable
calling bli_dgemm_small when optimum number of threads implied is 1 for (n and k < 10).
Improved smart threading logic for dgemm,
Additional conditions at the blas interface added to invoke bli_dgemm_small.
Removed  N > 3  condition from bli_dgemm_small.

Change-Id: Id751528dfe9de37800b02ffaf765b6c82487093e
2021-08-10 12:34:43 -04:00
Nallani Bhaskar
40a5a614b2 Updated vector loads/stores in reminder cases to avoid access beyond the matrix boundary
Description:
1. While processing reminder cases in bli_trsm_small algorithm
   there were few loads and stores which were accessing
   beyond the given matrix buffer because of vectorized instructions.

2. Modified 256bit vector loads at edges into 128bit or 64 bit loads/stores
   such that no read/write happens beyond the matrix boundary.

AMD-Internal: [CPUPL-1759] [SWLCSG-819]

Change-Id: Iba51d0ed9bb28d1b0948a219755b8dbcc86a7fa9
2021-08-09 10:51:26 +05:30
Madan mohan Manokar
d3542ff0e0 3m_sqp conjugate support added
1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added.

AMD-Internal: [CPUPL-1521]
Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f
2021-07-05 18:44:55 +05:30
Madan mohan Manokar
70e9d327a2 squarePacked(sqp) framework and multi-instance handling
1. kx partitions added to k loop for dgemm and zgemm.
2. mx loop based threading model added for dgemm as prototype of zgemm.
3. nx loop added for 3m_sqp and dgemm_sqp.
4. single 3m_sqp workspace allocation with smaller memory footprint.
5. sqp framework done from dgemm and zgemm.
6. sqp kernels moved to seperate kernel file.
7. residue kernel core added to handle mx<8.
8. multi-instance tuning for 3m_sqp done.
9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp.

AMD-Internal: [CPUPL-1521]
Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c
2021-06-28 15:40:11 +05:30
Chandrashekara KR
d7377f967c Merge "AOCL-Windows: Update BLIS build system" into amd-staging-milan-3.1 2021-06-17 08:49:55 -04:00
Nallani Bhaskar
e328bdc549 Added prefetch in left cases of dtrsm small
Details:

1. Added prefetching next micro-panel of A and B in dgemm block,
   which are helping in reducing load latency and improved performance.

2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core
   dgemm into macros and made it more modular

3. Packing and diagonal packing in main dgemm loops are modularized.
   Fringe cases are yet to modularize.

4. Updated dtrsm small thresholds for single and multi thread cases

5. Updated div/scale based on disable/enable of trsm pre-inversion

6. Code clean up

Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df
2021-06-15 23:15:22 +05:30
Chandrashekara K R
f94e3ad237 AOCL-Windows: Update BLIS build system
1. Added support in cmake scripts for linking libomp for blis multithreading build.
 2. Added ${CMAKE_CURRENT_SOURCE_DIR}/bli_axpyf_zen_int_6.c statement in blis\kernels\zen\1f cmake file to build newly added file.
 3. Added the new macros in blis/frame/include/bli_macro_defs.h for ENABLE_NO_UNDERSCORE_API support for gemm_batch and axpby API's.
 4. Modified the file open mode from binary to text mode in blis/testsuite/src/test_libblis.c file to avoid the line ending issue on different OS.
 5. Added the definition for the macro BLIS_DISABLE_TRSM_PREINVERSION in main CmakeLists.txt file.

AMD Internal : [CPUPL-1630]

Change-Id: Iba1b7b6d014a4317de7cbaf42f812cad20111e4f
2021-06-15 16:49:08 +05:30
satish kumar nuggu
8885136786 Added prefetch in gemm module for single threaded dtrsm small for right cases
Details:

1. By adding prefetch in gemm module we observed average gain of 10% in dtrsm right cases.
2. For skinny sizes with sizes m<=2000 and n<=1000, performance is equivalent to MKL.

Change-Id: I6a5f4b676aa133eb71edb249eccc4644d97da605
2021-06-08 17:39:23 +05:30
Nagarapu Phanikumar
7ea32e6d0b Merge " Unifying BLIS Windows and Linux codebase" into amd-staging-milan-3.1 2021-06-03 06:03:26 -04:00
nphaniku
2bdee3cd6c Unifying BLIS Windows and Linux codebase
1. Removed dependency on bli_config.h inclusion in blis.h
 2. Provided AOCL DYNAMIC / TRSM PRE INVERSION / COMPLEX RETURN configuration flags.
 3. CMAKE changes to incorporate new changes as per 3.1 code base.
 4. Removed zen2 folder from Windows directory.

AMD Internal : [CPUPL-1532]

Change-Id: I9261851087d10f73ab563d466fa3f7bb72ddee47
2021-06-03 15:28:10 +05:30
mkurumel
99e3bce065 SGEMV : single Precision axpyf kernel optimization for SGEMV
Details :
  - Implemented saxpyf kernel with fuse factor=6 for sgemv.

AMD-Internal: [CPUPL-1403]
Change-Id: I72fd30c08a789603267cf58910138549d45d231a
2021-06-02 07:55:48 -04:00
Nageshwar Singh
2e1a5bc1dd Optimized double complex axpyf kernel for zgemv
Details:
  - Implemented zaxpyf kernel with fuse factor=4 for zgemv.
  - Modified BLAS interface call for zgemv to reduce framework overhead.
  - Directed gemv to dotv in the case where dimension of y vector is 1.
  - when alpha = 0, gemv becomes scalv of Y with beta. Added code to
    return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: I2231285fe3060982d4434466346a040b7ab803fc
2021-06-01 18:03:29 +05:30
satish kumar nuggu
82087773a0 Optimized single threaded dtrsm small for right cases
Details:

1. Added optimized dtrsm kernels for all 8 right side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a01 block required for GEMM operation. The block size
      increases from 0 to 6X(n-6) in steps of 6x6 while solving TRSM
      from one end of A to other end of triangular A
   b. Packing of 6 diagonal elements in one location helped to utilize
      cache line efficiently

      AMD-Internal: [CPUPL-1563]

Change-Id: Iabd37536216d5215fc69ee1f8ec671b52f1be9d3
2021-05-25 01:09:50 -04:00
Meghana Vankadari
8c9a7c21b4 Optimized axpyf kernel for scomplex datatype
Details:
- Implemented axpyf kernel with fuse factor=4 for scomplex datatype.
- Modified BLAS interface call for cgemv to reduce framework overhead.
- Directed gemv to dotv in the case where dimension of y vector is 1.
- when alpha = 0, gemv becomes scalv of Y with beta. Added code to
return early after scaling Y vector with beta.

AMD-Internal: [CPUPL-1402]
Change-Id: Ibaab078008d76953332ba4da3515993578c0e586
2021-05-24 14:40:17 +05:30
Nallani Bhaskar
3a2e4c3db8 Added optimized single threaded dtrsm small for left cases
Details:

1. Added optimized dtrsm kernels for all 8 left side cases
   Below are few notable optimizations which improved performance

   a. Loading, transposing (for transa cases), packing and reusing
      of a10 block required for GEMM operation. The block size
      increases from 0 to 8X(m-8) in steps of 8x8 while solving TRSM
      from one end of A to other end of triangular A
   b. Performing inregister transpose whenever required
   c. Packing of 8 diagonal elements in one location helped to utilize
      cache line efficiently

2. Enabled calling dtrsm small for smaller sizes at cblas level itself
   to avoid frame work overhead, which is significant for very small
   sizes

3. Thanks to SatishKumar.Nuggu@amd.com for implementing lln, llt, lun
   and manideep.kurumella@amd.com for implementing lut kernels
   using intrinsics.

4. Removed all older implementations of strsm which are not
   developed as per the guide lines, can be refered from
   older releases if required.

Change-Id: I66ad6ef364cbcf5c99a3c4a4dcac12929865ade6
2021-05-18 16:16:00 +05:30
Nallani Bhaskar
b239a5aee7 Bug fix in sgemmsup 1x16n kernel
Details:

Address increment was missing in bli_sgemmsup_rv_zen_asm_1x16 kernel
while storing output in column major order in beta zero case

JIRA: CPUPL-1548

Change-Id: I36269cd28de6fbef2256451e399f90f0437b0ce1
2021-04-28 21:33:30 +05:30
lcpu
7401effc03 BLIS:merge:
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch

Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)

Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.

Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.

Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)

Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)

Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.

Minor code consolidation in all level-3 _front() functions.

Reorganized Windows cpp branch of bli_pthreads.c.

Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.

Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.

Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.

AMD-internal-[CPUPL-1523]

Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
2021-04-27 11:09:48 +05:30
Madan mohan Manokar
f6088ac1cf Enabling 3m_sqp and 3m1 methods
1. Re-enabling 3m methods for zgemm.
2. Vectorization of pack_sum routines re-enabled with bug fix.
3. 8mx6n kernel added.

AMD-Internal: [CPUPL-1352]
Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810
2021-04-22 02:47:31 -04:00
Mangala V
9e147912ee Merge "ZGEMM SUP: Removed unused assembly intructions" into amd-staging-milan-3.1 2021-04-19 03:08:31 -04:00
managalv
6c3741cd3e ZGEMM SUP: Removed unused assembly intructions
Removed memory operations which were being unused
Modified labels to be unquie to a file
Rowstride update is done at once to avoid multiple mul instruction

AMD Internal  : [CPUPL-1419]

Change-Id: I9b1a61e5d73f46f7527339a43789edd8e2402103
2021-04-19 20:31:03 +05:30
mkurumel
f8525a888e SGEMV performance improvement.
1.bli_sdotxf_zen_int_8 :
               added hadd_ps intrinsic instead of dp_ps for
               add partial dot outputs.

AMD Internal  : [CPUPL-1512]

Change-Id: I6e8e71a9cf8c1f30a1710dd1c67f193a998beb03
2021-04-12 10:47:23 +05:30
Madan mohan Manokar
7112b73d0d disabled zgemm induced and gemm sqp temporarily.
1. mx1, mx4 kernel addition and framework modification.
2. 8mx6n kernel addition.
3. NULL check added in dgemm_sqp malloc.
4. mem tracing added.
5. Restricted 3m_sqp to limited matrix sizes.
6. Induced methods disabled temporarily for debug.

AMD-Internal: [CPUPL-1352]
Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2
2021-04-04 20:43:03 +05:30
Nicholai Tukanov
22c6b5dc4c Fixed bug in power10 microkernel I/O. (#488)
Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
  not store the microtile result correctly due to incorrect indices
  calculations. (The error was introduced when I reorganized the 
  'kernels/power10/3' directory.)
2021-03-30 19:07:42 -05:00
Madan Mohan Manokar
4f19ef8339 Merge "3m_sqp vectorization" into amd-staging-milan-3.1 2021-03-10 02:03:23 -05:00
Madan mohan Manokar
a424e8b426 3m_sqp vectorization
1. bli_malloc modified to normal malloc and address alignment within 3m_sqp.
2. function added to pack A real,imag and sum.
3. function added to pack B real,imag and sum.
4. function added to pack C real,imag and beta handling.
4. sum and sub vectorized.

AMD-Internal: [CPUPL-1352]
Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
2021-03-10 11:54:50 +05:30
nphaniku
d78defa0fc AOCL Windows: 3.1 BLIS changes
1. CMake script changes for adding new files to the build.
 2. Added Upper case support for couple of API's.
 3. bool is not support in clang so defined it.

AMD Internal : [CPUPL-1422]

Change-Id: I4cac8fb8ef86cd6bacfd29e3b1a84c5da1310f61
2021-03-08 22:32:13 +05:30
nphaniku
b3628cdfd3 AOCL Windows: 3.1 BLIS changes
1. CMake script changes for build with Clang compiler.
 2. CMake script changes for build test and testsuite based on the lib type ST/MT
 3. CMake script changes for testcpp and blastest
 4. Added python scripts to support library build and testsuite build.

AMD Internal : [CPUPL-1422]

Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
2021-03-08 19:04:17 +05:30
Nicholai Tukanov
670bc7b60f Add low-precision POWER10 gemm kernels (#467)
Details:
- This commit adds a new BLIS sandbox that (1) provides implementations 
  based on low-precision gemm kernels, and (2) extends the BLIS typed 
  API for those new implementations. Currently, these new kernels can 
  only be used for the POWER10 microarchitecture; however, they may 
  provide a template for developing similar kernels for other 
  microarchitectures (even those beyond POWER), as changes would likely 
  be limited to select places in the microkernel and possibly the 
  packing routines. The new low-precision operations that are now 
  supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more 
  information, refer to the POWER10.md document that is included in 
  'sandbox/power10'.
2021-03-05 13:53:43 -06:00
Field G. Van Zee
f5871c7e06 Added complex asm packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision complex domain (c and z) and housed them in the 'haswell'
  kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
  were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), upon which these complex kernels are
  partially based.
2021-02-28 17:03:57 -06:00
Field G. Van Zee
426ad679f5 Added assembly packm kernels for 'haswell' set.
Details:
- Implemented assembly-based packm kernels for single- and double-
  precision real domain (s and d) and housed them in the 'haswell'
  kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
  optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
  and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
  packm kernels (d6xk and d8xk), which I have now tweaked and used to
  create comparable single-precision real kernels (s6xk and s16xk).
2021-02-27 18:39:56 -06:00
Meghana Vankadari
22d4689360 Implemented 16x3 based gemm kernel for the case where A has transpose
Details:
- This implementation does a transpose operation while packing 16xk of A
  buffer and passes it to 16x3-nn kernel.
- The same implementation works for the case where B has transpose.

AMD-Internal: [CPUPL-1376]
Change-Id: I81f74deb609926598f62c30f5bd6fc80fb1b9a17
2021-02-18 16:47:14 +05:30
Meghana Vankadari
cf7d9c7314 Disabled calling of bli_dgemm_small from gemm_front
Details:
- Decision logic to choose small_gemm has been moved to blas interface.
- Redirecting all the calls to small_gemm from gemm_front to native
  implementation.

AMD-Internal: [CPUPL-1376]
Change-Id: I6490f67113e9f7c272269f441c86f2a0b3c89a53
2021-02-16 11:30:20 +05:30