Commit Graph

3171 Commits

Author SHA1 Message Date
Shubham Sharma
d5cd5836b1 Fixed DGEMM 8x24 kernel for beta zero
- Column stride is not taken into consideration in
  current implementation when writing to C buffer
  if beta is zero and C is column major stored.

- Fixed C storage in case of column major stored C
  when beta is zero in 8x24 DGEMM kernel.

AMD-Internal: [CPUPL-4404]
Change-Id: I5b8dfce962995e3238cf902b5a09dd1bf90002a8
2024-02-05 06:57:06 -05:00
mangala v
aa5731eba7 Gtestsuite: Updated SGemm test scenario
1. Earlier tests were taking long time for initialisation and running
   Hence removed testcases which is already covered as part of another
   scenario
2. Added two category of tests:
   a. Tests to cover all sizes of m, n, k for
      bli_sgemmsup_rv_zen_asm_6x16m kernel
   b. Tests to cover various alpha and beta values for above kernel

With current update building and running takes less than 2 minutes.

Change-Id: I1479a8ca960c04d4642857fdc7949458646dafb7
2024-02-05 04:22:21 -05:00
Shubham Sharma
fc91932b4a Fixed out of bounds read in DTRSM small kernels
- In 3x1 fringe case in [RLN/RUT] kernel, 4 double
  precision floats are being read instead of 3 doubles.

- Fixed the code to read only 3 double.

AMD-Internal: [CPUPL-4403]
Change-Id: If0afb155efefabe13487cf322d479981f1838aa2
2024-02-02 10:31:12 +05:30
mangala v
0659a647e0 Gtestsuite: Micro Kernel Testing of ZGEMM API
Summary:
- Aims to perform accuracy testing of ZGEMM micro kernel.
- Blis kernel is called directly from gtestuite framework.
- Micro kernel is invoked with required input, output parameters.
- No objects are created to call micro kernel.
- No framework code would be invoked in this method.

Below AVX2 & AVX512 Micro kernels are being tested using gtestsuite

Native Kernels:
 - AVX2: bli_zgemm_haswell_asm_3x4
         bli_zgemm_zen_asm_2x6(Required for TRSM computation)
 - AVX512: bli_zgemm_zen4_asm_12x4
           bli_zgemm_zen4_asm_4x12(Required for TRSM computation)

SUP Kernels:
- AVX2 Kernels:
      bli_zgemmsup_rd_zen_asm_3x4m
      bli_zgemmsup_rd_zen_asm_3x2m
      bli_zgemmsup_rd_zen_asm_3x4n
      bli_zgemmsup_rd_zen_asm_2x4n
      bli_zgemmsup_rd_zen_asm_(2/1)x4
      bli_zgemmsup_rd_zen_asm_(2/1)x2
      bli_zgemmsup_rv_zen_asm_(2/1)x4
      bli_zgemmsup_rv_zen_asm_(2/1)x2
      bli_zgemmsup_rv_zen_asm_3x4m
      bli_zgemmsup_rv_zen_asm_3x2m
      bli_zgemmsup_rv_zen_asm_3x4n
      bli_zgemmsup_rv_zen_asm_2x4n
      bli_zgemmsup_rv_zen_asm_1x4n
      bli_zgemmsup_rv_zen_asm_3x2

- AVX512 kernels:
     bli_zgemmsup_cv_zen4_asm_12x4m
     bli_zgemmsup_cv_zen4_asm_12x3m
     bli_zgemmsup_cv_zen4_asm_12x2m
     bli_zgemmsup_cv_zen4_asm_12x1m
     bli_zgemmsup_cv_zen4_asm_8x(4/3/2/1)
     bli_zgemmsup_cv_zen4_asm_4x(4/3/2/1)
     bli_zgemmsup_cv_zen4_asm_2x(4/3/2/1)

Above kernels are tested with different combination of parameters such as storage, alpha, beta, transpose & dimensions.

DGEMM: Minor update in DGEMM micro kernel (Buffer allocation, comment section, order of passing arguments)

AMD-Internal: [CPUPL-4426]

Change-Id: I9d6ab24278450f57d13589ad89151a4acc641f08
2024-01-31 10:30:57 -05:00
Eleni Vlachopoulou
b9a808e5d8 GTestSuite: Updating datagenerators helper functions.
- Moved function definitions in the header to avoid explicit template
  instantiations.
- Templatized from and to bounds to enable combinations of integer of
  floating-point values.
- Used an enum class for the element type instead of a char to make it
  more robust since chars get casted to integers. Now we should be
  getting better error messages if there is a missmatch.
- Deleted argument for datatypes that was a leftover from the past.
  Default argument is used instead.

Change-Id: I3f95d73f03028de46324b310826edca8057e561d
2024-01-31 07:08:25 -05:00
eashdash
ef134dc49f Added Trans A feature for all INT8 LPGEMM APIs
1. Added Trans A feature to handle column major inputs
   for A matrix.
2. Trans A is enabled by on-the-go pack of A matrix.
3. The on-the-go pack of A converts a column storage
   MCxKC block of A into row storage MCxKC block as
   LPGEMM kernels are row major kernels.
4. New pack routines are added for conversion of A matrix
   from column major storage to row major storage.
5. LPGEMM Cntx is updated with pack kernel function
   pointers.
6. Packing of A matrix:
   -  Converts column major input A to row major
      in blocks of MCxKC with newly added pack A
      functions when cs_a > 1.
7. Pack routines are added for AVX512 and AVX2
   INT8 LPGEMM APIs.
8. Trans A feature is now supported in:
   1. u8s8s32os32/os8
   2. u8s8s16os16/os8/ou8
   3. s8s8s32os32/os8
   4. s8s8s16os16/os8

AMD-Internal: SWLCSG-2582
Change-Id: I7ce331545525a9a09f3853280615b55fcf2edabf
2024-01-30 03:40:56 -05:00
Vignesh Balasubramanian
ddec0c1de0 Negative parameter testing for ?COPY, ?AXPY and ?AXPBY APIs
- As per the standard compliance, the ?copy(), ?axpy() and
  ?axpby() APIs do not require invalid input testing(IIT)
  with respect to the input parameters they receive, as part
  of BLAS and CBLAS calls.

- Thus, test-cases have been added to verify early return scenarios
  (ERS) as per the compliance. The testsuite is type-parameterized,
  since the compliance for early return cases is the same across the
  datatypes.

- Updated the conditional directives in micro-kernel(ukr) test files
  to include the new set of macros generated as part of the
  buildsystem in GTestsuite.

- Updated the conditional macro to enable the appropriate code
  section for compilation of ref_axpbyv(), based on our choice
  of reference library when building GTestsuite.

AMD-Internal: [CPUPL-4402]
Change-Id: Ibea2bc34469b008f4d4558ce359717c08b92e978
2024-01-29 06:31:18 -05:00
Kiran Varaganti
63be4c8ce4 AOCL-BLIS changed to AOCL-BLAS
AOCL-BLIS replaced with AOCL-BLAS at various places like "configure",
"CMakeLists.txt" and documentation files.

Change-Id: I75c3fbe8a1abc91828eeacb25672fd7bc905d226
2024-01-25 04:31:25 -05:00
Eleni Vlachopoulou
e4ac153a3e GTestSuite: Set macros for kernel testing depending on hardware capabilities.
- During configuration, CMake system detects if AVX2, AVX512, AVX512VNNI or AVX512BF16 is supported and sets up a macro.
- Those macros need to be used in addition to BLIS_KERNELS_ZEN* to build/run only those tests supported by a specific architecture.

Change-Id: I60adc57d3a570f7bdd6dc834e2562da6bfb52bcc
2024-01-22 08:04:12 -05:00
Shubham Sharma
c1a3dbadf1 Micro-kernel testing of DTRSM kernels
- Added unit tests for avx512 and avx2 native path
  DTRSM kernels for various value of storage, stride,
  K, alpha, ldc.

AMD-Internal: [CPUPL-4403]

Change-Id: I42b1f08aa98c73af39a6e3bd94049965e7c51ae9
2024-01-22 06:24:17 -05:00
Shubham Sharma
006b86c22f Added tests for DTRSM
- Added API tests for DTRSM.
- Added Extreme Value Test cases (EVT) for DTRSM.
  - Tests for various combinations of INFs
    and NANs in A and B matrix are added.
- Added Invalid input test cases (IIT).
  - Added tests to check for cases where inputs
    are not blas compliant.

AMD-Internal: [CPUPL-4403]

Change-Id: Id8af1f1ec65a4e5bc7abba4e86df2756bce6cd42
2024-01-22 06:23:57 -05:00
Harsh Dave
156bc734f0 Micro-kernel testing of DGEMM kernels
- Added unit tests for avx512 and avx2 native and sup path
  DGEMM kernels for various value of storage, M, N
  K, alpha, beta, ldc.

AMD-Internal: [CPUPL-4404]
Change-Id: I33a8098b6a20b55c9f1f1bcffa6812bd792890b1
2024-01-22 05:39:45 -05:00
Arnav Sharma
823e8bfb2d Functional Testing for DDOTV, DSCALV and DASUMV
- Added unit-tests for the following kernels:
  DDOTV
	- bli_ddotv_zen_int( ... )
	- bli_ddotv_zen_int10( ... )
	- bli_ddotv_zen_int_avx512( ... )
  DSCALV
	- bli_dscalv_zen_int( ... )
	- bli_dscalv_zen_int10( ... )
	- bli_dscalv_zen_int_avx512( ... )

- Added API level unit-tests for the following cases:
	- Unit Positive Increments
	- Non-Unit Positive Increments
	- Negative Increments

- Added gtestsuite framework for (s/d/sc/dz)ASUMV.

AMD-Internal: [CPUPL-4406]
Change-Id: I086c51c563fecc7a7e67791c4c4eee8b56c5417b
2024-01-19 07:05:11 -05:00
Edward Smyth
05be482203 GTestSuite: Threshold comparison
Changes to threshold comparison:
- Use error <= threshold as measure of success rather than
  error < threshold.
- Report error compared to epsilon as well as absolute value.
- Correct typo.

AMD-Internal: [CPUPL-4378]
Change-Id: I58e718504ee863294dcdd6bd3cd7637de2638dbc
2024-01-19 05:05:10 -05:00
Vignesh Balasubramanian
476ae9359c Functionality testing of DAXPBYV, DAXPYV and DCOPYV APIs
- Implemented a design to allow isolation of micro-kernels to
  compare against the standard reference as part of GTestSuite.
  The design requires the kernel address to be passed as one of
  the values from the instantiator. This is further sent to the
  testing interface, which makes the call to the micro-kernel
  directly.

- The testing interface is templatized with both the datatype and
  the function-pointer type. This interface makes the direct call
  to the micro-kernel(address passed as a parameter), in addition
  to the call to the reference API.

- Added unit tests to cover the functionality testing of the following
  kernels :
  - bli_daxpbyv_zen_int10( ... ) and bli_daxpbyv_zen_int( ... )
  - bli_daxpyv_zen_int10( ... ), bli_daxpyv_zen_int10( ... ) and
    bli_daxpyv_zen_int_avx512( ... ).
  - bli_dcopyvv_zen_int( ... ).
  Further added dummy tests for bli_saxpbyv_zen_int10( ... ),
  bli_saxpbyv_zen_int( ... ) and bli_zaxpbyv_zen_int( ... ) kernels
  to verify the templatized testing interface.

- Added API level test-cases, to verify the functionality
  of DAXPY and DAXPBY APIs. These tests cover unit increments,
  negative increments(BLAS/CBLAS) and non-unit positive increments.
  Furthermore, for DAXPY an instantiator tests with sizes
  corresponding to the AOCL_DYNAMIC thresholds, since it is
  multithreaded at the framework level.

- Updated the API-level tests for ZAXPBY to allow negative increment
  testing only if GTestsuite is not configured for native BLIS
  typed interface as reference.

AMD-Internal: [CPUPL-4402]
Change-Id: I86b3b52d0737075897a9e9bc5e8d9654f75072fc
2024-01-19 01:53:34 -05:00
Edward Smyth
f93ccb0cea BLIS: zen5 cpuid and arch changes
Implement initial support for Zen5 systems:
- Detect new Zen5 AVXVNNI, AVX512VP2INTERSECT, MOVDIRI and MOVDIR64B
  instructions.
- Assume for now that Zen5 will use Zen4 code path. BLIS_ARCH_TYPE=zen5
  will therefore function as an alias for BLIS_ARCH_TYPE=zen4, but
  different hardware model will still be detected.

AMD-Internal: [CPUPL-3518]
Change-Id: I00fb413d743f152a5412ace3e740df1fd39a1600
2024-01-17 11:41:15 -05:00
mkadavil
864170f5cb Scalar value support for zero-point and scale-factor.
-As it stands, in LPGEMM, users are expected to pass an array of values
with length the same as N dimension as inputs for zero point or scale
factor. However at times, a single scalar value is used as zero point
or scale factor for the entire downscaling operation. The mandate to
pass an array requires the user to allocate extra memory and fill it
with the scalar value so as to be used in downscaling. This limitation
is lifted as part of this commit, and now scalar values can be passed
as zero point or scale factor.
-LPGEMM bench enhancements along with new input format to improve
readability as well as flexibility.

AMD-Internal: [SWLCSG-2581]
Change-Id: Ibd0d89f03e1acadd099382dffcabfec324ceb50f
2024-01-12 04:37:35 +05:30
Eleni Vlachopoulou
29e4ce644d CMake: Removing blatest-related targets for Windows/shared libs.
Due to the way dlls resolve internal symbols, calling custom function
xerbla_() is not possible. Because of that the targets result in errors
which are independent of BLIS library.
Testing Windows/static version will suffice.

To enable make check and make test targets, we throw warnings for the
Windows/shared cases and not depend on checkblas.

AMD-Internal: [CPUPL-2748]
Change-Id: Iaa93399dec5781277ee94611074f5ed4e70bcb37
2024-01-10 05:06:41 -05:00
jagar
1821c2142b CMake:Fix in testsuite cmake to work for static-st on linux
CMakeLists.txt is updated in blis/testsuite to make it work for
static single thread version of BLIS.

AMD-Internal: [CPUPL-2748]
Change-Id: I004e19d4ddbf9cb94d6d23699893a2f684a3fb35
2024-01-09 09:13:03 -05:00
Meghana Vankadari
6567df7b12 bf16bf16f32o<bf16|f32> Fix for scaling issue when transA is enabled.
Details:
- LPGEMM uses bli_pba_acquire_m with BLIS_BUFFER_FOR_A_BLOCK to checkout
  memory when A matrix needs to be packed. This multi-threaded lock
  overhead becomes prominent when m/n dimensions are relatively small,
  even when k is large. In order to address this, bli_pba_acquire_m
  is used with BLIS_BUFFER_FOR_GEN_USE for LPGEMM. For *GEN_USE,
  the memory is allocated using aligned malloc instead of checking
  out from memory pool. Experiments have shown malloc costs to be
  far lower than memory pool guarded by locks, especially for higher
  thread count.
- Deleted few unnecessary instructions from packing kernels.
- Replaced bench_input.txt with lesser number of inputs.

AMD-Internal: [CPUPL-4329]
Change-Id: I5982a0a4df9dc72fab0cffab795c23822d5c8774
2023-12-21 04:53:32 +05:30
Edward Smyth
98006ea422 fatal error: malloc.h: No such file or directory #785
Replace include of non-standard header malloc.h by
stdlib.h to fix issue reported on upstream BLIS
github.

https://github.com/flame/blis/issues/785

AMD-Internal: [CPUPL-4307]
Change-Id: I4ac5cb3164fe7050bba6579b08cc2d3ff412ccba
2023-12-13 04:05:07 -05:00
mkadavil
89bb999afa Aligning rntm_t struct to 64 byte address to address performance issues.
Non aligned rntm_t struct can potentially have its first/last cache
line shared with other objects in memory. This could affect performance
depending on how much the shared cache lines are used. rntm_t struct is
aligned to 64 bytes to workaround this issue.

Change-Id: Id0956fca771be062ada9f81e8cd75ac1f290fd8e
2023-12-12 03:39:04 -05:00
jagar
4c77ef4953 GTestsuite:Search library in user specified-path
In Gtestsuite CMakeLists.txt, find_library() will search
user-mentioned library in default system paths first then
in user specified paths. To avoid this CMake is updated
to search the user mentioned library in user specified
path and ignore searching in default path.

AMD-Internal: [CPUPL-4284]
Change-Id: Ia99cf59eb39deac4110d3d733f17548d432dde64
2023-12-11 00:20:48 -05:00
Vignesh Balasubramanian
8693c996ac Fixing coverity issues on SNRM2_ and SCNRM2_
- The bli_snormfv_unb_var1( ... ) and bli_cnormfv_unb_var1( ... )
  functions posed an uninitialized pointer read coverity issue,
  due to the local rntm_t object being declared as part of the
  function scope, but initialized only on a need basis(i.e, when
  attempting to pack x vector if incx != 1).

- The fix was to have the declaration and initialization inside
  the case where incx != 1, thereby making the scope of the rntm_t
  and mem_t objects more stringent.

- This required an additional condition to call the kernel in case
  of unit stride.

AMD-Internal: [CPUPL-4278]
Change-Id: I763b1d4920532557749d8943f12b6df626aa5372
2023-12-06 23:56:09 +05:30
Shubham Sharma
054d4fde82 Changed threshold for ZTRSM small code path
- Changed the threshold for using ZTRSM small
  code path when multithreading is enabled.
- Very skinny matrices are not taken into
  consideration in existing threshold tuning.

AMD-Internal: [CPUPL-4267]
Change-Id: I4294ec58a8535af7a9d618ae8f0d86407b66f341
2023-12-01 02:05:28 -05:00
Edward Smyth
48444d4316 cblas.h: Correct order of including other header files
Include bli_config.h before bli_system.h in
./frame/compat/cblas/src/cblas.h so that BLIS_ENABLE_SYSTEM is
defined correctly before it is needed. This copies the change
to ./frame/include/blis.h made in 1f527a93b9 (via merge
c6f3340125). Also standardize some comments and formatting
between blis.h and cblas.h

AMD-Internal: [CPUPL-4251]
Change-Id: Ie5cab646367f15003c25fa126344b02640d9106e
2023-11-24 11:46:17 -05:00
mangala v
70343cba5b Gtestsuite: Updated sgemm testcase for sup
Updated sgemm testcase to handle multiple values of alpha, beta for different input size
Added sgemm testcase to cover m,n,k dimension till 20 size atleast instepsize of 1

Change-Id: Id10ba3d7a05154b171511ef11ea76297494672cd
2023-11-24 10:42:22 -05:00
Eleni Vlachopoulou
79ad303902 GTestSuite: Clean-up on build system.
- and a small bugfix so that it works again on Windows.

Change-Id: I986b81d74d0f00c55eee497712aed5b268211d5f
2023-11-24 04:59:36 -05:00
Bhaskar Nallani
21d6ab6a21 Improved thread balancing for aocl_gemm f32 API
Description:
1. Updated the thread partition logic for aocl_gemm_f32f32f32of32
   for m<MR, n<NR cases and also balanced thread in m, n directions
   such that each thread gets equal amount of work and not to span
   thread without any work.
2. Disabled dynamic enabling of packing of a and b matrixes for
   smaller sizes for genoa architecture.

AMD-Internal: [SWLCSG-2353 , SWLCSG-2391]
Change-Id: I03b2c50e592c2e9d336ea84c0e0394af63a34cec
2023-11-24 03:45:44 -05:00
mkadavil
2676ac8249 LPGEMM s32 micro-kernel updates to fix gcc10.2 compilation issue.
Some AVX512 intrinsics(eg: _mm_loadu_epi8) were introduced in later
versions of gcc (11+) in addition to already existing masked intrinsic
(eg: _mm_mask_loadu_epi8). In order to support compilation using gcc
10.2, either the masked intrinsic or other gcc 10.2 compatible intrinsic
needs to be used (eg: _mm_loadu_si128) in LPGEMM <u|s>8s8os32 kernels.

AMD-Internal: [SWLCSG-2542]
Change-Id: I6cfedfdcb28711b19df63d162ab267f5eea8d2ef
2023-11-24 01:58:58 -05:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Eleni Vlachopoulou
52fb555ea2 CMake: Improving how CMake system handles targets.
- Instead of putting the built libraries in  blis/bin directory, build them in the chosen build-cmake directory.
- Install headers in <prefix>/include instead of <prefix>/include/blis.
- Fix on some targets to match configure/make system.
- Update documentation.

AMD-Internal: [CPUPL-2748]
Change-Id: I15553948209345dbee350e89965b6a3c72a4e340
2023-11-23 16:43:03 +05:30
Edward Smyth
50608f28df BLIS: Missing clobbers (batch 7)
Add missing clobbers in:
- bli_gemmsup_rv_haswell kernels
- spare copies of kernels in old, other and broken subdirectories
- misc kernels for legacy platforms

AMD-Internal: [CPUPL-3521]
Change-Id: I7cdb7fd1cb29630d8b7fa914b1002a270dfe9ef5
2023-11-22 17:51:46 -05:00
Edward Smyth
f471615c66 Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
2023-11-22 17:11:10 -05:00
Edward Smyth
dc41fa3829 User selection of code path in single architecture builds
User control over code path using AOCL_ENABLE_INSTRUCTIONS
or BLIS_ARCH_TYPE only makes sense for fat binary builds.
Thus this functionality is now disabled by default for
single architecture builds. User can still override the default
selections by using configure options --enable-blis-arch-type
or --disable-blis-arch-type.

Other changes:
- include x86_64 family as using zen codepaths in cmake build system.
- Update help and error messages to include AOCL_ENABLE_INSTRUCTIONS.

AMD-Internal: [CPUPL-4202]
Change-Id: I7aa5fcf89df8675bcc12d81f81781de647e0fcf8
2023-11-22 10:48:44 -05:00
mangala v
e0df20806a Updated prefetching in SGEMM SUP (mask load/store) kernels
1. Prefetch only MR rows or rows required for fringe cases
2. Specify prefetching offset - the least column address supported
   by masked functions
3. Removed unnecessary prefetches in fringe case for mx4 kernels

Updated gtestuite for sgemm calls

AMD_Internal: [CPUPL-4221]
Change-Id: I1e2e7d3ebce37dc54a2f0a5c1c70ce0a6d4c8d6c
2023-11-21 06:31:47 -05:00
Harsh Dave
e91d23ff05 Re-implements ddotv edge kernel using masked instructions
- This commit uses avx2 and avx512  masked load instructions
for handling edge case where vector size is not exact multiple
of avx2/avx512 vector register size.

- Thanks to Shubham, Sharma <shubham.sharma3@amd.com> for
avx512 ddotv kernel changes

Change-Id: I998651eeb1083caf3308f1b45bd7d55b7974bcb4
2023-11-21 02:25:00 -05:00
Harsh Dave
c6ed490907 Fixed functionality failure in c/z trsm framework code.
- For the inputs where either m or n is 1, based on right or
left side, it invokes c/z scalv kernel and post that it scales the
matrix post checking whether the input is blis conjugate transpose
or not.

- Previously the check condition was case sensitive *diaga = 'n',
and as a result, it is always executing the "else" code-part.

- Fixed the condition check.

AMD-Internal: [CPUPL-4204]
Change-Id: Iae2514c742ab17ac6c6e43036da095a74ad131c5
2023-11-19 09:58:46 -05:00
mangala v
3256a7b074 BugFix: Re-Designed SGEMM SUP kernel to use mask load/store instruction
Segfault was reported through nightly jenkins job.
Issue was observed when running in MT mode.

Issue was due to extra broadcast being used.
Extra broadcast would access out of bound memory on input buffer

Cleaned up cobbler list by removing unused registers.

AMD_Internal: [CPUPL-4180]

Change-Id: I1c8715b2850ef855328f2ef12f215987299bdb2b
2023-11-17 18:14:34 +05:30
Edward Smyth
6e020ecc01 Include bli_lang_defs.h in cblas.h
Changes in commit 64a1f786d5 (via merge c6f3340125) included in
./frame/include/bli_type_defs.h a prototype that uses the C
restrict keyword. When using C++ we need to provide a definition
for this C language keyword. This is done in bli_lang_defs.h which
was included in blis.h but not in cblas.h.

AMD-Internal: [CPUPL-4188]
Change-Id: I75d5f32599d18794331ff452e562eb42afb5ae93
2023-11-14 10:52:35 -05:00
Eleni Vlachopoulou
9c9bc20c9e CMake: Adding more dependencies in the generation of blis.h and cblas.h
- In case the build directory doesn't get cleaned between different configurations this should re-generate the headers correctly.

AMD-Internal: [CPUPL-2748]
Change-Id: I57cd03a9ae87d8ddfee64fe8b1a1ee9ea1b7ad3c
2023-11-10 15:38:16 -05:00
Edward Smyth
c6f3340125 Merge commit '5013a6cb' into amd-main
* commit '5013a6cb':
  More edits and fixes to docs/FAQ.md.
  Fixed newly broken link to CREDITS in FAQ.md.
  More minor fixes to FAQ.md and Sandboxes.md.
  Updates to FAQ.md, Sandboxes.md, and README.md.
  Safelist 'master', 'dev', 'amd' branches.
  Re-enable and fix fb93d24.
  Reverted fb93d24.
  Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
  Removed last vestige of #define BLIS_NUM_ARCHS.
  Added new packm var3 to 'gemmlike'.
  Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
  Fix more copy-paste errors in the haswell gemmsup code.
  Do a fast test on OSX. [ci skip]
  Fix AArch64 tests and consolidate some other tests.
  Use C++ cross-compiler for ARM tests.
  Attempt to fix cxx-test for OOT builds.
  Updated travis-ci.org link in README.md to .com.
  Disabled (at least temporarily) commit 8e0c425.
  Define BLIS_OS_NONE when using --disable-system.
  Updated stale calls to malloc_intl() in gemmlike.
  Blacklist clang10/gcc9 and older for 'armsve'.
  Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.
  Moved lang defs from _macro_def.h to _lang_defs.h.
  Minor tweaks to gemmlike sandbox.
  Added local _check() code to gemmlike sandbox.
  README.md citation updates (e.g. BLIS7 bibtex).
  Tweaks to gemmlike to facilitate 3rd party mods.
  Whitespace tweaks.
  Add row- and column-strides for A/B in obj_ukr_fn_t.
  Clean up some warnings that show up on clang/OSX.
  Remove schema field on obj_t (redundant) and add new API functions.
  Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.
  Disabled sanity check in bli_pool_finalize().
  Implement proposed new function pointer fields for obj_t.

AMD-Internal: [CPUPL-2698]
Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d
2023-11-10 13:05:12 -05:00
mangala v
3136e57a39 Fixed memory leak issue reported by ASAN in testsuite.
Memory allocated for pointer chars_for_dt was not freed at the
end of function in testsuite.

Freeing up of the buffer fixed the issue.

AMD-Internal: [CPUPL-3932]
Change-Id: I432c3ff95d289159f02a871b6d4fff5ab252ea9e
2023-11-10 12:11:55 -05:00
Eleni Vlachopoulou
acdaa91786 Revert "CMake : placing include folder under blis directory."
This reverts commit 59f1333883.

Reason for revert: Potentially breaking CI

Change-Id: I65a92a96896091cb92cc534d5b458070524ab75a
2023-11-10 10:37:56 -05:00
jagar
59f1333883 CMake : placing include folder under blis directory.
Updating cmake files to place include folder under
blis directory in new cmake system on windows.

AMD-Internal: [CPUPL-2748]
Change-Id: I650cca95193f7c89b39648ac1bda1fa1093b1560
2023-11-10 04:14:23 -04:00
Mangala V
f6046784ce Re-Designed SGEMM SUP kernel to use mask load/store instruction
Added all fringe kernels with mask load store support
Fringe kernels cover m direction from 5 to 1 and
n direction from 15 to 1 for row storage format

- New edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmaskmovps, VMASKMOVPS for masked load-store.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

- Mask load-store is added only for row storage format

AMD-Internal: [CPUPL-4041]

Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456
2023-11-10 01:23:48 -05:00
Harsh Dave
77161c1e5d Design change of DGEMM 6x8 native kernel.
- Following optimizations are included for dgemm 6x8 native kernel.
1) Reorganized the C update and store to reduce register dependencies.
2) moved the C prefetch to part-way through the kernel for efficiently
prefetching C matrix at appropriate distance.
3) Offsetting A matrix, so that kernel can use a smaller instruction
encoding saving, saving i-cache space.
4) Aligned the K iteration loop.

- Thanks to Moore, Branden <Branden.Moore@amd.com> for these design
  changes of DGEMM 6x8 native kernels.

- Additional change, reorganization of C update and store for
beta zero case to facilitate out of order execution of storing of C
matrix.


Change-Id: I9d1ec8d39f1154b0f38b136bd6a04b05d7d1e6ba
2023-11-09 23:07:43 -05:00
Meghana Vankadari
77bd9a7f17 Added parameter checking for LPGEMM APIs
Change-Id: I6ea89fd0d2516539e5a4e9cd8537570b23194d89
2023-11-09 21:50:55 -05:00
Vignesh Balasubramanian
bd0b50a077 Introduced fast-path to kernels in DNRM2_ and DZNRM2_ APIs
- Added a conditional check to see if the vectorized kernels
  for DNRM2_ and DZNRM2_ can be called directly, without
  incurring any framework overhead.

- The condition to satisfy this fast-path is for the size to be
  such that the ideal threads required is 1, with the vector having
  unit stride( so that packing at the framework-level can be avoided ).

AMD-Internal: [CPUPL-4045]
Change-Id: Ie37e86f802ada0e226dff88e74f0341e97ebfe28
2023-11-09 21:13:10 +05:30
Eashan Dash
e4e4fe55fb Added Parameter Checks and DTL Trace for Extension APIs
1. Added input parameter checking for the extension APIs
   1. gemm_pack_get_size API
   2. gemm_pack API

2. Additionally added early returns for these APIs when
   m or n dimensions are 0.

3. Routines for input parameter check for all the 3
   BLAS extension APIs - gemm_pack_get_size, gemm_pack and
   gemm_compute are defined in:
   frame/compat/check/bla_gemm_pack_compute_check.h

4. Added AOCL DTL TRACE for all the functions of
   1. gemm_pack_get_size
   2. gemm_pack
   3. gemm_compute

AMD-Internal: [CPUPL-3560]
Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631
2023-11-09 18:53:59 +05:30