Commit Graph

3151 Commits

Author SHA1 Message Date
Edward Smyth
98006ea422 fatal error: malloc.h: No such file or directory #785
Replace include of non-standard header malloc.h by
stdlib.h to fix issue reported on upstream BLIS
github.

https://github.com/flame/blis/issues/785

AMD-Internal: [CPUPL-4307]
Change-Id: I4ac5cb3164fe7050bba6579b08cc2d3ff412ccba
2023-12-13 04:05:07 -05:00
mkadavil
89bb999afa Aligning rntm_t struct to 64 byte address to address performance issues.
Non aligned rntm_t struct can potentially have its first/last cache
line shared with other objects in memory. This could affect performance
depending on how much the shared cache lines are used. rntm_t struct is
aligned to 64 bytes to workaround this issue.

Change-Id: Id0956fca771be062ada9f81e8cd75ac1f290fd8e
2023-12-12 03:39:04 -05:00
jagar
4c77ef4953 GTestsuite:Search library in user specified-path
In Gtestsuite CMakeLists.txt, find_library() will search
user-mentioned library in default system paths first then
in user specified paths. To avoid this CMake is updated
to search the user mentioned library in user specified
path and ignore searching in default path.

AMD-Internal: [CPUPL-4284]
Change-Id: Ia99cf59eb39deac4110d3d733f17548d432dde64
2023-12-11 00:20:48 -05:00
Vignesh Balasubramanian
8693c996ac Fixing coverity issues on SNRM2_ and SCNRM2_
- The bli_snormfv_unb_var1( ... ) and bli_cnormfv_unb_var1( ... )
  functions posed an uninitialized pointer read coverity issue,
  due to the local rntm_t object being declared as part of the
  function scope, but initialized only on a need basis(i.e, when
  attempting to pack x vector if incx != 1).

- The fix was to have the declaration and initialization inside
  the case where incx != 1, thereby making the scope of the rntm_t
  and mem_t objects more stringent.

- This required an additional condition to call the kernel in case
  of unit stride.

AMD-Internal: [CPUPL-4278]
Change-Id: I763b1d4920532557749d8943f12b6df626aa5372
2023-12-06 23:56:09 +05:30
Shubham Sharma
054d4fde82 Changed threshold for ZTRSM small code path
- Changed the threshold for using ZTRSM small
  code path when multithreading is enabled.
- Very skinny matrices are not taken into
  consideration in existing threshold tuning.

AMD-Internal: [CPUPL-4267]
Change-Id: I4294ec58a8535af7a9d618ae8f0d86407b66f341
2023-12-01 02:05:28 -05:00
Edward Smyth
48444d4316 cblas.h: Correct order of including other header files
Include bli_config.h before bli_system.h in
./frame/compat/cblas/src/cblas.h so that BLIS_ENABLE_SYSTEM is
defined correctly before it is needed. This copies the change
to ./frame/include/blis.h made in 1f527a93b9 (via merge
c6f3340125). Also standardize some comments and formatting
between blis.h and cblas.h

AMD-Internal: [CPUPL-4251]
Change-Id: Ie5cab646367f15003c25fa126344b02640d9106e
2023-11-24 11:46:17 -05:00
mangala v
70343cba5b Gtestsuite: Updated sgemm testcase for sup
Updated sgemm testcase to handle multiple values of alpha, beta for different input size
Added sgemm testcase to cover m,n,k dimension till 20 size atleast instepsize of 1

Change-Id: Id10ba3d7a05154b171511ef11ea76297494672cd
2023-11-24 10:42:22 -05:00
Eleni Vlachopoulou
79ad303902 GTestSuite: Clean-up on build system.
- and a small bugfix so that it works again on Windows.

Change-Id: I986b81d74d0f00c55eee497712aed5b268211d5f
2023-11-24 04:59:36 -05:00
Bhaskar Nallani
21d6ab6a21 Improved thread balancing for aocl_gemm f32 API
Description:
1. Updated the thread partition logic for aocl_gemm_f32f32f32of32
   for m<MR, n<NR cases and also balanced thread in m, n directions
   such that each thread gets equal amount of work and not to span
   thread without any work.
2. Disabled dynamic enabling of packing of a and b matrixes for
   smaller sizes for genoa architecture.

AMD-Internal: [SWLCSG-2353 , SWLCSG-2391]
Change-Id: I03b2c50e592c2e9d336ea84c0e0394af63a34cec
2023-11-24 03:45:44 -05:00
mkadavil
2676ac8249 LPGEMM s32 micro-kernel updates to fix gcc10.2 compilation issue.
Some AVX512 intrinsics(eg: _mm_loadu_epi8) were introduced in later
versions of gcc (11+) in addition to already existing masked intrinsic
(eg: _mm_mask_loadu_epi8). In order to support compilation using gcc
10.2, either the masked intrinsic or other gcc 10.2 compatible intrinsic
needs to be used (eg: _mm_loadu_si128) in LPGEMM <u|s>8s8os32 kernels.

AMD-Internal: [SWLCSG-2542]
Change-Id: I6cfedfdcb28711b19df63d162ab267f5eea8d2ef
2023-11-24 01:58:58 -05:00
Edward Smyth
ed5010d65b Code cleanup: AMD copyright notice
Standardize format of AMD copyright notice.

AMD-Internal: [CPUPL-3519]
Change-Id: I98530e58138765e5cd5bc0c97500506801eb0bf0
2023-11-23 08:54:31 -05:00
Eleni Vlachopoulou
52fb555ea2 CMake: Improving how CMake system handles targets.
- Instead of putting the built libraries in  blis/bin directory, build them in the chosen build-cmake directory.
- Install headers in <prefix>/include instead of <prefix>/include/blis.
- Fix on some targets to match configure/make system.
- Update documentation.

AMD-Internal: [CPUPL-2748]
Change-Id: I15553948209345dbee350e89965b6a3c72a4e340
2023-11-23 16:43:03 +05:30
Edward Smyth
50608f28df BLIS: Missing clobbers (batch 7)
Add missing clobbers in:
- bli_gemmsup_rv_haswell kernels
- spare copies of kernels in old, other and broken subdirectories
- misc kernels for legacy platforms

AMD-Internal: [CPUPL-3521]
Change-Id: I7cdb7fd1cb29630d8b7fa914b1002a270dfe9ef5
2023-11-22 17:51:46 -05:00
Edward Smyth
f471615c66 Code cleanup: No newline at end of file
Some text files were missing a newline at the end of the file.
One has been added.

AMD-Internal: [CPUPL-3519]
Change-Id: I4b00876b1230b036723d6b56755c6ca844a7ffce
2023-11-22 17:11:10 -05:00
Edward Smyth
dc41fa3829 User selection of code path in single architecture builds
User control over code path using AOCL_ENABLE_INSTRUCTIONS
or BLIS_ARCH_TYPE only makes sense for fat binary builds.
Thus this functionality is now disabled by default for
single architecture builds. User can still override the default
selections by using configure options --enable-blis-arch-type
or --disable-blis-arch-type.

Other changes:
- include x86_64 family as using zen codepaths in cmake build system.
- Update help and error messages to include AOCL_ENABLE_INSTRUCTIONS.

AMD-Internal: [CPUPL-4202]
Change-Id: I7aa5fcf89df8675bcc12d81f81781de647e0fcf8
2023-11-22 10:48:44 -05:00
mangala v
e0df20806a Updated prefetching in SGEMM SUP (mask load/store) kernels
1. Prefetch only MR rows or rows required for fringe cases
2. Specify prefetching offset - the least column address supported
   by masked functions
3. Removed unnecessary prefetches in fringe case for mx4 kernels

Updated gtestuite for sgemm calls

AMD_Internal: [CPUPL-4221]
Change-Id: I1e2e7d3ebce37dc54a2f0a5c1c70ce0a6d4c8d6c
2023-11-21 06:31:47 -05:00
Harsh Dave
e91d23ff05 Re-implements ddotv edge kernel using masked instructions
- This commit uses avx2 and avx512  masked load instructions
for handling edge case where vector size is not exact multiple
of avx2/avx512 vector register size.

- Thanks to Shubham, Sharma <shubham.sharma3@amd.com> for
avx512 ddotv kernel changes

Change-Id: I998651eeb1083caf3308f1b45bd7d55b7974bcb4
2023-11-21 02:25:00 -05:00
Harsh Dave
c6ed490907 Fixed functionality failure in c/z trsm framework code.
- For the inputs where either m or n is 1, based on right or
left side, it invokes c/z scalv kernel and post that it scales the
matrix post checking whether the input is blis conjugate transpose
or not.

- Previously the check condition was case sensitive *diaga = 'n',
and as a result, it is always executing the "else" code-part.

- Fixed the condition check.

AMD-Internal: [CPUPL-4204]
Change-Id: Iae2514c742ab17ac6c6e43036da095a74ad131c5
2023-11-19 09:58:46 -05:00
mangala v
3256a7b074 BugFix: Re-Designed SGEMM SUP kernel to use mask load/store instruction
Segfault was reported through nightly jenkins job.
Issue was observed when running in MT mode.

Issue was due to extra broadcast being used.
Extra broadcast would access out of bound memory on input buffer

Cleaned up cobbler list by removing unused registers.

AMD_Internal: [CPUPL-4180]

Change-Id: I1c8715b2850ef855328f2ef12f215987299bdb2b
2023-11-17 18:14:34 +05:30
Edward Smyth
6e020ecc01 Include bli_lang_defs.h in cblas.h
Changes in commit 64a1f786d5 (via merge c6f3340125) included in
./frame/include/bli_type_defs.h a prototype that uses the C
restrict keyword. When using C++ we need to provide a definition
for this C language keyword. This is done in bli_lang_defs.h which
was included in blis.h but not in cblas.h.

AMD-Internal: [CPUPL-4188]
Change-Id: I75d5f32599d18794331ff452e562eb42afb5ae93
2023-11-14 10:52:35 -05:00
Eleni Vlachopoulou
9c9bc20c9e CMake: Adding more dependencies in the generation of blis.h and cblas.h
- In case the build directory doesn't get cleaned between different configurations this should re-generate the headers correctly.

AMD-Internal: [CPUPL-2748]
Change-Id: I57cd03a9ae87d8ddfee64fe8b1a1ee9ea1b7ad3c
2023-11-10 15:38:16 -05:00
Edward Smyth
c6f3340125 Merge commit '5013a6cb' into amd-main
* commit '5013a6cb':
  More edits and fixes to docs/FAQ.md.
  Fixed newly broken link to CREDITS in FAQ.md.
  More minor fixes to FAQ.md and Sandboxes.md.
  Updates to FAQ.md, Sandboxes.md, and README.md.
  Safelist 'master', 'dev', 'amd' branches.
  Re-enable and fix fb93d24.
  Reverted fb93d24.
  Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).
  Removed last vestige of #define BLIS_NUM_ARCHS.
  Added new packm var3 to 'gemmlike'.
  Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.
  Fix more copy-paste errors in the haswell gemmsup code.
  Do a fast test on OSX. [ci skip]
  Fix AArch64 tests and consolidate some other tests.
  Use C++ cross-compiler for ARM tests.
  Attempt to fix cxx-test for OOT builds.
  Updated travis-ci.org link in README.md to .com.
  Disabled (at least temporarily) commit 8e0c425.
  Define BLIS_OS_NONE when using --disable-system.
  Updated stale calls to malloc_intl() in gemmlike.
  Blacklist clang10/gcc9 and older for 'armsve'.
  Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.
  Moved lang defs from _macro_def.h to _lang_defs.h.
  Minor tweaks to gemmlike sandbox.
  Added local _check() code to gemmlike sandbox.
  README.md citation updates (e.g. BLIS7 bibtex).
  Tweaks to gemmlike to facilitate 3rd party mods.
  Whitespace tweaks.
  Add row- and column-strides for A/B in obj_ukr_fn_t.
  Clean up some warnings that show up on clang/OSX.
  Remove schema field on obj_t (redundant) and add new API functions.
  Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.
  Disabled sanity check in bli_pool_finalize().
  Implement proposed new function pointer fields for obj_t.

AMD-Internal: [CPUPL-2698]
Change-Id: I6fc33351fa824580cf4f25b63f0370383cd9422d
2023-11-10 13:05:12 -05:00
mangala v
3136e57a39 Fixed memory leak issue reported by ASAN in testsuite.
Memory allocated for pointer chars_for_dt was not freed at the
end of function in testsuite.

Freeing up of the buffer fixed the issue.

AMD-Internal: [CPUPL-3932]
Change-Id: I432c3ff95d289159f02a871b6d4fff5ab252ea9e
2023-11-10 12:11:55 -05:00
Eleni Vlachopoulou
acdaa91786 Revert "CMake : placing include folder under blis directory."
This reverts commit 59f1333883.

Reason for revert: Potentially breaking CI

Change-Id: I65a92a96896091cb92cc534d5b458070524ab75a
2023-11-10 10:37:56 -05:00
jagar
59f1333883 CMake : placing include folder under blis directory.
Updating cmake files to place include folder under
blis directory in new cmake system on windows.

AMD-Internal: [CPUPL-2748]
Change-Id: I650cca95193f7c89b39648ac1bda1fa1093b1560
2023-11-10 04:14:23 -04:00
Mangala V
f6046784ce Re-Designed SGEMM SUP kernel to use mask load/store instruction
Added all fringe kernels with mask load store support
Fringe kernels cover m direction from 5 to 1 and
n direction from 15 to 1 for row storage format

- New edge kernels that uses masked load-store
  instructions for handling corner cases.

- Mask load-store instruction macros are added.
  vmaskmovps, VMASKMOVPS for masked load-store.

- It improves performance by reducing branching overhead
  and by being more cache friendly.

- Mask load-store is added only for row storage format

AMD-Internal: [CPUPL-4041]

Change-Id: I563c036c79bf8e476a8ebde37f8f6db751fb3456
2023-11-10 01:23:48 -05:00
Harsh Dave
77161c1e5d Design change of DGEMM 6x8 native kernel.
- Following optimizations are included for dgemm 6x8 native kernel.
1) Reorganized the C update and store to reduce register dependencies.
2) moved the C prefetch to part-way through the kernel for efficiently
prefetching C matrix at appropriate distance.
3) Offsetting A matrix, so that kernel can use a smaller instruction
encoding saving, saving i-cache space.
4) Aligned the K iteration loop.

- Thanks to Moore, Branden <Branden.Moore@amd.com> for these design
  changes of DGEMM 6x8 native kernels.

- Additional change, reorganization of C update and store for
beta zero case to facilitate out of order execution of storing of C
matrix.


Change-Id: I9d1ec8d39f1154b0f38b136bd6a04b05d7d1e6ba
2023-11-09 23:07:43 -05:00
Meghana Vankadari
77bd9a7f17 Added parameter checking for LPGEMM APIs
Change-Id: I6ea89fd0d2516539e5a4e9cd8537570b23194d89
2023-11-09 21:50:55 -05:00
Vignesh Balasubramanian
bd0b50a077 Introduced fast-path to kernels in DNRM2_ and DZNRM2_ APIs
- Added a conditional check to see if the vectorized kernels
  for DNRM2_ and DZNRM2_ can be called directly, without
  incurring any framework overhead.

- The condition to satisfy this fast-path is for the size to be
  such that the ideal threads required is 1, with the vector having
  unit stride( so that packing at the framework-level can be avoided ).

AMD-Internal: [CPUPL-4045]
Change-Id: Ie37e86f802ada0e226dff88e74f0341e97ebfe28
2023-11-09 21:13:10 +05:30
Eashan Dash
e4e4fe55fb Added Parameter Checks and DTL Trace for Extension APIs
1. Added input parameter checking for the extension APIs
   1. gemm_pack_get_size API
   2. gemm_pack API

2. Additionally added early returns for these APIs when
   m or n dimensions are 0.

3. Routines for input parameter check for all the 3
   BLAS extension APIs - gemm_pack_get_size, gemm_pack and
   gemm_compute are defined in:
   frame/compat/check/bla_gemm_pack_compute_check.h

4. Added AOCL DTL TRACE for all the functions of
   1. gemm_pack_get_size
   2. gemm_pack
   3. gemm_compute

AMD-Internal: [CPUPL-3560]
Change-Id: I4351b8494d888eae7e7431a7e1e23e442ffc8631
2023-11-09 18:53:59 +05:30
Eleni Vlachopoulou
75a4d2f72f CMake: Adding new portable CMake system.
- A completely new system, made to be closer to Make system.

AMD-Internal: [CPUPL-2748]
Change-Id: I83232786406cdc4f0a0950fb6ac8f551e5968529
2023-11-09 15:49:45 +05:30
Meghana Vankadari
0c12b72651 LPGEMM bench enhancements
Details:
- Moved the downscale & postop options from commmandline to
  input file.
- Now the format of the input file is as follows:
  dt_in dt_out stor transa transb op_a op_b m n k lda ldb ldc postops
- In case of no-postops, 'none' has to be passed in the place of
  postops.
- Removed duplication of mat_mul_bench_main function for bf16 APIs.
- Added a function called print_matrix for each datatype which can
  help in printing matrices while debugging.
- Added printing of ref, computed and diff values while reporting
  failure.
- Added new functions for memory allocation and freeing. Different
  types of memory allocation is chosen based on mode bench is
  running(performance or accuracy mode).

Change-Id: Ia7d740c53035bc76e578a03869590c9f04396b72
2023-11-09 03:55:10 -05:00
mkadavil
ed052c6c44 Smart Threading for LPGEMM <u|s>8s8s<16|32>os<8|16|32> API.
The LPGEMM micro-kernel operates on blocks of dimension MRxKC and
KCxNR. Current LPGEMM design involves using all the available threads
for computing the output. If the number of threads assigned along ic
or jc direction is more than M/MR or N/NR blocks respectively, it
could results in threads sleeping due to the lack of MR or NR blocks.
This scenario is now handled by reducing the number of threads if
there are threads without any work (MR or NR blocks).

AMD-Internal: [SWLCSG-2354, SWLCSG-2389, SWLCSG-2267]
Change-Id: I74819337c7a0d3ab05ea0e18bb42780f977ea8f6
2023-11-09 00:50:30 -05:00
Edward Smyth
d2ad268525 Compilation error when using pthreads on FreeBSD
We are using pthread_self to get a thread id for use in the DTL
tracing functionality to name individual output files per thread.
This is not an appropriate use of pthread_self as its return type
(pthread_t) is an opaque type that can vary between implementations.
On linux we haven't had a problem, as pthread_t is an unsigned long
int. However on freeBSD it is a pointer to an empty struct. The
difference between this and the int type we used for its value within
the BLIS code was causing a compile error.

The best long term solution would be for pthread builds to maintain
their own internal thread id. A mechanism to implement this has not
yet been identifie. In the meantime, we make the following changes as
a stopgap:
- Explicitly cast from pthread_t return value to our BLIS internal
  data type AOCL_TID.
- Make AOCL_TID a long int rather than pid_t (i.e. an int) in pthread
  builds to match the sizes expected on both Linux and FreeBSD.

AMD-Internal: [CPUPL-4167]
Change-Id: Ia07ee8f97273cc3bab46f6bca1eeb7954320415b
2023-11-09 00:20:13 -05:00
Edward Smyth
9500cbee63 Code cleanup: spelling corrections
Corrections for some spelling mistakes in comments.

AMD-Internal: [CPUPL-3519]
Change-Id: I9a82518cde6476bc77fc3861a4b9f8729c6380ba
2023-11-09 00:16:30 -05:00
Harsh Dave
75356d45e5 DGEMM improvement for very tiny sizes less than 24.
- This commit helps improving performance for very small input
by reducing framework check and routing all such inputs to
bli_dgemm_tiny_6x8_kernel. It forces single threaded computation
for such sizes.

- It invokes bli_dgemm_tiny_6x8_kernel for ZEN, ZEN2, ZEN3 and ZEN4
code path. Except for the case AOCL_ENABLE_INSTRUCTIONS environment
variable is set to avx512. In that case, such a small inputs are
routed to bli_dgemm_tiny_24x8_kernel avx512 kernel.

AMD-Internal: [CPUPL-1701]
Change-Id: Idf59f4a8ee76ee8f2514a33be2b618e3ce02383e
2023-11-08 23:45:57 -05:00
Arnav Sharma
008b77e94d BLAS Compliance: SCALV Early Returns
- According to BLAS Standards, SCALV should return when incx .le. 0.
- To make SCALV compliant to this, added an early return inside the BLAS
  layer, for the cases where incx <= 0.
- Also, added early return for the case where alpha is a unit scalar.

AMD-Internal: [CPUPL-3562]
Change-Id: Id474fdd6ed9232226f5c5381d0398f43384e4a49
2023-11-08 06:36:02 -05:00
Vignesh Balasubramanian
5f9c8c6929 Bugfix : Fallback mechanism in SNRM2 and SCNRM2 kernels if packing fails
- Abstracted packing from the vectorized kernels for SNRM2 and SCNRM2 to
  a layer higher.

- Added a scalar loop to handle compute in case of non-unit strides.
  This loop ensures functionality in case packing fails at the
  framework level.

AMD-Internal: [CPUPL-3633]
Change-Id: I555aea519d7434d43c541bb0f661f81105135b98
2023-11-08 15:16:10 +05:30
mangala v
fa355c0049 Removed warning during compilation of gemv api for non-zen config
- When configured for haswell config "Warning unused variable 'zero'"
  was throwed during compilation.
- Removed zero variable which is not being used

AMD-Internal: [CPUPL-3973]
Change-Id: I45a1f16b4c50307b07148bba63ca5332c48648b8
2023-11-08 01:43:33 -05:00
Vignesh Balasubramanian
06f23c4fd4 Bugfix : Functional correctness of DNRM2_ and DZNRM2_ APIs
- Updated the final reduction of partial sums( AVX-2 code section )
  to use scalar accumulation entirely, instead of using the
  _mm256_hadd_pd( ... ) intrinsic. This will in turn change the
  associativity in the reduction step.

- Reverted to using scalar code on the fringe cases in AVX-2 kernel
  for DNRM2 and DZNRM2, for improving functional correctness.

AMD-Internal: [CPUPL-4049]
Change-Id: I9d320b39d23a0cbcc77fb24d951fced778ea5ea5
2023-11-07 10:21:41 -05:00
Harsh Dave
0de10cc86c Added k=1 avx512 dgemm kernel.
- This commit implements avx512 dgemm kernel for k=1 cases.
which gets called for zen4 codepath.

- Added architecture check for k=1 kernel in dgemm code path
to pick correct kernel based on cpu arhcitecture since now
blis is having avx2 and avx512 dgemm kernels for k=1 case.

- Previously in dgemm path bli_dgemm_8x6_avx2_k1_nn kernel was
being called irrespective of architecture type.

- Added architecture check before calling the kernel for case where
k=1, so only for respective architectures this kernel is invoked.

AMD-Internal: [CPUPL-4017]
Change-Id: I418bbc933b41db41d323b331c6d89893868a6971
2023-11-07 01:10:09 -05:00
Harihara Sudhan S
bed6bd4941 Modified DSCALV AOCL dynamic
- AOCL dynamic logic that determines the number of threads to be
  launched has been modified.

AMD-Internal: [CPUPL-3956]
Change-Id: Ia6c052515bd24e93660f020a7d0894fc75a229fc
2023-11-06 22:35:14 -05:00
Arnav Sharma
dd1cf23090 Gtestsuite Update for Pack and Compute Extension APIs
- Pack and compute are now compared against GEMM operation of reference
  library when MKL is not used as a reference.
- For the case where both A and B are unpacked, the reference GEMM is
  invoked with a unit-alpha scalar.
- If MKL is used as reference, then these APIs are compared against pack
  and compute operations of MKL.
- Updated description in ref_gemm_compute.cpp to reflect this behavior.

AMD-Internal: [CPUPL-4084]
Change-Id: Id0521c9cad8743a7ae471a7f3c547ceb67191f86
2023-11-03 09:45:42 -04:00
Shubham Sharma
ffa8f584be Added ZTRSM AVX512 native path kernels
- Added 4x12 ZGEMM row-preferred kernel.
- Added 4x12 ZTRSM row-preferred lower
  and upper kernels using AVX512 ISA.
- These kernels are used for ZTRSM only, zgemm
  still uses 12x4 kernel.
- Kernels support row/col/gen storage.
- Kernels support A prefetch, B prefetch,
  A_next prefetch, B_next prefetch and c prefetch.
- B prefetch, B_next prefetch and C prefetch
  are enabled by default.
- Updated CMakeLists.txt with ZGEMM kernels for
  windows build.
AMD-Internal: [CPUPL-3781]

Change-Id: I0fb4b2ec2f4bd66db6499c25f12bcc4bdb09804a
2023-11-03 09:42:24 -04:00
Arnav Sharma
44dfc7a515 Fix for gemm_compute BLAS Check
- BLAS compute checks updated to properly check for rs_c and cs_c.
- Updated BLAS compute checks to skip validity check if m==1 or n==1.
  For the same reason, added a check just before to validate rs_c and
  cs_c are greater than or equal to 1.
- Added tiny size tests to gtestsuite as a sanity check.
- Also updated the Invalid Input Tests to test for the updated checks.

AMD-Internal: [CPUPL-4140]
Change-Id: I984339ec7909778b58409ffcdbeed4ee33f28cfb
2023-11-03 09:41:16 -04:00
Arnav Sharma
8885510db2 Fix for Missing Symbols for gemm_pack_get_size
- Symbols for gemm_pack_get_size were not being exported properly when
  BLIS was built as a shared library.
- Correctly assigned the BLIS_EXPORT_BLAS macro to ?gemm_pack_get_size_
  function declaration.
- Added missing gemm_pack and gemm_pack_get_size macros to
  bli_macro_defs.h file.
- Removed an unnecessary BLIS_EXPORT_BLAS macro from dgemm_compute
  function definition.
- Updated bli_util_api_wrap with no underscore API wrappers for pack and
  compute set of BLAS Extension APIs:
  1. ?gemm_pack_get_size
  2. ?gemm_pack
  3. ?gemm_compute

AMD-Internal: [CPUPL-4083]
Change-Id: I78cd7642c2fcbfdf02676e654a377ad2aa5295c1
2023-11-03 08:58:59 -04:00
Eashan Dash
c3d1a3878c Parallelized Pack and Compute Extension APIs
1. OpenMP based multi-threading parallelism is added for BLAS
   extension APIs of Pack and Compute

2. Both pack and compute APIs are parallelized.

3. Multi-threading of pack and compute APIs done with different
   number of threads can lead to inconsistent results due to
   output difference of the full packed matrix buffer when packed
   with different number of threads.

4. In multi-threaded execution, we ensure output of packed buffer
   is exactly the same as in single threaded execution.

5. Similarly for compute API, read of packed buffer in multi-
   threaded execution is exactly the same as in single-threaded
   execution.

6. Routines are added to compute the offsets for thread workload
   distribution for MT execution.
   1. The offsets are calculated in such a way that it resembles
      the reorder buffer traversal in single threaded reordering.
   2. The panel boundaries (KCxNC) remain as it is accessed in
      single thread, and as a consequence a thread with jc_start
      inside the panel cannot consider NC range for reorder.
   3. It has to work with NC' < NC, and the offset is calulated
      using prev NC panels spanning k dim + cur NC panel spaning
      pc loop cur iteration + (NC - NC') spanning current
      kc0 (<= KC).

7. Routines to ensure the same are added for MT execution
   1. frame/base/bli_pack_compute_utils.c
   2. frame/base/bli_pack_compute_utils.h

AMD-Internal: [CPUPL-3560]
Change-Id: I0dad33e0062519de807c32f6071e61fba976d9ac
2023-11-03 08:47:17 -04:00
Vignesh Balasubramanian
84faccdd7d Enabling the vectorized path for SNRM2_
- Enabled the vectorized AVX-2 code-path for SNRM2_. The
  framework queries the architecture ID and calls the
  vectorized kernel based on the architecture support.

- In case of not having the architecture support, we use
  the default path based on the sumsqv method.

AMD-Internal: [CPUPL-3277]
Change-Id: Ic60c0782dec0b7eb09fac21818eb625e57b1d14f
2023-11-03 17:45:56 +05:30
Edward Smyth
d8b8f68066 Improvements to xerbla functionality (2)
Improvements to functionality introduced in commit
6d0444497f:
- Call to bli_init_auto() before calling PASTEBLACHK macro in
  gemv caused significant runtime overhead. Initialize stored
  info_value directly.
- Add similar code in frame/compat/f2c routines.

AMD-Internal: [CPUPL-3520]
Change-Id: I2df201aed7dbceb4cbe66d6c81b5a03e8092de89
2023-11-03 06:38:40 -04:00
Meghana Vankadari
f8f4343b55 Updated cntx with packA function pointer for AVX512_VNNI support
Details:
- Modified bench to support testing for sizes where matrix
  strides are larger than the corresponding dimensions.
- Modified early-return checks in all interface APIs to
  check validity of strides in relation to the corresponding
  dimension rather than checking if strides are equal to dimensions.

Change-Id: I382529b636a4acc75f6d93d997af22a168a7bfc4
2023-11-03 04:50:00 -04:00