Commit Graph

1998 Commits

Author SHA1 Message Date
Meghana
28bb28b79f Modified Function definition for BLAS and CBLAS interfaces of DOTV and SWAPV APIs
Details:
-Kernel is called directly from API call to avoid framework
 overhead in case of single and double precisions.
-Currently these changes are applicable only for zen2 configuration.
 They will be enabled for zen family processors in future.
-These changes improve performance of BLAS and CBLAS interfaces of API.
 They do not affect BLIS-specific APIs.

Change-Id: I1eb7ca470ced82c3cfa8b22f2b53000d42fef96c
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847,CPUPL-816]
2020-05-04 15:06:07 +05:30
Devrajegowda, Kiran
4ad5b1a5e6 Update zen2 kernel context with number of level1 kernels
Details:
       - Adding missed copyv simd kernel changes while rebasing.

Change-Id: Iaedabd39fdf297fefec1e48e7d2c1a2f3d7eb08d
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-04-24 12:22:24 +05:30
Devrajegowda, Kiran
4caee59466 Adding a simd kernel for copyv function
Details:
    - Separate kernel for copyv function added to improve performance.
    - Modified cntx_init file in zen and zen2 configuration
    - Added test_copyv.c in test folder
    - Modified test/Makefile to include test_copyv.c

Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-04-24 01:55:25 -04:00
Nallani Bhaskar
ba00f75f64 Merge "JIRA: CPUPL-853: Fix for the redefinition of _unsigned int __get_cpuid_max(unsigned int, unsigned int*)_. http://ontrack-internal.amd.com/browse/CPUPL-853 https://github.com/flame/blis/issues/393" into amd-staging-rome-2.2 2020-04-24 01:12:00 -04:00
Nallani Bhaskar
ea3865fbf2 JIRA: CPUPL-853: Fix for the redefinition of _unsigned int __get_cpuid_max(unsigned int, unsigned int*)_. http://ontrack-internal.amd.com/browse/CPUPL-853 https://github.com/flame/blis/issues/393
Change-Id: I88c23b2fdad0beb3796d0e6acbcf215fe9daab2d
2020-04-23 17:14:24 +05:30
Meghana
f80e21ca7b Modified Function definition for BLAS and CBLAS interfaces of I?AMAX API
Details:
-Kernel is called directly from API call to avoid framework
 overhead in case of single and double precisions.
-Currently these changes are applicable only for zen2 configuration.
 They will be enabled for zen family processors in future.
-These changes improve performance of BLAS and CBLAS interfaces of API.
 They do not affect BLIS-specific APIs.

Change-Id: Ib12f5a4f66a3227681fb3028207a08cb69cc2406
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-855]
2020-04-22 17:43:55 +05:30
Meghana
138bc75063 Modified function definition for AXPY CBLAS interface
Details:
-Kernel is called directly from API call to avoid framework
 overhead in case of single and double precisions.
-Currently these changes are applicable only for zen2 configuration.
 They will be enabled for zen family processors in future.

Change-Id: Ifa17dc28d3b38e1e16b28bb785d9fdf4a223d909
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-805]
2020-04-21 07:59:07 -04:00
Devrajegowda, Kiran
1c76723320 Block parameters tuning to improve sgemm performance on Rome
Details:
    - Tuned block sizes to get better performance for sgemm default path.

Change-Id: I892e8642fa2d03a07a6d53537131536e6b1b091e
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-832]
2020-04-21 07:34:12 -04:00
Meghana Vankadari
139fbbb77f Merge "Added opt kernels for SWAPV" into amd-staging-rome-2.2 2020-04-21 05:54:19 -04:00
Kiran Varaganti
0fdb539d40 Fixed CPUPL-845 - expert interfaces consistent with other interfaces w.r.t disabling selective packing in sup by defaut
Change-Id: Id678ee727e8e9197e1c5b48a994fafd7797c48f2
2020-04-20 16:15:05 +05:30
Meghana
b846059bcf Added opt kernels for SWAPV
Details:
-Added SIMD kernels for SWAPV for both single and double precisions.
-Modified cntx_init file for zen and zen2 configurations to choose opt kernels for
 SWAPV.
-Added test_swapv.c in test folder.
-Modified test/Makefile to include test_swapv.c

Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847]
2020-04-20 01:21:44 -05:00
Dipal Madhukar Zambare
f7bb291f6b Merge "Disable execution and debug trace by default." into amd-staging-rome-2.2 2020-04-20 01:16:04 -04:00
dzambare
80de43a483 Disable execution and debug trace by default.
Change-Id: I126336bc3d8a49019b083e66621c6a79725f7f0d
2020-04-20 10:44:32 +05:30
Meghana
80086fad15 Modified function definition for AXPY BLAS interface
Details:
-Calling the kernel directly from API call to avoid framework
overhead.
-Currently these changes are only applicable for zen2 configuration.
 They will be enabled for zen family processors in future.

Change-Id: I0139e185178f726f5cd8cba0ff6a441a00d67868
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-805]
2020-04-19 23:55:27 -05:00
Vasanthakumar Rajagopal
489d501f2e Merge "Execution and Debug trace support." into amd-staging-rome-2.2 2020-04-15 06:30:50 -04:00
Meghana
e56cf63a3f Optimized "bli_dotv_zen_int10" kernels
Details:
- Fixed issues in "bli_dotv_zen_int10" kernels and optimized them.
- Changed cntx_init file to choose "bli_dotv_zen_int10" kernel for dotv
 API call.

Change-Id: Iee8d7519f3a22a2d41166390be6047e9cb37557f
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-824]
2020-04-14 09:52:57 +05:30
dzambare
d40edf7dac Execution and Debug trace support.
Added support add debug logging, execution trace and decode.

Change-Id: I024bf6165daa9e23a62423f2401c0f1c5de459ba
AMD-Internal: [CPUPL-806]
2020-04-07 08:48:59 +05:30
Meghana
b5fe75e104 Closing input and output files in test_gemm.c and test_trsm.c
Change-Id: I75cdd5adc2bd2dac7d0eca9c050e06dbd52bec26
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
2020-03-24 09:09:58 +05:30
Meghana Vankadari
ddcb3d8a52 Modified test_trsm.c file in test folder to read input sizes from a file
Details:
-A Macro 'FILE_IN_OUT' is defined to read matrix dimensions and strides from a csv file.
 Format for input file if 'FILE_IN_OUT' is defined:
 Each line defines a TRSM problem with the following parameters: m n cs_a cs_b
 The operation implemented by default is AX=B where A is lower-triangular and matrices are in column-major order.
 When macro is disabled, it reverts back to original implementation.
 Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv
-A macro 'READ_ALL_PARAMS_FROM_FILE' is defined to read all the parameters for TRSM from a csv file.
 This macro can be defined only when 'FILE_IN_OUT' is already defined.
 Format for the input file if 'READ_ALL_PARAMS_FROM_FILE' is defined:
 Each line defines a TRSM problem with the following paramenters: sideA uploA transA diagA m n cs_a cs_b
 By default, column-major order is chosen as storage scheme for matrices.
 Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv

Change-Id: I349bc69ca968911c16e04d1ce70974d01e65a2fb
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
2020-03-20 07:15:17 -04:00
Meghana
c20c96d9c0 Made some critical changes to small_gemm kernels
Details:
- In case of GEMM, whenever beta is zero, we need to perform C = alpha
*(A * B) instead of C = beta * C + alpha * (A * B)
 Added conditions to check the value of beta at different levels inside
 small_gemm kernels and decide whether to perform scaling C with beta or
 not.
-Modified small_gemm kernels to use BLIS specific functions to retrieve
 different fields of objects.
-Calling bli_gemm_check before entering bli_gemm_small to facilitate
 early return in case of invalid inputs.
-For corner cases inside small_gemm kernels, a buffer called f_temp
 is used to load and store data to and from registers.
 populating the buffer with zeroes before use.
-In bli_gemm_front, datatypes of status and return value from
 bli_gemm_small are not matching.
 Corrected the datatype of the variable 'status' inside bli_gemm_front
 to err_t.

Change-Id: I8b52ad55008f028d6c8b7e0d20f746a869d9daea
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-689,SWLCSG-104]
2020-03-19 16:30:04 +05:30
Field G. Van Zee
efe85b3c19 Added missing return to bli_thread_partition_2x2().
Details:
- Added a missing return statement to the body of an early case handling
  branch in bli_thread_partition_2x2(). This bug only affected cases
  where n_threads < 4, and even then, the code meant to handle cases
  where n_threads >= 4 executes and does the right thing, albeit using
  more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
  for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).

Change-Id: I2182be0911f76861dd14bec9b6bacb6c20c2725d
2020-03-16 12:28:25 +05:30
Field G. Van Zee
a7c5723e77 Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.

Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0
2020-03-13 01:10:34 -04:00
Kiran Devrajegowda
04fc9d3710 Merge "Fixed bug(s) in mt sup when single-threaded." into amd-staging-rome-2.2 2020-03-13 01:10:22 -04:00
Field G. Van Zee
574de9e29e Fixed bug(s) in mt sup when single-threaded.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
  changing function interface for the thread entry point function
  (of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
  a memory leak in the sba at bli_finalize() time. It turns out that,
  due to the new multithreading-capable variant code useing thrinfo_t
  objects--specifically, their calling of bli_thrinfo_grow()--we
  have to pass in a real thrinfo_t object rather than the global
  objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
  Thus, I inserted the appropriate logic from the OpenMP and pthreads
  versions so that single-threaded execution would work as intended
  with the newly upgraded variants.

Change-Id: I2bfff849abf3fa30c73e0c5876128400854bbcb5
2020-03-13 01:10:04 -04:00
Field G. Van Zee
1a284828d1 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-03-13 01:09:29 -04:00
Nallani Bhaskar
83745c7ffc Beta Zero Check for sgemm small. Core Software Group SWLCSG-137 BLIS-ST validation failures
Change-Id: I21d5eae6ec390438be847f2dca42350b97059d6e
2020-03-09 02:55:51 -04:00
Nallani Bhaskar
e0c95d77e1 Beta Zero Checks for sgemm_small
Change-Id: I111b66ad54a27b1977d155904738a55a351e6689
2020-03-09 02:55:25 -04:00
Meghana Vankadari
cc98047fd6 Made framework changes to initialize specific cache block sizes for TRSM.
Details:
-This commit addresses the performance optimization(single-thread and
 multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
 what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
 on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
 store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
 block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
 This macro is automatically defined for zen family architectures.
 It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.

Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]
2020-03-09 10:33:42 +05:30
dzambare
f965b95d8b CPUPL-587: Corrected condition for A packing in sgemm_small
Change-Id: I1e5dc4a1dbe2f1d17f9c72e8dd0c6728ac1fd750
2020-01-27 11:08:20 +05:30
Meghana
b3e2938b9e Fix for CPUPL-549: TRSM for AlXB case results in NaN values
For the kernel of size 4x8, cs_b is used instead of cs_a to calculate address of diagonal elements of matrix A.
Correcting the mistake.

Change-Id: Ie74e0f6a397fcd32fefb5804cd00f1e90bfe5523
AOCL-2.1 2.1
2019-12-21 23:12:09 +05:30
Dipal M Zambare
72f4a7ab1e Increased pool buffer size to accommodate packing buffers needed in small_gemm to make it reentrant.
Change-Id: I96ac19ce97c39becce2c6e7ab47c3e7624560b30
2019-12-19 14:45:13 +05:30
Meghana Vankadari
62e00b4d64 Merge "Change in threshold condition for trsm_small kernels" into amd-staging-rome-rel-2.1 2019-12-17 23:54:01 -05:00
Meghana Vankadari
8eb264f78b Change in threshold condition for trsm_small kernels
Change-Id: I396e246b1639d300fcb94bdf7e5fa8bc8c87e994
2019-12-16 18:54:48 +05:30
Devrajegowda, Kiran
1fe8edbed0 "Merge Selective Packing code from amd branch flame/blis"
Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed
2019-12-16 14:48:53 +05:30
Nallani Bhaskar
a8af07f68c Added support to handle unsupported storage formats in sgemmsup using normal/small gemm path
Change-Id: I8762059c89e50f60e765a2a2983c5b2bdcdd8bc1
2019-12-13 15:31:28 +05:30
Kiran Devrajegowda
21224e8264 Merge "Revert " Merge Selective Packing code from amd branch flame/blis"" into amd-staging-rome-rel-2.1 2019-12-13 00:45:34 -05:00
Nallani Bhaskar
10a26a7357 Merge "Fix for CPUPL-550: AOCC clang compiler error. Resolved: Duplicate back to back declaration of a lable in asm file" into amd-staging-rome-rel-2.1 2019-12-13 00:25:49 -05:00
Kiran Varaganti
1650bcb623 Revert " Merge Selective Packing code from amd branch flame/blis"
This reverts commit e4a6af33f5.

Reason for revert: <Review not done>

Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2
2019-12-13 00:01:35 -05:00
Nallani Bhaskar
dc4e7d1203 Fix for CPUPL-550: AOCC clang compiler error. Resolved: Duplicate back to back declaration of a lable in asm file
Change-Id: I82c386d5fc00139da74fa031980d65c6a3874bd0
2019-12-12 20:43:47 +05:30
Devrajegowda, Kiran
e4a6af33f5 Merge Selective Packing code from amd branch flame/blis
Change-Id: I6d577f67ec84febe6af3635b10e5c9c77844ccd2
2019-12-12 15:22:21 +05:30
Kiran Varaganti
edc8f04dea Merge "Fix for CPUPL-541,When threading is enabled blis-mt library gets generated otherwise blis.a gets generated for sequential builds. However blis.h header file will differ for both type of libraries. The difference is about enable/disable of defining BLIS_ENABLE_OPENMP or BLIS_ENABLE_PTHREADS when threading is enabled. Appropriate header file needs to be included in the application." into amd-staging-rome-rel-2.1 2019-12-11 01:26:06 -05:00
Nallani Bhaskar
44edee7404 Added support to handle 7x16,8x16,9x16 efficiently in 6x16n kernel 2019-12-10 16:09:46 +05:30
Kiran Varaganti
82ec21f1c7 Fix for CPUPL-541,When threading is enabled blis-mt library gets generated otherwise blis.a gets generated for sequential builds. However blis.h header file will differ for both type of libraries. The difference is about enable/disable of defining BLIS_ENABLE_OPENMP or BLIS_ENABLE_PTHREADS when threading is enabled. Appropriate header file needs to be included in the application.
Change-Id: I82a4ae2bf529eedd83868e059a43749714cbe246
2019-12-10 15:40:27 +05:30
Kiran Varaganti
9b6c04d075 Merge " change in threshold condition for SUP and small kernels" into amd-staging-rome-rel-2.1 2019-12-08 23:42:25 -05:00
Devrajegowda, Kiran
3192914a1c change in threshold condition for SUP and small kernels
Change-Id: I7dbd30b2004c67122a639f081efc36e0f0d69fad
2019-12-09 01:31:58 +05:30
Kiran Varaganti
27d2b5a0db Merge "Made some improvements to trsm_small kernels" into amd-staging-rome-rel-2.1 2019-12-06 05:21:34 -05:00
Meghana
17b3a2639e Made some improvements to trsm_small kernels
Interchanged some loops to favour column-major storage.
Added check condiion to identify last column and load it using a 'for' loop to avoid memory accesses out of buffer

Change-Id: Id5d2e16c65017a7f4b641d33228d23903efd09ac
2019-12-06 14:48:28 +05:30
Nallani Bhaskar
af94ba29cf Added sup support for sgemm under zen and related frame work changes.
Change-Id: Ia7e88b96d3a3617e8d24754f50db081ffe2e9955
2019-12-04 10:56:10 +05:30
Meghana Vankadari
31bfe8985f re-enabling the boundary check condition for bli_dtrsm_small_AlXB. It was disabled by mistake in previous commits.
Change-Id: Ib7d2d0c5e133ff10559ce3dc5f7e624707e43c11
2019-12-03 17:07:37 +05:30
Meghana
cef185250e Fixed Segmentation fault in trsm_small kernels for the case AlXB.
For matrix sizes which are not multiples of 4, trsm_small kernels access memory outside the allocated buffers which causes segmentation fault.
This is fixed by handling each of the corner cases separately.

Change-Id: Ia7cfad5d65339a209a7376cc1654382593c933af
2019-12-03 17:05:57 +05:30