Commit Graph

2013 Commits

Author SHA1 Message Date
managalv
9b09dd7d6c CPUPL-929:Improve Complex GEMM performance
Added context for CCR format

Change-Id: I81ac1b882f176235b1c48f4952ec30f44c6b138c
2020-05-23 11:10:35 +05:30
managalv
154bedc785 CPUPL-929:Improve Complex GEMM performance
Removed print which was part of kernel

Change-Id: I288e0151ba8da8d6dd4415734c88ed3474ba3a5b
2020-05-22 14:39:12 +05:30
dzambare
8ce6e49a34 Added file and copyright header for aoclflist.c file.
AMD Internal: CPUPL-968

Change-Id: I1d0905fac62cac8224e38ae8d0912b9c5663799d
2020-05-22 12:54:18 +05:30
managalv
11570dbc14 CPUPL-929:Improve Complex GEMM performance
Updated BLIS_MC value and created SUP context for CCC storage format

Change-Id: I5032b29834ea545d7b5f7a9469bc5655c71b7fe5
2020-05-22 10:47:28 +05:30
managalv
f630b3fc36 CPUPL-929:Improve Complex GEMM performance
Details:
SUP support added for ZGEMM for different storage formats in M direction
SUP kernels and sub kernels are implemented to cover all dimensions of square matrix
SUP kernels supports RRR, RCR, CRR, CCR storage formats

Change-Id: I2c846a430dfcf356cac8ebf62015b1f743157381
2020-05-20 17:36:04 +05:30
Meghana Vankadari
9ea0472f4c Replaced all the instances of zen_basic with zen_ref_c
Change-Id: Id53f2c1ce7e9878991a831c3651061f0b679b080
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-885]
2020-05-19 20:27:17 +05:30
Meghana Vankadari
4fcc4e499d Optimized DGEMV kernel and changed BLAS interface call
Details:
- Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of dgemv to remove dependency on cntx variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for DGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.
  They will be enabled for zen family processors in future.
- Changed naming convention for new BLAS macros to indicate their use.
- Added new optimized kernel for axpyf under zen2 folder.
- Implemented basic GEMV kernel without using axpyv or axpyf.
  This kernel is chosen for small sizes.

Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-885]
2020-05-19 06:40:44 -04:00
managalv
af1ad806f2 CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices
Details:
Supports cgemm SUP all storage formats for XXR format

Change-Id: I1f1ac6b47f0b54141acac65e2cb4f3a2aaa3bac6
2020-05-18 21:06:57 +05:30
managalv
310dda928f CPUPL-709: Improve Complex GEMM performance - Level 1 Optimization
Details
Added SUP support for cgemm in M direction
SUP kernels are 3x8m, 3x4m, 3x2m is implemeted
Sub kernels are implemented to support various dimenions
SUP CGEMM supports matrix C & A row/col major and Matrix B is row major matrix

Change-Id: Ia6854b929d3b5741a4900422d05df1257f5d014d
2020-05-18 20:43:49 +05:30
Nallani Bhaskar
b3a308b689 CPUPL-948: Selective Packing changes are imlplemented in sgemm sup
Description:
Pannel strides are updated using variables rather than constant values to
support selecive packing in sgemm sup kernels

Change-Id: Ic098eb70592d12d7d2174a1166aebf3bc749140c
2020-05-18 11:46:33 +05:30
Devrajegowda, Kiran
884f2febd1 Revert "Block parameters tuning to improve sgemm performance on Rome"
Details:
    - Reverts commit 1c76723320.
    - Regression in Multi-Threaded sgemm performance with new block parameters on Rome

Change-Id: I67b050f6434f6ade2c982b3cd10aa863c0077601
Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-920]
2020-05-14 11:18:46 +05:30
Devrajegowda, Kiran
6f33fd6aac Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API
Details:
    -Kernel is called directly from API call to avoid framework
     overhead in case of single and double precisions.
    -Currently these changes are applicable only for zen2 configuration.
     They will be enabled for zen family processors in future.
    -These changes improve performance of BLAS and CBLAS interfaces of API.
     They do not affect BLIS-specific APIs.
    -setv simd kernel is added for single and double precision elements

Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a
    Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com>
    AMD-Internal: [CPUPL-817]
2020-05-13 01:51:48 -04:00
Meghana Vankadari
d6db8d1d2c Merge "Modified Function definition for BLAS and CBLAS interfaces of DOTV and SWAPV APIs" into amd-staging-rome-2.2 2020-05-06 00:34:37 -04:00
Nallani Bhaskar
830f1a44c6 CPUPL-849: BLIS SGEMM general stride test cases fails for smaller matrix sizes
Details:
Support for inputs with general strides is not yet implemented in sup. This check will redirect to default path.

Change-Id: I594672e56ffb60c8d89a634e27f30f2ac2a7e38f
2020-05-05 16:23:46 +05:30
Meghana
28bb28b79f Modified Function definition for BLAS and CBLAS interfaces of DOTV and SWAPV APIs
Details:
-Kernel is called directly from API call to avoid framework
 overhead in case of single and double precisions.
-Currently these changes are applicable only for zen2 configuration.
 They will be enabled for zen family processors in future.
-These changes improve performance of BLAS and CBLAS interfaces of API.
 They do not affect BLIS-specific APIs.

Change-Id: I1eb7ca470ced82c3cfa8b22f2b53000d42fef96c
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847,CPUPL-816]
2020-05-04 15:06:07 +05:30
Nallani Bhaskar
49cd7a96d5 CPUPL-866: ZenDNN gtest cases failing with blis 2.1 and later releases
Change-Id: Ib9ddfb133576d06cea6642fc3fefd818317fe922
2020-05-03 13:00:43 +05:30
Devrajegowda, Kiran
4ad5b1a5e6 Update zen2 kernel context with number of level1 kernels
Details:
       - Adding missed copyv simd kernel changes while rebasing.

Change-Id: Iaedabd39fdf297fefec1e48e7d2c1a2f3d7eb08d
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-04-24 12:22:24 +05:30
Devrajegowda, Kiran
4caee59466 Adding a simd kernel for copyv function
Details:
    - Separate kernel for copyv function added to improve performance.
    - Modified cntx_init file in zen and zen2 configuration
    - Added test_copyv.c in test folder
    - Modified test/Makefile to include test_copyv.c

Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
2020-04-24 01:55:25 -04:00
Nallani Bhaskar
ba00f75f64 Merge "JIRA: CPUPL-853: Fix for the redefinition of _unsigned int __get_cpuid_max(unsigned int, unsigned int*)_. http://ontrack-internal.amd.com/browse/CPUPL-853 https://github.com/flame/blis/issues/393" into amd-staging-rome-2.2 2020-04-24 01:12:00 -04:00
Nallani Bhaskar
ea3865fbf2 JIRA: CPUPL-853: Fix for the redefinition of _unsigned int __get_cpuid_max(unsigned int, unsigned int*)_. http://ontrack-internal.amd.com/browse/CPUPL-853 https://github.com/flame/blis/issues/393
Change-Id: I88c23b2fdad0beb3796d0e6acbcf215fe9daab2d
2020-04-23 17:14:24 +05:30
Meghana
f80e21ca7b Modified Function definition for BLAS and CBLAS interfaces of I?AMAX API
Details:
-Kernel is called directly from API call to avoid framework
 overhead in case of single and double precisions.
-Currently these changes are applicable only for zen2 configuration.
 They will be enabled for zen family processors in future.
-These changes improve performance of BLAS and CBLAS interfaces of API.
 They do not affect BLIS-specific APIs.

Change-Id: Ib12f5a4f66a3227681fb3028207a08cb69cc2406
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-855]
2020-04-22 17:43:55 +05:30
Meghana
138bc75063 Modified function definition for AXPY CBLAS interface
Details:
-Kernel is called directly from API call to avoid framework
 overhead in case of single and double precisions.
-Currently these changes are applicable only for zen2 configuration.
 They will be enabled for zen family processors in future.

Change-Id: Ifa17dc28d3b38e1e16b28bb785d9fdf4a223d909
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-805]
2020-04-21 07:59:07 -04:00
Devrajegowda, Kiran
1c76723320 Block parameters tuning to improve sgemm performance on Rome
Details:
    - Tuned block sizes to get better performance for sgemm default path.

Change-Id: I892e8642fa2d03a07a6d53537131536e6b1b091e
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-832]
2020-04-21 07:34:12 -04:00
Meghana Vankadari
139fbbb77f Merge "Added opt kernels for SWAPV" into amd-staging-rome-2.2 2020-04-21 05:54:19 -04:00
Kiran Varaganti
0fdb539d40 Fixed CPUPL-845 - expert interfaces consistent with other interfaces w.r.t disabling selective packing in sup by defaut
Change-Id: Id678ee727e8e9197e1c5b48a994fafd7797c48f2
2020-04-20 16:15:05 +05:30
Meghana
b846059bcf Added opt kernels for SWAPV
Details:
-Added SIMD kernels for SWAPV for both single and double precisions.
-Modified cntx_init file for zen and zen2 configurations to choose opt kernels for
 SWAPV.
-Added test_swapv.c in test folder.
-Modified test/Makefile to include test_swapv.c

Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847]
2020-04-20 01:21:44 -05:00
Dipal Madhukar Zambare
f7bb291f6b Merge "Disable execution and debug trace by default." into amd-staging-rome-2.2 2020-04-20 01:16:04 -04:00
dzambare
80de43a483 Disable execution and debug trace by default.
Change-Id: I126336bc3d8a49019b083e66621c6a79725f7f0d
2020-04-20 10:44:32 +05:30
Meghana
80086fad15 Modified function definition for AXPY BLAS interface
Details:
-Calling the kernel directly from API call to avoid framework
overhead.
-Currently these changes are only applicable for zen2 configuration.
 They will be enabled for zen family processors in future.

Change-Id: I0139e185178f726f5cd8cba0ff6a441a00d67868
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-805]
2020-04-19 23:55:27 -05:00
Vasanthakumar Rajagopal
489d501f2e Merge "Execution and Debug trace support." into amd-staging-rome-2.2 2020-04-15 06:30:50 -04:00
Meghana
e56cf63a3f Optimized "bli_dotv_zen_int10" kernels
Details:
- Fixed issues in "bli_dotv_zen_int10" kernels and optimized them.
- Changed cntx_init file to choose "bli_dotv_zen_int10" kernel for dotv
 API call.

Change-Id: Iee8d7519f3a22a2d41166390be6047e9cb37557f
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-824]
2020-04-14 09:52:57 +05:30
dzambare
d40edf7dac Execution and Debug trace support.
Added support add debug logging, execution trace and decode.

Change-Id: I024bf6165daa9e23a62423f2401c0f1c5de459ba
AMD-Internal: [CPUPL-806]
2020-04-07 08:48:59 +05:30
Meghana
b5fe75e104 Closing input and output files in test_gemm.c and test_trsm.c
Change-Id: I75cdd5adc2bd2dac7d0eca9c050e06dbd52bec26
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
2020-03-24 09:09:58 +05:30
Meghana Vankadari
ddcb3d8a52 Modified test_trsm.c file in test folder to read input sizes from a file
Details:
-A Macro 'FILE_IN_OUT' is defined to read matrix dimensions and strides from a csv file.
 Format for input file if 'FILE_IN_OUT' is defined:
 Each line defines a TRSM problem with the following parameters: m n cs_a cs_b
 The operation implemented by default is AX=B where A is lower-triangular and matrices are in column-major order.
 When macro is disabled, it reverts back to original implementation.
 Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv
-A macro 'READ_ALL_PARAMS_FROM_FILE' is defined to read all the parameters for TRSM from a csv file.
 This macro can be defined only when 'FILE_IN_OUT' is already defined.
 Format for the input file if 'READ_ALL_PARAMS_FROM_FILE' is defined:
 Each line defines a TRSM problem with the following paramenters: sideA uploA transA diagA m n cs_a cs_b
 By default, column-major order is chosen as storage scheme for matrices.
 Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv

Change-Id: I349bc69ca968911c16e04d1ce70974d01e65a2fb
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
2020-03-20 07:15:17 -04:00
Meghana
c20c96d9c0 Made some critical changes to small_gemm kernels
Details:
- In case of GEMM, whenever beta is zero, we need to perform C = alpha
*(A * B) instead of C = beta * C + alpha * (A * B)
 Added conditions to check the value of beta at different levels inside
 small_gemm kernels and decide whether to perform scaling C with beta or
 not.
-Modified small_gemm kernels to use BLIS specific functions to retrieve
 different fields of objects.
-Calling bli_gemm_check before entering bli_gemm_small to facilitate
 early return in case of invalid inputs.
-For corner cases inside small_gemm kernels, a buffer called f_temp
 is used to load and store data to and from registers.
 populating the buffer with zeroes before use.
-In bli_gemm_front, datatypes of status and return value from
 bli_gemm_small are not matching.
 Corrected the datatype of the variable 'status' inside bli_gemm_front
 to err_t.

Change-Id: I8b52ad55008f028d6c8b7e0d20f746a869d9daea
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-689,SWLCSG-104]
2020-03-19 16:30:04 +05:30
Field G. Van Zee
efe85b3c19 Added missing return to bli_thread_partition_2x2().
Details:
- Added a missing return statement to the body of an early case handling
  branch in bli_thread_partition_2x2(). This bug only affected cases
  where n_threads < 4, and even then, the code meant to handle cases
  where n_threads >= 4 executes and does the right thing, albeit using
  more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
  for reporting this bug via issue #377.
- Whitespace changes to bli_thread.c (spaces -> tabs).

Change-Id: I2182be0911f76861dd14bec9b6bacb6c20c2725d
2020-03-16 12:28:25 +05:30
Field G. Van Zee
a7c5723e77 Skip building thrinfo_t tree when mt is disabled.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
  address is equal to either &BLIS_GEMM_SINGLE_THREADED or
  &BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
  bli_l3_sup_decor_single.c that (by default) disables code that
  creates and frees the thrinfo_t tree and instead passes
  &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
  sup implementation.
- The net effect of the above changes is that a small amount of
  thrinfo_t overhead is avoided when running small/skinny dgemm
  problems when BLIS is compiled with multithreading disabled.

Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0
2020-03-13 01:10:34 -04:00
Kiran Devrajegowda
04fc9d3710 Merge "Fixed bug(s) in mt sup when single-threaded." into amd-staging-rome-2.2 2020-03-13 01:10:22 -04:00
Field G. Van Zee
574de9e29e Fixed bug(s) in mt sup when single-threaded.
Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
  changing function interface for the thread entry point function
  (of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
  a memory leak in the sba at bli_finalize() time. It turns out that,
  due to the new multithreading-capable variant code useing thrinfo_t
  objects--specifically, their calling of bli_thrinfo_grow()--we
  have to pass in a real thrinfo_t object rather than the global
  objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
  Thus, I inserted the appropriate logic from the OpenMP and pthreads
  versions so that single-threaded execution would work as intended
  with the newly upgraded variants.

Change-Id: I2bfff849abf3fa30c73e0c5876128400854bbcb5
2020-03-13 01:10:04 -04:00
Field G. Van Zee
1a284828d1 Support multithreading within the sup framework.
Details:
- Added multithreading support to the sup framework (via either OpenMP
  or pthreads). Both variants 1n and 2m now have the appropriate
  threading infrastructure, including data partitioning logic, to
  parallelize computation. This support handles all four combinations
  of packing on matrices A and B (neither, A only, B only, or both).
  This implementation tries to be a little smarter when automatic
  threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
  recalculate the factorization in units of micropanels (rather than
  using the raw dimensions) in bli_l3_sup_int.c, when the final
  problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
  or column-stored matrices. (This is used for the rrc and crc storage
  cases.) Previously, copym was used, but that would no longer suffice
  because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
  bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
  instead of from the variant functions. This has the effect of making
  the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
  and inserted usage of these functions within bli_thrinfo_init(), which
  previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
  whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
  tests, as well as appropriate octave/matlab scripts to plot the
  resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
  that specifying any BLIS_*_NT variable, even if it is set to 1, will
  be considered manual specification for the purposes of determining
  whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]

Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
2020-03-13 01:09:29 -04:00
Nallani Bhaskar
83745c7ffc Beta Zero Check for sgemm small. Core Software Group SWLCSG-137 BLIS-ST validation failures
Change-Id: I21d5eae6ec390438be847f2dca42350b97059d6e
2020-03-09 02:55:51 -04:00
Nallani Bhaskar
e0c95d77e1 Beta Zero Checks for sgemm_small
Change-Id: I111b66ad54a27b1977d155904738a55a351e6689
2020-03-09 02:55:25 -04:00
Meghana Vankadari
cc98047fd6 Made framework changes to initialize specific cache block sizes for TRSM.
Details:
-This commit addresses the performance optimization(single-thread and
 multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
 what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
 on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
 store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
 block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
 This macro is automatically defined for zen family architectures.
 It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.

Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]
2020-03-09 10:33:42 +05:30
dzambare
f965b95d8b CPUPL-587: Corrected condition for A packing in sgemm_small
Change-Id: I1e5dc4a1dbe2f1d17f9c72e8dd0c6728ac1fd750
2020-01-27 11:08:20 +05:30
Meghana
b3e2938b9e Fix for CPUPL-549: TRSM for AlXB case results in NaN values
For the kernel of size 4x8, cs_b is used instead of cs_a to calculate address of diagonal elements of matrix A.
Correcting the mistake.

Change-Id: Ie74e0f6a397fcd32fefb5804cd00f1e90bfe5523
AOCL-2.1 2.1
2019-12-21 23:12:09 +05:30
Dipal M Zambare
72f4a7ab1e Increased pool buffer size to accommodate packing buffers needed in small_gemm to make it reentrant.
Change-Id: I96ac19ce97c39becce2c6e7ab47c3e7624560b30
2019-12-19 14:45:13 +05:30
Meghana Vankadari
62e00b4d64 Merge "Change in threshold condition for trsm_small kernels" into amd-staging-rome-rel-2.1 2019-12-17 23:54:01 -05:00
Meghana Vankadari
8eb264f78b Change in threshold condition for trsm_small kernels
Change-Id: I396e246b1639d300fcb94bdf7e5fa8bc8c87e994
2019-12-16 18:54:48 +05:30
Devrajegowda, Kiran
1fe8edbed0 "Merge Selective Packing code from amd branch flame/blis"
Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed
2019-12-16 14:48:53 +05:30
Nallani Bhaskar
a8af07f68c Added support to handle unsupported storage formats in sgemmsup using normal/small gemm path
Change-Id: I8762059c89e50f60e765a2a2983c5b2bdcdd8bc1
2019-12-13 15:31:28 +05:30