amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	99da76fd64	Fixed 'configure' breakage introduced in `6433831`. Details: - Added a missing 'fi' (endif) keyword to a conditional block added in the configure script in commit `6433831`.	2020-05-21 11:40:00 +05:30
Field G. Van Zee	38ecda47e7	Updated 1m draft article link in README.md.	2020-05-21 11:40:00 +05:30
Jeff Hammond	570d51483b	blacklist ICC 18 for knl/skx due to test failures Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>	2020-05-21 11:40:00 +05:30
Jeff Hammond	afc57adc1b	blacklist Intel 19+ Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>	2020-05-21 11:40:00 +05:30
Jeff Hammond	dd54e792a7	fix link to docs the comment contains an incorrect link, which is trivially fixed here. @fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.	2020-05-21 11:40:00 +05:30
Field G. Van Zee	d988a5bbd7	Fixed bugs in cblas_sdsdot(), sdsdot_(). Details: - Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar, named 'sb'. This value was already being added by the underlying sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub(). Thanks to Simon Lukas Märtens for reporting this bug via #367. - Fixed a second bug in order of typecasting intermediate products in sdsdot_(). Previously, the "alpha" scalar was being added after the "outer" typecast to float. However, the operation is supposed to first add the dot product to the (promoted) scalar and THEN downcast the sum to float. Thanks to Devin Matthews for catching this bug.	2020-05-21 11:40:00 +05:30
Field G. Van Zee	afee36b251	Annoted missing thread-related symbols for export. Details: - Added BLIS_EXPORT_BLIS annotation to function prototypes for bli_thrcomm_bcast() bli_thrcomm_barrier() bli_thread_range_sub() so that these functions are exported to shared libraries by default. This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for reporting this bug. - CREDITS file update.	2020-05-21 11:40:00 +05:30
Nicholai Tukanov	718b64814d	Add prototypes for POWER9 reference kernels (#365 ) Updates and fixes to power9 subconfig. Details: - Register s,c,z reference gemm and trsm ukernels that assume elements of B have been broadcast. - Added prototypes for level-3 ukernels that assume elements of B have been broadcast. Also added prototype for an spackm function that employs a duplication/broadcast factor of 4. - Register virtual gemmtrsm ukernels that work with broadcasting of B. - Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h. - Thanks to Nicholai Tukanov for providing these updates.	2020-05-21 11:40:00 +05:30
Meghana Vankadari	9ea0472f4c	Replaced all the instances of zen_basic with zen_ref_c Change-Id: Id53f2c1ce7e9878991a831c3651061f0b679b080 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-885]	2020-05-19 20:27:17 +05:30
Meghana Vankadari	4fcc4e499d	Optimized DGEMV kernel and changed BLAS interface call Details: - Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2. - Modified framework files of dgemv to remove dependency on cntx variable. - Updated cntx_init file of zen2 to choose optimized kernels. - Modified BLAS interface call for DGEMV to reduce framework overhread. - Currently these changes are applicable for zen2 configuration. They will be enabled for zen family processors in future. - Changed naming convention for new BLAS macros to indicate their use. - Added new optimized kernel for axpyf under zen2 folder. - Implemented basic GEMV kernel without using axpyv or axpyf. This kernel is chosen for small sizes. Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-885]	2020-05-19 06:40:44 -04:00
managalv	af1ad806f2	CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices Details: Supports cgemm SUP all storage formats for XXR format Change-Id: I1f1ac6b47f0b54141acac65e2cb4f3a2aaa3bac6	2020-05-18 21:06:57 +05:30
managalv	310dda928f	CPUPL-709: Improve Complex GEMM performance - Level 1 Optimization Details Added SUP support for cgemm in M direction SUP kernels are 3x8m, 3x4m, 3x2m is implemeted Sub kernels are implemented to support various dimenions SUP CGEMM supports matrix C & A row/col major and Matrix B is row major matrix Change-Id: Ia6854b929d3b5741a4900422d05df1257f5d014d	2020-05-18 20:43:49 +05:30
Nallani Bhaskar	b3a308b689	CPUPL-948: Selective Packing changes are imlplemented in sgemm sup Description: Pannel strides are updated using variables rather than constant values to support selecive packing in sgemm sup kernels Change-Id: Ic098eb70592d12d7d2174a1166aebf3bc749140c	2020-05-18 11:46:33 +05:30
Devrajegowda, Kiran	884f2febd1	Revert "Block parameters tuning to improve sgemm performance on Rome" Details: - Reverts commit `1c76723320`. - Regression in Multi-Threaded sgemm performance with new block parameters on Rome Change-Id: I67b050f6434f6ade2c982b3cd10aa863c0077601 Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-920]	2020-05-14 11:18:46 +05:30
Devrajegowda, Kiran	6f33fd6aac	Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. -setv simd kernel is added for single and double precision elements Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-817]	2020-05-13 01:51:48 -04:00
Meghana Vankadari	d6db8d1d2c	Merge "Modified Function definition for BLAS and CBLAS interfaces of DOTV and SWAPV APIs" into amd-staging-rome-2.2	2020-05-06 00:34:37 -04:00
Nallani Bhaskar	830f1a44c6	CPUPL-849: BLIS SGEMM general stride test cases fails for smaller matrix sizes Details: Support for inputs with general strides is not yet implemented in sup. This check will redirect to default path. Change-Id: I594672e56ffb60c8d89a634e27f30f2ac2a7e38f	2020-05-05 16:23:46 +05:30
Meghana	28bb28b79f	Modified Function definition for BLAS and CBLAS interfaces of DOTV and SWAPV APIs Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. Change-Id: I1eb7ca470ced82c3cfa8b22f2b53000d42fef96c Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-847,CPUPL-816]	2020-05-04 15:06:07 +05:30
Nallani Bhaskar	49cd7a96d5	CPUPL-866: ZenDNN gtest cases failing with blis 2.1 and later releases Change-Id: Ib9ddfb133576d06cea6642fc3fefd818317fe922	2020-05-03 13:00:43 +05:30
Devrajegowda, Kiran	4ad5b1a5e6	Update zen2 kernel context with number of level1 kernels Details: - Adding missed copyv simd kernel changes while rebasing. Change-Id: Iaedabd39fdf297fefec1e48e7d2c1a2f3d7eb08d Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-818]	2020-04-24 12:22:24 +05:30
Devrajegowda, Kiran	4caee59466	Adding a simd kernel for copyv function Details: - Separate kernel for copyv function added to improve performance. - Modified cntx_init file in zen and zen2 configuration - Added test_copyv.c in test folder - Modified test/Makefile to include test_copyv.c Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41 Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-818]	2020-04-24 01:55:25 -04:00
Nallani Bhaskar	ba00f75f64	Merge "JIRA: CPUPL-853: Fix for the redefinition of _unsigned int __get_cpuid_max(unsigned int, unsigned int*)_. http://ontrack-internal.amd.com/browse/CPUPL-853 https://github.com/flame/blis/issues/393 " into amd-staging-rome-2.2	2020-04-24 01:12:00 -04:00
Nallani Bhaskar	ea3865fbf2	JIRA: CPUPL-853: Fix for the redefinition of _unsigned int __get_cpuid_max(unsigned int, unsigned int*)_. http://ontrack-internal.amd.com/browse/CPUPL-853 https://github.com/flame/blis/issues/393 Change-Id: I88c23b2fdad0beb3796d0e6acbcf215fe9daab2d	2020-04-23 17:14:24 +05:30
Meghana	f80e21ca7b	Modified Function definition for BLAS and CBLAS interfaces of I?AMAX API Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. Change-Id: Ib12f5a4f66a3227681fb3028207a08cb69cc2406 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-855]	2020-04-22 17:43:55 +05:30
Meghana	138bc75063	Modified function definition for AXPY CBLAS interface Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. Change-Id: Ifa17dc28d3b38e1e16b28bb785d9fdf4a223d909 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-805]	2020-04-21 07:59:07 -04:00
Devrajegowda, Kiran	1c76723320	Block parameters tuning to improve sgemm performance on Rome Details: - Tuned block sizes to get better performance for sgemm default path. Change-Id: I892e8642fa2d03a07a6d53537131536e6b1b091e Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-832]	2020-04-21 07:34:12 -04:00
Meghana Vankadari	139fbbb77f	Merge "Added opt kernels for SWAPV" into amd-staging-rome-2.2	2020-04-21 05:54:19 -04:00
Kiran Varaganti	0fdb539d40	Fixed CPUPL-845 - expert interfaces consistent with other interfaces w.r.t disabling selective packing in sup by defaut Change-Id: Id678ee727e8e9197e1c5b48a994fafd7797c48f2	2020-04-20 16:15:05 +05:30
Meghana	b846059bcf	Added opt kernels for SWAPV Details: -Added SIMD kernels for SWAPV for both single and double precisions. -Modified cntx_init file for zen and zen2 configurations to choose opt kernels for SWAPV. -Added test_swapv.c in test folder. -Modified test/Makefile to include test_swapv.c Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-847]	2020-04-20 01:21:44 -05:00
Dipal Madhukar Zambare	f7bb291f6b	Merge "Disable execution and debug trace by default." into amd-staging-rome-2.2	2020-04-20 01:16:04 -04:00
dzambare	80de43a483	Disable execution and debug trace by default. Change-Id: I126336bc3d8a49019b083e66621c6a79725f7f0d	2020-04-20 10:44:32 +05:30
Meghana	80086fad15	Modified function definition for AXPY BLAS interface Details: -Calling the kernel directly from API call to avoid framework overhead. -Currently these changes are only applicable for zen2 configuration. They will be enabled for zen family processors in future. Change-Id: I0139e185178f726f5cd8cba0ff6a441a00d67868 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-805]	2020-04-19 23:55:27 -05:00
Vasanthakumar Rajagopal	489d501f2e	Merge "Execution and Debug trace support." into amd-staging-rome-2.2	2020-04-15 06:30:50 -04:00
Meghana	e56cf63a3f	Optimized "bli_dotv_zen_int10" kernels Details: - Fixed issues in "bli_dotv_zen_int10" kernels and optimized them. - Changed cntx_init file to choose "bli_dotv_zen_int10" kernel for dotv API call. Change-Id: Iee8d7519f3a22a2d41166390be6047e9cb37557f Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-824]	2020-04-14 09:52:57 +05:30
dzambare	d40edf7dac	Execution and Debug trace support. Added support add debug logging, execution trace and decode. Change-Id: I024bf6165daa9e23a62423f2401c0f1c5de459ba AMD-Internal: [CPUPL-806]	2020-04-07 08:48:59 +05:30
Meghana	b5fe75e104	Closing input and output files in test_gemm.c and test_trsm.c Change-Id: I75cdd5adc2bd2dac7d0eca9c050e06dbd52bec26 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>	2020-03-24 09:09:58 +05:30
Meghana Vankadari	ddcb3d8a52	Modified test_trsm.c file in test folder to read input sizes from a file Details: -A Macro 'FILE_IN_OUT' is defined to read matrix dimensions and strides from a csv file. Format for input file if 'FILE_IN_OUT' is defined: Each line defines a TRSM problem with the following parameters: m n cs_a cs_b The operation implemented by default is AX=B where A is lower-triangular and matrices are in column-major order. When macro is disabled, it reverts back to original implementation. Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv -A macro 'READ_ALL_PARAMS_FROM_FILE' is defined to read all the parameters for TRSM from a csv file. This macro can be defined only when 'FILE_IN_OUT' is already defined. Format for the input file if 'READ_ALL_PARAMS_FROM_FILE' is defined: Each line defines a TRSM problem with the following paramenters: sideA uploA transA diagA m n cs_a cs_b By default, column-major order is chosen as storage scheme for matrices. Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv Change-Id: I349bc69ca968911c16e04d1ce70974d01e65a2fb Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>	2020-03-20 07:15:17 -04:00
Meghana	c20c96d9c0	Made some critical changes to small_gemm kernels Details: - In case of GEMM, whenever beta is zero, we need to perform C = alpha (A B) instead of C = beta * C + alpha * (A * B) Added conditions to check the value of beta at different levels inside small_gemm kernels and decide whether to perform scaling C with beta or not. -Modified small_gemm kernels to use BLIS specific functions to retrieve different fields of objects. -Calling bli_gemm_check before entering bli_gemm_small to facilitate early return in case of invalid inputs. -For corner cases inside small_gemm kernels, a buffer called f_temp is used to load and store data to and from registers. populating the buffer with zeroes before use. -In bli_gemm_front, datatypes of status and return value from bli_gemm_small are not matching. Corrected the datatype of the variable 'status' inside bli_gemm_front to err_t. Change-Id: I8b52ad55008f028d6c8b7e0d20f746a869d9daea Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-689,SWLCSG-104]	2020-03-19 16:30:04 +05:30
Field G. Van Zee	efe85b3c19	Added missing return to bli_thread_partition_2x2(). Details: - Added a missing return statement to the body of an early case handling branch in bli_thread_partition_2x2(). This bug only affected cases where n_threads < 4, and even then, the code meant to handle cases where n_threads >= 4 executes and does the right thing, albeit using more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti for reporting this bug via issue #377. - Whitespace changes to bli_thread.c (spaces -> tabs). Change-Id: I2182be0911f76861dd14bec9b6bacb6c20c2725d	2020-03-16 12:28:25 +05:30
Field G. Van Zee	a7c5723e77	Skip building thrinfo_t tree when mt is disabled. Details: - Return early from bli_thrinfo_sup_grow() if the thrinfo_t object address is equal to either &BLIS_GEMM_SINGLE_THREADED or &BLIS_PACKM_SINGLE_THREADED. - Added preprocessor logic to bli_l3_sup_thread_decorator() in bli_l3_sup_decor_single.c that (by default) disables code that creates and frees the thrinfo_t tree and instead passes &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the sup implementation. - The net effect of the above changes is that a small amount of thrinfo_t overhead is avoided when running small/skinny dgemm problems when BLIS is compiled with multithreading disabled. Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0	2020-03-13 01:10:34 -04:00
Kiran Devrajegowda	04fc9d3710	Merge "Fixed bug(s) in mt sup when single-threaded." into amd-staging-rome-2.2	2020-03-13 01:10:22 -04:00
Field G. Van Zee	574de9e29e	Fixed bug(s) in mt sup when single-threaded. Details: - Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of changing function interface for the thread entry point function (of type l3supint_t). - Unfortunately, fixing the interface was not enough, as it caused a memory leak in the sba at bli_finalize() time. It turns out that, due to the new multithreading-capable variant code useing thrinfo_t objects--specifically, their calling of bli_thrinfo_grow()--we have to pass in a real thrinfo_t object rather than the global objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED. Thus, I inserted the appropriate logic from the OpenMP and pthreads versions so that single-threaded execution would work as intended with the newly upgraded variants. Change-Id: I2bfff849abf3fa30c73e0c5876128400854bbcb5	2020-03-13 01:10:04 -04:00
Field G. Van Zee	1a284828d1	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates. AMD-Internal: [CPUPL-713] Change-Id: I9536648e7befac4d2dc17805e44ef34470961662	2020-03-13 01:09:29 -04:00
Nallani Bhaskar	83745c7ffc	Beta Zero Check for sgemm small. Core Software Group SWLCSG-137 BLIS-ST validation failures Change-Id: I21d5eae6ec390438be847f2dca42350b97059d6e	2020-03-09 02:55:51 -04:00
Nallani Bhaskar	e0c95d77e1	Beta Zero Checks for sgemm_small Change-Id: I111b66ad54a27b1977d155904738a55a351e6689	2020-03-09 02:55:25 -04:00
Meghana Vankadari	cc98047fd6	Made framework changes to initialize specific cache block sizes for TRSM. Details: -This commit addresses the performance optimization(single-thread and multi-thread) for DTRSM on zen2. -This new optimization employs different MC, KC & NC values for TRSM than what is being used in other Level-3 routines like DGEMM. -Changed TRSM framework code to choose these blocksizes for TRSM on zen family configurations. -Added a new field called "trsm_blkszs" to cntx structure in order to store TRSM specific block sizes. -Implemented routines to initialize, set and query the TRSM-specific block sizes. -Defined a new macro "AOCL_BLIS_ZEN" in configure script. This macro is automatically defined for zen family architectures. It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes. Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6 Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com> AMD-Internal: [CPUPL-656]	2020-03-09 10:33:42 +05:30
dzambare	f965b95d8b	CPUPL-587: Corrected condition for A packing in sgemm_small Change-Id: I1e5dc4a1dbe2f1d17f9c72e8dd0c6728ac1fd750	2020-01-27 11:08:20 +05:30
Meghana	b3e2938b9e	Fix for CPUPL-549: TRSM for AlXB case results in NaN values For the kernel of size 4x8, cs_b is used instead of cs_a to calculate address of diagonal elements of matrix A. Correcting the mistake. Change-Id: Ie74e0f6a397fcd32fefb5804cd00f1e90bfe5523 AOCL-2.1 2.1	2019-12-21 23:12:09 +05:30
Dipal M Zambare	72f4a7ab1e	Increased pool buffer size to accommodate packing buffers needed in small_gemm to make it reentrant. Change-Id: I96ac19ce97c39becce2c6e7ab47c3e7624560b30	2019-12-19 14:45:13 +05:30
Meghana Vankadari	62e00b4d64	Merge "Change in threshold condition for trsm_small kernels" into amd-staging-rome-rel-2.1	2019-12-17 23:54:01 -05:00

1 2 3 4 5 ...

2016 Commits