amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-24 18:34:40 +00:00

Author	SHA1	Message	Date
Nageshwar Singh	dd5b38d221	Added BLIS, BLAS, and CBLAS interface for cblas?amin Details: - Amin api returns index of minimum absolute value in a vector. - Added amin reference blis kernel. - Added blas and cblas interface for amin. AMD-Internal: [CPUPL-1155] Change-Id: I89c1e37e86950a4582bba70a5d8fc70ac915bd3c	2020-10-28 17:50:27 +05:30
Meghana Vankadari	47744663d9	Enabling framework optimizations for zen family architectures. Details: - Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas framework optimizations for zen family configurations. - The macro needs to be defined in family.h files of respective arch configs. - Moved zen2-specific optimized kernels to zen folder, in order to be accessible to all zen family architectures. Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d	2020-10-07 13:10:50 +05:30
dzambare	267a959af1	Rebased amd-staging-milan-3.0 branch on master -- Rebased on top of master commit # `6e522e5823` -- Updated merged code to remove duplicated code added by auto-merging -- Updated merged code to rename bool_t type -- Updated merged code to rename bli_thread_obarrier -- Updated merged code to rename bli_thread_obroadcast Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c AMD-Internal: [CPUPL-1067]	2020-08-06 10:09:29 +05:30
Devrajegowda, Kiran	6b5c68b9ed	"Merge Selective Packing code from amd branch flame/blis" Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed	2020-08-06 10:09:28 +05:30
Kiran Varaganti	307ddc3110	Revert " Merge Selective Packing code from amd branch flame/blis" This reverts commit `e4a6af33f5`. Reason for revert: <Review not done> Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2	2020-08-06 10:09:28 +05:30
Field G. Van Zee	889b90888f	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2020-08-03 11:48:42 +05:30
Field G. Van Zee	e4d07f93db	Removed export macros from all internal prototypes. Details: - After merging PR #303, at Isuru's request, I removed the use of BLIS_EXPORT_BLIS from all function prototypes except those that we potentially wish to be exported in shared/dynamic libraries. In other words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of functions that can be considered private or for internal use only. This is likely the last big modification along the path towards implementing the functionality spelled out in issue #248. Thanks again to Isuru Fernando for his initial efforts of sprinkling the export macros throughout BLIS, which made removing them where necessary relatively painless. Also, I'd like to thank Tony Kelman, Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for participating in the initial discussion in issue #37 that was later summarized and restated in issue #248. - CREDITS file update.	2020-08-03 11:47:18 +05:30
Isuru Fernando	ceda852482	Add symbol export macro for all functions (#302 ) * initial export of blis functions * Regenerate def file for master * restore bli_extern_defs exporting for now	2020-08-03 11:42:15 +05:30
Field G. Van Zee	fd5db714f4	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-08-03 11:27:13 +05:30
Field G. Van Zee	14d95a2183	Redefined bool_t typedef in terms of C99 bool. Details: - Changed the typedef that defines bool_t from: typedef gint_t bool_t; where gint_t is a signed integer that forms the basis of most other integers in BLIS, to: typedef bool bool_t; - Changed BLIS's TRUE and FALSE macro definitions from being in terms of integer literals: #define TRUE 1 #define FALSE 0 to being in terms of C99 boolean constants: #define TRUE true #define FALSE false which are provided by stdbool.h. - This commit constitutes the second phase of a transition toward using C99's bool instead of bool_t, which will address issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`.	2020-08-03 11:23:40 +05:30
Field G. Van Zee	8b9257df67	Cleaned up bool_t usage and various typecasts. Details: - Fixed various typecasts in frame/base/bli_cntx.h frame/base/bli_mbool.h frame/base/bli_rntm.h frame/include/bli_misc_macro_defs.h frame/include/bli_obj_macro_defs.h frame/include/bli_param_macro_defs.h that were missing or being done improperly/incompletely. For example, many return values were being typecast as (bool_t)x && y rather than (bool_t)(x && y) Thankfully, none of these deficiencies had manifested as actual bugs at the time of this commit. - Changed the return type of bli_env_get_var() from dim_t to gint_t. This reflects the fact that bli_env_get_var() needs to be able to return a signed integer, and even though dim_t is currently defined as a signed integer, it does not intuitively appear to necessarily be signed by inspection (i.e., an integer named "dim_t" for matrix "dimension"). Also, updated use of bli_env_get_var() within bli_pack.c to reflect the changed return type. - Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t and added comments to the bli_thrcomm_*.h files that will explain a planned replacement of bool_t with C99's bool type. - Note: These changes are being made to facilitate the substitution of 'bool' for 'bool_t', which will eliminate the namespace conflict with arm_sve.h as reported in issue #420. This commit implements the first phase of that transition. Thanks to RuQing Xu for reporting this issue. - CREDITS file update.	2020-08-03 11:23:40 +05:30
Field G. Van Zee	3eef698711	Declare/define static functions via BLIS_INLINE. Details: - Updated all static function definitions to use the cpp macro BLIS_INLINE instead of the static keyword. This allows blis.h to use a different keyword (inline) to define these functions when compiling with C++, which might otherwise trigger "defined but not used" warning messages. Thanks to Giorgos Margaritis for reporting this issue and Devin Matthews for suggesting the fix. - Updated the following files, which are used by configure's hardware auto-detection facility, to unconditionally #define BLIS_INLINE to the static keyword (since we know BLIS will be compiled with C, not C++): build/detect/config/config_detect.c frame/base/bli_arch.c frame/base/bli_cpuid.c - CREDITS file update.	2020-08-03 11:23:40 +05:30
Dipal M Zambare	25d23cdda2	Zen3 support, disabled IR, JR loop parallelization AMD-Internal: [CPUPL-1013] Change-Id: I859152d63d1a56519c508dfa19587f25123e08b4	2020-07-24 20:55:47 +05:30
dzambare	9c7814da1c	Added support for zen3 configuration - User can now specify zen3 configuration, currently it reuses block sizes and kernels from zen2. - Auto configuration can detect and enable if zen3 config is needed - Added support for amd64 bundle which contains all zen platforms - Moved exiting amd bundle to amd64 legacy. AMD-Internal: [CPUPL-500, CPUPL-1013] Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957	2020-07-22 18:24:26 +05:30
nprasadm	af1f9ab98d	BLIS: 'zdotc_' API modified to support Fortran invocation in flang environment. 1) Added dcomplex based zdotc_ version as a function with additional parameter. 2) The datatypes (single , double, Complex) functions retained as the macros. 3) This modification handles the ZDOTC_ invocation from Fortran based application for 'double complex' datatypes. 4) The modifications are placed under macro 'AOCL_F2C'. 5) Blis, Blas Test suites verified ALL PASS with GCC and Flang + with and without 'AOCL_F2C' macro on Ubuntu machine. 6) Adding BLIS_EXPORT_BLAS to make the APIs visible when linking dll. Change-Id: I4ada39a73f416e3794708f5b55e947342c261117 Signed-off-by: Meghana <Meghana.Vankadari@amd.com>, Nagendra <Nagendra.PrasadM@amd.com> AMD-Internal: [SWLCSG-177]	2020-07-14 00:53:07 -04:00
Meghana Vankadari	6a0a65ee23	Added sup kernels and code path for gemmt similar to GEMM.GEMMT now also supports complex data types. Details: - Added framework code for GEMMT SUP. - Implemented SUP for GEMMT using similar techniques as native path. - Moved update routines to frame/util folder. - Ported update routines for complex datatypes. Change-Id: I17adfd0586d07f5a23dca6a07b2d48f4c9fcf71c Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>, Dipal M Zambare <DipalMadhukar.Zambare@amd.com>, Mangala V <managala.v@amd.com>	2020-07-13 16:26:32 +05:30
Meghana Vankadari	f59d4befb5	Added framework support and interface APIs for GEMMT Details: - Added new API Which Computes a matrix-matrix product with general matrices but updates only the upper or lower triangular part of the result matrix. cblas_?gemmt() and ?gemmt_(). - These routines are similar to the ?gemm routines, but they only access and update a triangular part of the square result matrix. - Added DGEMMT functionality by reusing GEMM kernels. - Created a new folder for GEMMT under l3, and added GEMMT specific framework code. - Modified cntl_create routine to choose different macro kernel for GEMMT. - Added routines to copy lower/upper triangular part of a block to the buffer. - Defined BLIS, BLAS and CBLAS interface APIs for GEMMT. - Added test_gemmt.c to test folder and Updated the Makefile. - Added a macro 'CBLAS' in test_gemm.c to call CBLAS APIs. Change-Id: Ie00c1a15b9c654b65c687a9ca781cbc6f9641791	2020-07-06 00:51:16 -04:00
Field G. Van Zee	32365b3ea5	Ensure random objects' 1-norms are non-zero. Details: - Fixed an innocuous bug that manifested when running the testsuite on extremely small matrices with randomization via the "powers of 2 in narrow precision range" option enabled. When the randomization function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will then compute 0.0/0.0 during the normalization process, which leads to NaN residuals. The solution entails smarter implementaions of randv, randnv, randm, and randnm, each of which will compute the 1-norm of the vector or matrix in question. If the object has a 1-norm of 0.0, the object is re-randomized until the 1-norm is not 0.0. Thanks to Kiran Varaganti for reporting this issue (#413). - Updated the implementation of randm_unb_var1() so that it loops over a call to the randv_unb_var1() implementation directly rather than calling it indirectly via randv(). This was done to avoid the overhead of multiple calls to norm1v() when randomizing the rows/columns of a matrix. - Updated comments. Change-Id: I0e3d65ff97b26afde614da746e17ed33646839d1	2020-06-19 15:40:55 +05:30
Dipal M Zambare	80b3127ff1	Added support for logging gemm input values. Added BLIS specific extension to AOCL DTL, in this added support to print the input matrix sizes from BLIS library. AMD Internal: [CPUPL-806] Change-Id: I80ed779d65f9b1c48466137fc2f05629fa2fb561	2020-06-15 14:21:22 +05:30
Meghana	9fce1ec4a4	Optimized SGEMV kernel and changed BLAS interface call Details: - Optimized saxpyf kernel with fuse_factor=5 and iter_unroll=2. - Modified framework files of sgemv to remove dependency on cntx variable. - Updated cntx_init file of zen2 to choose optimized kernels. - Modified BLAS interface call for SGEMV to reduce framework overhread. - Currently these changes are applicable for zen2 configuration. Change-Id: Iabc36ae640e82e65f8764f3c6dee513ad64b22fd Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-707]	2020-06-04 02:49:08 -04:00
Guodong Xu	66ec22705b	New kernel set for Arm SVE using assembly (#396 ) Here adds two kernels for Arm SVE vector extensions. 1. a gemm kernel for double at sizes 8x8. 2. a packm kernel for double at dimension 8xk. To achive best performance, variable length agonostic programming is not used. Vector length (VL) of 256 bits is mandated in both kernels. Kernels to support other VLs can be added later. "SVE is a vector extension for AArch64 execution mode for the A64 instruction set of the Armv8 architecture. Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains into a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. Instructions are provided specifically to query an implementation for its register size, to guarantee that the applications can run on different implementations of the ISA without the need to recompile the code." [1] [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-05-21 11:56:45 +05:30
Meghana Vankadari	4fcc4e499d	Optimized DGEMV kernel and changed BLAS interface call Details: - Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2. - Modified framework files of dgemv to remove dependency on cntx variable. - Updated cntx_init file of zen2 to choose optimized kernels. - Modified BLAS interface call for DGEMV to reduce framework overhread. - Currently these changes are applicable for zen2 configuration. They will be enabled for zen family processors in future. - Changed naming convention for new BLAS macros to indicate their use. - Added new optimized kernel for axpyf under zen2 folder. - Implemented basic GEMV kernel without using axpyv or axpyf. This kernel is chosen for small sizes. Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-885]	2020-05-19 06:40:44 -04:00
Devrajegowda, Kiran	6f33fd6aac	Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. -setv simd kernel is added for single and double precision elements Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-817]	2020-05-13 01:51:48 -04:00
Meghana	28bb28b79f	Modified Function definition for BLAS and CBLAS interfaces of DOTV and SWAPV APIs Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. Change-Id: I1eb7ca470ced82c3cfa8b22f2b53000d42fef96c Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-847,CPUPL-816]	2020-05-04 15:06:07 +05:30
Devrajegowda, Kiran	4caee59466	Adding a simd kernel for copyv function Details: - Separate kernel for copyv function added to improve performance. - Modified cntx_init file in zen and zen2 configuration - Added test_copyv.c in test folder - Modified test/Makefile to include test_copyv.c Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41 Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-818]	2020-04-24 01:55:25 -04:00
Meghana	80086fad15	Modified function definition for AXPY BLAS interface Details: -Calling the kernel directly from API call to avoid framework overhead. -Currently these changes are only applicable for zen2 configuration. They will be enabled for zen family processors in future. Change-Id: I0139e185178f726f5cd8cba0ff6a441a00d67868 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-805]	2020-04-19 23:55:27 -05:00
dzambare	d40edf7dac	Execution and Debug trace support. Added support add debug logging, execution trace and decode. Change-Id: I024bf6165daa9e23a62423f2401c0f1c5de459ba AMD-Internal: [CPUPL-806]	2020-04-07 08:48:59 +05:30
Field G. Van Zee	1a284828d1	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates. AMD-Internal: [CPUPL-713] Change-Id: I9536648e7befac4d2dc17805e44ef34470961662	2020-03-13 01:09:29 -04:00
Meghana Vankadari	cc98047fd6	Made framework changes to initialize specific cache block sizes for TRSM. Details: -This commit addresses the performance optimization(single-thread and multi-thread) for DTRSM on zen2. -This new optimization employs different MC, KC & NC values for TRSM than what is being used in other Level-3 routines like DGEMM. -Changed TRSM framework code to choose these blocksizes for TRSM on zen family configurations. -Added a new field called "trsm_blkszs" to cntx structure in order to store TRSM specific block sizes. -Implemented routines to initialize, set and query the TRSM-specific block sizes. -Defined a new macro "AOCL_BLIS_ZEN" in configure script. This macro is automatically defined for zen family architectures. It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes. Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6 Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com> AMD-Internal: [CPUPL-656]	2020-03-09 10:33:42 +05:30
Devrajegowda, Kiran	1fe8edbed0	"Merge Selective Packing code from amd branch flame/blis" Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed	2019-12-16 14:48:53 +05:30
Kiran Varaganti	1650bcb623	Revert " Merge Selective Packing code from amd branch flame/blis" This reverts commit `e4a6af33f5`. Reason for revert: <Review not done> Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2	2019-12-13 00:01:35 -05:00
Devrajegowda, Kiran	e4a6af33f5	Merge Selective Packing code from amd branch flame/blis Change-Id: I6d577f67ec84febe6af3635b10e5c9c77844ccd2	2019-12-12 15:22:21 +05:30
Meghana	fb75044ea2	Removed zen and zen2 configurations from amd64 family amd64 family supports all the architectures before zen. Assigned (BLIS_ARCH_GENERIC+1) to BLIS_NUM_ARCHS in order to avoid update for every new architecture. Change-Id: I8241e643f6dfd0ebe272e053ca8b6a9c1463d9dc	2019-12-03 16:48:34 +05:30
Field G. Van Zee	39fa7136f4	Added support for selective packing to gemmsup. Details: - Implemented optional packing for A or B (or both) within the sup framework (which currently only supports gemm). The request for packing either matrix A or matrix B can be made via setting environment variables BLIS_PACK_A or BLIS_PACK_B (to any non-zero value; if set, zero means "disable packing"). It can also be made globally at runtime via bli_pack_set_pack_a() and bli_pack_set_pack_b() or with individual rntm_t objects via bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert interface of either the BLIS typed or object APIs. (If using the BLAS API, environment variables are the only way to communicate the packing request.) - One caveat (for now) with the current implementation of selective packing is that any blocksize extension registered in the _cntx_init function (such as is currently used by haswell and zen subconfigs) will be ignored if the affected matrix is packed. The reason is simply that I didn't get around to implementing the necessary logic to pack a larger edge-case micropanel, though this is entirely possible and should be done in the future. - Spun off the variant-choosing portion of bli_gemmsup_ref() into bli_gemmsup_int(), in bli_l3_sup_int.c. - Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along with corresponding headers, in which higher-level packm-related functions are defined for use within the sup framework. The actual packm variant code resides in bli_l3_sup_packm_var.c. - Pass the following new parameters into var1n and var2m: packa, packb bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now always NULL), and pointer to a thrinfo_t* (which for nowis the address of the global single-threaded packm thread control node). - Added panel strides ps_a and ps_b to the auxinfo_t structure so that the millikernel can query the panel stride of the packed matrix and step through it accordingly. If the matrix isn't packed, the panel stride of interest for the given millikernel will be set to the appropriate value so that the mkernel may step through the unpacked matrix as it normally would. - Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate panel strides (ps_a and ps_b, respectively) instead of computing them on the fly. - Spun off the environment variable getting and setting functions into a new file, bli_env.c (with a corresponding prototype header). These functions are now used by the threading infrastructure (e.g. BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B). - Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER. - Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER, for use within the definition of BLIS_MEM_INITIALIZER. - Moved the global_rntm object to bli_rntm.c and extern it where needed. This means that the function bli_thread_init_rntm() was renamed to bli_rntm_init_from_global() and relocated accordingly. - Added a new bli_pack.c function, which serves as the home for functions that manage the pack_a and pack_b fields of the global rntm_t, including from environment variables, just as we have functions to manage the threading fields of the global rntm_t in bli_thread.c. - Reorganized naming for files in frame/thread, which mostly involved spinning off the bli_l3_thread_decorator() functions into their own files. This change makes more sense when considering the further addition of bli_l3_sup_thread_decorator() functions (for now limited only to the single-threaded form found in the _single.c file). - Explicitly initialize the reference sup handlers in both bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more obvious how to customize to a different handler, if desired. - Removed various snippets of disabled code. - Various comment updates.	2019-11-29 15:27:07 -06:00
Jérôme Duval	f377bb4485	Add Haiku to the known OS list (#361 )	2019-11-07 16:39:29 -06:00
Field G. Van Zee	e29b1f9706	Fixed failing testsuite gemmtrsm_ukr for power9. Details: - Added code that fixes false failures in the gemmtrsm_ukr module of the testsuite. The tests were failing because the computation (bli_gemv()) that performs the numerical check was not able to properly travserse the matrix operands bx1 and b11 that are views into the micropanel of B, which has duplicated/broadcast elements under the power9 subconfig. (For example, a micropanel of B with duplication factor of 2 needs to use a column stride of 2; previously, the column stride was being interpreted as 1.) - Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride() static functions in bli_obj_macro_defs.h. (Previously, only the function bli_obj_set_strides() was defined. Amazing to think that we got this far without these former functions.) - Updated/expounded upon comments.	2019-11-05 17:15:19 -06:00
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
Field G. Van Zee	6218ac95a5	Merge branch 'master' into amd	2019-10-11 11:53:51 -05:00
Field G. Van Zee	29b0e1ef4e	Code review + tweaks to AMD's AOCL 2.0 PR (#349 ). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes.	2019-10-11 10:24:24 -05:00
Field G. Van Zee	c60db26aee	Fixed bad loop counter in bli_[cz]scal2bbs_mxn(). Details: - Fixed a typo in the loop counter for the 'd' (duplication) dimension in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h. They shouldn't be used by anyone yet, but thankfully clang via AppVeyor spit out warnings that alerted me to the issue.	2019-09-17 18:04:17 -05:00
Field G. Van Zee	31c8657f1d	Added support for pre-broadcast when packing B. Details: - Added support for being able to duplicate (broadcast) elements in memory when packing matrix B (ie: the left-hand operand) in level-3 operations. This turns out advantageous for some architectures that can afford the cost of the extra bandwidth and somehow benefit from the pre-broadcast elements (and thus being able to avoid using broadcast-style load instructions on micro-rows of B in the gemm microkernel). - Support optionally disabling right-side hemm and symm. If this occurs, hemm_r is implemented in terms of hemm_l (and symm_r in terms of symm_l). This is needed when broadcasting during packing because the alternative--supporting the broadcast of B while also allowing matrix B to be Hermitian/symmetric--would be an absolute mess. - Support alignment factors for packed blocks of A, B, and C separately (as well as for general-purpose buffers). In addition, we support byte offsets from those alignment values (which is different from aligning by align+offset bytes to begin with). The default alignment values are BLIS_PAGE_SIZE in all four cases, with the offset values defaulting to zero. - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed into the packm kernel, where it will be needed by packm kernels that perform broadcasts of B, since the idea is that we only want to broadcast when packing micropanels of B and not A. - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be used to set custom virtual level-3 microkernels in the cntx_t, which would typically be done in the bli_cntx_init_*() function defined in the subconfiguration of interest. - Added a "broadcast B" kernel function for use with NP/NR = 12/6, defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c. - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels defined in ref_kernels/3/bb. (These kernels have been tested with double real with NP/NR = 12/6.) - Added #ifndef ... #endif guards around several macro constants defined in frame/include/bli_kernel_macro_defs.h. - Defined a few "broadcast B" static functions in frame/include/level0/bb for use by "broadcast B"-style packm reference kernels. For now, only the real domain kernels are tested and fully defined. - Output the alignment and offset values for packed blocks of A and B in the testsuite's "BLIS configuration info" section. - Comment updates to various files. - Bumped so_version to 3.0.0.	2019-09-17 17:42:10 -05:00
kdevraje	cac127182d	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis with public repo commit id `565fa3853b`. Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42	2019-06-24 14:05:54 +05:30
Field G. Van Zee	ad937db950	Added missing #include "bli_family_thunderx2.h". Details: - Added a cpp-conditional directive block to bli_arch_config.h that #includes "bli_family_thunderx2.h". The code has been missing since `adf5c17f`. However, this never manifested as an error because the file is virtually empty and not needed for thunderx2 (or most subconfigs). Thanks to Jeff Diamond for helping to spot this.	2019-06-07 11:34:08 -05:00
Field G. Van Zee	6bf449cc69	Merge branch 'amd'	2019-05-31 17:42:40 -05:00
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
kdevraje	a3554eb1dc	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2 Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae	2019-05-23 11:53:32 +05:30
Kiran Varaganti	a23f92594c	config_registry: New AMD zen2 architecture configuration added. frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS] frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...). frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..). frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20 Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866	2019-05-22 05:28:16 -04:00
kdevraje	df755848b8	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis into rome2.0 Change-Id: Ie8aad1ab810f0f3c0b90ec67f9dd3dfb8dcc74cc	2019-05-22 13:30:07 +05:30
Field G. Van Zee	fa7e6b182b	Define _POSIX_C_SOURCE in bli_system.h. Details: - Added #ifndef _POSIX_C_SOURCE #define _POSIX_C_SOURCE 200809L #endif to bli_system.h so that an application that uses BLIS (specifically, an application that #includes blis.h) does not need to remember to #define the macro itself (either on the command line or in the code that includes blis.h) in order to activate things like the pthreads. Thanks to Christos Psarras for reporting this issue and suggesting this fix. - Commented out #include <sys/time.h> in bli_system.h, since I don't think this header is used/needed anymore. - Comment update to function macro for bli_?normiv_unb_var1() in frame/util/bli_util_unb_var1.c.	2019-05-01 19:13:00 -05:00

1 2 3 4 5 ...

382 Commits