amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
RuQing Xu	df40efe8fb	Armv8-A Add Part of GEMMSUP 8x4m Kernel - Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead	2021-08-13 02:40:06 +09:00
RuQing Xu	6639999288	Armv8A DGEMM 4x4 Kernel WIP. Slow Quite slow.	2021-08-13 02:40:06 +09:00
RuQing Xu	a29c16394c	Armv8-A Add 8x4 Kernel WIP Test result: a bit lower GFlOps than 6x8.	2021-08-13 02:40:04 +09:00
Field G. Van Zee	21911d6ed3	Merge branch 'dev'	2021-07-09 18:10:46 -05:00
Devin Matthews	17729cf449	Add vzeroupper to Haswell microkernels. (#524 ) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions.	2021-07-09 14:59:48 -05:00
Devin Matthews	bf72763663	Merge pull request #506 from xrq-phys/arm64-mac BLIS on Darwin_Aarch64	2021-06-18 18:59:43 -05:00
nicholai	56ffca6a9b	Fix asm warning	2021-06-15 18:17:39 -05:00
Field G. Van Zee	689fa0f403	Merge branch 'master' into dev	2021-06-13 19:44:14 -05:00
Field G. Van Zee	7f7d72610c	Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging.	2021-05-31 16:50:18 -05:00
RuQing Xu	5fc93e2806	Armv8A Rename Regs for Safe Darwin Compile Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there.	2021-05-29 18:44:47 +09:00
RuQing Xu	9f4a4a3cfb	Armv8A Rename Regs for Clang Compile: FP32 Part Roughly the same as `916e1fa` , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed.	2021-05-29 17:21:28 +09:00
RuQing Xu	916e1fa8be	Armv8A Rename Regs for Clang Compile: FP64 Part - x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, #8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed.	2021-05-29 16:46:52 +09:00
RuQing Xu	7fabd896af	Asm Flag Mingling for Darwin_Aarch64 Apple+Arm64 requires additional "tagging" of local symbols.	2021-05-29 16:28:03 +09:00
RuQing Xu	61584deddf	Added 512b SVE-based a64fx subconfig + SVE kernels. Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424.	2021-05-19 09:52:29 -05:00
Field G. Van Zee	09bd4f4f12	Add err_t* "return" parameter to malloc functions. Details: - Added an err_t* parameter to memory allocation functions including bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(), bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions already use the return value to return the allocated memory address, they can't communicate errors to the caller through the return value. This commit does not employ any error checking within these functions or their callers, but this sets up BLIS for a more comprehensive commit that moves in that direction. - Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to bli_type_defs.h. This was done so that what remains of bli_malloc.h can be included after the definition of the err_t enum. (This ordering was needed because bli_malloc.h now contains function prototypes that use err_t.) - Defined bli_is_success() and bli_is_failure() static functions in bli_param_macro_defs.h. These functions provide easy checks for error codes and will be used more heavily in future commits. - Unfortunately, the additional err_t* argument discussed above breaks the API for bli_malloc_user(), which is an exported symbol in the shared library. However, it's quite possible that the only application that calls bli_malloc_user()--indeed, the reason it is was marked for symbol exporting to begin with--is the BLIS testsuite. And if that's the case, this breakage won't affect anyone. Nonetheless, the "major" part of the so_version file has been updated accordingly to 4.0.0.	2021-03-31 17:09:36 -05:00
Field G. Van Zee	f9ad55ce7e	Merge branch 'master' into dev	2021-03-31 14:20:19 -05:00
Nicholai Tukanov	22c6b5dc4c	Fixed bug in power10 microkernel I/O. (#488 ) Details: - Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did not store the microtile result correctly due to incorrect indices calculations. (The error was introduced when I reorganized the 'kernels/power10/3' directory.)	2021-03-30 19:07:42 -05:00
Field G. Van Zee	3a6f41afb8	Renamed membrk files/vars/functions to pba. Details: - Renamed the files, variables, and functions relating to the packing block allocator from its legacy name (membrk) to its current name (pba). This more clearly contrasts the packing block allocator with the small block allocator (sba). - Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that caused the function to erroneously change the value of the pack_a field of the global rntm_t instead of the pack_b field. (Apparently nobody has used this API yet.) - Comment updates.	2021-03-27 17:22:14 -05:00
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
Field G. Van Zee	f5871c7e06	Added complex asm packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision complex domain (c and z) and housed them in the 'haswell' kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Minor modifications to the corresponding s and d packm kernels that were introduced in `426ad67`. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), upon which these complex kernels are partially based.	2021-02-28 17:03:57 -06:00
Field G. Van Zee	426ad679f5	Added assembly packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision real domain (s and d) and housed them in the 'haswell' kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), which I have now tweaked and used to create comparable single-precision real kernels (s6xk and s16xk).	2021-02-27 18:39:56 -06:00
Field G. Van Zee	ed50c94738	Merge branch 'master' into dev	2021-01-04 14:31:44 -06:00
Field G. Van Zee	7038bbaa05	Optionally disable trsm diagonal pre-inversion. Details: - Implemented a configure-time option, --disable-trsm-preinversion, that optionally disables the pre-inversion of diagonal elements of the triangular matrix in the trsm operation and instead uses division instructions within the gemmtrsm microkernels. Pre-inversion is enabled by default. When it is disabled, performance may suffer slightly, but numerical robustness should improve for certain pathological cases involving denormal (subnormal) numbers that would otherwise result in overflow in the pre-inverted value. Thanks to Bhaskar Nallani for reporting this issue via #461. - Added preprocessor macro guards to bli_trsm_cntl.c as well as the gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant to the aforementioned feature. - Added macros to frame/include/bli_x86_asm_macros.h related to division instructions.	2020-12-04 16:08:15 -06:00
Field G. Van Zee	b43dae9a5d	Fixed copy-paste bugs in edge-case sup kernels. Details: - Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly instructions that were left over from when the kernels were first written. These instructions would cause segmentation faults in some situations where extra memory was not allocated beyond the end of the matrix buffers. Thanks to Kiran Varaganti for reporting these bugs and to Bhaskar Nallani for identifying the cause and solution.	2020-12-01 16:44:38 -06:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
Nicholai Tukanov	2d8ec164e7	Add POWER10 support to BLIS (#450 )	2020-09-29 16:52:18 -05:00
Field G. Van Zee	e293cae2d1	Implemented sgemmsup assembly kernels. Details: - Created a set of single-precision real millikernels and microkernels comparable to the dgemmsup kernels that already exist within BLIS. - Added prototypes for all kernels within bli_kernels_haswell.h. - Registered entry-point millikernels in bli_cntx_init_haswell.c and bli_cntx_init_zen.c. - Added sgemmsup support to the Makefile, runme.sh script, and source file in test/sup. This included edits that allow for separate "small" dimensions for single- and double-precision as well as for single- vs. multithreaded execution.	2020-09-15 16:09:11 -05:00
Field G. Van Zee	2765c6f37c	Type saga continues; fixed sgemm ukernel signature. Details: - Changed double* pointers in sgemm function signature to float*. At this point I've lost track of whether this was my fault or another dormant bug like the one described in `ece9f6a`, but at this point I no longer care. It's one of those days (aka I didn't ask for this).	2020-09-12 17:48:15 -05:00
Field G. Van Zee	0779559509	Fixed missing restrict in knl sgemm prototype. Details: - Added a missing 'restrict' qualifier in the sgemm ukernel prototype for knl. (Not sure how that code was ever compiling before now.)	2020-09-12 17:37:21 -05:00
Field G. Van Zee	ece9f6a3ef	Fixed dormant type bugs in bli_kernels_knl.h. Details: - Fixed dormant type mismatches in the use of the prototype-generating macros in bli_kernels_knl.h. Specifically, some float prototypes were incorrectly using double as their ctype. This didn't actually matter until the type changes in `645d771`, as previously those types were not used since packm was prototyped with void* pointers.	2020-09-12 17:22:42 -05:00
Field G. Van Zee	645d771a14	Minor packm kernel type cleanup (void* -> ctype). Details: - Changed all void function arguments in reference packm kernels to those of the native type (ctype). These pointers no longer need to be void and are better represented by their native types anyway. (See below for details.) Updated knl packm kernels accordingly. - In the definition of the PACKM_KER_PROT prototype macro template in frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a, and p from void* to ctype. They were originally void because these function signatures had to share the same type so they could all be stored in a single array of that shared type, from which they were queried and called by packm_cxk(). This is no longer how the function pointers are stored, and so it no longer makes sense to force the caller of packm kernels to use void, only so that the implementor of the packm kernels can typecast back to the native datatype within the kernel definition. This change has no effect internally within BLIS because currently all packm kernels are called after querying the function addresses from the context and then typecasting to the appropriate function pointer type, which is based upon type-specific function pointers like float and double*. - Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and misleading due to changes to the handling of packm kernels since moving them into the context.	2020-09-12 15:31:56 -05:00
Field G. Van Zee	00e14cb6d8	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-07-29 14:24:34 -05:00
Field G. Van Zee	2605eb4d99	Added missing rv_d?x6 edge cases to sup kernel. Details: - Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling various n = 6 edge cases with a single sup kernel call. Previously, only n = {4,2,1} were handled explicitly as single kernel calls; that is, cases where n = 6 were previously being executed via two kernel calls (n = 4 and n = 2). - Added commented debug line to testsuite's test_libblis.c.	2020-07-15 15:25:19 -05:00
Field G. Van Zee	1c719c91a3	Bugfixes, cleanup of sup dgemm ukernels. Details: - Fixed a few not-really-bugs: - Previously, the d6x8m kernels were still prefetching the next upanel of A using MRrs_a instead of ps_a (same for prefetching of next upanel of B in d6x8n kernels using NRcs_b instead of ps_b). Given that the upanels might be packed, using ps_a or ps_b is the correct way to compute the prefetch address. - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck, executed as intended even though it was based on a faulty pointer management. Basically, in the rd_d6x8m kernel, the pointer for B (stored in rdx) was loaded only once, outside of the jj loop, and in the second iteration its new position was calculated by incrementing rdx by the absolute offset (four columns), which happened to be the same as the relative offset (also four columns) that was needed. It worked only because that loop only executed twice. A similar issue was fixed in the rd_d6x8n kernels. - Various cleanups and additions, including: - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so that it is loaded only once outside of the loops rather than multiple times inside the loops. - Changed outer loop in rd kernels so that the jump/comparison and loop bounds more closely mimic what you'd see in higher-level source code. That is, something like: for( i = 0; i < 6; i+=3 ) rather than something like: for( i = 0; i <= 3; i+=3 ) - Switched row-based IO to use byte offsets instead of byte column strides (e.g. via rsi register), which were known to be 8 anyway since otherwise that conditional branch wouldn't have executed. - Cleaned up and homogenized prefetching a bit. - Updated the comments that show the before and after of the in-register transpositions. - Added comments to column-based IO cases to indicate which columns are being accessed/updated. - Added rbp register to clobber lists. - Removed some dead (commented out) code. - Fixed some copy-paste typos in comments in the rv_6x8n kernels. - Cleaned up whitespace (including leading ws -> tabs). - Moved edge case (non-milli) kernels to their own directory, d6x8, and split them into separate files based on the "NR" value of the kernels (Mx8, Mx4, Mx2, etc.). - Moved config-specific reference Mx1 kernels into their own file (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory. - Added rd_dMx1 assembly kernels, which seems marginally faster than the corresponding reference kernels. - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using the row-oriented reference kernels for all storage combos.	2020-06-04 17:21:08 -05:00
Guodong Xu	28be1a4265	avoid loading twice in armv8a gemm kernel (#403 ) This bug happens at a corner case, when k_iter == 0 and we jump to CONSIDERKLEFT. In current design, first row/col. of a and b are loaded twice. The fix is to rearrange a and b (first row/col.) loading instructions. Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-05-20 13:22:22 -05:00
Guodong Xu	f032d5d4a6	New kernel set for Arm SVE using assembly (#396 ) Here adds two kernels for Arm SVE vector extensions. 1. a gemm kernel for double at sizes 8x8. 2. a packm kernel for double at dimension 8xk. To achive best performance, variable length agonostic programming is not used. Vector length (VL) of 256 bits is mandated in both kernels. Kernels to support other VLs can be added later. "SVE is a vector extension for AArch64 execution mode for the A64 instruction set of the Armv8 architecture. Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains into a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. Instructions are provided specifically to query an implementation for its register size, to guarantee that the applications can run on different implementations of the ISA without the need to recompile the code." [1] [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-04-29 12:08:46 -05:00
Devin Matthews	492a736fab	Fix vectorized version of bli_amaxv (#382 ) * Fix vectorized version of bli_amaxv To match Netlib, i?amax should return: - the lowest index among equal values - the first NaN if one is encountered * Fix typos. * And another one... * Update ref. amaxv kernel too. * Re-enabled optimized amaxv kernels. Details: - Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen' kernel set for use in haswell, zen, zen2, knl, and skx subconfigs. These two kernels (for s and d datatypes) were temporarily disabled in `e186d71` as part of issue #380. However, the key missing semantic properties that prompted the disabling of these kernels--returning the index of the first rather than of the last element with largest absolute value, and returning the index of the first NaN if one is encountered--were added as part of #382 thanks to Devin Matthews. Thus, now that the kernels are working as expected once more, this commit causes these kernels to once again be registered for the affected subconfigs, which effectively reverts all code changes included in `e186d71`. - Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c. Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>	2020-03-24 17:28:47 -05:00
Field G. Van Zee	39fa7136f4	Added support for selective packing to gemmsup. Details: - Implemented optional packing for A or B (or both) within the sup framework (which currently only supports gemm). The request for packing either matrix A or matrix B can be made via setting environment variables BLIS_PACK_A or BLIS_PACK_B (to any non-zero value; if set, zero means "disable packing"). It can also be made globally at runtime via bli_pack_set_pack_a() and bli_pack_set_pack_b() or with individual rntm_t objects via bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert interface of either the BLIS typed or object APIs. (If using the BLAS API, environment variables are the only way to communicate the packing request.) - One caveat (for now) with the current implementation of selective packing is that any blocksize extension registered in the _cntx_init function (such as is currently used by haswell and zen subconfigs) will be ignored if the affected matrix is packed. The reason is simply that I didn't get around to implementing the necessary logic to pack a larger edge-case micropanel, though this is entirely possible and should be done in the future. - Spun off the variant-choosing portion of bli_gemmsup_ref() into bli_gemmsup_int(), in bli_l3_sup_int.c. - Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along with corresponding headers, in which higher-level packm-related functions are defined for use within the sup framework. The actual packm variant code resides in bli_l3_sup_packm_var.c. - Pass the following new parameters into var1n and var2m: packa, packb bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now always NULL), and pointer to a thrinfo_t* (which for nowis the address of the global single-threaded packm thread control node). - Added panel strides ps_a and ps_b to the auxinfo_t structure so that the millikernel can query the panel stride of the packed matrix and step through it accordingly. If the matrix isn't packed, the panel stride of interest for the given millikernel will be set to the appropriate value so that the mkernel may step through the unpacked matrix as it normally would. - Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate panel strides (ps_a and ps_b, respectively) instead of computing them on the fly. - Spun off the environment variable getting and setting functions into a new file, bli_env.c (with a corresponding prototype header). These functions are now used by the threading infrastructure (e.g. BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B). - Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER. - Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER, for use within the definition of BLIS_MEM_INITIALIZER. - Moved the global_rntm object to bli_rntm.c and extern it where needed. This means that the function bli_thread_init_rntm() was renamed to bli_rntm_init_from_global() and relocated accordingly. - Added a new bli_pack.c function, which serves as the home for functions that manage the pack_a and pack_b fields of the global rntm_t, including from environment variables, just as we have functions to manage the threading fields of the global rntm_t in bli_thread.c. - Reorganized naming for files in frame/thread, which mostly involved spinning off the bli_l3_thread_decorator() functions into their own files. This change makes more sense when considering the further addition of bli_l3_sup_thread_decorator() functions (for now limited only to the single-threaded form found in the _single.c file). - Explicitly initialize the reference sup handlers in both bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more obvious how to customize to a different handler, if desired. - Removed various snippets of disabled code. - Various comment updates.	2019-11-29 15:27:07 -06:00
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
Field G. Van Zee	6218ac95a5	Merge branch 'master' into amd	2019-10-11 11:53:51 -05:00
Field G. Van Zee	29b0e1ef4e	Code review + tweaks to AMD's AOCL 2.0 PR (#349 ). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes.	2019-10-11 10:24:24 -05:00
Field G. Van Zee	c766c81d62	Added missing schema arg to knl packm kernels. Details: - Added the pack_t schema argument to the knl packm kernel functions. This change was intended for inclusion in `31c8657`. (Thank you SDE + Travis CI.)	2019-09-17 18:00:29 -05:00
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
Meghana	ee123f5358	Defined small matrix thresholds for TRSM for various cases for NAPLES and ROME Updated copyright information for kernels/zen/bli_trsm_small.c file Removed separate kernels for zen2 architecture Instead added threshold conditions in zen kernels both for ROME and NAPLES Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5	2019-05-27 15:36:44 +05:30
Meghana	e05171118c	Implemented TRSM for small matrices for cases where A is on the right Added separate kernels for zen and zen2 Change-Id: I6318ddc250cf82516c1aa4732718a35eae0c9134	2019-05-23 16:17:19 +05:30
kdevraje	02920f5c48	make checkblis fails for matrix dimension check at the begining hence reverting it Change-Id: Ibd2ee8c2d4914598b72003fbfc5845be9c9c1e87	2019-05-23 15:29:59 +05:30
kdevraje	84215022f2	Adding threshold condition to dgemm small matrix kernels, defining the constants in zen2 configuration Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5	2019-05-23 14:33:47 +05:30
kdevraje	a3554eb1dc	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2 Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae	2019-05-23 11:53:32 +05:30
kdevraje	ea082f8390	adding empty zen2 directory with .gitignore file Change-Id: Ifa37cf54b2578aa19ad335372b44bca17043fe4b	2019-05-23 10:38:29 +05:30

1 2 3 4 5 ...

287 Commits