amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 18:52:14 +00:00

Author	SHA1	Message	Date
Devin Matthews	90508192f2	Update do_sde.sh (#489 ) Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.	2021-03-30 21:16:44 -05:00
Nicholai Tukanov	22c6b5dc4c	Fixed bug in power10 microkernel I/O. (#488 ) Details: - Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did not store the microtile result correctly due to incorrect indices calculations. (The error was introduced when I reorganized the 'kernels/power10/3' directory.)	2021-03-30 19:07:42 -05:00
Field G. Van Zee	159ca6f01a	Made test/3/octave scripts robust to missing data. Details: - Modified the octave scripts in test/3 so that the script does not choke when one or more of the expected OpenBLAS, Eigen, or vendor data files is missing. (The BLIS data set, however, must be complete.) When a file is missing, that data series is simply not included on that particular graph. Also factored out a lot of the redundant logic from plot_panel_4x5.m into a separate function in read_data.m.	2021-03-24 15:57:32 -05:00
Field G. Van Zee	545e6c2f6d	CHANGELOG update (0.8.1)	2021-03-22 17:42:33 -05:00
Field G. Van Zee	8535b3e11d	Version file update (0.8.1)	2021-03-22 17:42:33 -05:00
Field G. Van Zee	e56d9f2d94	ReleaseNotes.md update in advance of next version.	2021-03-22 17:40:50 -05:00
Field G. Van Zee	ca83f955d4	CREDITS file update.	2021-03-22 17:21:21 -05:00
Field G. Van Zee	57ef61f6cd	Merge branch 'master' of github.com:flame/blis	2021-03-19 13:05:43 -05:00
Field G. Van Zee	bf1b578ea3	Reduced KC on skx from 384 to 256. Details: - Reduced the KC cache blocksize for double real on the skx subconfig from 384 to 256. The maximum (extended) KC was also reduced accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting this change.	2021-03-19 13:03:17 -05:00
Nicholai Tukanov	e7a4a8edc9	Fix calculation of new pb size (#487 ) Details: - Added missing parentheses to the i8 and i4 instantiations of the GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.	2021-03-17 19:43:31 -05:00
Field G. Van Zee	4493cf516e	Redefined BLIS_NUM_ARCHS to update automatically. Details: - Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum value in the arch_t enum. This means that it no longer needs to get updated manually whenever new subconfigurations are added to BLIS. Also removed the explicit initial index assigment of 0 from the first enum value, which was unnecessary due to how the C language standard mandates indexing of enum values. Thanks to Devin Matthews for originally submitting this as a PR in #446. - Updated docs/ConfigurationHowTo.md to reflect the aforementioned change.	2021-03-15 13:12:49 -05:00
Field G. Van Zee	a4b73de84c	Disabled _self() and _equal() in bli_pthread API. Details: - Disabled the _self() and _equal() extensions to the bli_pthread API introduced in d479654. These functions were disabled after I realized that they aren't actually needed yet. Thanks to Devin Matthews for helping me reason through the appropriate consumer code that will appear in BLIS (eventually) in a future commit. (Also, I could never get the Windows branch to link properly in clang builds in AppVeyor. See the comment I left in the code, and #485, for more info.)	2021-03-12 19:47:39 -06:00
Field G. Van Zee	f9d604679d	Added _self() and _equal() to bli_pthread API. Details: - Expanded the bli_pthread API to include equivalents to pthread_self() and pthread_equal(). Implemented these two functions for all three cpp branches present within bli_pthread.c: systemless, Windows, and Linux/BSD.	2021-03-12 19:47:39 -06:00
Field G. Van Zee	fa9b3c8f6b	Shuffled code in Windows branch of bli_pthreads.c. Details: - Reordered the definitions in the cpp branch in bli_pthreads.c that defines the bli_pthreads API in terms of Windows API calls. Also added missing comments that mark sections of the API, which brings the code into harmony with other cpp branches (as well as bli_pthread.h).	2021-03-11 15:13:51 -06:00
Field G. Van Zee	95d4f3934d	Moved cpp macro redef of strerror_r to bli_env.c. Details: - Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r (in terms of strerror_s) from bli_thread.h to bli_env.c. It was likely left behind in bli_thread.h in a previous commit, when code that now resides in bli_env.c was moved from bli_thread.c. (I couldn't find any other instance of strerror_r being used in BLIS, so I moved the #define directly to bli_env.c rather than place it in bli_env.h.) The code that uses strerror_r is currently disabled, though, so this commit should have no affect on BLIS.	2021-03-11 13:50:40 -06:00
Field G. Van Zee	8a3066c315	Relocated gemmsup_ref general stride handling. Details: - Moved the logic that checks for general stridedness in any of the matrix operands in a gemmsup problem. The logic previously resided near the top of bli_gemmsup_int(), which is the thread entry point for the parallel region of the current gemmsup implementation. The problem with this setup was that the code would attempt to reject problems with any general-strided operands by returning BLIS_FAILURE, and that return value was then being ignored by the l3_sup thread decorator, which unconditionally returns BLIS_SUCCESS. To solve this issue, rather than try to manage n return values, one from each of n threads, I simply moved the logic into bli_gemmsup_ref(). I didn't move it any higher (e.g. bli_gemmsup()) because I still want the logic to be part of the current gemmsup handler implementation. That is, perhaps someone else will create a different handler, and that author wants to handle general stride differently. (We don't want to force them into a particular way of handling general stride.) - Removed the general stride handling from bli_gemmtsup_int(), even though this function is inoperative for now. - This commit addresses issue #484. Thanks to RuQing Xu for reporting this issue.	2021-03-09 17:52:59 -06:00
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
RuQing Xu	b8dcc5bc75	Fixed typed API definition for gemmt (#476 ) Details: - Fixed incorrect definition and prototype of bli_?gemmt() in frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously defined identically to gemm, which was wrong because it did not take into account the uplo property of C. - Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md. Specifically, the document erroneously listed only a single transab parameter instead of transa and transb.	2021-03-01 16:58:24 -06:00
Ilknur	a0e4fe2340	Fixed double free() in level1v example (#482 ) Details: - In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and pointer 'a' was not being freed at all. This commit correctly frees each pointer exactly once.	2021-03-01 16:06:56 -06:00
Field G. Van Zee	f5871c7e06	Added complex asm packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision complex domain (c and z) and housed them in the 'haswell' kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Minor modifications to the corresponding s and d packm kernels that were introduced in `426ad67`. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), upon which these complex kernels are partially based.	2021-02-28 17:03:57 -06:00
Field G. Van Zee	426ad679f5	Added assembly packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision real domain (s and d) and housed them in the 'haswell' kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), which I have now tweaked and used to create comparable single-precision real kernels (s6xk and s16xk).	2021-02-27 18:39:56 -06:00
Devin Matthews	f50c1b7e58	Merge pull request #473 from ajaypanyala/pkgconfig build: generate pkgconfig file	2021-02-01 11:55:51 -06:00
Field G. Van Zee	8f39aea11f	Merge branch 'dev'	2021-01-30 17:59:56 -06:00
Field G. Van Zee	f8db9fb33b	Fixed missing parentheses in README.md Citations.	2021-01-28 08:04:52 -06:00
Ajay Panyala	b3953b938e	drop CFLAGS in the generated pkgconfig file	2021-01-12 17:07:04 -08:00
Ajay Panyala	b02d9376ba	add datadir	2021-01-12 11:47:58 -08:00
Ajay Panyala	d8d8deeb6d	generate pkgconfig file	2021-01-11 17:47:50 -08:00
Devin Matthews	8c65411c7c	Merge pull request #471 from flame/fix-470 Fix kernel-to-config mapping for intel64	2021-01-11 16:01:45 -06:00
Devin Matthews	874c3f04ec	Update configure Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes #470.	2021-01-08 13:56:30 -06:00
Field G. Van Zee	2a815d5b36	Support trsm pre-inversion in 1m, bb, ref kernels. Details: - Expanded support for disabling trsm diagonal pre-inversion to other microkernel types, including the reference microkernel as well as the kernel implementations for 1m and the pre-broadcast B (bb) format used by the power9 subconfig. This builds on the 'haswell' and 'penryn' kernel support added in `7038bba`. Thanks to Bhaskar Nallani for reminding me, in #461 (post-closure), that 1m support was missing from that commit. - Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the omp simd implementation after making a stripped-down copy in 'old'. This code has been disabled for some time and it seemed better suited to rot away out of sight rather than clutter up a file that is already cluttered by the presence of lower and upper versions. - Minor comment update to bli_ind_init().	2021-01-04 18:03:39 -06:00
Field G. Van Zee	c3ed2cbb9f	Enable 1m only if real domain ukr is not reference. Details: - Previously, BLIS would automatically enable use of the 1m method for a given precision if the complex domain microkernel was a reference kernel. This commit adds an additional constraint so that 1m is only enabled if the corresponding real domain microkernel is NOT reference. That is, BLIS now forgos use of 1m if both the real and complex domain kernels are reference implementations. Note that this does not prevent 1m from being enabled manually under those conditions; it only means that 1m will not be enabled automatically at initialization-time.	2021-01-04 16:16:32 -06:00
Field G. Van Zee	ed50c94738	Merge branch 'master' into dev	2021-01-04 14:31:44 -06:00
Devin Matthews	328b4f8872	Shared object (dylib) was not built correctly for partial build. The SO build rule used $? instead of $^. Observed on macOS, not sure if it affected Linux or not.	2020-12-30 17:54:18 -06:00
Devin Matthews	ae6ef66ef8	bli_diag_offset_with_trans had wrong return type. Fixes #468 .	2020-12-30 17:34:55 -06:00
Devin Matthews	ebcf197fb8	Merge pull request #466 from isuruf/patch-3 fix cc_vendor for crosstool-ng toolchains	2020-12-05 22:26:27 -06:00
Isuru Fernando	21aa67e11c	fix cc_vendor for crosstool-ng toolchains	2020-12-05 21:59:13 -06:00
Field G. Van Zee	472f138cb9	Fixed typo in README.md to CodingConventions.md.	2020-12-05 14:13:52 -06:00
Field G. Van Zee	0cef09aa92	Consolidated code in level-3 _front() functions. Details: - Reduced a code segment that appears in all of the bli_*_front() functions except for bli_gemm_front(). Previously, the code looked like this (taken from bli_herk_front()): if ( bli_cntx_method( cntx ) == BLIS_NAT ) { bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local ); bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local ); } else // if ( bli_cntx_method( cntx ) != BLIS_NAT ) { pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); } This code segment is part of a sort-of-hack that allows us to communicate the pack schemas into the level-3 thread decorator, which needs them so that they can be passed into bli_l3_cntl_create_if(), where the control tree is created. However, the first conditional case above is unnecessary because the second case is fully generalized. That is, even in the native case, the context contains correct, queryable schemas. Thus, these code segments were reduced to something like: pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); There's always a small chance that the seemingly unnecessary code in the first branch case has some special use that is not apparent to me, but the testsuite's default input parameters seem to think this commit will be fine.	2020-12-04 16:40:59 -06:00
Field G. Van Zee	7038bbaa05	Optionally disable trsm diagonal pre-inversion. Details: - Implemented a configure-time option, --disable-trsm-preinversion, that optionally disables the pre-inversion of diagonal elements of the triangular matrix in the trsm operation and instead uses division instructions within the gemmtrsm microkernels. Pre-inversion is enabled by default. When it is disabled, performance may suffer slightly, but numerical robustness should improve for certain pathological cases involving denormal (subnormal) numbers that would otherwise result in overflow in the pre-inverted value. Thanks to Bhaskar Nallani for reporting this issue via #461. - Added preprocessor macro guards to bli_trsm_cntl.c as well as the gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant to the aforementioned feature. - Added macros to frame/include/bli_x86_asm_macros.h related to division instructions.	2020-12-04 16:08:15 -06:00
Field G. Van Zee	78aee79452	Allow amaxv testsuite module to run with dim = 0. Details: - Exit early from libblis_test_amaxv_check() when the vector dimension (length) of x is 0. This allows the module to run when the testsuite driver passes in a problem size of 0. Thanks to Meghana Vankadari for alerting us to this issue via #459. - Note: All other testsuite modules appear to work with problem sizes of 0, except for the microkernel modules. I chose not to "fix" those modules because a failure (or segmentation fault, as happens in this case) is actually meaningful in that it alerts the developer that some microkernels cannot be used with k = 0. Specifically, the 'haswell' kernel set contains microkernels that preload elements of B. Those microkernels would need to be restructured to avoid preloading in order to support usage when k = 0.	2020-12-02 13:02:36 -06:00
Field G. Van Zee	92d2b12a44	Fixed obscure testsuite gemmt dependency bug. Details: - Fixed a bug in the gemmt testsuite module that only manifested when testing of gemmt is enabled but testing of gemv is disabled. The bug was due to a copy-paste error dating back to the introduction of gemmt in `88ad841`.	2020-12-02 13:02:00 -06:00
Field G. Van Zee	b43dae9a5d	Fixed copy-paste bugs in edge-case sup kernels. Details: - Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly instructions that were left over from when the kernels were first written. These instructions would cause segmentation faults in some situations where extra memory was not allocated beyond the end of the matrix buffers. Thanks to Kiran Varaganti for reporting these bugs and to Bhaskar Nallani for identifying the cause and solution.	2020-12-01 16:44:38 -06:00
Field G. Van Zee	11dfc176a3	Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c.	2020-12-01 19:51:27 +00:00
Devin Matthews	6d3bafacd7	Update BuildSystem.md Add git version >= 1.8.5 requirement (see #462).	2020-11-28 17:17:56 -06:00
Field G. Van Zee	64856ea5a6	Auto-reduce (by default) prime numbers of threads. Details: - When requesting multithreaded parallelism by specifying the total number of threads (whether it be via environment variable, globally at runtime, or locally at runtime), reduce the number of threads actually used by one if the original value (a) is prime and (b) exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set to 11 by default. If, when specifying the total number of threads (and not the individual ways of parallelism for each loop), prime numbers of threads are desired, this feature may be overridden by defining the BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that corresponds to the configuration family targeted at configure-time. (For now, there is no configure option(s) to control this feature.) Thanks to Jeff Diamond for suggesting this change. - Defined a new function in bli_thread.c, bli_is_prime(), that returns a bool that determines whether an integer is prime. This function is implemented in terms of existing functions in bli_thread.c. - Updated docs/Multithreading.md to document the above feature, along with unrelated minor edits.	2020-11-23 16:54:51 -06:00
Field G. Van Zee	55933b6ff6	Added missing attribution to docs/ReleaseNotes.md.	2020-11-20 10:39:32 -06:00
Field G. Van Zee	e310f57b4b	CHANGELOG update (0.8.0)	2020-11-19 13:33:37 -06:00
Field G. Van Zee	9b387f6d5a	Version file update (0.8.0) 0.8.0	2020-11-19 13:33:37 -06:00
Field G. Van Zee	2928ec750d	ReleaseNotes.md update in advance of next version. Details: - Updated docs/ReleaseNotes.md in preparation for next version.	2020-11-18 18:31:35 -06:00
Field G. Van Zee	b9899bedff	CREDITS file update.	2020-11-18 16:52:41 -06:00

1 2 3 4 5 ...

1974 Commits