amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-04-19 23:28:52 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	2e3b3782cf	Merge branch 'master' into amd	2020-04-06 14:55:35 -05:00
Field G. Van Zee	2bca03ea9d	Updates, tweaks to runme.sh in test/1m4m. Details: - Made several updates to test/1m4m/runme.sh, including: - Added missing handling for 1m and 4m1a implementations when setting the BLIS_??_NT environment variables. - Added support for using numactl to run the test executables. - Several other cleanups.	2020-03-28 22:10:00 +00:00
Meghana	b5fe75e104	Closing input and output files in test_gemm.c and test_trsm.c Change-Id: I75cdd5adc2bd2dac7d0eca9c050e06dbd52bec26 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>	2020-03-24 09:09:58 +05:30
Meghana Vankadari	ddcb3d8a52	Modified test_trsm.c file in test folder to read input sizes from a file Details: -A Macro 'FILE_IN_OUT' is defined to read matrix dimensions and strides from a csv file. Format for input file if 'FILE_IN_OUT' is defined: Each line defines a TRSM problem with the following parameters: m n cs_a cs_b The operation implemented by default is AX=B where A is lower-triangular and matrices are in column-major order. When macro is disabled, it reverts back to original implementation. Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv -A macro 'READ_ALL_PARAMS_FROM_FILE' is defined to read all the parameters for TRSM from a csv file. This macro can be defined only when 'FILE_IN_OUT' is already defined. Format for the input file if 'READ_ALL_PARAMS_FROM_FILE' is defined: Each line defines a TRSM problem with the following paramenters: sideA uploA transA diagA m n cs_a cs_b By default, column-major order is chosen as storage scheme for matrices. Usage: ./test_trsm_<mkl/blis/openblas>.x input.csv output.csv Change-Id: I349bc69ca968911c16e04d1ce70974d01e65a2fb Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>	2020-03-20 07:15:17 -04:00
Field G. Van Zee	1a284828d1	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates. AMD-Internal: [CPUPL-713] Change-Id: I9536648e7befac4d2dc17805e44ef34470961662	2020-03-13 01:09:29 -04:00
Field G. Van Zee	8c3d9b9eeb	Merge branch 'amd' of github.com:flame/blis into amd	2020-03-10 14:03:33 -05:00
Field G. Van Zee	71249fe8dd	Merged test/sup, test/supmt into test/sup. Details: - Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able to compile and run both single-threaded and multithreaded experiments. This should help with maintenance going forward. - Created a test/sup/octave_st directory of scripts (based on the previous test/sup/octave scripts) as well as a test/sup/octave_mt directory (based on the previous test/supmt/octave scripts). The octave scripts are slightly different and not easily mergeable, and thus for now I'll maintain them separately. - Preserved the previous test/sup directory as test/sup/old/supst and the previous test/supmt directory as test/sup/old/supmt.	2020-03-10 13:55:29 -05:00
Field G. Van Zee	0f9e0399e1	Updated sup performance graphs; added mt results. Details: - Reran all existing single-threaded performance experiments comparing BLIS sup to other implementations (including the conventional code path within BLIS), using the latest versions (where appropriate). - Added multithreaded results for the three existing hardware types showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc (Zen1). - Various minor updates to the text in docs/PerformanceSmall.md. - Updates to the octave scripts in test/sup/octave, test/supmt/octave.	2020-03-05 17:03:21 -06:00
Field G. Van Zee	90db88e572	Updated sup[mt] Makefiles for variable dim ranges. Details: - Updated test/sup/Makefile and test/supmt/Makefile to allow specifying different problem size ranges for the drivers where one, two, or three matrix dimensions is large. This will facilitate the generation of more meaningful graphs, particularly when two dimensions are tiny.	2020-03-02 15:06:48 -06:00
Field G. Van Zee	31f11a06ea	Updates to octave scripts in test/sup[mt]/octave. Details: - Optimized scripts in test/sup/octave and test/supmt/octave for use with octave 5.2.0 on Ubuntu 18.04. - Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m, which were not only unnecessary but also causing issues with versions 5.x.	2020-02-27 14:33:20 -06:00
Field G. Van Zee	c0558fde45	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates.	2020-02-17 14:08:08 -06:00
Devrajegowda, Kiran	b074c5e09c	Added a macro MATRIX_INITIALISATION for matrix initialisation in test application Change-Id: I8e5c9902e603a549218d4e8509a481288792266d	2019-12-01 13:12:02 +05:30
Devrajegowda, Kiran	c4047e491a	Merge branch 'amd-blis-nov-mergetest' into amd-staging-rome2.1 Change-Id: I1e04592dd9494faa34555008dd1edbca8a092a44	2019-11-29 23:01:51 +05:30
Devrajegowda, Kiran	85fa9e4107	resolved merge conflicts when merged with public repo master branch Change-Id: Iad6ba809680ba5081cc9d7879794ef58cc8f8a40	2019-11-25 14:46:48 +05:30
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
Kiran Varaganti	97a4236c82	Matrices are not initialized when inputs dimensions are fed through file, now these are fixed. test_gemm.c contains matrices initialized for file-based inputs as well. Change-Id: I4c3625a51dcbf64c99f56f354dcd898e66035cb1	2019-10-24 13:57:55 +05:30
Field G. Van Zee	29b0e1ef4e	Code review + tweaks to AMD's AOCL 2.0 PR (#349 ). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes.	2019-10-11 10:24:24 -05:00
Field G. Van Zee	80e6c10b72	Added reproduction section to Performance docs. Details: - Added section titled "Reproduction" to both Performance.md and PerformanceSmall.md that briefly nudges the motivated reader in the right direction if he/she wishes to run the same performance benchmarks used to produce the graphs shown in those documents. Thanks to Dave Love for making this suggestion.	2019-08-29 12:12:08 -05:00
Field G. Van Zee	b02e0aae8c	Updated test drivers to iterate backwards. Details: - Updated test driver source in test, test/3, test/1m4m, and test/mixeddt to iterate through the problem space backwards. This can help avoid certain situations where the CPU frequency does not immediately throttle up to its maximum. Thanks to Robert van de Geijn for recommending this fix (originally made to test/sup drivers in `57e422a`). - Applied off-by-one matlab output bugfix from `b6017e5` to test drivers in test, test/3, test/1m4m, and test/mixeddt directories.	2019-08-27 14:37:46 -05:00
Field G. Van Zee	b6017e53f4	Bugfix of output text + tweaks to test/sup driver. Details: - Fixed an off-by-one bug in the output of matlab row indices in test/sup/test_gemm.c that only manifested when the problem size increment was equal to 1. - Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage combinations for blissup drivers in test/sup. This helps make the building of drivers complete sooner. - Trivial changes to test/sup/runme.sh.	2019-08-27 14:18:14 -05:00
Field G. Van Zee	40781774df	Updated sup performance graphs with libxsmm. Details: - Added libxsmm to column-stored sup graphs presented in docs/PerformanceSmall.md. - Updated sup results for BLASFEO. - Added sup results for Lonestar5 (Haswell). - Addresses issue #326.	2019-08-26 16:47:37 -05:00
Field G. Van Zee	4a0a6e89c5	Changed test/sup alpha to 1; test libxsmm+netlib. Details: - Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is needed because libxsmm currently only optimizes gemm operations where alpha is unit (and beta is unit or zero). - Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its fallback library. This is the library that will be called the problem dimensions are deemed too large, or any other criteria for optimization are not met. (This was done not because it is realistic, but rather so that it would be very clear when libxsmm ceased handling gemm calls internally when the data are graphed.)	2019-08-24 15:25:16 -05:00
Field G. Van Zee	7aa52b5783	Use libxsmm API in test/sup; add missing -ldl. Details: - Switch the driver source in test/sup so that libxsmm_?gemm() is called instead of ?gemm_() when compiling for / linking against libxsmm. libxsmm's documentation isn't clear on whether it is even trying to provide BLAS API compatibility, and I got tired of trying to figure it out. - Added missing -ldl in LDFLAGS when linking against libxsmm.	2019-08-23 16:12:50 -05:00
Field G. Van Zee	57e422aa16	Added libxsmm support to test/sup drivers. Details: - Modified test/sup/Makefile to build drivers that test the performance of skinny/small problems via libxsmm. - Modified test/sup/runme.sh to run aforementioned drivers. - Modified test/sup/test_gemm.c so that problem sizes are tested in reverse order (from largest to smallest). This can help avoid certain situations where the CPU frequency does not immediately throttle up to its maximum. Thanks to Robert van de Geijn for recommending this fix.	2019-08-23 14:17:52 -05:00
kdevraje	c4368c66ed	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-08-23 14:18:55 +05:30
Field G. Van Zee	06c5a5c4a9	Added test/1m4m driver directory. Details: - Added a new standalone test driver directory named '1m4m' that can build and run performance experiments for BLIS 1m, 4m1a, assembly, OpenBLAS, and the vendor library (MKL). This new driver directory was used to regenerate performance results for the 1m paper. - Added alternate (commented-out) cache blocksizes to config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to work well on an a 12-core Intel Xeon E5-2650 v3.	2019-08-23 14:18:09 +05:30
Field G. Van Zee	b3974dafac	New cntx_t blksz "set" functions + misc tweaks. Details: - Defined two new static functions in bli_cntx.h: bli_cntx_set_blksz_def_dt() bli_cntx_set_blksz_max_dt() which developers may find convenient when experimenting with different values of cache blocksizes. - Updated one- and two-socket multithreaded problem size range and increment values in test/3/Makefile. - Changed default to column storage in test/3/test_gemm.c. - Fixed typo in comment in testsuite/src/test_subm.c.	2019-08-23 14:18:09 +05:30
Field G. Van Zee	66c43ca427	Updated BLASFEO results in PerformanceSmall.md. Details: - Updated the BLASFEO performance graphs shown in PerformanceSmall.md using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md accordingly. - Updated test/sup/octave/plot_l3sup_perf.m so that the .m files containing the mpnpkp results do not need to be preprocessed in order to plot half the problem size range (ie: up to 400 instead of the 800 range of the other shape cases). - Trivial updates to runme.m.	2019-08-23 14:18:09 +05:30
Field G. Van Zee	bb4a01f130	Added BLASFEO results to docs/PerformanceSmall.md. Details: - Updated the graphs linked in PerformanceSmall.md with BLASFEO results, and added documenting language accordingly. - Updated scripts in test/sup/octave to plot BLASFEO data. - Minor tweak to language re: how OpenBLAS was configured for docs/Performance.md.	2019-08-23 14:18:09 +05:30
Field G. Van Zee	ecfc223e62	Minor tweaks to test/sup. Details: - Changed starting problem and increment from 16 to 4. - Added 'lll' (square problems) to list of problem size shapes to compile and run with. - Define BLASFEO location and added BLASFEO-related definitions.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	55e7b045c3	Added sup performance graphs/document to 'docs'. Details: - Added a new markdown document, docs/PerformanceSmall.md, which publishes new performance graphs for Kaby Lake and Epyc showcasing the new BLIS sup (small/skinny/unpacked) framework logic and kernels. For now, only single-threaded dgemm performance is shown. - Reorganized graphs in docs/graphs into docs/graphs/large, with new graphs being placed in docs/graphs/sup. - Updates to scripts in test/sup/octave, mostly to allow decent output in both GNU octave and Matlab. - Updated README.md to mention and refer to the new PerformanceSmall.md document.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	5e03ca6fc7	Increased MT sup threshold for double to 201. Details: - Fine-tuned the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 180 to 201 for haswell and 180 to 256 for zen. - Updated octave scripts in test/sup/octave to include a seventh column to display performance for m = n = k.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	fb305d0837	Minor build system housekeeping. Details: - Commented out redundant setting of LIBBLIS_LINK within all driver- level Makefiles. This variable is already set within common.mk, and so the only time it should be overridden is if the user wants to link to a different copy of libblis. - Very minor changes to build/gen-make-frags/gen-make-frag.sh. - Whitespace and inconsequential quoting change to configure. - Moved top-level 'windows' directory into a new 'attic' directory.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	4f08619855	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	f73cef483e	Support row storage in Eigen gemm test/3 driver. Details: - Added preprocessor branches to test/3/test_gemm.c to explicitly support row-stored matrices. Column-stored matrices are also still supported (and is the default for now). (This is mainly residual work leftover from initial integration of Eigen into the test drivers, so if we ever want to test Eigen with row-stored matrices, the code will be ready to use, even if it is not yet integrated into the Makefile in test/3.)	2019-08-23 14:18:08 +05:30
Field G. Van Zee	22768bf959	Updated Eigen results in docs/graphs with 3.3.90. Details: - Updated the level-3 performance graphs in docs/graphs with new Eigen results, this time using a development version cloned from their git mirror on March 27, 2019 (version 3.3.90). Performance is improved over 3.3.7, though still noticeably short of BLIS/MKL in most cases. - Very minor updates to docs/Performance.md and matlab scripts in test/3/matlab.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	ee44719d43	Added ability to plot with Eigen in test/3/matlab. Details: - Updated matlab scripts in test/3/matlab to optionally plot/display Eigen performance curves. Whether Eigen is plotted is determined by a new boolean function parameter, with_eigen. - Updated runme.m scratchpad to reflect the latest invocations of the plot_panel_4x5() function (with Eigen plotting enabled).	2019-08-23 14:18:08 +05:30
Field G. Van Zee	bd6cdd884b	Fixed mislabeled eigen output from test/3 drivers. Details: - Fixed the Makefile in test/3 so that it no longer incorrectly labels the matlab output variables from Eigen-linked hemm, herk, trmm, and trsm driver output as "vendor". (The gemm drivers were already correctly outputing matlab variables containing the "eigen" label.)	2019-08-23 14:18:08 +05:30
Field G. Van Zee	b495ca9b76	Link to Eigen BLAS for non-gemm drivers in test/3. Details: - Adjusted test/3/Makefile so that the test drivers are linked against Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do this since Eigen's headers don't define implementations to the standard BLAS APIs. - Simplified #included headers in hemm, herk, trmm, and trsm source driver files, since nothing specific to Eigen is needed at compile-time for those operations.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	728e9666e1	Add more support for Eigen to drivers in test/3. Details: - Use compile-time implementations of Eigen in test_gemm.c via new EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS library is not necessary.) However, as of Eigen 3.3.7, Eigen only parallelizes the gemm operation and not hemm, herk, trmm, trsm, or any other level-3 operation. - Fixed a bug in trmm and trsm drivers whereby the wrong function (bli_does_trans()) was being called to determine whether the object for matrix A should be created for a left- or right-side case. This was corrected by changing the function to bli_is_left(), as is done in the hemm driver. - Added support for running Eigen test drivers from runme.sh.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	a1c8b11b3f	Added Eigen support to test/3 Makefile, runme.sh. Details: - Added targets to test/3/Makefile that link against a BLAS library build by Eigen. It appears, however, that Eigen's BLAS library does not support multithreading. (It may be that multithreading is only available when using the native C++ APIs.) - Updated runme.sh with a few Eigen-related tweaks. - Minor tweaks to docs/Performance.md.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	c6793be46e	Added docs/Performance.md and docs/graphs subdir. Details: - Added a new markdown document, docs/Performance.md, which reports performance of a representative set of level-3 operations across a variety of hardware architectures, comparing BLIS to OpenBLAS and a vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs, in pdf and png formats, reside in docs/graphs. - Updated README.md to link to new Performance.md document. - Minor updates to CREDITS, docs/Multithreading.md. - Minor updates to matlab scripts in test/3/matlab.	2019-08-23 14:18:08 +05:30
Field G. Van Zee	da590746b0	Renamed test/3m4m to test/3. Details: - Renamed '3m4m' directory to '3', which captures the directory nicely since it builds test drivers to test level-3 operations. - These test drivers ceased to be used to test the 3m and 4m (or even 1m) induced methods long ago, hence the name change.	2019-08-23 14:18:07 +05:30
Field G. Van Zee	2ebe4aafe5	More minor updates and edits to test/3m4m. Details: - Further updates to matlab scripts, mostly for compatibility with GNU Octave. - More tweaks to runme.sh. - Updates to runme.m that allow copy-paste into matlab interactive session to generate graphs.	2019-08-23 14:18:07 +05:30
Field G. Van Zee	09ed72c5a7	Very minor updates to test/3m4m for ul252. Details: - Very minor updates to the newly revamped test/3m4m drivers when used on a Xeon Platinum (SkylakeX).	2019-08-23 14:18:07 +05:30
Field G. Van Zee	8beda64ea5	Overhauled test/3m4m Makefile and scripts. Details: - Rewrote much of Makefile to generate executables for single- and dual- socket multithreading as well as single-threaded. Each of the three can also use a different problem size range/increment, as is often appropriate when doubling/halving the number of threads. - Rewrote runme.sh script to flexibly execute as many threading parameter scenarios as is given in the input parameter string (currently set within the script itself). The string also encodes the maximum problem size for each threading scenario, which is used to identify the executable to run. Also improved the "progress" output of the script to reduce redundant info and improve readability in terminals that are not especially wide. - Minor updates to test_*.c source files. - Updated matlab scripts according to changes made to the Makefile, test drivers, and runme.sh script, and renamed 'plot_all.m' to 'runme.m'.	2019-08-23 14:18:07 +05:30
Field G. Van Zee	7918a6deca	Updates (from ls5) to test/3m4m/runme.sh. Details: - Lonestar5-specific updates to runme.sh.	2019-08-23 14:18:07 +05:30
Field G. Van Zee	84282bba54	Updates to 3m4m/matlab scripts. Details: - Minor updates to matlab graph-generating scripts. - Added a plot_all.m script that is more of a scratchpad for copying and pasting function invocations into matlab to generate plots that are presently of interest to us.	2019-08-23 14:18:07 +05:30
Field G. Van Zee	deda4ca8a0	Added test/1m4m driver directory. Details: - Added a new standalone test driver directory named '1m4m' that can build and run performance experiments for BLIS 1m, 4m1a, assembly, OpenBLAS, and the vendor library (MKL). This new driver directory was used to regenerate performance results for the 1m paper. - Added alternate (commented-out) cache blocksizes to config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to work well on an a 12-core Intel Xeon E5-2650 v3.	2019-07-22 13:59:05 -05:00

1 2 3 4 5 ...

288 Commits