amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
Dipal M Zambare	1ec020b33e	AMD kernel updates; frame-specific AMD updates. (#597 ) Details: - Allow building BLIS with certain framework files (each with the '_amd' suffix) that have been customized by AMD for Zen-based hardware. These customized files were derived from portable versions of the same files (i.e., those without the '_amd' suffix). Whether the portable or AMD- specific files are compiled is now controlled by a new configure option, --[en\|dis]able-amd-frame-tweaks. This option is disabled by default in vanilla BLIS, though AMD may choose to enable it by default in their fork. For now, the added AMD-specific files are: - bli_gemv_unf_var2_amd.c - bla_copy_amd.c - bla_gemv_amd.c These files reside in 'amd' subdirectories found within the directory housing their generic counterparts. - Register optimized real-domain copyv, setv, and swapv kernels in bli_cntx_init_zen.c. - Various minor updates to level-1v kernels in 'zen' kernel set. - Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to the 'zen' kernel set - If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim, call gemv instead and return early. - Combined variable declarations with their initialization in various level-2 and level-3 BLAS compatibility files, and also inserted 'const' qualifer in those same declaration statements. - Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ . - Added copyv and swapv test drivers to 'test' directory. - Whitespace, comment changes.	2022-03-29 16:15:36 -05:00
Bhaskar Nallani	0db2bd5341	Added BLAS/CBLAS APIs for gemm3m. (#590 ) Details: - Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply invoke the 1m implementation unconditionally. (Note that these APIs bypass sup handling.) - Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h. - Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h. - Relocated: frame/compat/cblas/src/cblas_?gemmt.c files into frame/compat/cblas/src/extra/ - Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ . - Minor reorganization of prototypes and cpp macro directives in bli_blas.h, cblas.h, and cblas_f77.h. - Trival whitespace change to cblas_zgemm.c.	2022-03-24 18:41:55 -05:00
Devin Matthews	54fa28bd84	Move edge cases to gemm ukr; more user-custom mods. (#583 ) Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations.	2021-12-24 08:00:33 -06:00
Madan mohan Manokar	9be97c150e	Support all four dts in test/test_her[2][k].c (#578 ) Details: - Replaced the hard-coded calls to double-precision real syr, syr2, syrk, and syrk in the corresponding standalone test drivers in the 'test' directory with conditional branches that will call the appropriate BLAS interface depending on which datatype is enabled. Thanks to Madan mohan Manokar for this improvement. - CREDITS file update.	2021-11-17 13:16:46 -06:00
Meghana-vankadari	7bc8ab485e	Added BLAS/CBLAS APIs for axpby, gemm_batch. (#566 ) Details: - Expanded the BLAS compatibility layer to include support for ?axpby_() and ?gemm_batch_(). The former is a straightforward BLAS-like interface into the axpbyv operation while the latter implements a batched gemm via loops over bli_?gemm(). Also expanded the CBLAS compatibility layer to include support for cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari for submitting these new APIs via #566. - Fixed a long-standing bug in common.mk that for some reason never manifested until now. Previously, CBLAS source files were compiled without the location of cblas.h being specified via a -I flag. I'm not sure why this worked, but it may be due to the fact that the cblas.h file resided in the same directory as all of the CBLAS source, and perhaps compilers implicitly add a -I flag for the directory that corresponds to the location of the source file being compiled. This bug only showed up because some CBLAS-like source code was moved into an 'extra' subdirectory of that frame/compat/cblas/src directory. After moving the code, compilation for those files failed (because the cblas.h header file, presumably, could not be found in the same location). This bug was fixed within common.mk by explicitly adding the cblas.h directory to the list of -I flags passed to the compiler. - Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory, and updated test/Makefile to build those drivers. - Fixed typo in error message string in cblas_sgemm.c.	2021-11-11 16:46:14 -06:00
Field G. Van Zee	f065a8070f	Removed support for 3m, 4m induced methods. Details: - Removed support for all induced methods except for 1m. This included removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any code that existed only to support those implementations. These implementations were rarely used and posed code maintenance challenges for BLIS's maintainers going forward. - Removed reference kernels for packm that pack 3m and 4m micropanels, and removed 3m/4m-related code from bli_cntx_ref.c. - Removed support for 3m/4m from the code in frame/ind, then reorganized and streamlined the remaining code in that directory. The ind(), nat(), and 1m() APIs were all removed. (These additional API layers no longer made as much sense with only one induced method (1m) being supported.) The bli_ind.c file (and header) were moved to frame/base and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to frame/3. - Removed 3m/4m support from the code in frame/1m/packm. - Removed 3m/4m support from trmm/trsm macrokernels and simplified some pointer arithmetic that was previously expressed in terms of the bli_ptr_inc_by_frac() static inline function (whose definition was also removed). - Removed the following subdirectories of level-0 macro headers from frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros defined in these directories were used exclusively for 3m and 4m method codes. - Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in light of 1m being the only induced method left within BLIS. - Removed dt_on_output field within auxinfo_t and its associated accessor functions. - Re-indexed the 1e/1r pack schemas after removing those associated with variants of the 3m and 4m methods. This leaves two bits unused within the pack format portion of the schema bitfield. (See bli_type_defs.h for more info.) - Spun off the basic and expert interfaces to the object and typed APIs into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c and bli_l3_tapi_ex.c. - Moved the level-3 operation-specific _check function calls from the operations' _front() functions to the corresponding _ex() function of the object API. (This change roughly maintains where the _check() functions are called in the call stack but lays the groundwork for future changes that may come to the level-3 object APIs.) Minor modifications to bli_l3_check.c to allow the check() functions to be called from the expert interface APIs. - Removed support within the testsuite for testing the aforementioned induced methods, and updated the standalone test drivers in the 'test' directory so reflect the retirement of those induced methods. - Modified the sandbox contract so that the user is obliged to define bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light of the nat() functions no longer existing.) Also updated the existing 'power10' and 'gemmlike' sandboxes to come into compliance with the new sandbox rules. - Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation to reflect the retirement of 3m/4m, and also modified Sandboxes.md to bring the document into alignment with new conventions. - Updated various comments; removed segments of commented-out code.	2021-10-28 16:05:43 -05:00
Field G. Van Zee	cc9206df66	Added Graviton2 Neoverse N1 performance results. Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on a Graviton2 Neoverse N1 server. Special thanks to Nicholai Tukanov for collecting these results via the Arm-HPC/AWS hackaton. - Corrected what was supposed to be a temporary tweak to the legend labels in test/3/octave/plot_l3_perf.m.	2021-07-16 15:48:37 -05:00
Field G. Van Zee	82af05f54c	Updated Fugaku (a64fx) performance results. Details: - Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx entry within Performance.md, and also updated the experiment details accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2 experiments reflected in this commit. - In Performance.md, added an English translation of the project name under which the Fugaku results were gathered, courtesy of RuQing Xu.	2021-05-25 15:25:08 -05:00
Field G. Van Zee	ba3ba8da83	Minor updates and fixes to test/3/octave scripts. Details: - Fixed an issue where the wrong string was being passed in for the vendor legend string. - Changed the graph in which the legends appear. - Updates to runthese.m.	2021-04-06 18:39:58 -05:00
Field G. Van Zee	159ca6f01a	Made test/3/octave scripts robust to missing data. Details: - Modified the octave scripts in test/3 so that the script does not choke when one or more of the expected OpenBLAS, Eigen, or vendor data files is missing. (The BLIS data set, however, must be complete.) When a file is missing, that data series is simply not included on that particular graph. Also factored out a lot of the redundant logic from plot_panel_4x5.m into a separate function in read_data.m.	2021-03-24 15:57:32 -05:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
Field G. Van Zee	7677e9ba60	Merge branch 'dev' of github.com:flame/blis into dev	2020-10-09 15:41:25 -05:00
Field G. Van Zee	addcd46b05	Added Epyc 7742 Zen2 ("Rome") sup perf results. Details: - Added single-threaded and multithreaded sup performance results to docs/PerformanceSmall.md for both sgemm and dgemm. These results were gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2 microarchitecture. Special thanks to Jeff Diamond for facilitating access to the system via the Oracle Cloud. - Updates to octave scripts in test/sup/octave for use with Octave 5.2 and for use with subplot_tight(). - Minor updates to octave scripts in test/3/octave. - Renamed files containing the previous Zen performance results for consistency with the new results. - Decreased line thickness slightly in large/conventional Zen2 graphs. I'm done tweaking those this time. Really. - Added missing line regarding eigen header installation for each microarchitecture section.	2020-10-09 15:41:09 -05:00
Field G. Van Zee	a0849d390d	Register l3 sup kernels in zen2 subconfig. Details: - Registered full suite of sgemm and dgemm sup millikernels, blocksizes, and crossover thresholds in bli_cntx_init_zen2.c. - Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742 system.	2020-10-09 20:22:17 +00:00
Field G. Van Zee	bc4a213a2c	Updated matlab (now octave) plot code in test/3. Details: - Renamed test/3/matlab to test/3/octave. - Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m files for use with octave (which is free and doesn't crash on me mid-way through my use of subplot). - Updated runthese.m scratchpad for zen2 invocations. - Added Nikolay S.'s subplot_tight() function, along with its license.	2020-09-30 15:28:20 -05:00
Field G. Van Zee	c77ddc4181	Added optional numactl usage to test/3/runme.sh.	2020-09-30 20:15:43 +00:00
Field G. Van Zee	4fd8d9fec2	Tweaked zen2 subconfig's MC cache blocksizes. Details: - Updated the MC cache blocksizes registered by the 'zen2' subconfig. - Minor updates to test/3/Makefile and test/3/runme.sh.	2020-09-28 23:39:05 +00:00
Field G. Van Zee	e293cae2d1	Implemented sgemmsup assembly kernels. Details: - Created a set of single-precision real millikernels and microkernels comparable to the dgemmsup kernels that already exist within BLIS. - Added prototypes for all kernels within bli_kernels_haswell.h. - Registered entry-point millikernels in bli_cntx_init_haswell.c and bli_cntx_init_zen.c. - Added sgemmsup support to the Makefile, runme.sh script, and source file in test/sup. This included edits that allow for separate "small" dimensions for single- and double-precision as well as for single- vs. multithreaded execution.	2020-09-15 16:09:11 -05:00
Field G. Van Zee	00e14cb6d8	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-07-29 14:24:34 -05:00
Field G. Van Zee	e01dd12558	Fail-safe updates to Makefiles in 'test' dir. Details: - Updated Makefiles in test, test/3, and test/sup so that running any of the usual targets without having first built BLIS results in a helpful error message. For example, if BLIS is not yet configured, make will output: Makefile:327: * Cannot proceed: config.mk not detected! Run configure first. Stop. Similarly, if BLIS is configured but not yet built, make will output: Makefile:340: * Cannot proceed: BLIS library not yet built! Run make first. Stop. In previous commits, these actions would result in a rather cryptic make error such as: make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x', needed by 'blis-nat-st'. Stop.	2020-07-24 15:41:46 -05:00
Field G. Van Zee	b37634540f	Support ldims, packing in sup/test drivers. Details: - Updated the test/sup source file (test_gemm.c) and Makefile to support building matrices with small or large leading dimensions, and updated runme.sh to support executing both kinds of test drivers. - Updated runme.sh to allow for executing sup drivers with unpacked (the default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B environment variables), and for capturing output to files that encode both the leading dimension (small or large) and packing status into the filenames. - Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt into test/sup/octave and updated the octave code in that consolidated directory to read the new output filename format (encoding ldim and packing). Also added comments and streamlined code, particularly in plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0. - Moved old octave_st, octave_mt directories to test/sup/old.	2020-06-25 16:05:12 -05:00
Field G. Van Zee	2e3b3782cf	Merge branch 'master' into amd	2020-04-06 14:55:35 -05:00
Field G. Van Zee	2bca03ea9d	Updates, tweaks to runme.sh in test/1m4m. Details: - Made several updates to test/1m4m/runme.sh, including: - Added missing handling for 1m and 4m1a implementations when setting the BLIS_??_NT environment variables. - Added support for using numactl to run the test executables. - Several other cleanups.	2020-03-28 22:10:00 +00:00
Field G. Van Zee	8c3d9b9eeb	Merge branch 'amd' of github.com:flame/blis into amd	2020-03-10 14:03:33 -05:00
Field G. Van Zee	71249fe8dd	Merged test/sup, test/supmt into test/sup. Details: - Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able to compile and run both single-threaded and multithreaded experiments. This should help with maintenance going forward. - Created a test/sup/octave_st directory of scripts (based on the previous test/sup/octave scripts) as well as a test/sup/octave_mt directory (based on the previous test/supmt/octave scripts). The octave scripts are slightly different and not easily mergeable, and thus for now I'll maintain them separately. - Preserved the previous test/sup directory as test/sup/old/supst and the previous test/supmt directory as test/sup/old/supmt.	2020-03-10 13:55:29 -05:00
Field G. Van Zee	0f9e0399e1	Updated sup performance graphs; added mt results. Details: - Reran all existing single-threaded performance experiments comparing BLIS sup to other implementations (including the conventional code path within BLIS), using the latest versions (where appropriate). - Added multithreaded results for the three existing hardware types showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc (Zen1). - Various minor updates to the text in docs/PerformanceSmall.md. - Updates to the octave scripts in test/sup/octave, test/supmt/octave.	2020-03-05 17:03:21 -06:00
Field G. Van Zee	90db88e572	Updated sup[mt] Makefiles for variable dim ranges. Details: - Updated test/sup/Makefile and test/supmt/Makefile to allow specifying different problem size ranges for the drivers where one, two, or three matrix dimensions is large. This will facilitate the generation of more meaningful graphs, particularly when two dimensions are tiny.	2020-03-02 15:06:48 -06:00
Field G. Van Zee	31f11a06ea	Updates to octave scripts in test/sup[mt]/octave. Details: - Optimized scripts in test/sup/octave and test/supmt/octave for use with octave 5.2.0 on Ubuntu 18.04. - Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m, which were not only unnecessary but also causing issues with versions 5.x.	2020-02-27 14:33:20 -06:00
Field G. Van Zee	c0558fde45	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates.	2020-02-17 14:08:08 -06:00
Field G. Van Zee	c84391314d	Reverted minor temp/wspace changes from `b426f9e`. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files.	2019-11-04 13:57:12 -06:00
Nicholai Tukanov	b426f9e04e	POWER9 DGEMM (#355 ) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and not xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list.	2019-11-01 17:57:03 -05:00
Field G. Van Zee	29b0e1ef4e	Code review + tweaks to AMD's AOCL 2.0 PR (#349 ). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes.	2019-10-11 10:24:24 -05:00
Field G. Van Zee	80e6c10b72	Added reproduction section to Performance docs. Details: - Added section titled "Reproduction" to both Performance.md and PerformanceSmall.md that briefly nudges the motivated reader in the right direction if he/she wishes to run the same performance benchmarks used to produce the graphs shown in those documents. Thanks to Dave Love for making this suggestion.	2019-08-29 12:12:08 -05:00
Field G. Van Zee	b02e0aae8c	Updated test drivers to iterate backwards. Details: - Updated test driver source in test, test/3, test/1m4m, and test/mixeddt to iterate through the problem space backwards. This can help avoid certain situations where the CPU frequency does not immediately throttle up to its maximum. Thanks to Robert van de Geijn for recommending this fix (originally made to test/sup drivers in `57e422a`). - Applied off-by-one matlab output bugfix from `b6017e5` to test drivers in test, test/3, test/1m4m, and test/mixeddt directories.	2019-08-27 14:37:46 -05:00
Field G. Van Zee	b6017e53f4	Bugfix of output text + tweaks to test/sup driver. Details: - Fixed an off-by-one bug in the output of matlab row indices in test/sup/test_gemm.c that only manifested when the problem size increment was equal to 1. - Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage combinations for blissup drivers in test/sup. This helps make the building of drivers complete sooner. - Trivial changes to test/sup/runme.sh.	2019-08-27 14:18:14 -05:00
Field G. Van Zee	40781774df	Updated sup performance graphs with libxsmm. Details: - Added libxsmm to column-stored sup graphs presented in docs/PerformanceSmall.md. - Updated sup results for BLASFEO. - Added sup results for Lonestar5 (Haswell). - Addresses issue #326.	2019-08-26 16:47:37 -05:00
Field G. Van Zee	4a0a6e89c5	Changed test/sup alpha to 1; test libxsmm+netlib. Details: - Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is needed because libxsmm currently only optimizes gemm operations where alpha is unit (and beta is unit or zero). - Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its fallback library. This is the library that will be called the problem dimensions are deemed too large, or any other criteria for optimization are not met. (This was done not because it is realistic, but rather so that it would be very clear when libxsmm ceased handling gemm calls internally when the data are graphed.)	2019-08-24 15:25:16 -05:00
Field G. Van Zee	7aa52b5783	Use libxsmm API in test/sup; add missing -ldl. Details: - Switch the driver source in test/sup so that libxsmm_?gemm() is called instead of ?gemm_() when compiling for / linking against libxsmm. libxsmm's documentation isn't clear on whether it is even trying to provide BLAS API compatibility, and I got tired of trying to figure it out. - Added missing -ldl in LDFLAGS when linking against libxsmm.	2019-08-23 16:12:50 -05:00
Field G. Van Zee	57e422aa16	Added libxsmm support to test/sup drivers. Details: - Modified test/sup/Makefile to build drivers that test the performance of skinny/small problems via libxsmm. - Modified test/sup/runme.sh to run aforementioned drivers. - Modified test/sup/test_gemm.c so that problem sizes are tested in reverse order (from largest to smallest). This can help avoid certain situations where the CPU frequency does not immediately throttle up to its maximum. Thanks to Robert van de Geijn for recommending this fix.	2019-08-23 14:17:52 -05:00
Field G. Van Zee	deda4ca8a0	Added test/1m4m driver directory. Details: - Added a new standalone test driver directory named '1m4m' that can build and run performance experiments for BLIS 1m, 4m1a, assembly, OpenBLAS, and the vendor library (MKL). This new driver directory was used to regenerate performance results for the 1m paper. - Added alternate (commented-out) cache blocksizes to config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to work well on an a 12-core Intel Xeon E5-2650 v3.	2019-07-22 13:59:05 -05:00
Field G. Van Zee	c4cc6fa702	New cntx_t blksz "set" functions + misc tweaks. Details: - Defined two new static functions in bli_cntx.h: bli_cntx_set_blksz_def_dt() bli_cntx_set_blksz_max_dt() which developers may find convenient when experimenting with different values of cache blocksizes. - Updated one- and two-socket multithreaded problem size range and increment values in test/3/Makefile. - Changed default to column storage in test/3/test_gemm.c. - Fixed typo in comment in testsuite/src/test_subm.c.	2019-07-16 13:00:35 -05:00
Field G. Van Zee	c152109e9a	Updated BLASFEO results in PerformanceSmall.md. Details: - Updated the BLASFEO performance graphs shown in PerformanceSmall.md using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md accordingly. - Updated test/sup/octave/plot_l3sup_perf.m so that the .m files containing the mpnpkp results do not need to be preprocessed in order to plot half the problem size range (ie: up to 400 instead of the 800 range of the other shape cases). - Trivial updates to runme.m.	2019-06-19 13:23:24 -05:00
Field G. Van Zee	cbaa22e1ca	Added BLASFEO results to docs/PerformanceSmall.md. Details: - Updated the graphs linked in PerformanceSmall.md with BLASFEO results, and added documenting language accordingly. - Updated scripts in test/sup/octave to plot BLASFEO data. - Minor tweak to language re: how OpenBLAS was configured for docs/Performance.md.	2019-06-04 16:06:58 -05:00
Field G. Van Zee	763fa39c30	Minor tweaks to test/sup. Details: - Changed starting problem and increment from 16 to 4. - Added 'lll' (square problems) to list of problem size shapes to compile and run with. - Define BLASFEO location and added BLASFEO-related definitions.	2019-06-04 14:46:45 -05:00
Field G. Van Zee	09ba05c6f8	Added sup performance graphs/document to 'docs'. Details: - Added a new markdown document, docs/PerformanceSmall.md, which publishes new performance graphs for Kaby Lake and Epyc showcasing the new BLIS sup (small/skinny/unpacked) framework logic and kernels. For now, only single-threaded dgemm performance is shown. - Reorganized graphs in docs/graphs into docs/graphs/large, with new graphs being placed in docs/graphs/sup. - Updates to scripts in test/sup/octave, mostly to allow decent output in both GNU octave and Matlab. - Updated README.md to mention and refer to the new PerformanceSmall.md document.	2019-06-03 16:53:19 -05:00
Field G. Van Zee	6bf449cc69	Merge branch 'amd'	2019-05-31 17:42:40 -05:00
Field G. Van Zee	a4e8801d08	Increased MT sup threshold for double to 201. Details: - Fine-tuned the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 180 to 201 for haswell and 180 to 256 for zen. - Updated octave scripts in test/sup/octave to include a seventh column to display performance for m = n = k.	2019-05-31 17:30:51 -05:00
kdevraje	13806ba3b0	This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041	2019-05-27 16:24:43 +05:30
Field G. Van Zee	057f5f3d21	Minor build system housekeeping. Details: - Commented out redundant setting of LIBBLIS_LINK within all driver- level Makefiles. This variable is already set within common.mk, and so the only time it should be overridden is if the user wants to link to a different copy of libblis. - Very minor changes to build/gen-make-frags/gen-make-frag.sh. - Whitespace and inconsequential quoting change to configure. - Moved top-level 'windows' directory into a new 'attic' directory.	2019-05-23 12:51:17 -05:00
kdevraje	a3554eb1dc	Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2 Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae	2019-05-23 11:53:32 +05:30

1 2 3 4

178 Commits