amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 02:25:39 +00:00

Author	SHA1	Message	Date
Dipal Madhukar Zambare	821fa267c9	Merge "Updated makefiles to fix issues introduced in merge" into amd-staging-milan-3.1	2021-05-05 23:42:15 -04:00
Dipal M Zambare	7454cca9e7	Updated makefiles to fix issues introduced in merge - Updated Makefile to include DTL files in library build - Updated Makefile to include cpp header file installation - Updated test/makefile to include extra API added by AMD team. AMD-Internal: [CPUPL-1559] Change-Id: I249c6935d5ff5fb645f9deec7e0218575484be13	2021-05-05 14:59:15 +05:30
Nallani Bhaskar	f917d826b5	Updated test application to work with row major cblas Details: 1. Fixed reading leading dimenstions in test_gemm.c based on row/col major 2. Reduced redundent code and adjusted alignment Change-Id: I8ca8c81223386fc21c6cc7c1d8f8a2109c9f5343	2021-05-02 23:09:13 +05:30
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
satish kumar nuggu	c0d16f70e4	Added CBLAS API interface and memory alignment check Details: 1. Added CBLAS API in test_gemm.c and test_trsm.c 2. Creating matrixes with memory aligned leading dimensions irrespective of dimension. 3. Shifting powers of 2 aligned memory to non-powers of 2 by adding extra cache line size at the end. AMD-Internal: [CPUPL-1500] [CPUPL-1450] Change-Id: I4180cec29b62d0388b974abee3e9b699cce3af6a	2021-04-26 13:44:30 +05:30
Field G. Van Zee	ba3ba8da83	Minor updates and fixes to test/3/octave scripts. Details: - Fixed an issue where the wrong string was being passed in for the vendor legend string. - Changed the graph in which the legends appear. - Updates to runthese.m.	2021-04-06 18:39:58 -05:00
Kiran Varaganti	657f58b82b	Fix test_dotv.c when complex-return=intel When BLIS is built with --complex-return=intel, the zdotu_ and cdotu_ function prototype changes. Now "return parameter" will become the first argument of these functions and these functions return void. This fix addresses the change in the function declarations when --complex-return=intel is enabled. Missing Trace and Log statements for this configuration are now added. [CPUPL-1376] Change-Id: Ib420989da71839211c16088bf431a2ad775a3978	2021-04-06 15:38:12 +05:30
Field G. Van Zee	159ca6f01a	Made test/3/octave scripts robust to missing data. Details: - Modified the octave scripts in test/3 so that the script does not choke when one or more of the expected OpenBLAS, Eigen, or vendor data files is missing. (The BLIS data set, however, must be complete.) When a file is missing, that data series is simply not included on that particular graph. Also factored out a lot of the redundant logic from plot_panel_4x5.m into a separate function in read_data.m.	2021-03-24 15:57:32 -05:00
nphaniku	b3628cdfd3	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for build with Clang compiler. 2. CMake script changes for build test and testsuite based on the lib type ST/MT 3. CMake script changes for testcpp and blastest 4. Added python scripts to support library build and testsuite build. AMD Internal : [CPUPL-1422] Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898	2021-03-08 19:04:17 +05:30
Kumar, Phani	477fc41fff	Cmake script changes and blis.h changes for amd-staging-milan-3.0 AMD Internal : [CPUPL-1083] Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580	2020-11-24 06:12:25 -05:00
Dipal M Zambare	0a3d94c9a2	Updated test drivers for dotv, scalv and swapv. Added traces in cblas layer for these API's. These test drivers didn't have calls for complex data types, the drivers are updated to support them. AMD-Internal : [CPUPL-1315] Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b	2020-11-24 10:26:32 +05:30
Madan mohan Manokar	1d8fab0996	Test driver fix for her and her2 Fixed test driver code for her, her2 Support added to handle complex and double complex data type in test driver. Change-Id: If65939e99d8cf77e0fb70561166d84bf67d0321d AMD-Internal: [CPUPL-1326]	2020-11-23 04:10:43 -05:00
Madan Mohan Manokar	35d33bab6a	Merge "Test driver fix for her, her2, herk and her2k" into amd-staging-milan-3.0	2020-11-19 05:47:31 -05:00
Madan mohan Manokar	38698f0dfd	Test driver fix for her, her2, herk and her2k Fixed test driver code for her, her2, herk and her2k function. Above functions supports only complex and double complex data type, test code is updated accordingly. Change-Id: Iee7b79abda4a2959a265c420d23879bf47f2c38d AMD-Internal: [CPUPL-1313]	2020-11-19 12:58:21 +05:30
satish kumar nuggu	17f994bd15	Added Blas interface for ?imatcopy, ?omatcopy, ?omatadd, ?omatcopy2 AMD-Internal: [CPUPL-1116] Original review was in this commit http://gerrit-git.amd.com/c/cpulibraries/er/blis/+/428165. Added new commit for transpose API's Change-Id: I322389cc0be0aaccf82d1d0bb4476beea8694cd8	2020-11-18 12:55:36 +05:30
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
managalv	aae48c2221	Optimised AXPYF routine for complex float and complex double Details: - Added SIMD code - Processing 5 rows at a time in SIMD loop to improve performance AMD-Internal: [CPUPL-1054] Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a	2020-11-06 18:42:13 +05:30
managalv	68b4ff976b	Added Function trace and Input logging for dotv and gemv Change-Id: I992bd80b2322d6c387f609ecd70c1109c13f6254 AMD-Internal: [CPUPL-1274]	2020-11-06 03:53:43 +05:30
Meghana Vankadari	0775f09b41	Added debug trace and log support for copy and ger routines Change-Id: Id7fb64c0a626b2f8f53e89ee7df4391693eb4f4c	2020-11-02 22:56:58 -05:00
Kiran Varaganti	65daaab6ac	Fix Bug in DTL When library is built as single thread and trace is enabled, the test applications in test folder fail to compile. In the file aoclos.c the function AOCL_gettid() uses "omp_get_thread_num() to get thread_id, which is only enabled when OpenMP based parallel BLIS library is generated. To fix this in single thread case we now return zero for thread id, openmp function is used only when BLIS_ENABLE_OPENMP macro is defined. However this is not a complete fix. If library is built with pthread, AOCL_gettid() always return 0, which is not the intended behaviour. Change-Id: I5b79ed57d27d0022d3dcab0e2a3a557c8e4ff8ee	2020-11-02 12:05:09 +05:30
bhaskarn	376ac5856b	Added BLAS Extension API's: CBLAS_?GEMM3M AMD-Internal: [CPUPL-1151] Induced 3M1 method is enabled for CGEMM3M and ZGEMM3M Change-Id: I8276c5018340d0a45694551f48aad5b735819eae	2020-10-29 17:06:30 +05:30
Nageshwar Singh	dd5b38d221	Added BLIS, BLAS, and CBLAS interface for cblas?amin Details: - Amin api returns index of minimum absolute value in a vector. - Added amin reference blis kernel. - Added blas and cblas interface for amin. AMD-Internal: [CPUPL-1155] Change-Id: I89c1e37e86950a4582bba70a5d8fc70ac915bd3c	2020-10-28 17:50:27 +05:30
Nageshwar Singh	dbd7b28373	Development of AVX2 axpyv kernels for c and z datatypes. Details - Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv). - Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family. AMD-Internal: [CPUPL-1231] Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7	2020-10-23 09:33:35 +05:30
Nageshwar Singh	b245ea9c65	cblas_?cabs1 test cblas header file bug Details: - Added BLIS_ENABLE_CBLAS around cblas header file. AMD-Internal: [CPUPL-1129] Change-Id: I3baacd26aa96c8eeb753d95210817ffe9b2a3f85	2020-10-13 01:43:55 -04:00
Field G. Van Zee	7677e9ba60	Merge branch 'dev' of github.com:flame/blis into dev	2020-10-09 15:41:25 -05:00
Field G. Van Zee	addcd46b05	Added Epyc 7742 Zen2 ("Rome") sup perf results. Details: - Added single-threaded and multithreaded sup performance results to docs/PerformanceSmall.md for both sgemm and dgemm. These results were gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2 microarchitecture. Special thanks to Jeff Diamond for facilitating access to the system via the Oracle Cloud. - Updates to octave scripts in test/sup/octave for use with Octave 5.2 and for use with subplot_tight(). - Minor updates to octave scripts in test/3/octave. - Renamed files containing the previous Zen performance results for consistency with the new results. - Decreased line thickness slightly in large/conventional Zen2 graphs. I'm done tweaking those this time. Really. - Added missing line regarding eigen header installation for each microarchitecture section.	2020-10-09 15:41:09 -05:00
Field G. Van Zee	a0849d390d	Register l3 sup kernels in zen2 subconfig. Details: - Registered full suite of sgemm and dgemm sup millikernels, blocksizes, and crossover thresholds in bli_cntx_init_zen2.c. - Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742 system.	2020-10-09 20:22:17 +00:00
Meghana Vankadari	016885348c	Added CBLAS interface and test file for gemm_batch API AMD-Internal: [CPUPL-1184] Change-Id: Icc5c41429b0d92f1a66a955769cc0518ca4706ee	2020-10-07 06:55:08 -04:00
Kiran Varaganti	aa56c36b82	Fixed Logs & code cleanup Fixed AOCL DTL logs printing incorrect alpha and beta values for single precision. Added missing info like data-types, lower or upper traingular and Side parameter in the case of TRSM. Code cleanup and formatting the files test_cabs1.c, test_axpbyv.c and test_gemm.c. In dumping trsm parameters replaced 'side' with bli_is_right(side). Change-Id: Ic81503ae696956eb074ec208f7109d1a394183d7	2020-10-04 22:17:31 +05:30
Field G. Van Zee	bc4a213a2c	Updated matlab (now octave) plot code in test/3. Details: - Renamed test/3/matlab to test/3/octave. - Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m files for use with octave (which is free and doesn't crash on me mid-way through my use of subplot). - Updated runthese.m scratchpad for zen2 invocations. - Added Nikolay S.'s subplot_tight() function, along with its license.	2020-09-30 15:28:20 -05:00
Field G. Van Zee	c77ddc4181	Added optional numactl usage to test/3/runme.sh.	2020-09-30 20:15:43 +00:00
Field G. Van Zee	4fd8d9fec2	Tweaked zen2 subconfig's MC cache blocksizes. Details: - Updated the MC cache blocksizes registered by the 'zen2' subconfig. - Minor updates to test/3/Makefile and test/3/runme.sh.	2020-09-28 23:39:05 +00:00
Nageshwar Singh	5243da5cec	BLIS: CBLAS Extensions. cblas_?cabs1 : Absolute value of a complex number cabs1 Details: - added cblas extension cblas_?cabs1. - Functionality : res=\|Re(z)\|+\|Im(z)\|, z is a complex number, and res is a value containing the absolute value of a complex number z. AMD-Internal: [CPUPL-1129] Change-Id: I4a3c265c89527c8fd3060c5d2ed38b1953ce6343	2020-09-23 19:01:02 +05:30
Meghana Vankadari	80828f6fda	Added few changes in test_axpbyv.c file Details: - Corrected "#if" directive in line 89 - Commented out "#define print" to disable printing the vectors Change-Id: I9ec3cbfb716540dd3e2264f5c3925d9e0c0c294a	2020-09-16 23:48:01 -04:00
Kiran Varaganti	5a8bd9f41c	Enable BLAS interface in BLIS Modified Makefile in test folder to enable calling BLAS interfaces for BLIS as well. This is possible by replacing -DBLIS with -DBLAS=\"aocl\" in the makefile. Also added linking to multi-threaded MKL library. Change-Id: Iccf2ec99b48bb35da985b69218bc680f678ff7c9	2020-09-16 23:26:44 +05:30
Meghana Vankadari	6c9bf36424	Added BLAS and CBLAS interfaces for axpby API Details: - The axpby routines perform a vector-vector operation defined as y = ax + by where a, b are scalars and x, y are vectors. - This API is part of BLAS-like extension APIs Change-Id: I17a53b03bba97de7ae1995a9f086084bd241bcdc AMD-Internal: [CPUPL-1118]	2020-09-16 09:49:31 +05:30
Field G. Van Zee	e293cae2d1	Implemented sgemmsup assembly kernels. Details: - Created a set of single-precision real millikernels and microkernels comparable to the dgemmsup kernels that already exist within BLIS. - Added prototypes for all kernels within bli_kernels_haswell.h. - Registered entry-point millikernels in bli_cntx_init_haswell.c and bli_cntx_init_zen.c. - Added sgemmsup support to the Makefile, runme.sh script, and source file in test/sup. This included edits that allow for separate "small" dimensions for single- and double-precision as well as for single- vs. multithreaded execution.	2020-09-15 16:09:11 -05:00
dzambare	267a959af1	Rebased amd-staging-milan-3.0 branch on master -- Rebased on top of master commit # `6e522e5823` -- Updated merged code to remove duplicated code added by auto-merging -- Updated merged code to rename bool_t type -- Updated merged code to rename bli_thread_obarrier -- Updated merged code to rename bli_thread_obroadcast Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c AMD-Internal: [CPUPL-1067]	2020-08-06 10:09:29 +05:30
phakumar	c7a914411f	BLIS library porting on to Windows: GEMMT changes porting on to Windows AMD Internal : [CPUPL-1061] Change-Id: I587d1789cd29ea18b04f8ab43e5742b4d902067a	2020-08-06 10:09:29 +05:30
Field G. Van Zee	26cd966af7	Merged test/sup, test/supmt into test/sup. Details: - Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able to compile and run both single-threaded and multithreaded experiments. This should help with maintenance going forward. - Created a test/sup/octave_st directory of scripts (based on the previous test/sup/octave scripts) as well as a test/sup/octave_mt directory (based on the previous test/supmt/octave scripts). The octave scripts are slightly different and not easily mergeable, and thus for now I'll maintain them separately. - Preserved the previous test/sup directory as test/sup/old/supst and the previous test/supmt directory as test/sup/old/supmt. Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed	2020-08-06 10:09:28 +05:30
Field G. Van Zee	2839b88f00	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates. AMD-Internal: [CPUPL-713] Change-Id: I9536648e7befac4d2dc17805e44ef34470961662	2020-08-06 10:09:28 +05:30
Field G. Van Zee	068072b6b4	Updated BLASFEO results in PerformanceSmall.md. Details: - Updated the BLASFEO performance graphs shown in PerformanceSmall.md using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md accordingly. - Updated test/sup/octave/plot_l3sup_perf.m so that the .m files containing the mpnpkp results do not need to be preprocessed in order to plot half the problem size range (ie: up to 400 instead of the 800 range of the other shape cases). - Trivial updates to runme.m.	2020-08-03 11:48:43 +05:30
Field G. Van Zee	50c0ffe390	Added BLASFEO results to docs/PerformanceSmall.md. Details: - Updated the graphs linked in PerformanceSmall.md with BLASFEO results, and added documenting language accordingly. - Updated scripts in test/sup/octave to plot BLASFEO data. - Minor tweak to language re: how OpenBLAS was configured for docs/Performance.md.	2020-08-03 11:48:43 +05:30
Field G. Van Zee	89105695cc	Added sup performance graphs/document to 'docs'. Details: - Added a new markdown document, docs/PerformanceSmall.md, which publishes new performance graphs for Kaby Lake and Epyc showcasing the new BLIS sup (small/skinny/unpacked) framework logic and kernels. For now, only single-threaded dgemm performance is shown. - Reorganized graphs in docs/graphs into docs/graphs/large, with new graphs being placed in docs/graphs/sup. - Updates to scripts in test/sup/octave, mostly to allow decent output in both GNU octave and Matlab. - Updated README.md to mention and refer to the new PerformanceSmall.md document.	2020-08-03 11:48:43 +05:30
Field G. Van Zee	889b90888f	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2020-08-03 11:48:42 +05:30
Field G. Van Zee	58bede1460	Link to Eigen BLAS for non-gemm drivers in test/3. Details: - Adjusted test/3/Makefile so that the test drivers are linked against Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do this since Eigen's headers don't define implementations to the standard BLAS APIs. - Simplified #included headers in hemm, herk, trmm, and trsm source driver files, since nothing specific to Eigen is needed at compile-time for those operations.	2020-08-03 11:47:40 +05:30
Field G. Van Zee	e6ba82f11c	Add more support for Eigen to drivers in test/3. Details: - Use compile-time implementations of Eigen in test_gemm.c via new EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS library is not necessary.) However, as of Eigen 3.3.7, Eigen only parallelizes the gemm operation and not hemm, herk, trmm, trsm, or any other level-3 operation. - Fixed a bug in trmm and trsm drivers whereby the wrong function (bli_does_trans()) was being called to determine whether the object for matrix A should be created for a left- or right-side case. This was corrected by changing the function to bli_is_left(), as is done in the hemm driver. - Added support for running Eigen test drivers from runme.sh.	2020-08-03 11:47:40 +05:30
Field G. Van Zee	8db5abba14	Updates to 3m4m/matlab scripts. Details: - Minor updates to matlab graph-generating scripts. - Added a plot_all.m script that is more of a scratchpad for copying and pasting function invocations into matlab to generate plots that are presently of interest to us.	2020-08-03 11:37:41 +05:30
Chithra Sankar	7834ee191d	test folder files reverted to previous commit	2020-08-03 11:32:37 +05:30
Chithra Sankar	1052413e10	Return typename corrected in dot function	2020-08-03 11:31:56 +05:30

1 2 3 4 5 ...

258 Commits