amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-13 10:35:38 +00:00

Author	SHA1	Message	Date
Nallani Bhaskar	e328bdc549	Added prefetch in left cases of dtrsm small Details: 1. Added prefetching next micro-panel of A and B in dgemm block, which are helping in reducing load latency and improved performance. 2. Removed unnecessary unrolls in gemm loops and moved 8x6,6x8 core dgemm into macros and made it more modular 3. Packing and diagonal packing in main dgemm loops are modularized. Fringe cases are yet to modularize. 4. Updated dtrsm small thresholds for single and multi thread cases 5. Updated div/scale based on disable/enable of trsm pre-inversion 6. Code clean up Change-Id: I5de16805ff050a31d2b424bb3f6ae0a4019332df	2021-06-15 23:15:22 +05:30
Dipal M Zambare	21130ebece	Added configure option for AOCL Dynamic feature. - AOCL Dynamic feature is added in BLIS which determines optimal number of threads for the current problem size. - This feature can be enabled/disabled by modifying the source code - This change adds support to enable/disable this feature during configuration time by adding a new option in configure script AOCL-Internal : [CPUPL-1565] Change-Id: I590693f793cabc44d27a7f815adc41631dd01bbe	2021-05-12 00:41:13 -04:00
Meghana Vankadari	eea347b02e	Added dynamic threading support for GEMM SUP code path Details: - Introduced new feature called AOCL_DYNAMIC. - When this macro is defined, Optimum number of threads to solve DGEMM is estimated based on the dimensions (M,N,K). - Range of optimum number of threads will be [1, num_threads], where "num_threads" is number of threads set by the application. - Num_threads is derived from either environment variable "OMP_NUM_THREADS or BLIS_NUM_THREADS' or bli_set_num_threads() API. - Only local copy of rntm is modified by AOCL_DYNAMIC feature. global_rntm data structure remains unchanged in order to keep track of original number of threads set by application. - Optimum number of threads calculation is done only for SUP. - Since 'native' code path handles larger problem sizes, we use max number of threads recommended by the application. AMD-Internal: [CPUPL-1376] Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3	2021-05-07 09:52:51 +05:30
managalv	c420bd63e2	Enabled optimised packed routines on zen3 Change-Id: I5eb57f8ab2cccd20d0f778ada539fd1474cf6338	2021-05-06 01:25:08 -04:00
Madan mohan Manokar	c1fa9abe32	zgemm native path tuning 1. NC and MC values are tuned for both single-instance and multi-instance run. 2. zen2 and zen3 configs updated. 3. SUP path disabled for zgemm, since tuned native path performed better. To be re-enabled after setting right threshold for SUP selection. AMD-Internal: [CPUPL-1442] Change-Id: I0eb86926744d2983530a443e20e3e4e2ee3f3239	2021-05-06 01:15:35 -04:00
lcpu	7401effc03	BLIS:merge: Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond) Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations. Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations. Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu) Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu) Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs. Minor code consolidation in all level-3 _front() functions. Reorganized Windows cpp branch of bli_pthreads.c. Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS. Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion. Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv. AMD-internal-[CPUPL-1523] Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd	2021-04-27 11:09:48 +05:30
Field G. Van Zee	bf1b578ea3	Reduced KC on skx from 384 to 256. Details: - Reduced the KC cache blocksize for double real on the skx subconfig from 384 to 256. The maximum (extended) KC was also reduced accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting this change.	2021-03-19 13:03:17 -05:00
nphaniku	b3628cdfd3	AOCL Windows: 3.1 BLIS changes 1. CMake script changes for build with Clang compiler. 2. CMake script changes for build test and testsuite based on the lib type ST/MT 3. CMake script changes for testcpp and blastest 4. Added python scripts to support library build and testsuite build. AMD Internal : [CPUPL-1422] Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898	2021-03-08 19:04:17 +05:30
Nicholai Tukanov	670bc7b60f	Add low-precision POWER10 gemm kernels (#467 ) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'.	2021-03-05 13:53:43 -06:00
Field G. Van Zee	f5871c7e06	Added complex asm packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision complex domain (c and z) and housed them in the 'haswell' kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Minor modifications to the corresponding s and d packm kernels that were introduced in `426ad67`. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), upon which these complex kernels are partially based.	2021-02-28 17:03:57 -06:00
Field G. Van Zee	426ad679f5	Added assembly packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision real domain (s and d) and housed them in the 'haswell' kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), which I have now tweaked and used to create comparable single-precision real kernels (s6xk and s16xk).	2021-02-27 18:39:56 -06:00
Meghana Vankadari	943b1362c7	Enabled vectorized pack kernels for zen2 configuration. Details: - These kernels are implemented by Field G. Van Zee as part of TRSM SUP implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a. AMD-Internal: [CPUPL-1376] Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2	2021-02-12 19:16:57 +05:30
Dipal M Zambare	38a8008cd8	Enabled znver3 flag for zen3 architecture znver3 flag will be enabled if compiler is AOCC Clang version 3.0 and configuration is zen3 Change-Id: Ie164f4d469bf3f8df31ccf8fed9f80dfc62efb39 AMD-Internal: [CPUPL-1353]	2020-12-04 12:28:22 +05:30
Field G. Van Zee	11dfc176a3	Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c.	2020-12-01 19:51:27 +00:00
Dipal M Zambare	c2f63fcc54	Update amd64 bundle configuration The configuration is updated to - Enable EPYC architecture optimizations - Macros to override block sizes. AMD-Internal : [CPUPL-1350] Change-Id: Id712f9abe6e81c9ece2baaab9d965b405e72977a	2020-12-01 14:37:13 +05:30
Kumar, Phani	477fc41fff	Cmake script changes and blis.h changes for amd-staging-milan-3.0 AMD Internal : [CPUPL-1083] Change-Id: Ia29a1f328ee32e2aec59a7fc70c04400d6ee6580	2020-11-24 06:12:25 -05:00
Field G. Van Zee	88ad841434	Squash-merge 'pr' into 'squash'. (#457 ) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.	2020-11-14 09:39:48 -06:00
Dipal M Zambare	ce99b1ecef	Added dynamic block size selection logic for DGEMM. Block sizes (MC, KC, NC) for DGEMM are determined at runtime based on following parameters - Single or multithreaded build - Processor Architecture (currently support only zen3) - Number of threads requested while running the library Change-Id: Ia793484b77adb87486e630d0d3b4c7856ae52094 AMD-Internal: [CPUPL-660, CPUPL-661]	2020-11-12 22:40:38 +05:30
Field G. Van Zee	e14424f55b	Merge branch 'dev'	2020-11-07 13:02:50 -06:00
managalv	aae48c2221	Optimised AXPYF routine for complex float and complex double Details: - Added SIMD code - Processing 5 rows at a time in SIMD loop to improve performance AMD-Internal: [CPUPL-1054] Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a	2020-11-06 18:42:13 +05:30
Nageshwar Singh	dbd7b28373	Development of AVX2 axpyv kernels for c and z datatypes. Details - Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv). - Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family. AMD-Internal: [CPUPL-1231] Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7	2020-10-23 09:33:35 +05:30
managalv	90f30e4c37	Optimised dotv kernel by SIMD approach and by removing framework overhead Details: - Kernel is called directly from API call to avoid framework overhead in case of complex float and complex double precisions. - Added SIMD code for complex float and complex double and unrolled for loop 5 times to improve performance AMD-Internal: [CPUPL-1057] Change-Id: I3b9d202398cacc0168882c9d6da2b450c27466a0	2020-10-13 18:59:31 +05:30
Field G. Van Zee	a0849d390d	Register l3 sup kernels in zen2 subconfig. Details: - Registered full suite of sgemm and dgemm sup millikernels, blocksizes, and crossover thresholds in bli_cntx_init_zen2.c. - Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742 system.	2020-10-09 20:22:17 +00:00
Meghana Vankadari	47744663d9	Enabling framework optimizations for zen family architectures. Details: - Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas framework optimizations for zen family configurations. - The macro needs to be defined in family.h files of respective arch configs. - Moved zen2-specific optimized kernels to zen folder, in order to be accessible to all zen family architectures. Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d	2020-10-07 13:10:50 +05:30
Nicholai Tukanov	2d8ec164e7	Add POWER10 support to BLIS (#450 )	2020-09-29 16:52:18 -05:00
Field G. Van Zee	4fd8d9fec2	Tweaked zen2 subconfig's MC cache blocksizes. Details: - Updated the MC cache blocksizes registered by the 'zen2' subconfig. - Minor updates to test/3/Makefile and test/3/runme.sh.	2020-09-28 23:39:05 +00:00
Field G. Van Zee	e293cae2d1	Implemented sgemmsup assembly kernels. Details: - Created a set of single-precision real millikernels and microkernels comparable to the dgemmsup kernels that already exist within BLIS. - Added prototypes for all kernels within bli_kernels_haswell.h. - Registered entry-point millikernels in bli_cntx_init_haswell.c and bli_cntx_init_zen.c. - Added sgemmsup support to the Makefile, runme.sh script, and source file in test/sup. This included edits that allow for separate "small" dimensions for single- and double-precision as well as for single- vs. multithreaded execution.	2020-09-15 16:09:11 -05:00
dzambare	95bbdc12ab	Added -fomit-frame-pointer in kernel options Some of the SUP kernels now use rbp register. This register was also used by compiler to support automatic function call tracing, which was creating the conflict. Automatic call tracing feature is removed for now. If needed it can be enabled for non kernel code. Change-Id: Ib7ad00875f501ee2ad552cbb2ecdc245002d63b7 AMD-Internal: [CPUPL-1135]	2020-08-31 09:53:16 +05:30
Devin Matthews	7d41128219	Use -O2 for all framework code. (#435 ) It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes #341 and fixes #342.	2020-08-13 17:50:58 -05:00
Kiran Varaganti	fb0b4b57c1	Fixed missing changes Corrections in bli_gemm_front.c, taken the corrections from both public repo and 2.2.1 branch [CPUPL-1067] Change-Id: I4887ece6aa20bdfb87d97e7acebbe04cb9feea02	2020-08-13 00:29:37 +05:30
Dave Love	9c5b485d35	Don't override -mcpu with -march on ARM (#353 ) * Use -mcpu for ARM See the GCC doc about -march, -mtune, and -mpu and maybe https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu * Fix typo in flags * Fix typo in cortexa9 flags * Modify cortexa53 compilation flags to fix failing BLAS check (#341)	2020-08-07 15:11:18 -05:00
dzambare	267a959af1	Rebased amd-staging-milan-3.0 branch on master -- Rebased on top of master commit # `6e522e5823` -- Updated merged code to remove duplicated code added by auto-merging -- Updated merged code to rename bool_t type -- Updated merged code to rename bli_thread_obarrier -- Updated merged code to rename bli_thread_obroadcast Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c AMD-Internal: [CPUPL-1067]	2020-08-06 10:09:29 +05:30
Devrajegowda, Kiran	013d6a2cab	Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. -setv simd kernel is added for single and double precision elements Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-817]	2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran	0ecc147f03	Adding a simd kernel for copyv function Details: - Separate kernel for copyv function added to improve performance. - Modified cntx_init file in zen and zen2 configuration - Added test_copyv.c in test folder - Modified test/Makefile to include test_copyv.c Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41 Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-818]	2020-08-06 10:09:28 +05:30
Meghana	aee0376b16	Added opt kernels for SWAPV Details: -Added SIMD kernels for SWAPV for both single and double precisions. -Modified cntx_init file for zen and zen2 configurations to choose opt kernels for SWAPV. -Added test_swapv.c in test folder. -Modified test/Makefile to include test_swapv.c Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-847]	2020-08-06 10:09:28 +05:30
Meghana Vankadari	a03a0f8f70	Made framework changes to initialize specific cache block sizes for TRSM. Details: -This commit addresses the performance optimization(single-thread and multi-thread) for DTRSM on zen2. -This new optimization employs different MC, KC & NC values for TRSM than what is being used in other Level-3 routines like DGEMM. -Changed TRSM framework code to choose these blocksizes for TRSM on zen family configurations. -Added a new field called "trsm_blkszs" to cntx structure in order to store TRSM specific block sizes. -Implemented routines to initialize, set and query the TRSM-specific block sizes. -Defined a new macro "AOCL_BLIS_ZEN" in configure script. This macro is automatically defined for zen family architectures. It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes. Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6 Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com> AMD-Internal: [CPUPL-656]	2020-08-06 10:09:28 +05:30
Devrajegowda, Kiran	6b5c68b9ed	"Merge Selective Packing code from amd branch flame/blis" Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed	2020-08-06 10:09:28 +05:30
Kiran Varaganti	307ddc3110	Revert " Merge Selective Packing code from amd branch flame/blis" This reverts commit `e4a6af33f5`. Reason for revert: <Review not done> Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2	2020-08-06 10:09:28 +05:30
Nallani Bhaskar	447cdf5ea8	Added sup support for sgemm under zen and related frame work changes. Change-Id: Ia7e88b96d3a3617e8d24754f50db081ffe2e9955	2020-08-06 10:09:28 +05:30
Kiran Varaganti	d334452df4	Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration. This is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code clean Change-Id: I6827b58d2dab1041fe182fef5a007b679ac4bb1f	2020-08-06 10:09:10 +05:30
sraut	e082fc2b1f	Fix on EPYC machine for multi instance performance issue, Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically. Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores. Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe	2020-08-06 10:07:08 +05:30
Field G. Van Zee	889b90888f	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2020-08-03 11:48:42 +05:30
Field G. Van Zee	fd5db714f4	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-08-03 11:27:13 +05:30
Field G. Van Zee	b1144a856b	Added -fomit-frame-pointer option to CKOPTFLAGS. Details: - Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS variable in the following make_defs.mk files: config/haswell/make_defs.mk config/skx/make_defs.mk as well as comments that mention why the compiler option is needed. This option is needed to prevent the compiler from using the rbp frame register (in the very early portion of kernel code, typically where k_iter and k_left are defined and computed), which, as of `1c719c9`, is used explicitly by the gemmsup millikernels. Thanks to Devin Matthews for identifying this missing option and to Jeff Diamond for reporting the original bug in #417. - The file config/zen/amd_config.mk which feeds into the make_defs.mk for both zen and zen2 subconfigs, was also touched, but only to add a commented-out compiler option (and the aforementioned explanatory comment) since that file already uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of CKOPTFLAGS.	2020-08-03 11:22:33 +05:30
Field G. Van Zee	a20b5e3c60	Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS__NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates.	2020-08-03 11:10:38 +05:30
Field G. Van Zee	00e14cb6d8	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-07-29 14:24:34 -05:00
Dipal M Zambare	25d23cdda2	Zen3 support, disabled IR, JR loop parallelization AMD-Internal: [CPUPL-1013] Change-Id: I859152d63d1a56519c508dfa19587f25123e08b4	2020-07-24 20:55:47 +05:30
dzambare	9c7814da1c	Added support for zen3 configuration - User can now specify zen3 configuration, currently it reuses block sizes and kernels from zen2. - Auto configuration can detect and enable if zen3 config is needed - Added support for amd64 bundle which contains all zen platforms - Moved exiting amd bundle to amd64 legacy. AMD-Internal: [CPUPL-500, CPUPL-1013] Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957	2020-07-22 18:24:26 +05:30
Meghana Vankadari	6a0a65ee23	Added sup kernels and code path for gemmt similar to GEMM.GEMMT now also supports complex data types. Details: - Added framework code for GEMMT SUP. - Implemented SUP for GEMMT using similar techniques as native path. - Moved update routines to frame/util folder. - Ported update routines for complex datatypes. Change-Id: I17adfd0586d07f5a23dca6a07b2d48f4c9fcf71c Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>, Dipal M Zambare <DipalMadhukar.Zambare@amd.com>, Mangala V <managala.v@amd.com>	2020-07-13 16:26:32 +05:30
Field G. Van Zee	5fc701ac5f	Added -fomit-frame-pointer option to CKOPTFLAGS. Details: - Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS variable in the following make_defs.mk files: config/haswell/make_defs.mk config/skx/make_defs.mk as well as comments that mention why the compiler option is needed. This option is needed to prevent the compiler from using the rbp frame register (in the very early portion of kernel code, typically where k_iter and k_left are defined and computed), which, as of `1c719c9`, is used explicitly by the gemmsup millikernels. Thanks to Devin Matthews for identifying this missing option and to Jeff Diamond for reporting the original bug in #417. - The file config/zen/amd_config.mk which feeds into the make_defs.mk for both zen and zen2 subconfigs, was also touched, but only to add a commented-out compiler option (and the aforementioned explanatory comment) since that file already uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of CKOPTFLAGS.	2020-07-01 15:48:58 -05:00

1 2 3 4 5 ...

466 Commits