amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-11 17:50:00 +00:00

Author	SHA1	Message	Date
Nageshwar Singh	dbd7b28373	Development of AVX2 axpyv kernels for c and z datatypes. Details - Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv). - Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family. AMD-Internal: [CPUPL-1231] Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7	2020-10-23 09:33:35 +05:30
managalv	90f30e4c37	Optimised dotv kernel by SIMD approach and by removing framework overhead Details: - Kernel is called directly from API call to avoid framework overhead in case of complex float and complex double precisions. - Added SIMD code for complex float and complex double and unrolled for loop 5 times to improve performance AMD-Internal: [CPUPL-1057] Change-Id: I3b9d202398cacc0168882c9d6da2b450c27466a0	2020-10-13 18:59:31 +05:30
Meghana Vankadari	47744663d9	Enabling framework optimizations for zen family architectures. Details: - Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas framework optimizations for zen family configurations. - The macro needs to be defined in family.h files of respective arch configs. - Moved zen2-specific optimized kernels to zen folder, in order to be accessible to all zen family architectures. Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d	2020-10-07 13:10:50 +05:30
Meghana Vankadari	9a330f1754	Added debug trace and log support for gemmt and TRSM APIs Details: - Added debug trace support for DGEMMT and DTRSM APIs. - Added log support for gemmt, trsm APIs. - Modified gemm dump_sizes function to dump transpose parameters. AMD-Internal: [CPUPL-1210] Change-Id: Ice1effe27ec349203ce5def030a6b85b204bd91e	2020-10-02 12:31:47 +05:30
Meghana Vankadari	43d90e3110	Handling beta=0 case seperately for gemv inside bli_dgemv_zen_ref_c function Details: - For GEMV whenever beta = 0, we should not scale vector 'y' with beta, instead overwrite the 'y' vector with zeroes before carrying out the operation. Change-Id: I159afba6c6ac3b72b74718fab7a4f4ec293012c5	2020-09-08 07:50:34 -04:00
bhaskarn	d186cfdf2e	CPUPL-1074: - Bug fix in sgemmsup 1x16 Kernel for Beta Zero and with C col storage rcx register incrementing was missing because of this 4 values in output are overwritten Change-Id: Ia3028040dce3e615f1db5a331498d86faadcf916	2020-08-11 01:26:26 -04:00
dzambare	267a959af1	Rebased amd-staging-milan-3.0 branch on master -- Rebased on top of master commit # `6e522e5823` -- Updated merged code to remove duplicated code added by auto-merging -- Updated merged code to rename bool_t type -- Updated merged code to rename bli_thread_obarrier -- Updated merged code to rename bli_thread_obroadcast Change-Id: I39879f1ef3b42ecbe5808af3b559d88c36dbbf6c AMD-Internal: [CPUPL-1067]	2020-08-06 10:09:29 +05:30
Mangala V	5b8c2bc9e2	Revert "CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed" This reverts commit `725bf5aceb`. Reason for revert: <INSERT REASONING HERE> Change-Id: I7dd6b84731f091c8b39080ed9321a708fa5f11d8	2020-08-06 10:09:29 +05:30
managalv	2a0928aad4	CPUPL-1059: Failures seen in DGEMM SUP for specific size is fixed Details: - Problem: If row major, first four elements of last column on output matrix C was not updated If col major, first four elements of last row on output matrix C was not updated - Solution: Updating elements after computation is done on right offset in bli_dgemmsup_rv_haswell_asm_5x8() Change-Id: I588c60f2f3cd5f51e475cfc140e3bf0e9d5a4dae	2020-08-06 10:09:29 +05:30
Devrajegowda, Kiran	6b5c68b9ed	"Merge Selective Packing code from amd branch flame/blis" Change-Id: Ifbdf49735f56a66fbbc96dab6d3ca6069302daed	2020-08-06 10:09:28 +05:30
Kiran Varaganti	307ddc3110	Revert " Merge Selective Packing code from amd branch flame/blis" This reverts commit `e4a6af33f5`. Reason for revert: <Review not done> Change-Id: Iae548f949a81a66281023c860c2bcffdfdae21b2	2020-08-06 10:09:28 +05:30
Kiran Varaganti	87d5166b28	Merged BLIS Release 1.3 Modified config/zen/make_defs.mk, now CKVECFLAGS := -mavx2 -mfpmath=sse -mfma -march=znver1 Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81	2020-08-06 10:07:34 +05:30
Field G. Van Zee	889b90888f	Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en\|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat.	2020-08-03 11:48:42 +05:30
Field G. Van Zee	fd5db714f4	Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit `a69a4d7`. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit `2c554c2`.	2020-08-03 11:27:13 +05:30
Field G. Van Zee	4f5b014c05	Added missing rv_d?x6 edge cases to sup kernel. Details: - Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling various n = 6 edge cases with a single sup kernel call. Previously, only n = {4,2,1} were handled explicitly as single kernel calls; that is, cases where n = 6 were previously being executed via two kernel calls (n = 4 and n = 2). - Added commented debug line to testsuite's test_libblis.c.	2020-08-03 11:23:40 +05:30
Field G. Van Zee	0651b466c2	Bugfixes, cleanup of sup dgemm ukernels. Details: - Fixed a few not-really-bugs: - Previously, the d6x8m kernels were still prefetching the next upanel of A using MRrs_a instead of ps_a (same for prefetching of next upanel of B in d6x8n kernels using NRcs_b instead of ps_b). Given that the upanels might be packed, using ps_a or ps_b is the correct way to compute the prefetch address. - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck, executed as intended even though it was based on a faulty pointer management. Basically, in the rd_d6x8m kernel, the pointer for B (stored in rdx) was loaded only once, outside of the jj loop, and in the second iteration its new position was calculated by incrementing rdx by the absolute offset (four columns), which happened to be the same as the relative offset (also four columns) that was needed. It worked only because that loop only executed twice. A similar issue was fixed in the rd_d6x8n kernels. - Various cleanups and additions, including: - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so that it is loaded only once outside of the loops rather than multiple times inside the loops. - Changed outer loop in rd kernels so that the jump/comparison and loop bounds more closely mimic what you'd see in higher-level source code. That is, something like: for( i = 0; i < 6; i+=3 ) rather than something like: for( i = 0; i <= 3; i+=3 ) - Switched row-based IO to use byte offsets instead of byte column strides (e.g. via rsi register), which were known to be 8 anyway since otherwise that conditional branch wouldn't have executed. - Cleaned up and homogenized prefetching a bit. - Updated the comments that show the before and after of the in-register transpositions. - Added comments to column-based IO cases to indicate which columns are being accessed/updated. - Added rbp register to clobber lists. - Removed some dead (commented out) code. - Fixed some copy-paste typos in comments in the rv_6x8n kernels. - Cleaned up whitespace (including leading ws -> tabs). - Moved edge case (non-milli) kernels to their own directory, d6x8, and split them into separate files based on the "NR" value of the kernels (Mx8, Mx4, Mx2, etc.). - Moved config-specific reference Mx1 kernels into their own file (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory. - Added rd_dMx1 assembly kernels, which seems marginally faster than the corresponding reference kernels. - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using the row-oriented reference kernels for all storage combos.	2020-08-03 11:22:32 +05:30
dzambare	9c7814da1c	Added support for zen3 configuration - User can now specify zen3 configuration, currently it reuses block sizes and kernels from zen2. - Auto configuration can detect and enable if zen3 config is needed - Added support for amd64 bundle which contains all zen platforms - Moved exiting amd bundle to amd64 legacy. AMD-Internal: [CPUPL-500, CPUPL-1013] Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957	2020-07-22 18:24:26 +05:30
phakumar	ccf0772d6e	BLIS library porting on to Windows: This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler AMD internal:[CPUPL-657] Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9	2020-06-16 18:29:00 +05:30
Dipal M Zambare	dad7e2f235	Added support multiple trace levels & optimization of file size requirements Multiple trace levels will allow user to set the nested call levels up to which the traces to be limited. It will also reduce file size requirements. Also optimized auto trace output to reduce file size by removing thread ID's from individual lines. AMD Internal: [CPUPL-806] Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd	2020-06-10 16:00:49 +05:30
Dipal M Zambare	305c744131	Added traces in dgemm and sgemm paths. Added traces from blas/cblas API's till kernels for dgemm and sgemm. By default the traces will be disabled, user need to enable them in their local workspace, please check aocl_dtl/aocldtlcf.h file. AMD Internal : CPUPL-806 Change-Id: I83b310509fb1a599c114387192bcf882ef0480f9	2020-06-08 12:01:22 +05:30
Meghana	9fce1ec4a4	Optimized SGEMV kernel and changed BLAS interface call Details: - Optimized saxpyf kernel with fuse_factor=5 and iter_unroll=2. - Modified framework files of sgemv to remove dependency on cntx variable. - Updated cntx_init file of zen2 to choose optimized kernels. - Modified BLAS interface call for SGEMV to reduce framework overhread. - Currently these changes are applicable for zen2 configuration. Change-Id: Iabc36ae640e82e65f8764f3c6dee513ad64b22fd Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-707]	2020-06-04 02:49:08 -04:00
managalv	b4e599ecc2	CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices Failure was seen in libflame function (FLASH_UDdate_UT_inc) Due to typecasting double complex pointer as double pointer Change-Id: If6e2f4663575450a13a9a07dddd5622628f5c6b0	2020-06-02 22:27:54 +05:30
Nallani Bhaskar	6f01cd2c54	Fix for sblat3.x failure in make check Details: Using of ymm registers storing 8 float values than 4 floats values Changed register from ymm to xmm in required places. This can be found only when leading dimension is greater than the actual dimension. Change-Id: I39f04eac18c4fa3a8c93048c977d6a83aa92b800	2020-06-01 17:04:59 +05:30
managalv	f7bc37ea32	CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices Details Added Support of N SUP kernel for complex float and complex double Removed prefetching in M SUP kernels for complex float and complex double Removed all warnings Change-Id: I05ffde0f0613681927fe7576db7f5f1a4486fd05	2020-06-01 06:24:12 -04:00
Kiran Varaganti	c8f3cec5f7	Merge "Code cleanup in 6xk DGEMM pack Kernel" into amd-staging-rome-2.2	2020-06-01 05:08:58 -04:00
Nallani Bhaskar	5e0ad13f8e	Code Cleanup and replaced vzeroall with vxorps Change-Id: I74c2cc2183a407aad86eab5c3285c33690de9abd	2020-06-01 10:14:06 +05:30
Nallani Bhaskar	2413c31672	CPUPL-923: Implemented dot Product Kernels in SGEMM SUP for transpose cases. Details: Added two new kernels bli_sgemmsup_rd_zen_asm_6x16m and bli_sgemmsup_rd_zen_asm_6x16n to support dot product in Row Major (A * Tranpose(B)) and in Column Major (Tranpose(A) * B) Change-Id: I264fd75c4c4b68fb7dc4fd229eaa44d09e9f3432	2020-05-31 22:37:03 +05:30
Kiran Varaganti	3ebd5f8aa0	Code cleanup in 6xk DGEMM pack Kernel Removed conditional check if(*kappa_cast==0.0) in 6xk dgemm packing kernel Change-Id: Ie543787133d303aeb2532e67b83d6ba96e3d558e	2020-05-31 21:41:45 +05:30
Kiran Varaganti	f8ddd48594	Code Clean-up in DGEMM packing kernels Removed conditional check for (*kappa_cast == 1.0) because its always 1.0 in DGEMM packing kernels. [CPUPL-636] Change-Id: Ib04f2a3cdbb0f138036a8b0486d1dec073e40407	2020-05-30 21:55:29 +05:30
prangana	0c52aaefe1	Merge branch 'ref/heads/amd-staging-rome-2.2' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging-rome-2.2 Change-Id: I46acf48354ff73fb4eaeac255132d21095ea4d98	2020-05-30 10:31:10 +05:30
prangana	bb7eeec843	Change loop test expression in bli_packm_zen_int.c PRAGMA SIMD loop has issues with test expression (k !=0) Changed usage to (k > 0) Change-Id: I50204dbd0194de43f0d6cdcbfc586bb16aa25968	2020-05-30 10:00:21 +05:30
Kiran Varaganti	739803a441	DGEMM Packing Kernels for Native DGEMM implementation [CPUPL-858] Packing kernels for dgemm 6x8 kernel are added explicitly for zen2 configuration. Apart from generic packing kernels used by level-3 routines and for all combinations of the input parameters, introduced DGEMM specific packing kernels for the case op(A) & op(B) is no transpose. This helps us to vectorize these packing kernels and eliminate un-necessary branch conditional checks. The packed kernels are also optimized at the boundary. These boundary condition optimization help when the input matrix dimensions "m" and "n" are not multiples of register block-sizes "MR & NR". Typical DGEMM operation is C = betaC + alpha op(A) * op(B). Kindly note the multiplication with alpha is handled inside kernel, hence in these dgemm packing routines alpha is always consider 1.0. These routines are "bli_dpackm_8xk_nn_zen" & "bli_dpackm_6xk_nn_zen". The generic packing routines are "bli_dpackm_6xk_gen_zen" & bli_dpackm_8xk_gen_zen". These routines are enabled from "bli_cntx_init_zen2()" through bli_cntx_set_packm_kers(). In this checkout wthe generic packing kernels are enabled by default". Later will introduce run-time mechanism to change these packing kernels based on the DGEMM input parameters. Change-Id: I079b4dce0757d558224cb8c55d024bfea6a4de91	2020-05-28 02:01:43 -04:00
managalv	154bedc785	CPUPL-929:Improve Complex GEMM performance Removed print which was part of kernel Change-Id: I288e0151ba8da8d6dd4415734c88ed3474ba3a5b	2020-05-22 14:39:12 +05:30
Guodong Xu	72443e7173	avoid loading twice in armv8a gemm kernel (#403 ) This bug happens at a corner case, when k_iter == 0 and we jump to CONSIDERKLEFT. In current design, first row/col. of a and b are loaded twice. The fix is to rearrange a and b (first row/col.) loading instructions. Change-Id: I4a985a3abf9b1e7a0ee29e17c7d39a4a27138c4c Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-05-21 12:37:53 +05:30
Guodong Xu	66ec22705b	New kernel set for Arm SVE using assembly (#396 ) Here adds two kernels for Arm SVE vector extensions. 1. a gemm kernel for double at sizes 8x8. 2. a packm kernel for double at dimension 8xk. To achive best performance, variable length agonostic programming is not used. Vector length (VL) of 256 bits is mandated in both kernels. Kernels to support other VLs can be added later. "SVE is a vector extension for AArch64 execution mode for the A64 instruction set of the Armv8 architecture. Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains into a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. Instructions are provided specifically to query an implementation for its register size, to guarantee that the applications can run on different implementations of the ISA without the need to recompile the code." [1] [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-05-21 11:56:45 +05:30
Field G. Van Zee	9e76059f15	Renamed bli_thread_obarrier(), _obroadcast(). Details: - Renamed two bli_thread_*() APIs: bli_thread_obarrier() -> bli_thread_barrier() bli_thread_obroadcast() -> bli_thread_broadcast() The 'o' was a leftover from when thrcomm_t objects tracked both "inner" and "outer" communicators. They have long since been simplified to only support the latter, and thus the 'o' is superfluous. Change-Id: If9ec9a2383dfb02e1cfc74918f87a1fabddbd55b	2020-05-21 11:54:37 +05:30
managalv	f630b3fc36	CPUPL-929:Improve Complex GEMM performance Details: SUP support added for ZGEMM for different storage formats in M direction SUP kernels and sub kernels are implemented to cover all dimensions of square matrix SUP kernels supports RRR, RCR, CRR, CCR storage formats Change-Id: I2c846a430dfcf356cac8ebf62015b1f743157381	2020-05-20 17:36:04 +05:30
Meghana Vankadari	9ea0472f4c	Replaced all the instances of zen_basic with zen_ref_c Change-Id: Id53f2c1ce7e9878991a831c3651061f0b679b080 Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-885]	2020-05-19 20:27:17 +05:30
Meghana Vankadari	4fcc4e499d	Optimized DGEMV kernel and changed BLAS interface call Details: - Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2. - Modified framework files of dgemv to remove dependency on cntx variable. - Updated cntx_init file of zen2 to choose optimized kernels. - Modified BLAS interface call for DGEMV to reduce framework overhread. - Currently these changes are applicable for zen2 configuration. They will be enabled for zen family processors in future. - Changed naming convention for new BLAS macros to indicate their use. - Added new optimized kernel for axpyf under zen2 folder. - Implemented basic GEMV kernel without using axpyv or axpyf. This kernel is chosen for small sizes. Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-885]	2020-05-19 06:40:44 -04:00
managalv	af1ad806f2	CPUPL-929: Improve Complex GEMM performance - Support all storage formats and non Transpose/Conjugate Matrices Details: Supports cgemm SUP all storage formats for XXR format Change-Id: I1f1ac6b47f0b54141acac65e2cb4f3a2aaa3bac6	2020-05-18 21:06:57 +05:30
managalv	310dda928f	CPUPL-709: Improve Complex GEMM performance - Level 1 Optimization Details Added SUP support for cgemm in M direction SUP kernels are 3x8m, 3x4m, 3x2m is implemeted Sub kernels are implemented to support various dimenions SUP CGEMM supports matrix C & A row/col major and Matrix B is row major matrix Change-Id: Ia6854b929d3b5741a4900422d05df1257f5d014d	2020-05-18 20:43:49 +05:30
Nallani Bhaskar	b3a308b689	CPUPL-948: Selective Packing changes are imlplemented in sgemm sup Description: Pannel strides are updated using variables rather than constant values to support selecive packing in sgemm sup kernels Change-Id: Ic098eb70592d12d7d2174a1166aebf3bc749140c	2020-05-18 11:46:33 +05:30
Devrajegowda, Kiran	6f33fd6aac	Modified Function definition for BLAS and CBLAS interfaces of ?SCALV API Details: -Kernel is called directly from API call to avoid framework overhead in case of single and double precisions. -Currently these changes are applicable only for zen2 configuration. They will be enabled for zen family processors in future. -These changes improve performance of BLAS and CBLAS interfaces of API. They do not affect BLIS-specific APIs. -setv simd kernel is added for single and double precision elements Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-817]	2020-05-13 01:51:48 -04:00
Nallani Bhaskar	49cd7a96d5	CPUPL-866: ZenDNN gtest cases failing with blis 2.1 and later releases Change-Id: Ib9ddfb133576d06cea6642fc3fefd818317fe922	2020-05-03 13:00:43 +05:30
Devrajegowda, Kiran	4caee59466	Adding a simd kernel for copyv function Details: - Separate kernel for copyv function added to improve performance. - Modified cntx_init file in zen and zen2 configuration - Added test_copyv.c in test folder - Modified test/Makefile to include test_copyv.c Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41 Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com> AMD-Internal: [CPUPL-818]	2020-04-24 01:55:25 -04:00
Meghana	b846059bcf	Added opt kernels for SWAPV Details: -Added SIMD kernels for SWAPV for both single and double precisions. -Modified cntx_init file for zen and zen2 configurations to choose opt kernels for SWAPV. -Added test_swapv.c in test folder. -Modified test/Makefile to include test_swapv.c Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-847]	2020-04-20 01:21:44 -05:00
Meghana	e56cf63a3f	Optimized "bli_dotv_zen_int10" kernels Details: - Fixed issues in "bli_dotv_zen_int10" kernels and optimized them. - Changed cntx_init file to choose "bli_dotv_zen_int10" kernel for dotv API call. Change-Id: Iee8d7519f3a22a2d41166390be6047e9cb37557f Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-824]	2020-04-14 09:52:57 +05:30
Meghana	c20c96d9c0	Made some critical changes to small_gemm kernels Details: - In case of GEMM, whenever beta is zero, we need to perform C = alpha (A B) instead of C = beta * C + alpha * (A * B) Added conditions to check the value of beta at different levels inside small_gemm kernels and decide whether to perform scaling C with beta or not. -Modified small_gemm kernels to use BLIS specific functions to retrieve different fields of objects. -Calling bli_gemm_check before entering bli_gemm_small to facilitate early return in case of invalid inputs. -For corner cases inside small_gemm kernels, a buffer called f_temp is used to load and store data to and from registers. populating the buffer with zeroes before use. -In bli_gemm_front, datatypes of status and return value from bli_gemm_small are not matching. Corrected the datatype of the variable 'status' inside bli_gemm_front to err_t. Change-Id: I8b52ad55008f028d6c8b7e0d20f746a869d9daea Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-689,SWLCSG-104]	2020-03-19 16:30:04 +05:30
Nallani Bhaskar	83745c7ffc	Beta Zero Check for sgemm small. Core Software Group SWLCSG-137 BLIS-ST validation failures Change-Id: I21d5eae6ec390438be847f2dca42350b97059d6e	2020-03-09 02:55:51 -04:00
Nallani Bhaskar	e0c95d77e1	Beta Zero Checks for sgemm_small Change-Id: I111b66ad54a27b1977d155904738a55a351e6689	2020-03-09 02:55:25 -04:00

1 2 3 4 5 ...

337 Commits