amd/blis - blis - Public git mirror

amd/blis

mirror of https://github.com/amd/blis.git synced 2026-05-21 17:08:17 +00:00

Author	SHA1	Message	Date
Field G. Van Zee	e9da6425e2	Allow use of 1m with mixing of row/col-pref ukrs. Details: - Fixed a bug that broke the use of 1m for dcomplex when the single- precision real and double-precision real ukernels had opposing I/O preferences (row-preferential sgemm ukernel + column-preferential dgemm ukernel, or vice versa). The fix involved adjusting the API to bli_cntx_set_ind_blkszs() so that the induced method context init function (e.g., bli_cntx_init_<subconfig>_ind()) could call that function for only one datatype at a time. This allowed the blocksize scaling (which varies depending on whether we're doing 1m_r or 1m_c) to happen on a per-datatype basis. This fixes issue #557. Thanks to Devin Matthews and RuQing Xu for helping discover and report this bug. - The aforementioned 1m fix required moving the 1m_r/1m_c logic from bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is called from each level-3 _front() function. The pack_t schemas in the cntx_t were also removed entirely, along with the associated accessor functions. This in turn required updating the trsm1m-related virtual ukernels to read the pack schema for B from the auxinfo_t struct rather than the context. This also required slight tweaks to bli_gemm_md.c. - Repositioned the logic for transposing the operation to accommodate the microkernel IO preference. This mostly only affects gemm. Thanks to Devin Matthews for his help with this. - Updated dpackm pack ukernels in the 'armsve' kernel set to avoid querying pack_t schemas from the context. - Removed the num_t dt argument from the ind_cntx_init_ft type defined in bli_gks.c. The context initialization functions for induced methods were previously passed a dt argument, but I can no longer figure out why they were passed this value. To reduce confusion, I've removed the dt argument (including also from the function defintion + prototype). - Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This breaks high-leve implementations of 3m and 4m, but this is okay since those implementations will be removed very soon. - Removed some older blocks of preprocessor-disabled code. - Comment update to test_libblis.c.	2021-10-13 14:15:38 -05:00
Devin Matthews	32a6d93ef6	Merge pull request #543 from xrq-phys/armsve-packm-fix ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9	2021-10-09 15:53:54 -05:00
RuQing Xu	ccf16289d2	Arm SVE C/ZGEMM Fix FMOV 0 Mistake FMOV [hsd]M, #imm does not allow zero immediate. Use wzr, xzr instead.	2021-10-08 12:34:14 +09:00
RuQing Xu	82b61283b2	SH Kernel Unused Eigher	2021-10-08 12:17:29 +09:00
RuQing Xu	1749dfa493	Arm SVE C/ZGEMM Support *beta==0	2021-10-08 12:13:08 +09:00
RuQing Xu	66a018e6ad	Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0	2021-10-08 12:13:08 +09:00
RuQing Xu	9e1e781cb5	Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0	2021-10-08 12:13:08 +09:00
RuQing Xu	e4cabb977d	Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg	2021-10-08 12:13:08 +09:00
RuQing Xu	b677e0d61b	Arm SVE Add SGEMM 2Vx10 Unindexed	2021-10-08 12:13:07 +09:00
RuQing Xu	3f68e8309f	Arm SVE ZGEMM Support Gather Load / Scatt. St.	2021-10-08 12:13:07 +09:00
RuQing Xu	c19db2ff82	Arm SVE Add ZGEMM 2Vx10 Unindexed	2021-10-08 12:13:07 +09:00
RuQing Xu	e13abde30b	Arm SVE Add ZGEMM 2Vx7 Unindexed	2021-10-08 12:13:06 +09:00
RuQing Xu	49b9d7998e	Arm SVE Add ZGEMM 2Vx8 Unindexed	2021-10-08 12:12:48 +09:00
RuQing Xu	2604f40713	Config ArmSVE Unregister 12xk. Move 12xk to Old	2021-10-07 02:39:00 +09:00
RuQing Xu	1e3200326b	Revert __has_include(). Distinguish w/ BLIS_FAMILY_**	2021-10-07 02:37:14 +09:00
Devin Matthews	80c5366e4a	Move unused ARM SVE kernels to "old" directory.	2021-10-04 15:40:28 -05:00
Devin Matthews	13dbd5b5d3	Apply patch from @xrq-phys.	2021-10-02 16:08:05 -05:00
Devin Matthews	ae0eeeaf77	Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.	2021-09-29 16:43:38 -05:00
RuQing Xu	30c29b256e	Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Affected configs: a64fx.	2021-09-16 05:01:03 +09:00
RuQing Xu	bffa85be59	Arm SVE: Correct PACKM Ker Name: Intrinsic Kers SVE-Intrinsic-based kernels ought not to use asm in their names.	2021-09-16 04:31:45 +09:00
RuQing Xu	61584deddf	Added 512b SVE-based a64fx subconfig + SVE kernels. Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424.	2021-05-19 09:52:29 -05:00
Guodong Xu	f032d5d4a6	New kernel set for Arm SVE using assembly (#396 ) Here adds two kernels for Arm SVE vector extensions. 1. a gemm kernel for double at sizes 8x8. 2. a packm kernel for double at dimension 8xk. To achive best performance, variable length agonostic programming is not used. Vector length (VL) of 256 bits is mandated in both kernels. Kernels to support other VLs can be added later. "SVE is a vector extension for AArch64 execution mode for the A64 instruction set of the Armv8 architecture. Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains into a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. Instructions are provided specifically to query an implementation for its register size, to guarantee that the applications can run on different implementations of the ISA without the need to recompile the code." [1] [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning Signed-off-by: Guodong Xu <guodong.xu@linaro.org>	2020-04-29 12:08:46 -05:00

22 Commits