This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
AMD internal:[CPUPL-657]
Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
Multiple trace levels will allow user to set the nested call levels
up to which the traces to be limited. It will also reduce file size
requirements.
Also optimized auto trace output to reduce file size by removing
thread ID's from individual lines.
AMD Internal: [CPUPL-806]
Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd
Added traces from blas/cblas API's till kernels for dgemm and sgemm.
By default the traces will be disabled, user need to enable them
in their local workspace, please check aocl_dtl/aocldtlcf.h file.
AMD Internal : CPUPL-806
Change-Id: I83b310509fb1a599c114387192bcf882ef0480f9
Details:
- Optimized saxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of sgemv to remove dependency on cntx
variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for SGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.
Change-Id: Iabc36ae640e82e65f8764f3c6dee513ad64b22fd
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-707]
Failure was seen in libflame function (FLASH_UDdate_UT_inc)
Due to typecasting double complex pointer as double pointer
Change-Id: If6e2f4663575450a13a9a07dddd5622628f5c6b0
Details:
Using of ymm registers storing 8 float values than 4 floats values
Changed register from ymm to xmm in required places. This can be found
only when leading dimension is greater than the actual dimension.
Change-Id: I39f04eac18c4fa3a8c93048c977d6a83aa92b800
Details
Added Support of N SUP kernel for complex float and complex double
Removed prefetching in M SUP kernels for complex float and complex double
Removed all warnings
Change-Id: I05ffde0f0613681927fe7576db7f5f1a4486fd05
Details:
Added two new kernels bli_sgemmsup_rd_zen_asm_6x16m and bli_sgemmsup_rd_zen_asm_6x16n
to support dot product in Row Major (A * Tranpose(B)) and in Column Major (Tranpose(A) * B)
Change-Id: I264fd75c4c4b68fb7dc4fd229eaa44d09e9f3432
Removed conditional check for (*kappa_cast == 1.0) because its always 1.0 in
DGEMM packing kernels.
[CPUPL-636]
Change-Id: Ib04f2a3cdbb0f138036a8b0486d1dec073e40407
[CPUPL-858] Packing kernels for dgemm 6x8 kernel are added explicitly
for zen2 configuration. Apart from generic packing kernels used by level-3
routines and for all combinations of the input parameters, introduced DGEMM
specific packing kernels for the case op(A) & op(B) is no transpose. This
helps us to vectorize these packing kernels and eliminate un-necessary branch
conditional checks. The packed kernels are also optimized at the boundary.
These boundary condition optimization help when the input matrix dimensions
"m" and "n" are not multiples of register block-sizes "MR & NR".
Typical DGEMM operation is C = beta*C + alpha *op(A) * op(B). Kindly note
the multiplication with alpha is handled inside kernel, hence in these dgemm
packing routines alpha is always consider 1.0. These routines are
"bli_dpackm_8xk_nn_zen" & "bli_dpackm_6xk_nn_zen". The generic packing
routines
are "bli_dpackm_6xk_gen_zen" & bli_dpackm_8xk_gen_zen". These routines are
enabled from "bli_cntx_init_zen2()" through bli_cntx_set_packm_kers(). In this
checkout wthe generic packing kernels are enabled by default". Later will
introduce run-time mechanism to change these packing kernels based on the
DGEMM input parameters.
Change-Id: I079b4dce0757d558224cb8c55d024bfea6a4de91
This bug happens at a corner case, when k_iter == 0 and we jump to
CONSIDERKLEFT.
In current design, first row/col. of a and b are loaded twice.
The fix is to rearrange a and b (first row/col.) loading instructions.
Change-Id: I4a985a3abf9b1e7a0ee29e17c7d39a4a27138c4c
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
Here adds two kernels for Arm SVE vector extensions.
1. a gemm kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.
To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.
"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code." [1]
[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
Details:
- Renamed two bli_thread_*() APIs:
bli_thread_obarrier() -> bli_thread_barrier()
bli_thread_obroadcast() -> bli_thread_broadcast()
The 'o' was a leftover from when thrcomm_t objects tracked both
"inner" and "outer" communicators. They have long since been
simplified to only support the latter, and thus the 'o' is
superfluous.
Change-Id: If9ec9a2383dfb02e1cfc74918f87a1fabddbd55b
Details:
SUP support added for ZGEMM for different storage formats in M direction
SUP kernels and sub kernels are implemented to cover all dimensions of square matrix
SUP kernels supports RRR, RCR, CRR, CCR storage formats
Change-Id: I2c846a430dfcf356cac8ebf62015b1f743157381
Details:
- Optimized daxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of dgemv to remove dependency on cntx variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for DGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.
They will be enabled for zen family processors in future.
- Changed naming convention for new BLAS macros to indicate their use.
- Added new optimized kernel for axpyf under zen2 folder.
- Implemented basic GEMV kernel without using axpyv or axpyf.
This kernel is chosen for small sizes.
Change-Id: I4278d37e494854879c71499b8b9da8c5dbe3bf5b
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-885]
Details
Added SUP support for cgemm in M direction
SUP kernels are 3x8m, 3x4m, 3x2m is implemeted
Sub kernels are implemented to support various dimenions
SUP CGEMM supports matrix C & A row/col major and Matrix B is row major matrix
Change-Id: Ia6854b929d3b5741a4900422d05df1257f5d014d
Description:
Pannel strides are updated using variables rather than constant values to
support selecive packing in sgemm sup kernels
Change-Id: Ic098eb70592d12d7d2174a1166aebf3bc749140c
Details:
-Kernel is called directly from API call to avoid framework
overhead in case of single and double precisions.
-Currently these changes are applicable only for zen2 configuration.
They will be enabled for zen family processors in future.
-These changes improve performance of BLAS and CBLAS interfaces of API.
They do not affect BLIS-specific APIs.
-setv simd kernel is added for single and double precision elements
Change-Id: I1b343aa232f2571717c2b01ada5914f869883e1a
Signed-off-by: Kiran ND <Kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-817]
Details:
- Separate kernel for copyv function added to improve performance.
- Modified cntx_init file in zen and zen2 configuration
- Added test_copyv.c in test folder
- Modified test/Makefile to include test_copyv.c
Change-Id: I297f539f2ddd2d71997b127a71a460991cd07b41
Signed-off-by: Kiran N D <kiran.Devrajegowda@amd.com>
AMD-Internal: [CPUPL-818]
Details:
-Added SIMD kernels for SWAPV for both single and double precisions.
-Modified cntx_init file for zen and zen2 configurations to choose opt kernels for
SWAPV.
-Added test_swapv.c in test folder.
-Modified test/Makefile to include test_swapv.c
Change-Id: Ida786eec722e634aee0dacdd51c327823c80f01a
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-847]
Details:
- In case of GEMM, whenever beta is zero, we need to perform C = alpha
*(A * B) instead of C = beta * C + alpha * (A * B)
Added conditions to check the value of beta at different levels inside
small_gemm kernels and decide whether to perform scaling C with beta or
not.
-Modified small_gemm kernels to use BLIS specific functions to retrieve
different fields of objects.
-Calling bli_gemm_check before entering bli_gemm_small to facilitate
early return in case of invalid inputs.
-For corner cases inside small_gemm kernels, a buffer called f_temp
is used to load and store data to and from registers.
populating the buffer with zeroes before use.
-In bli_gemm_front, datatypes of status and return value from
bli_gemm_small are not matching.
Corrected the datatype of the variable 'status' inside bli_gemm_front
to err_t.
Change-Id: I8b52ad55008f028d6c8b7e0d20f746a869d9daea
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-689,SWLCSG-104]
For the kernel of size 4x8, cs_b is used instead of cs_a to calculate address of diagonal elements of matrix A.
Correcting the mistake.
Change-Id: Ie74e0f6a397fcd32fefb5804cd00f1e90bfe5523
Interchanged some loops to favour column-major storage.
Added check condiion to identify last column and load it using a 'for' loop to avoid memory accesses out of buffer
Change-Id: Id5d2e16c65017a7f4b641d33228d23903efd09ac