1. bli_malloc modified to normal malloc and address alignment within 3m_sqp.
2. function added to pack A real,imag and sum.
3. function added to pack B real,imag and sum.
4. function added to pack C real,imag and beta handling.
4. sum and sub vectorized.
AMD-Internal: [CPUPL-1352]
Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
Details:
- This implementation does a transpose operation while packing 16xk of A
buffer and passes it to 16x3-nn kernel.
- The same implementation works for the case where B has transpose.
AMD-Internal: [CPUPL-1376]
Change-Id: I81f74deb609926598f62c30f5bd6fc80fb1b9a17
Details:
- Decision logic to choose small_gemm has been moved to blas interface.
- Redirecting all the calls to small_gemm from gemm_front to native
implementation.
AMD-Internal: [CPUPL-1376]
Change-Id: I6490f67113e9f7c272269f441c86f2a0b3c89a53
Details:
- This kernel works best for cases where k = 1.
- This implementation is called directly from blas interface when A, B
matrices have no-transpose and k = 1.
AMD-Internal: [CPUPL-1376]
Change-Id: I3b31673a28290c81d4a4cb64c8605d56e50b5d3d
Details:
- These kernels are implemented by Field G. Van Zee as part of TRSM SUP
implementation with commit-ID 9e31f5e8553f8ae99cfe8a80052fc63499e0891a.
AMD-Internal: [CPUPL-1376]
Change-Id: Ib39a87fc20571ae9aeff82c9b87516ac583093c2
1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n)
2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel.
3. Currently the method supports only m multiple of 8. Residues cases to be implemented later.
4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done,
since real alpha and beta scaling are in 3m_sqp framework.
5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0.
Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8
AMD-Internal: [CPUPL-1352]
Modified dgemm_ to able to call small_gemm 16x3 kernel.
small_gemm will be called if((m + n -k) < 2000 && (m + k-n) < 2000 && n + k-m < 2000) && n > 2.
small_gemm kernel - if m or n or k = 0 we return and this case will be handled by sup or native kernel.
[CPUPL - 1376]
Change-Id: I61c2b36ad0ae4fb3dd23bc37c2b6c78556b3105b
TRSM API: AX = B, where X=B
Case1: Call TRSV when matrix B is vector & A is matrix,
When n = 1 for left side and when m = 1 for right side
Case2: Divide B/A when matrix B is vector & A is scalar(Diagonal element),
When m = 1 for left side and when n = 1 for right side
For right side, Transpose complete operation, Change upper to lower and
vice versa when A is being transposed
Change-Id: Ib020f2a568f04a6e8d8f75bfc38adbfd7c5d175a
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.
Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
Case1: Call TRSV when matrix C & B are vector & A is matrix,
When n = 1 for left side and when m = 1 for right side
Case2: Divide B/A when matrix C & B are vector & A is scalar(Diagonal element),
When m = 1 for left side and when n = 1 for right side
For right side, Transpose complete operation, Change upper to lower and
vice versa when A is being transposed
Change-Id: Ie87e4a263c287ba554832ccc56b629f982e3ac4c
Details:
- Added a new AXPYF kernel with fuse_factor = 4 and iter_unroll = 4.
- Modified blas interface of GEMM to call GEMV whenever m=1 or n=1.
Change-Id: I3f5acd37b009f53cf63f462cec79fd3e73676dbc
Description:
[AMD Internal]: CPUPL-1336
Removed extra/un-nesseary loads in dgemmmsup kernels which are
accessing the memory beyond the boundaries and causing segmentation
issue.
Kernels:
bli_dgemmsup_rd_haswell_asm_1x4
bli_dgemmsup_rv_haswell_asm_1x6
Change-Id: Idaeed36ebd9f13550943394a37e372b8d015b2d3
Details:
- Added SIMD code
- Processing 5 rows at a time in SIMD loop to improve performance
AMD-Internal: [CPUPL-1054]
Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
Details
- Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv).
- Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family.
AMD-Internal: [CPUPL-1231]
Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7
Details:
- Kernel is called directly from API call to avoid framework overhead in case of complex float and complex double precisions.
- Added SIMD code for complex float and complex double and unrolled for loop 5 times to improve performance
AMD-Internal: [CPUPL-1057]
Change-Id: I3b9d202398cacc0168882c9d6da2b450c27466a0
Details:
- Introduced a new macro 'BLIS_CONFIG_EPYC' to enable blas and cblas
framework optimizations for zen family configurations.
- The macro needs to be defined in family.h files of respective arch
configs.
- Moved zen2-specific optimized kernels to zen folder, in order to be
accessible to all zen family architectures.
Change-Id: I8da2db6b7ab22ef350a01d86c214006e812eb06d
Details:
- Added debug trace support for DGEMMT and DTRSM APIs.
- Added log support for gemmt, trsm APIs.
- Modified gemm dump_sizes function to dump transpose parameters.
AMD-Internal: [CPUPL-1210]
Change-Id: Ice1effe27ec349203ce5def030a6b85b204bd91e
Details:
- For GEMV whenever beta = 0, we should not scale vector 'y' with beta,
instead overwrite the 'y' vector with zeroes before carrying out the
operation.
Change-Id: I159afba6c6ac3b72b74718fab7a4f4ec293012c5
- Bug fix in sgemmsup 1x16 Kernel for Beta Zero and with C col storage
rcx register incrementing was missing because of this 4 values
in output are overwritten
Change-Id: Ia3028040dce3e615f1db5a331498d86faadcf916
Details:
- Problem:
If row major, first four elements of last column on output matrix C was not updated
If col major, first four elements of last row on output matrix C was not updated
- Solution:
Updating elements after computation is done on right offset in bli_dgemmsup_rv_haswell_asm_5x8()
Change-Id: I588c60f2f3cd5f51e475cfc140e3bf0e9d5a4dae
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
various n = 6 edge cases with a single sup kernel call. Previously,
only n = {4,2,1} were handled explicitly as single kernel calls;
that is, cases where n = 6 were previously being executed via two
kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.
Details:
- Fixed a few not-really-bugs:
- Previously, the d6x8m kernels were still prefetching the next upanel
of A using MR*rs_a instead of ps_a (same for prefetching of next
upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
that the upanels might be packed, using ps_a or ps_b is the correct
way to compute the prefetch address.
- Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
executed as intended even though it was based on a faulty pointer
management. Basically, in the rd_d6x8m kernel, the pointer for B
(stored in rdx) was loaded only once, outside of the jj loop, and in
the second iteration its new position was calculated by incrementing
rdx by the *absolute* offset (four columns), which happened to be the
same as the relative offset (also four columns) that was needed. It
worked only because that loop only executed twice. A similar issue
was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
- Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
that it is loaded only once outside of the loops rather than
multiple times inside the loops.
- Changed outer loop in rd kernels so that the jump/comparison and
loop bounds more closely mimic what you'd see in higher-level source
code. That is, something like:
for( i = 0; i < 6; i+=3 )
rather than something like:
for( i = 0; i <= 3; i+=3 )
- Switched row-based IO to use byte offsets instead of byte column
strides (e.g. via rsi register), which were known to be 8 anyway
since otherwise that conditional branch wouldn't have executed.
- Cleaned up and homogenized prefetching a bit.
- Updated the comments that show the before and after of the
in-register transpositions.
- Added comments to column-based IO cases to indicate which columns
are being accessed/updated.
- Added rbp register to clobber lists.
- Removed some dead (commented out) code.
- Fixed some copy-paste typos in comments in the rv_6x8n kernels.
- Cleaned up whitespace (including leading ws -> tabs).
- Moved edge case (non-milli) kernels to their own directory, d6x8,
and split them into separate files based on the "NR" value of the
kernels (Mx8, Mx4, Mx2, etc.).
- Moved config-specific reference Mx1 kernels into their own file
(e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
- Added rd_dMx1 assembly kernels, which seems marginally faster than
the corresponding reference kernels.
- Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
the row-oriented reference kernels for all storage combos.
- User can now specify zen3 configuration,
currently it reuses block sizes and kernels from zen2.
- Auto configuration can detect and enable if zen3 config is needed
- Added support for amd64 bundle which contains all zen platforms
- Moved exiting amd bundle to amd64 legacy.
AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
AMD internal:[CPUPL-657]
Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
Multiple trace levels will allow user to set the nested call levels
up to which the traces to be limited. It will also reduce file size
requirements.
Also optimized auto trace output to reduce file size by removing
thread ID's from individual lines.
AMD Internal: [CPUPL-806]
Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd
Added traces from blas/cblas API's till kernels for dgemm and sgemm.
By default the traces will be disabled, user need to enable them
in their local workspace, please check aocl_dtl/aocldtlcf.h file.
AMD Internal : CPUPL-806
Change-Id: I83b310509fb1a599c114387192bcf882ef0480f9
Details:
- Optimized saxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of sgemv to remove dependency on cntx
variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for SGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.
Change-Id: Iabc36ae640e82e65f8764f3c6dee513ad64b22fd
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-707]
Failure was seen in libflame function (FLASH_UDdate_UT_inc)
Due to typecasting double complex pointer as double pointer
Change-Id: If6e2f4663575450a13a9a07dddd5622628f5c6b0
Details:
Using of ymm registers storing 8 float values than 4 floats values
Changed register from ymm to xmm in required places. This can be found
only when leading dimension is greater than the actual dimension.
Change-Id: I39f04eac18c4fa3a8c93048c977d6a83aa92b800
Details
Added Support of N SUP kernel for complex float and complex double
Removed prefetching in M SUP kernels for complex float and complex double
Removed all warnings
Change-Id: I05ffde0f0613681927fe7576db7f5f1a4486fd05
Details:
Added two new kernels bli_sgemmsup_rd_zen_asm_6x16m and bli_sgemmsup_rd_zen_asm_6x16n
to support dot product in Row Major (A * Tranpose(B)) and in Column Major (Tranpose(A) * B)
Change-Id: I264fd75c4c4b68fb7dc4fd229eaa44d09e9f3432
Removed conditional check for (*kappa_cast == 1.0) because its always 1.0 in
DGEMM packing kernels.
[CPUPL-636]
Change-Id: Ib04f2a3cdbb0f138036a8b0486d1dec073e40407