Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.
Details:
- Fixed a few not-really-bugs:
- Previously, the d6x8m kernels were still prefetching the next upanel
of A using MR*rs_a instead of ps_a (same for prefetching of next
upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
that the upanels might be packed, using ps_a or ps_b is the correct
way to compute the prefetch address.
- Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
executed as intended even though it was based on a faulty pointer
management. Basically, in the rd_d6x8m kernel, the pointer for B
(stored in rdx) was loaded only once, outside of the jj loop, and in
the second iteration its new position was calculated by incrementing
rdx by the *absolute* offset (four columns), which happened to be the
same as the relative offset (also four columns) that was needed. It
worked only because that loop only executed twice. A similar issue
was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
- Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
that it is loaded only once outside of the loops rather than
multiple times inside the loops.
- Changed outer loop in rd kernels so that the jump/comparison and
loop bounds more closely mimic what you'd see in higher-level source
code. That is, something like:
for( i = 0; i < 6; i+=3 )
rather than something like:
for( i = 0; i <= 3; i+=3 )
- Switched row-based IO to use byte offsets instead of byte column
strides (e.g. via rsi register), which were known to be 8 anyway
since otherwise that conditional branch wouldn't have executed.
- Cleaned up and homogenized prefetching a bit.
- Updated the comments that show the before and after of the
in-register transpositions.
- Added comments to column-based IO cases to indicate which columns
are being accessed/updated.
- Added rbp register to clobber lists.
- Removed some dead (commented out) code.
- Fixed some copy-paste typos in comments in the rv_6x8n kernels.
- Cleaned up whitespace (including leading ws -> tabs).
- Moved edge case (non-milli) kernels to their own directory, d6x8,
and split them into separate files based on the "NR" value of the
kernels (Mx8, Mx4, Mx2, etc.).
- Moved config-specific reference Mx1 kernels into their own file
(e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
- Added rd_dMx1 assembly kernels, which seems marginally faster than
the corresponding reference kernels.
- Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
the row-oriented reference kernels for all storage combos.
Details:
- Reran all existing single-threaded performance experiments comparing
BLIS sup to other implementations (including the conventional code
path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
(Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
address is equal to either &BLIS_GEMM_SINGLE_THREADED or
&BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
bli_l3_sup_decor_single.c that (by default) disables code that
creates and frees the thrinfo_t tree and instead passes
&BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
sup implementation.
- The net effect of the above changes is that a small amount of
thrinfo_t overhead is avoided when running small/skinny dgemm
problems when BLIS is compiled with multithreading disabled.
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
Details:
- Fixed an error that manifests only when using C++ (specifically,
modern versions of g++) to compile drivers in 'test' (and likely most
other application code that #includes blis.h. Thanks to Ajay Panyala
for reporting this issue (#374).
Details:
- Since C is triangular, in order to maintain load balance among
threads, we need to use weighted range partitioning.
Change-Id: I03d8ff71ac7af843acd787f1389b5907b56453ee
- User can now specify zen3 configuration,
currently it reuses block sizes and kernels from zen2.
- Auto configuration can detect and enable if zen3 config is needed
- Added support for amd64 bundle which contains all zen platforms
- Moved exiting amd bundle to amd64 legacy.
AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
Details:
- Since GEMM kernel prefers row-storage, if input C matrix is in col-major order,
entire operation is transposed. In that case uplo(c) needs to be toggled
before kernel-variant selection.
- disabled "bli_gemmsup_ref_var1n2m_opt_cases" inside gemmtsup.
- Updated version number to 2.2.1
Change-Id: I0a85df1141fc4a98d98ea4e0c3d42db8602fa69b
1) Added dcomplex based zdotc_ version as a function with additional parameter.
2) The datatypes (single , double, Complex) functions retained as the macros.
3) This modification handles the ZDOTC_ invocation from Fortran based application
for 'double complex' datatypes.
4) The modifications are placed under macro 'AOCL_F2C'.
5) Blis, Blas Test suites verified ALL PASS with GCC and Flang
+ with and without 'AOCL_F2C' macro on Ubuntu machine.
6) Adding BLIS_EXPORT_BLAS to make the APIs visible when linking dll.
Change-Id: I4ada39a73f416e3794708f5b55e947342c261117
Signed-off-by: Meghana <Meghana.Vankadari@amd.com>, Nagendra <Nagendra.PrasadM@amd.com>
AMD-Internal: [SWLCSG-177]
Details:
- Added framework code for GEMMT SUP.
- Implemented SUP for GEMMT using similar techniques as native path.
- Moved update routines to frame/util folder.
- Ported update routines for complex datatypes.
Change-Id: I17adfd0586d07f5a23dca6a07b2d48f4c9fcf71c
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>,
Dipal M Zambare <DipalMadhukar.Zambare@amd.com>,
Mangala V <managala.v@amd.com>
Details:
- Added new API Which Computes a matrix-matrix product with general matrices
but updates only the upper or lower triangular part of the result matrix.
cblas_?gemmt() and ?gemmt_().
- These routines are similar to the ?gemm routines, but they only access
and update a triangular part of the square result matrix.
- Added DGEMMT functionality by reusing GEMM kernels.
- Created a new folder for GEMMT under l3, and added GEMMT specific
framework code.
- Modified cntl_create routine to choose different macro kernel for
GEMMT.
- Added routines to copy lower/upper triangular part of a block to the
buffer.
- Defined BLIS, BLAS and CBLAS interface APIs for GEMMT.
- Added test_gemmt.c to test folder and Updated the Makefile.
- Added a macro 'CBLAS' in test_gemm.c to call CBLAS APIs.
Change-Id: Ie00c1a15b9c654b65c687a9ca781cbc6f9641791
Details:
- Fixed an innocuous bug that manifested when running the testsuite on
extremely small matrices with randomization via the "powers of 2 in
narrow precision range" option enabled. When the randomization
function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
then compute 0.0/0.0 during the normalization process, which leads to
NaN residuals. The solution entails smarter implementaions of randv,
randnv, randm, and randnm, each of which will compute the 1-norm of
the vector or matrix in question. If the object has a 1-norm of 0.0,
the object is re-randomized until the 1-norm is not 0.0. Thanks to
Kiran Varaganti for reporting this issue (#413).
- Updated the implementation of randm_unb_var1() so that it loops over
a call to the randv_unb_var1() implementation directly rather than
calling it indirectly via randv(). This was done to avoid the overhead
of multiple calls to norm1v() when randomizing the rows/columns of a
matrix.
- Updated comments.
Change-Id: I0e3d65ff97b26afde614da746e17ed33646839d1
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
AMD internal:[CPUPL-657]
Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
Added BLIS specific extension to AOCL DTL, in this
added support to print the input matrix sizes from BLIS
library.
AMD Internal: [CPUPL-806]
Change-Id: I80ed779d65f9b1c48466137fc2f05629fa2fb561
Multiple trace levels will allow user to set the nested call levels
up to which the traces to be limited. It will also reduce file size
requirements.
Also optimized auto trace output to reduce file size by removing
thread ID's from individual lines.
AMD Internal: [CPUPL-806]
Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd
Added traces from blas/cblas API's till kernels for dgemm and sgemm.
By default the traces will be disabled, user need to enable them
in their local workspace, please check aocl_dtl/aocldtlcf.h file.
AMD Internal : CPUPL-806
Change-Id: I83b310509fb1a599c114387192bcf882ef0480f9
Details:
- Optimized saxpyf kernel with fuse_factor=5 and iter_unroll=2.
- Modified framework files of sgemv to remove dependency on cntx
variable.
- Updated cntx_init file of zen2 to choose optimized kernels.
- Modified BLAS interface call for SGEMV to reduce framework overhread.
- Currently these changes are applicable for zen2 configuration.
Change-Id: Iabc36ae640e82e65f8764f3c6dee513ad64b22fd
Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>
AMD-Internal: [CPUPL-707]
This will ensure early return in case full gemm processing is not needed.
Based on dimension which is found to be zero following actions will be taken:
If 'c' has zero dimension, no further processing is requried
If alpha is zero or if 'a' or 'b' has zero diemension, we
perform scalm operation instead of gemm. (c = alpha*a + beta*b)
Change-Id: Icc031944fc4e80138adf991974547f2d57ab570b
AMD-Internal: [CPUPL-904]
Failure was seen in libflame function (FLASH_UDdate_UT_inc)
Due to typecasting double complex pointer as double pointer
Change-Id: If6e2f4663575450a13a9a07dddd5622628f5c6b0
Details:
Using of ymm registers storing 8 float values than 4 floats values
Changed register from ymm to xmm in required places. This can be found
only when leading dimension is greater than the actual dimension.
Change-Id: I39f04eac18c4fa3a8c93048c977d6a83aa92b800
Details
Added Support of N SUP kernel for complex float and complex double
Removed prefetching in M SUP kernels for complex float and complex double
Removed all warnings
Change-Id: I05ffde0f0613681927fe7576db7f5f1a4486fd05
Details:
Added two new kernels bli_sgemmsup_rd_zen_asm_6x16m and bli_sgemmsup_rd_zen_asm_6x16n
to support dot product in Row Major (A * Tranpose(B)) and in Column Major (Tranpose(A) * B)
Change-Id: I264fd75c4c4b68fb7dc4fd229eaa44d09e9f3432
Removed conditional check for (*kappa_cast == 1.0) because its always 1.0 in
DGEMM packing kernels.
[CPUPL-636]
Change-Id: Ib04f2a3cdbb0f138036a8b0486d1dec073e40407
[CPUPL-858] Packing kernels for dgemm 6x8 kernel are added explicitly
for zen2 configuration. Apart from generic packing kernels used by level-3
routines and for all combinations of the input parameters, introduced DGEMM
specific packing kernels for the case op(A) & op(B) is no transpose. This
helps us to vectorize these packing kernels and eliminate un-necessary branch
conditional checks. The packed kernels are also optimized at the boundary.
These boundary condition optimization help when the input matrix dimensions
"m" and "n" are not multiples of register block-sizes "MR & NR".
Typical DGEMM operation is C = beta*C + alpha *op(A) * op(B). Kindly note
the multiplication with alpha is handled inside kernel, hence in these dgemm
packing routines alpha is always consider 1.0. These routines are
"bli_dpackm_8xk_nn_zen" & "bli_dpackm_6xk_nn_zen". The generic packing
routines
are "bli_dpackm_6xk_gen_zen" & bli_dpackm_8xk_gen_zen". These routines are
enabled from "bli_cntx_init_zen2()" through bli_cntx_set_packm_kers(). In this
checkout wthe generic packing kernels are enabled by default". Later will
introduce run-time mechanism to change these packing kernels based on the
DGEMM input parameters.
Change-Id: I079b4dce0757d558224cb8c55d024bfea6a4de91
This bug happens at a corner case, when k_iter == 0 and we jump to
CONSIDERKLEFT.
In current design, first row/col. of a and b are loaded twice.
The fix is to rearrange a and b (first row/col.) loading instructions.
Change-Id: I4a985a3abf9b1e7a0ee29e17c7d39a4a27138c4c
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
Details:
- Added a function definition for xerbla_array_(), which largely mirrors
its netlib implementation. Thanks to Isuru Fernando for suggesting the
addition of this function.
Change-Id: Ie9c619f5604e60a32edfda2db2b66f0c762581d3
Details:
- Added Perl to list of prerequisites for building BLIS. This is in part
(and perhaps completely?) due to some substitution commands used at
the end of configure that include '\n' characters that are not
properly interpreted by the version of sed included on some versions
of OS X. This new documentation addresses issue #398.
Here adds two kernels for Arm SVE vector extensions.
1. a gemm kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.
To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.
"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code." [1]
[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>