Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
changing function interface for the thread entry point function
(of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
a memory leak in the sba at bli_finalize() time. It turns out that,
due to the new multithreading-capable variant code useing thrinfo_t
objects--specifically, their calling of bli_thrinfo_grow()--we
have to pass in a real thrinfo_t object rather than the global
objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
Thus, I inserted the appropriate logic from the OpenMP and pthreads
versions so that single-threaded execution would work as intended
with the newly upgraded variants.
Change-Id: I2bfff849abf3fa30c73e0c5876128400854bbcb5
Details:
-This commit addresses the performance optimization(single-thread and
multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
This macro is automatically defined for zen family architectures.
It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.
Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]
For the kernel of size 4x8, cs_b is used instead of cs_a to calculate address of diagonal elements of matrix A.
Correcting the mistake.
Change-Id: Ie74e0f6a397fcd32fefb5804cd00f1e90bfe5523
Interchanged some loops to favour column-major storage.
Added check condiion to identify last column and load it using a 'for' loop to avoid memory accesses out of buffer
Change-Id: Id5d2e16c65017a7f4b641d33228d23903efd09ac
For matrix sizes which are not multiples of 4, trsm_small kernels access memory outside the allocated buffers which causes segmentation fault.
This is fixed by handling each of the corner cases separately.
Change-Id: Ia7cfad5d65339a209a7376cc1654382593c933af
amd64 family supports all the architectures before zen.
Assigned (BLIS_ARCH_GENERIC+1) to BLIS_NUM_ARCHS in order to avoid update for every new architecture.
Change-Id: I8241e643f6dfd0ebe272e053ca8b6a9c1463d9dc
Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
and pthreads so that those builds don't fail when performing shared
library linking (especially for Windows DLLs via AppVeyor). For now,
these dummy implementations of bli_l3_sup_thread_decorator() are
merely carbon-copies of the implementation provided for single-
threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
Thus, an OpenMP or pthreads build will be able to use the gemmsup
code (including the new selective packing functionality), as it did
before 39fa7136, even though it will not actually employ any
multithreaded parallelism.
Details:
- Implemented optional packing for A or B (or both) within the sup
framework (which currently only supports gemm). The request for
packing either matrix A or matrix B can be made via setting
environment variables BLIS_PACK_A or BLIS_PACK_B (to any
non-zero value; if set, zero means "disable packing"). It can also
be made globally at runtime via bli_pack_set_pack_a() and
bli_pack_set_pack_b() or with individual rntm_t objects via
bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
interface of either the BLIS typed or object APIs. (If using the
BLAS API, environment variables are the only way to communicate the
packing request.)
- One caveat (for now) with the current implementation of selective
packing is that any blocksize extension registered in the _cntx_init
function (such as is currently used by haswell and zen subconfigs)
will be ignored if the affected matrix is packed. The reason is
simply that I didn't get around to implementing the necessary logic
to pack a larger edge-case micropanel, though this is entirely
possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
with corresponding headers, in which higher-level packm-related
functions are defined for use within the sup framework. The actual
packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
always NULL), and pointer to a thrinfo_t* (which for nowis the address
of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
the millikernel can query the panel stride of the packed matrix and
step through it accordingly. If the matrix isn't packed, the panel
stride of interest for the given millikernel will be set to the
appropriate value so that the mkernel may step through the unpacked
matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
panel strides (ps_a and ps_b, respectively) instead of computing them
on the fly.
- Spun off the environment variable getting and setting functions into
a new file, bli_env.c (with a corresponding prototype header). These
functions are now used by the threading infrastructure (e.g.
BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
This means that the function bli_thread_init_rntm() was renamed to
bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
functions that manage the pack_a and pack_b fields of the global
rntm_t, including from environment variables, just as we have
functions to manage the threading fields of the global rntm_t in
bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
spinning off the bli_l3_thread_decorator() functions into their own
files. This change makes more sense when considering the further
addition of bli_l3_sup_thread_decorator() functions (for now limited
only to the single-threaded form found in the _single.c file).
- Explicitly initialize the reference sup handlers in both
bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.
Replaced global buffer used for packing with the buffer provided by
memory pools. These buffers are checkout at the beginning of each call
and return the pool once done.
Please check comment in the above functions for details.
Change-Id: I76b3560f7efcc621a4455e834fce06f629c38f50
Even though configure script check the availability of correct version
of python, this information is not passed to makefiles. This results
in python scripts getting involved without interpreter. This normally
works fine as the script used the path for shebang, however it doesn't
work if the command specified by shebang is alias.
This also causes confusion that even though configure has found the
python, we end up with python not found error during build.
This fix will pass the detected version of the python interpreter to
makefiles which solved both issues mentioned above.
Change-Id: Ic04da77601ff8ad2a461e9f2f936470109cda22c
Details:
- Fixed a subtle and complicated bug that only manifested via the BLAS
test drivers in the generic subconfiguration, and possibly any other
subconfiguration that did not register complex-domain gemm ukernels,
or registered ONLY real-domain ukernels as row-preferential. This is
a long story, but it boils down to an exception to the "transpose the
operation to bring storage of C into agreement with ukernel pref"
optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
proper functioning of the 1m method, but only when the imaginary
component of beta is zero. See the comments in issue #342 for more
details. Thanks to Dave Love for identifying the commit in which this
bug was introduced, and other feedback related to this bug.
For matrix sizes which are not multiples of 4, trsm_small kernels access memory outside the allocated buffers which causes segmentation fault.
This is fixed by handling each of the corner cases separately.
Change-Id: I267e69ee095a8ca3e8ce2a3ada5f48bfefcc2219