Details:
1. Added aocl-dynamic for dtrsm native path
When (m,n)<512 better performance observed for nthreads=4
2. Updated trsm_small threshold such that when (m+n)<320
trsm_small is doing better than native irrespective of
number of threads
Change-Id: Ic2c50f14db257a05e323cc97c5d1c9b73b68f487
Details:
- when AOCL dynamic is enabled, the decision to choose ST Vs MT
to solve SYRK is taken based on dimensions of matrices.
- Decisions to choose optimum number of threads will be updated in
the subsequent commits.
- Only local copy of rntm is modified by AOCL Dynamic feature.
global_rntm data structure remains unchanged in order to keep
track of original number of threads set by application.
- Added an early-exit condition in bli_nthreads_optimum when nt =1
or nt=-1. This ensures that AOCL dynamic feature is not used when
threading is set using BLIS_IC_NT or BLIS_JC_NT.
Change-Id: I8bb0d123e006f82b321ba47fe230ab9039742ce0
Details:
- Added decision logic to choose between SUP and native implementations
of SYRK for zen2 architectures.
- For architectures other than zen2 it will be redirected to gemm
threshold function.
Change-Id: I350578cc4f930e85b9581e4d9aed220e71a2171d
Details:
- Introduced new feature called AOCL_DYNAMIC.
- When this macro is defined, Optimum number of threads to solve DGEMM
is estimated based on the dimensions (M,N,K).
- Range of optimum number of threads will be [1, num_threads],
where "num_threads" is number of threads set by the application.
- Num_threads is derived from either environment variable "OMP_NUM_THREADS
or BLIS_NUM_THREADS' or bli_set_num_threads() API.
- Only local copy of rntm is modified by AOCL_DYNAMIC feature.
global_rntm data structure remains unchanged in order to keep track of
original number of threads set by application.
- Optimum number of threads calculation is done only for SUP.
- Since 'native' code path handles larger problem sizes, we use max
number of threads recommended by the application.
AMD-Internal: [CPUPL-1376]
Change-Id: I665ce14543d6719857d70325c4a9f959c08e66e3
Details:
- Added bli_syrksup function that internally uses gemmt implementation.
- Modified OAPI of syrk to call SUP before proceeding to the
conventional implementation.
- Copied gemmsup threshold function for syrk temporarily. Thresholds are
yet to be derived for syrk.
Change-Id: I751c6bd62bc76a3e4717f77c5cb33f19b759151d
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
(in terms of strerror_s) from bli_thread.h to bli_env.c. It was
likely left behind in bli_thread.h in a previous commit, when code
that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
find any other instance of strerror_r being used in BLIS, so I moved
the #define directly to bli_env.c rather than place it in bli_env.h.)
The code that uses strerror_r is currently disabled, though, so this
commit should have no affect on BLIS.
1. CMake script changes for build with Clang compiler.
2. CMake script changes for build test and testsuite based on the lib type ST/MT
3. CMake script changes for testcpp and blastest
4. Added python scripts to support library build and testsuite build.
AMD Internal : [CPUPL-1422]
Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
1.Improved performance when zgemm's alpha and beta are real and equal to +/-1.
2.change done in bli_zgemmsup_rv_zen_asm_3x4n.
3.change done in bli_zgemmsup_rv_zen_asm_3x4m.
4.change done in bli_zgemm_haswell_asm_3x4.
Change-Id: Ic14d8507b264c24a8748febf6bc73eb60e476430
AMD-Internal: [CPUPL-1352]
Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
guts were factored out into "fast" and "slow" variants. Then added
logic to the "fast" variant that allows for more optimal thread
factorizations in some situations where there is at least one factor
of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
added comments to that file describing BLIS_THREAD_RATIO_? and
BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
macros not used in vanilla BLIS and removed the unused macro
BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
and bli_trsm_front.c. (These branches of small matrix handling have
not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.
Details:
- When requesting multithreaded parallelism by specifying the total
number of threads (whether it be via environment variable, globally at
runtime, or locally at runtime), reduce the number of threads actually
used by one if the original value (a) is prime and (b) exceeds a
minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
to 11 by default. If, when specifying the total number of threads (and
not the individual ways of parallelism for each loop), prime numbers
of threads are desired, this feature may be overridden by defining the
BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
corresponds to the configuration family targeted at configure-time.
(For now, there is no configure option(s) to control this feature.)
Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
bool that determines whether an integer is prime. This function is
implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
with unrelated minor edits.
Details:
- Added a configure option, --[enable|disable]-system, which determines
whether the modest operating system dependencies in BLIS are included.
The most notable example of this on Linux and BSD/OSX is the use of
POSIX threads to ensure thread safety for when application-level
threads call BLIS. When --disable-system is given, the bli_pthreads
implementation is dummied out entirely, allowing the calling code
within BLIS to remain unchanged. Why would anyone want to build BLIS
like this? The motivating example was submitted via #454 in which a
user wanted to build BLIS for a simulator such as gem5 where thread
safety may not be a concern (and where the operating system is largely
absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
the implementation of bli_clock() unconditionally returns 0.0 instead
of the time elapsed since some fixed point in the past. The reasoning
for this is that if the operating system is truly minimal, the system
function call upon which bli_clock() would normally be implemented
(e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
to remove redundancies.
- Removed old comments and commented #include of "bli_pthread_wrap.h"
from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
and BLISTypedAPI.md, with a note that both are non-functional when
BLIS is configured with --disable-system.
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.
Details:
- Implemented support for the user manually overriding the automatic
subconfiguration selection that happens at runtime. This override
can be requested by setting the BLIS_ARCH_TYPE environment variable.
The variable must be set to the arch_t id (as enumerated in
bli_type_defs.h) corresponding to the desired subconfiguration. If a
value outside this enumerated range is given, BLIS will abort with an
error message. If the value is in the valid range but corresponds to a
subconfiguration that was not activated at configure-time/compile-time,
BLIS will abort with a (different) error message. Thanks to decandia50
for suggesting this feature via issue #451.
- Defined a new function bli_gks_lookup_id to return the address of an
internal data structure within the gks. If this address is NULL, then
it indicates that the subconfig corresponding to the arch_t id passed
into the function was not compiled into BLIS. This function is used
in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
is returned for the latter of the two abort scenarios mentioned above,
along with a corresponding error message and a function to perform
the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
auto-detect.x executable during configure-time. This cpp branch is
similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
going forward. Also added a convenient debug switch that outputs the
compilation command for the auto-detect.x executable and exits.
- User can now specify zen3 configuration,
currently it reuses block sizes and kernels from zen2.
- Auto configuration can detect and enable if zen3 config is needed
- Added support for amd64 bundle which contains all zen platforms
- Moved exiting amd bundle to amd64 legacy.
AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
Details:
-This commit addresses the performance optimization(single-thread and
multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
This macro is automatically defined for zen family architectures.
It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.
Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]
Details:
- Added export annotations to additional function prototypes in order to
accommodate the testsuite.
- Disabled calling bli_amaxv_check() from within the testsuite's
test_amaxv.c.
Details:
- After merging PR #303, at Isuru's request, I removed the use of
BLIS_EXPORT_BLIS from all function prototypes *except* those that we
potentially wish to be exported in shared/dynamic libraries. In other
words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
functions that can be considered private or for internal use only.
This is likely the last big modification along the path towards
implementing the functionality spelled out in issue #248. Thanks
again to Isuru Fernando for his initial efforts of sprinkling the
export macros throughout BLIS, which made removing them where
necessary relatively painless. Also, I'd like to thank Tony Kelman,
Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
participating in the initial discussion in issue #37 that was later
summarized and restated in issue #248.
- CREDITS file update.
* Revert "restore bli_extern_defs exporting for now"
This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8.
* Remove symbols not intended to be public
* No need of def file anymore
* Fix whitespace
* No need of configure option
* Remove export macro from definitions
* Remove blas export macro from definitions
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- Fixed various typecasts in
frame/base/bli_cntx.h
frame/base/bli_mbool.h
frame/base/bli_rntm.h
frame/include/bli_misc_macro_defs.h
frame/include/bli_obj_macro_defs.h
frame/include/bli_param_macro_defs.h
that were missing or being done improperly/incompletely. For example,
many return values were being typecast as
(bool_t)x && y
rather than
(bool_t)(x && y)
Thankfully, none of these deficiencies had manifested as actual bugs
at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
This reflects the fact that bli_env_get_var() needs to be able to
return a signed integer, and even though dim_t is currently defined
as a signed integer, it does not intuitively appear to necessarily be
signed by inspection (i.e., an integer named "dim_t" for matrix
"dimension"). Also, updated use of bli_env_get_var() within
bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
and added comments to the bli_thrcomm_*.h files that will explain a
planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
'bool' for 'bool_t', which will eliminate the namespace conflict with
arm_sve.h as reported in issue #420. This commit implements the first
phase of that transition. Thanks to RuQing Xu for reporting this
issue.
- CREDITS file update.
Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally #define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
Details:
- Fixed an error that manifests only when using C++ (specifically,
modern versions of g++) to compile drivers in 'test' (and likely most
other application code that #includes blis.h. Thanks to Ajay Panyala
for reporting this issue (#374).
Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue #420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.
Details:
- Fixed various typecasts in
frame/base/bli_cntx.h
frame/base/bli_mbool.h
frame/base/bli_rntm.h
frame/include/bli_misc_macro_defs.h
frame/include/bli_obj_macro_defs.h
frame/include/bli_param_macro_defs.h
that were missing or being done improperly/incompletely. For example,
many return values were being typecast as
(bool_t)x && y
rather than
(bool_t)(x && y)
Thankfully, none of these deficiencies had manifested as actual bugs
at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
This reflects the fact that bli_env_get_var() needs to be able to
return a signed integer, and even though dim_t is currently defined
as a signed integer, it does not intuitively appear to necessarily be
signed by inspection (i.e., an integer named "dim_t" for matrix
"dimension"). Also, updated use of bli_env_get_var() within
bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
and added comments to the bli_thrcomm_*.h files that will explain a
planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
'bool' for 'bool_t', which will eliminate the namespace conflict with
arm_sve.h as reported in issue #420. This commit implements the first
phase of that transition. Thanks to RuQing Xu for reporting this
issue.
- CREDITS file update.
- User can now specify zen3 configuration,
currently it reuses block sizes and kernels from zen2.
- Auto configuration can detect and enable if zen3 config is needed
- Added support for amd64 bundle which contains all zen platforms
- Moved exiting amd bundle to amd64 legacy.
AMD-Internal: [CPUPL-500, CPUPL-1013]
Change-Id: I60b0b8abc6d2821c27ff0f5f6e032e889194b957
Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally #define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler
AMD internal:[CPUPL-657]
Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
Multiple trace levels will allow user to set the nested call levels
up to which the traces to be limited. It will also reduce file size
requirements.
Also optimized auto trace output to reduce file size by removing
thread ID's from individual lines.
AMD Internal: [CPUPL-806]
Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd
Details:
- Fixed a missing argument (conjy) in the function signatures of
bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert
van de Geijn for reporting this omission.
Change-Id: Ifd1e01d5d7f943db4b1d67b467eb57e4a5c44165
Details:
- Removed a line of code in common.mk that passed LDFLAGS through the
sort function. The purpose was not to sort the contents, but rather
to remove duplicates. However, there is valid syntax in a string of
linker flags that, when sorted, yields different/broken behavior.
So I've removed the line in common.mk that sorts LDFLAGS. Also, for
future use, I've added a new function, rm-dupls, that removes
duplicates without sorting. (This function was based on code from a
stackoverflow thread that is linked to in the comments for that
code.) Thanks to Isuru Fernando for reporting this issue (#373).
Change-Id: Ie355cc111fd2c6669f0c3088e8fa5dc7c407a3b9
* Fix parsing in vpu_count on workstation SKX
* Document Skylake-X as Haswell for single FMA
* Update vpu_count for Skylake and Cascade Lake models
* Support printing the configuration selected, controlled by the environment
Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.
* Move bli_log outside the cpp condition, and use it where intended
* Add Fixme comment (Skylake D)
* Mostly superficial edits to commits towards #351.
Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
relates to single-VPU Skylake-Xs.
* Fix comment typos
Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>
Details:
- Relocated the #include "cpuid.h" directive from bli_cpuid.h to
bli_cpuid.c. This was done because cpuid.h (which is pulled into
the post-build blis.h developer header) doesn't protect its
definitions with a preprocessor guard of the form:
#ifndef FOOBAR_H
#define FOOBAR_H
// header contents.
#endif
and as a result, applications (previously) could not #include both
blis.h and cpuid.h (since the former was already including the
latter). Thanks to Bhaskar Nallani for raising this issue via #393
and to Devin Matthews for suggesting this fix.
- CREDITS file update.
Details:
- Changed the behavior of bli_rntm_init() as well as the static
initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t
objects by default specify the disabling of packing for A and B.
Packing of A/B was already disabled by default when calling non-expert
APIs (and enabled only when the user set environment variables
BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of
using user-initialized rntm_t objects with expert APIs comes into line
with the default behavior of non-expert APIs--that is, they now both
lead to the avoidance of packing in the sup code path. (Note: The
conventional code path is unaffected by the environment variables
BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t
object when calling an expert API.) This addresses issue #392. Thanks
to Kiran Varaganti for bringing this inconsistency to our attention.
- The above change was accomplished by changing the the definitions of
static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b()
in bli_rntm.h, which are both for internal use only.
Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
address is equal to either &BLIS_GEMM_SINGLE_THREADED or
&BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
bli_l3_sup_decor_single.c that (by default) disables code that
creates and frees the thrinfo_t tree and instead passes
&BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
sup implementation.
- The net effect of the above changes is that a small amount of
thrinfo_t overhead is avoided when running small/skinny dgemm
problems when BLIS is compiled with multithreading disabled.
Change-Id: Ia1066752849f1dfc0cd98f8ac0302e2f7b0f8bf0
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]
Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
Details:
-This commit addresses the performance optimization(single-thread and
multi-thread) for DTRSM on zen2.
-This new optimization employs different MC, KC & NC values for TRSM than
what is being used in other Level-3 routines like DGEMM.
-Changed TRSM framework code to choose these blocksizes for TRSM
on zen family configurations.
-Added a new field called "trsm_blkszs" to cntx structure in order to
store TRSM specific block sizes.
-Implemented routines to initialize, set and query the TRSM-specific
block sizes.
-Defined a new macro "AOCL_BLIS_ZEN" in configure script.
This macro is automatically defined for zen family architectures.
It enables us to choose different cache block sizes for TRSM instead of common level-3 block sizes.
Change-Id: Id8557b1c962a316b1edecca9cd582675eaf35fe6
Signed-off-by: Meghana Vankadari <meghana.vankadari@amd.com>
AMD-Internal: [CPUPL-656]