- Updated Makefile to include DTL files in library build
- Updated Makefile to include cpp header file installation
- Updated test/makefile to include extra API added by AMD team.
AMD-Internal: [CPUPL-1559]
Change-Id: I249c6935d5ff5fb645f9deec7e0218575484be13
Details:
1. Fixed reading leading dimenstions in test_gemm.c based on row/col major
2. Reduced redundent code and adjusted alignment
Change-Id: I8ca8c81223386fc21c6cc7c1d8f8a2109c9f5343
Merge conflicts araised has been fixed while downstreaming BLIS code from master to milan-3.1 branch
Implemented an automatic reduction in the number of threads when the user requests parallelism via a single number (ie: the automatic way) and (a) that number of threads is prime, and (b) that number exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which defaults to 11. If prime numbers are really desired, this feature may be suppressed by defining the macro BLIS_ENABLE_AUTO_PRIME_NUM_THREADS in the appropriate configuration family's bli_family_*.h. (Jeff Diamond)
Changed default value of BLIS_THREAD_RATIO_M from 2 to 1, which leads to slightly different automatic thread factorizations.
Enable the 1m method only if the real domain microkernel is not a reference kernel. BLIS now forgoes use of 1m if both the real and complex domain kernels are reference implementations.
Relocated the general stride handling for gemmsup. This fixed an issue whereby gemm would fail to trigger to conventional code path for cases that use general stride even after gemmsup rejected the problem. (RuQing Xu)
Fixed an incorrect function signature (and prototype) of bli_?gemmt(). (RuQing Xu)
Redefined BLIS_NUM_ARCHS to be part of the arch_t enum, which means it will be updated automatically when defining future subconfigs.
Minor code consolidation in all level-3 _front() functions.
Reorganized Windows cpp branch of bli_pthreads.c.
Implemented bli_pthread_self() and _equals(), but left them commented out (via cpp guards) due to issues with getting the Windows versions working. Thankfully, these functions aren't yet needed by BLIS.
Allow disabling of trsm diagonal pre-inversion at compile time via --disable-trsm-preinversion.
Fixed obscure testsuite bug for the gemmt test module that relates to its dependency on gemv.
AMD-internal-[CPUPL-1523]
Change-Id: I0d1df018e2df96a23dc4383d01d98b324d5ac5cd
Details:
1. Added CBLAS API in test_gemm.c and test_trsm.c
2. Creating matrixes with memory aligned leading dimensions irrespective of dimension.
3. Shifting powers of 2 aligned memory to non-powers of 2 by adding extra cache line size at the end.
AMD-Internal: [CPUPL-1500] [CPUPL-1450]
Change-Id: I4180cec29b62d0388b974abee3e9b699cce3af6a
Details:
- Fixed an issue where the wrong string was being passed in for the
vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.
When BLIS is built with --complex-return=intel, the zdotu_ and cdotu_ function prototype changes.
Now "return parameter" will become the first argument of these functions and these functions return void.
This fix addresses the change in the function declarations when --complex-return=intel is enabled.
Missing Trace and Log statements for this configuration are now added.
[CPUPL-1376]
Change-Id: Ib420989da71839211c16088bf431a2ad775a3978
Details:
- Modified the octave scripts in test/3 so that the script does not
choke when one or more of the expected OpenBLAS, Eigen, or vendor data
files is missing. (The BLIS data set, however, must be complete.) When
a file is missing, that data series is simply not included on that
particular graph. Also factored out a lot of the redundant logic from
plot_panel_4x5.m into a separate function in read_data.m.
1. CMake script changes for build with Clang compiler.
2. CMake script changes for build test and testsuite based on the lib type ST/MT
3. CMake script changes for testcpp and blastest
4. Added python scripts to support library build and testsuite build.
AMD Internal : [CPUPL-1422]
Change-Id: Ie34c3e60e9f8fbf7ea69b47fd1b50ee90099c898
Added traces in cblas layer for these API's.
These test drivers didn't have calls for complex data
types, the drivers are updated to support them.
AMD-Internal : [CPUPL-1315]
Change-Id: Ia52ecca68ea17314315d626b57c46a2f5973985b
Fixed test driver code for her, her2
Support added to handle complex and double complex data type in test driver.
Change-Id: If65939e99d8cf77e0fb70561166d84bf67d0321d
AMD-Internal: [CPUPL-1326]
Fixed test driver code for her, her2, herk and her2k function.
Above functions supports only complex and double complex data type, test code is updated accordingly.
Change-Id: Iee7b79abda4a2959a265c420d23879bf47f2c38d
AMD-Internal: [CPUPL-1313]
Merged contributions from AMD's AOCL BLIS (#448).
Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.
Details:
- Added SIMD code
- Processing 5 rows at a time in SIMD loop to improve performance
AMD-Internal: [CPUPL-1054]
Change-Id: I2ac93f25895dccfc42e14be0689e6d4e655d6a0a
When library is built as single thread and trace is enabled, the test
applications in test folder fail to compile. In the file aoclos.c the function
AOCL_gettid() uses "omp_get_thread_num() to get thread_id, which is only
enabled when OpenMP based parallel BLIS library is generated. To fix this in
single thread case we now return zero for thread id, openmp function is used
only when BLIS_ENABLE_OPENMP macro is defined. However this is not a complete
fix. If library is built with pthread, AOCL_gettid() always return 0, which is
not the intended behaviour.
Change-Id: I5b79ed57d27d0022d3dcab0e2a3a557c8e4ff8ee
Details:
- Amin api returns index of minimum absolute value in a vector.
- Added amin reference blis kernel.
- Added blas and cblas interface for amin.
AMD-Internal: [CPUPL-1155]
Change-Id: I89c1e37e86950a4582bba70a5d8fc70ac915bd3c
Details
- Added Framework optimizations for BLAS and CBLAS interfaces for caxpyv_(cblas_caxpyv) and zaxpyv_ (cblas_zaxpyv).
- Added new axpyv AVX2 kernels for c and z data types for AMD EPYC family.
AMD-Internal: [CPUPL-1231]
Change-Id: I9bc0c21fef9da84533adcef76427977430b27ea7
Details:
- Added single-threaded and multithreaded sup performance results to
docs/PerformanceSmall.md for both sgemm and dgemm. These results were
gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
microarchitecture. Special thanks to Jeff Diamond for facilitating
access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
microarchitecture section.
Details:
- Registered full suite of sgemm and dgemm sup millikernels, blocksizes,
and crossover thresholds in bli_cntx_init_zen2.c.
- Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742
system.
Fixed AOCL DTL logs printing incorrect alpha and beta values for single
precision. Added missing info like data-types, lower or upper traingular and
Side parameter in the case of TRSM.
Code cleanup and formatting the files test_cabs1.c, test_axpbyv.c and
test_gemm.c. In dumping trsm parameters replaced 'side' with bli_is_right(side).
Change-Id: Ic81503ae696956eb074ec208f7109d1a394183d7
Details:
- Renamed test/3/matlab to test/3/octave.
- Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m
files for use with octave (which is free and doesn't crash on me
mid-way through my use of subplot).
- Updated runthese.m scratchpad for zen2 invocations.
- Added Nikolay S.'s subplot_tight() function, along with its license.
Details:
- added cblas extension cblas_?cabs1.
- Functionality : res=|Re(z)|+|Im(z)|, z is a complex number, and res is a value containing the absolute value of a complex number z.
AMD-Internal: [CPUPL-1129]
Change-Id: I4a3c265c89527c8fd3060c5d2ed38b1953ce6343
Details:
- Corrected "#if" directive in line 89
- Commented out "#define print" to disable printing the vectors
Change-Id: I9ec3cbfb716540dd3e2264f5c3925d9e0c0c294a
Modified Makefile in test folder to enable calling BLAS interfaces for BLIS as
well. This is possible by replacing -DBLIS with -DBLAS=\"aocl\" in the
makefile. Also added linking to multi-threaded MKL library.
Change-Id: Iccf2ec99b48bb35da985b69218bc680f678ff7c9
Details:
- The axpby routines perform a vector-vector operation defined as
y = a*x + b*y where a, b are scalars and x, y are vectors.
- This API is part of BLAS-like extension APIs
Change-Id: I17a53b03bba97de7ae1995a9f086084bd241bcdc
AMD-Internal: [CPUPL-1118]
Details:
- Created a set of single-precision real millikernels and microkernels
comparable to the dgemmsup kernels that already exist within BLIS.
- Added prototypes for all kernels within bli_kernels_haswell.h.
- Registered entry-point millikernels in bli_cntx_init_haswell.c and
bli_cntx_init_zen.c.
- Added sgemmsup support to the Makefile, runme.sh script, and source
file in test/sup. This included edits that allow for separate "small"
dimensions for single- and double-precision as well as for single-
vs. multithreaded execution.
Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.
Change-Id: Ia230fc65185fd9a34eec714721004aa9e0bd40ed
Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.
AMD-Internal: [CPUPL-713]
Change-Id: I9536648e7befac4d2dc17805e44ef34470961662
Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
containing the mpnpkp results do not need to be preprocessed in order
to plot half the problem size range (ie: up to 400 instead of the
800 range of the other shape cases).
- Trivial updates to runme.m.
Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
docs/Performance.md.
Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
publishes new performance graphs for Kaby Lake and Epyc showcasing
the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
document.
Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.
Details:
- Adjusted test/3/Makefile so that the test drivers are linked against
Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do
this since Eigen's headers don't define implementations to the
standard BLAS APIs.
- Simplified #included headers in hemm, herk, trmm, and trsm source
driver files, since nothing specific to Eigen is needed at
compile-time for those operations.
Details:
- Use compile-time implementations of Eigen in test_gemm.c via new
EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS
library is not necessary.) However, as of Eigen 3.3.7, Eigen only
parallelizes the gemm operation and not hemm, herk, trmm, trsm, or
any other level-3 operation.
- Fixed a bug in trmm and trsm drivers whereby the wrong function
(bli_does_trans()) was being called to determine whether the object
for matrix A should be created for a left- or right-side case. This
was corrected by changing the function to bli_is_left(), as is done
in the hemm driver.
- Added support for running Eigen test drivers from runme.sh.
Details:
- Minor updates to matlab graph-generating scripts.
- Added a plot_all.m script that is more of a scratchpad for copying and
pasting function invocations into matlab to generate plots that are
presently of interest to us.