Details:
- Renamed herk macrokernels and supporting files and functions to gemmt,
which is possible since at the macrokernel level they are identical.
Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
functions rather than cpp macros that instantiate multiple functions.
Thanks to Devin Matthews for his efforts on this issue (#531).
- Check that the maximum stack buffer size is sufficiently large
relative to the register blocksizes for each datatype, and do so when
the context is initialized rather than when an operation is called.
Note that with this change, users who pass in their own contexts into
the expert interfaces currently will *not* have any checks performed.
Thanks to Devin Matthews for suggesting this change.
Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
precision real and double-precision real ukernels had opposing I/O
preferences (row-preferential sgemm ukernel + column-preferential
dgemm ukernel, or vice versa). The fix involved adjusting the API
to bli_cntx_set_ind_blkszs() so that the induced method context init
function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
function for only one datatype at a time. This allowed the blocksize
scaling (which varies depending on whether we're doing 1m_r or 1m_c)
to happen on a per-datatype basis. This fixes issue #557. Thanks to
Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
called from each level-3 _front() function. The pack_t schemas in the
cntx_t were also removed entirely, along with the associated accessor
functions. This in turn required updating the trsm1m-related virtual
ukernels to read the pack schema for B from the auxinfo_t struct
rather than the context. This also required slight tweaks to
bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
the microkernel IO preference. This mostly only affects gemm. Thanks
to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
in bli_gks.c. The context initialization functions for induced methods
were previously passed a dt argument, but I can no longer figure out
*why* they were passed this value. To reduce confusion, I've removed
the dt argument (including also from the function defintion +
prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
breaks high-leve implementations of 3m and 4m, but this is okay since
those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.
- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
it's not called by any upper routine.
Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)