Files
blis/ref_kernels
Field G. Van Zee 0651b466c2 Bugfixes, cleanup of sup dgemm ukernels.
Details:
- Fixed a few not-really-bugs:
  - Previously, the d6x8m kernels were still prefetching the next upanel
    of A using MR*rs_a instead of ps_a (same for prefetching of next
    upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
    that the upanels might be packed, using ps_a or ps_b is the correct
    way to compute the prefetch address.
  - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
    executed as intended even though it was based on a faulty pointer
    management. Basically, in the rd_d6x8m kernel, the pointer for B
    (stored in rdx) was loaded only once, outside of the jj loop, and in
    the second iteration its new position was calculated by incrementing
    rdx by the *absolute* offset (four columns), which happened to be the
    same as the relative offset (also four columns) that was needed. It
    worked only because that loop only executed twice. A similar issue
    was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
  - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
    that it is loaded only once outside of the loops rather than
    multiple times inside the loops.
  - Changed outer loop in rd kernels so that the jump/comparison and
    loop bounds more closely mimic what you'd see in higher-level source
    code. That is, something like:
      for( i = 0; i < 6; i+=3 )
    rather than something like:
      for( i = 0; i <= 3; i+=3 )
  - Switched row-based IO to use byte offsets instead of byte column
    strides (e.g. via rsi register), which were known to be 8 anyway
    since otherwise that conditional branch wouldn't have executed.
  - Cleaned up and homogenized prefetching a bit.
  - Updated the comments that show the before and after of the
    in-register transpositions.
  - Added comments to column-based IO cases to indicate which columns
    are being accessed/updated.
  - Added rbp register to clobber lists.
  - Removed some dead (commented out) code.
  - Fixed some copy-paste typos in comments in the rv_6x8n kernels.
  - Cleaned up whitespace (including leading ws -> tabs).
  - Moved edge case (non-milli) kernels to their own directory, d6x8,
    and split them into separate files based on the "NR" value of the
    kernels (Mx8, Mx4, Mx2, etc.).
  - Moved config-specific reference Mx1 kernels into their own file
    (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
  - Added rd_dMx1 assembly kernels, which seems marginally faster than
    the corresponding reference kernels.
  - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
    the row-oriented reference kernels for all storage combos.
2020-08-03 11:22:32 +05:30
..
2020-06-16 18:29:00 +05:30
2020-06-16 18:29:00 +05:30
2020-06-16 18:29:00 +05:30
2020-06-16 18:29:00 +05:30
2020-06-16 18:29:00 +05:30