Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.
Details:
- Implemented assembly-based packm kernels for single- and double-
precision complex domain (c and z) and housed them in the 'haswell'
kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), upon which these complex kernels are
partially based.
Details:
- Implemented assembly-based packm kernels for single- and double-
precision real domain (s and d) and housed them in the 'haswell'
kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), which I have now tweaked and used to
create comparable single-precision real kernels (s6xk and s16xk).