mirror of
https://github.com/amd/blis.git
synced 2026-04-20 07:38:53 +00:00
Details: - In s8 APIs with symmetric quantization, Existing kernels are reused to avoid duplication of reorder code. - Since the existing kernels are designed assuming that entire KCxNC block is packed at once, to handle grouping in symmetric quantization, we have to add JR and group loop outside the function call to existing packB function. - Though this was being done before, the cases where n_rem < 64 was not handled properly. - Modified reorder and pack code to first divide the n_fringe part into multiples-of-16 part and n_lt_16 part and then calling the pack kernel twice to handle both parts separately. - All the strides to access the reordered/pack buffer are updated accordingly.