Description:
1. Written 6x64 main and other fringe kernels for WoQ where scaling s4
weights into bf16 performed in the kernel itself to reduce bandwidth.
2. These kernels are performing better compared to bf16 weights when m
is small and n is large.
3. Established a threshold to do quantization support at packing of
B (KCXNC) level or WoQ kernel level.
Change-Id: I4f8265b8b58c276ff2590cc948d1f920aa0bb289