* Fix bug in iqk_mul_mat
I recently added the possibility to have a matrix multiplication
kernel that processes 16 columns in the right matrix per iteration.
This introduced a bug that shows up when batch size is greater
than 16, is not a multiple of 16, and the remainder is not a multiple
of the maximum columns being processed by the regular kernels
(and so, never showed up in my testing using TG-128 and PP-512).
This commit fixes the issue.
* Make sure rows per thread is a multiple of 4 also for MoE when using _r4 quants
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>