-Currently when m is small compared to n, even if MR blks (m / MR) > 1,
and total work blocks (MR blks * NR blks) < available threads, the
number of threads assigned for m dimension (ic ways) is 1. This results
in sub par performance in bandwidth bound cases. To address this, the
thread factorization is updated to increase ic ways for these cases.
AMD-Internal: [SWLCSG-3333]
Change-Id: Ife3eafc282a2b62eb212af615edb7afa40d09ae9