-Currently the scale factor is loaded without using mask in downscale,
and matrix add/mul ops in the F32 eltwise kernels. This results in
out of memory reads when n is not a multiple of NR (64).
-The loads are updated to masked loads to fix the same.
AMD-Internal: [SWLCSG-3390]
Change-Id: Ib2fc555555861800c591344dc28ac0e3f63fd7cb