* Use vectorized stores for dropout randvals
With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.
* Calculate dropout randvals for storing and applying only once
Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.