In this way we can avoid the Q, K, V copies being made after multiplication with the QKV tensor in, e.g., Phi-3.5-mini. This results in a 6-7% speedup of PP-512(Phi-3.5-mini) on CUDA (RTX-4080)