mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-30 19:57:40 +00:00
* root cause: fhma_bwd not support if hdim > 256 due to the use of LDS goes beyond the hardware limitations. * solution: 1. split dqdkdv kernel into 2 kernels. * 1) QGrad * 2) KGrad & VGrad * 2. reuse LDS memory. * 1). K and K^T use same LDS memory in dq kernel * 2). OGrad and OGrad^T use same LDS memory in dq kernel * 3. to avoid or reduce the number of VGPR spills, the calculation order has been readjusted, and prefetch has been disabled.