mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-30 03:37:38 +00:00
It should allow decoupling the MFMA and the FMA-scaling with two c_thread_buf_per_scale buffers, and look ahead fetching of a/b thread bufs. The performance is still quite similar as without double buffering.