mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-01 20:21:23 +00:00
* Observed a 2x perf improvement with kBlockSize = 256 * Using 512 threads may lead to redundant computations
* Observed a 2x perf improvement with kBlockSize = 256 * Using 512 threads may lead to redundant computations