mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 02:02:46 +00:00
* use another instance to check the efficiency
* optimize group layer norm
* 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop
* add more instances to layernorm
* add 2 more test cases
* remove ignore in generating tuple of vector
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 40942b9098]