mirror of
https://github.com/amd/blis.git
synced 2026-04-20 07:38:53 +00:00
Replace fused multiply-add (FMA) intrinsics with explicit multiply and add/subtract operations in bli_cscalv_zen_int to resolve incorrect results with GCC 12 and later compilers. The original code used register reuse pattern with _mm256_fmaddsub_ps() that causes GCC 12+ instruction scheduler to generate assembly with corrupted intermediate values due to register allocation conflicts. GCC 11 and earlier handled the same pattern correctly. Changes: - Replace _mm256_fmaddsub_ps() with _mm256_mul_ps() + _mm256_addsub_ps() - Eliminate temp register reuse to fix instruction scheduling conflicts AMD-Internal: [CPUPL-6445]