mirror of
https://github.com/amd/blis.git
synced 2026-06-29 02:37:05 +00:00
This PR optimizes the complex scalar vector multiplication kernels by replacing intrinsics with inline assembly and leveraging FMA (Fused Multiply-Add) instructions for improved performance. Changes: - Replaced intrinsic-based implementation with inline assembly - Utilizes `vfmaddsub231ps`, `vfmadd231ss`, and `vfmsub231ss` FMA instructions - Improved instruction scheduling and register usage - Handles both unit-stride (vectorized) and non-unit-stride (scalar) cases - Processes up to 16 complex elements per iteration in the main loop