Files
S, Hari Govind cdd181a7d7 Optimize complex scalv kernels with inline assembly and FMA instructions
This PR optimizes the complex scalar vector multiplication kernels by replacing
intrinsics with inline assembly and leveraging FMA (Fused Multiply-Add) instructions
for improved performance.

Changes:
- Replaced intrinsic-based implementation with inline assembly
- Utilizes `vfmaddsub231ps`, `vfmadd231ss`, and `vfmsub231ss` FMA instructions
- Improved instruction scheduling and register usage
- Handles both unit-stride (vectorized) and non-unit-stride (scalar) cases
- Processes up to 16 complex elements per iteration in the main loop
2026-04-13 16:06:02 +05:30
..