Files
S, Hari Govind ec6f4e96cd Replace intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5
GCC over-optimizes intrinsics code by reordering and interleaving
instructions, making it difficult to verify correctness and causing
potential accuracy issues in certain cases. This change replaces
intrinsics-based implementations with inline assembly to ensure
one-to-one mapping between source and generated assembly.

Changes:
- bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly
  * Processes blocks of 128, 64, 32, 16, and 8 elements
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

- bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly
  * Processes blocks of 16 and 8 elements with 5-way fusion
  * Handles fringe cases with masked operations
  * Preserves scalar path for non-unit strides

Benefits:
- Predictable code generation with no compiler reordering
- Better numerical accuracy by preventing unexpected transformations
- Easier verification of generated assembly against specifications
- Explicit control over instruction sequence and register allocation
2026-01-29 11:48:47 +05:30
..