mirror of
https://github.com/amd/blis.git
synced 2026-07-02 13:17:16 +00:00
GCC over-optimizes intrinsics code by reordering and interleaving instructions, making it difficult to verify correctness and causing potential accuracy issues in certain cases. This change replaces intrinsics-based implementations with inline assembly to ensure one-to-one mapping between source and generated assembly. Changes: - bli_saxpyv_zen4_int: Converted AVX-512 intrinsics to inline assembly * Processes blocks of 128, 64, 32, 16, and 8 elements * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides - bli_saxpyf_zen_int_5: Converted AVX2 intrinsics to inline assembly * Processes blocks of 16 and 8 elements with 5-way fusion * Handles fringe cases with masked operations * Preserves scalar path for non-unit strides Benefits: - Predictable code generation with no compiler reordering - Better numerical accuracy by preventing unexpected transformations - Easier verification of generated assembly against specifications - Explicit control over instruction sequence and register allocation