Files
ik_llama.cpp/github-data/pull_requests/546 - Faster ARM_NEON GEMM implementation for legacy quants.md
2025-07-23 13:31:53 +02:00

14 KiB

🔀 #546 - Faster ARM_NEON GEMM implementation for legacy quants

Author ikawrakow
State Closed
Created 2025-06-21
Updated 2025-06-22

Description

It is time to give some attention to the ARM_NEON back-end, which has fallen behind quite a bit.

This PR corresponds to PRs #531, #533, #534 and applies the on-the-fly repacking technique to Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0, IQ4_NL for the ARM_NEON implementation.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type t/s (main) t/s (PR) Speedup
Q4_0 83.58 128.41 1.536
Q5_0 74.20 128.57 1.733
Q6_0 74.25 128.79 1.735
Q8_0 84.45 128.63 1.523
IQ4_NL 84.47 128.09 1.516
Q4_1 74.44 115.36 1.550
Q5_1 64.16 114.89 1.791

💬 Conversation

👤 zhouwg commented the 2025-06-22 at 07:22:29:

I tried your ik_llamacpp on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) today, the performance of your excellent ik_llamacpp is impressive(faster than the upstream llama.cpp) .

both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only " because " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only " can't works with ik_llama.cpp cause of some compile error with inline assemble codes.

upstream llama.cpp: llama-bench: Screenshot from 2025-06-22 12-58-28 llama-cli: Screenshot from 2025-06-22 15-12-04

ik_llama.cpp: llama-bench: Screenshot from 2025-06-22 13-08-16 Screenshot from 2025-06-22 15-09-01

llama-cli(the inference result is incorrect and don't know why) Screenshot from 2025-06-22 15-12-20


👤 zhouwg commented the 2025-06-22 at 07:24:04:

I tried ik_llamacpp on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) today, the performance of your excellent ik_llamacpp is impressive .

both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only " because " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only " can't works with ik_llama.cpp cause of some compile error with inline assemble codes.

upstream llama.cpp with latest codes: llama-bench: Screenshot from 2025-06-22 12-58-28 llama-cli: Screenshot from 2025-06-22 15-12-04

ik_llama.cpp with latest codes: llama-bench: Screenshot from 2025-06-22 13-08-16 Screenshot from 2025-06-22 15-09-01

llama-cli(the inference result is incorrect and don't know why) Screenshot from 2025-06-22 15-12-20


👤 zhouwg commented the 2025-06-22 at 08:36:03:

comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28:

  1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "

upstream llama.cpp with latest codes: llama-bench: Screenshot from 2025-06-22 12-58-28 llama-cli: Screenshot from 2025-06-22 15-12-04

ik_llama.cpp with latest codes:

Screenshot from 2025-06-22 13-08-16

Screenshot from 2025-06-22 15-09-01

llama-cli(the inference result is incorrect) Screenshot from 2025-06-22 15-12-20

  1. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"

upstream llama.cpp with latest codes:

Screenshot from 2025-06-22 15-55-05

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 15-47-34

  1. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".

upstream llama.cpp with latest codes: Screenshot from 2025-06-22 16-16-13

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 16-22-37

  1. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"

upstream llama.cpp with latest codes: the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.

455784182-f30ce0c8-5528-44fe-8be3-213ebaf4e730

ik_llama.cpp with latest codes:


👤 zhouwg commented the 2025-06-22 at 09:46:28:

comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28:

  1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "

upstream llama.cpp with latest codes: llama-bench: Screenshot from 2025-06-22 12-58-28 llama-cli: Screenshot from 2025-06-22 15-12-04

ik_llama.cpp with latest codes:

Screenshot from 2025-06-22 13-08-16

Screenshot from 2025-06-22 15-09-01

llama-cli(the inference result is incorrect) Screenshot from 2025-06-22 15-12-20

  1. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"

upstream llama.cpp with latest codes:

Screenshot from 2025-06-22 15-55-05

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 15-47-34

  1. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".

upstream llama.cpp with latest codes: Screenshot from 2025-06-22 16-16-13

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 16-22-37

  1. both build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"

upstream llama.cpp with latest codes:

Screenshot from 2025-06-22 17-30-43

the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.

455784182-f30ce0c8-5528-44fe-8be3-213ebaf4e730

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 17-45-34


👤 zhouwg commented the 2025-06-22 at 10:58:12:

comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28(the following benchmark data might-be depend on the workload of Android OS):

  1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "

upstream llama.cpp with latest codes: llama-bench: Screenshot from 2025-06-22 12-58-28 llama-cli: Screenshot from 2025-06-22 15-12-04

ik_llama.cpp with latest codes:

Screenshot from 2025-06-22 13-08-16

Screenshot from 2025-06-22 15-09-01

llama-cli(the inference result is incorrect) Screenshot from 2025-06-22 15-12-20

  1. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"

upstream llama.cpp with latest codes:

Screenshot from 2025-06-22 15-55-05

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 15-47-34

  1. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".

upstream llama.cpp with latest codes: Screenshot from 2025-06-22 16-16-13

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 16-22-37

  1. both build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"

upstream llama.cpp with latest codes:

Screenshot from 2025-06-22 17-30-43

the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.

455784182-f30ce0c8-5528-44fe-8be3-213ebaf4e730

ik_llama.cpp with latest codes: Screenshot from 2025-06-22 17-45-34

after enable GGML_IQK_FLASH_ATTENTION

build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only" Screenshot from 2025-06-22 18-09-57

build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"

Screenshot from 2025-06-22 18-18-55

build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "

Screenshot from 2025-06-22 18-24-55

build failed with " -O3 -march=armv8.7-a -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "

build with " -O3 -march=armv8.7-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only" Screenshot from 2025-06-22 18-33-45

build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only" Screenshot from 2025-06-22 18-46-51

build with "-O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only" Screenshot from 2025-06-22 18-56-27