mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-01 11:51:53 +00:00
257 lines
14 KiB
Markdown
257 lines
14 KiB
Markdown
### 🔀 [#546](https://github.com/ikawrakow/ik_llama.cpp/pull/546) - Faster ARM_NEON GEMM implementation for legacy quants
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **State** | ❌ **Closed** |
|
|
| **Created** | 2025-06-21 |
|
|
| **Updated** | 2025-06-22 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
It is time to give some attention to the `ARM_NEON` back-end, which has fallen behind quite a bit.
|
|
|
|
This PR corresponds to PRs #531, #533, #534 and applies the on-the-fly repacking technique to `Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0, IQ4_NL` for the `ARM_NEON` implementation.
|
|
|
|
Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
|
|
|
|
| type | t/s (main) | t/s (PR) | Speedup |
|
|
| ---: | ---: | ---: | ---: |
|
|
| Q4_0 | 83.58 | 128.41 | 1.536 |
|
|
| Q5_0 | 74.20 | 128.57 | 1.733 |
|
|
| Q6_0 | 74.25 | 128.79 | 1.735 |
|
|
| Q8_0 | 84.45 | 128.63 | 1.523 |
|
|
| IQ4_NL | 84.47 | 128.09 | 1.516 |
|
|
| Q4_1 | 74.44 | 115.36 | 1.550 |
|
|
| Q5_1 | 64.16 | 114.89 | 1.791 |
|
|
|
|
---
|
|
|
|
#### 💬 Conversation
|
|
|
|
👤 **zhouwg** commented the **2025-06-22** at **07:22:29**:<br>
|
|
|
|
I tried your ik_llamacpp on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) today, the **performance of your excellent ik_llamacpp is impressive(faster than the upstream llama.cpp)** .
|
|
|
|
both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only " because " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only " can't works with ik_llama.cpp cause of some compile error with inline assemble codes.
|
|
|
|
upstream llama.cpp:
|
|
llama-bench:
|
|

|
|
llama-cli:
|
|

|
|
|
|
ik_llama.cpp:
|
|
llama-bench:
|
|

|
|

|
|
|
|
llama-cli(the inference result is incorrect and don't know why)
|
|

|
|
|
|
---
|
|
|
|
👤 **zhouwg** commented the **2025-06-22** at **07:24:04**:<br>
|
|
|
|
I tried ik_llamacpp on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) today, the **performance of your excellent ik_llamacpp is impressive** .
|
|
|
|
both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only " because " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only " can't works with ik_llama.cpp cause of some compile error with inline assemble codes.
|
|
|
|
upstream llama.cpp with latest codes:
|
|
llama-bench:
|
|

|
|
llama-cli:
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|
llama-bench:
|
|

|
|

|
|
|
|
llama-cli(the inference result is incorrect and don't know why)
|
|

|
|
|
|
---
|
|
|
|
👤 **zhouwg** commented the **2025-06-22** at **08:36:03**:<br>
|
|
|
|
comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28:
|
|
|
|
1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
|
|
|
|
upstream llama.cpp with latest codes:
|
|
llama-bench:
|
|

|
|
llama-cli:
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|
|
|

|
|
|
|

|
|
|
|
llama-cli(the inference result is incorrect)
|
|

|
|
|
|
|
|
2. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
|
|
|
|
upstream llama.cpp with latest codes:
|
|
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
3. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".
|
|
|
|
upstream llama.cpp with latest codes:
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
4. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
|
|
|
|
upstream llama.cpp with latest codes:
|
|
the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.
|
|
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|
|
|
---
|
|
|
|
👤 **zhouwg** commented the **2025-06-22** at **09:46:28**:<br>
|
|
|
|
comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28:
|
|
|
|
1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
|
|
|
|
upstream llama.cpp with latest codes:
|
|
llama-bench:
|
|

|
|
llama-cli:
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|
|
|

|
|
|
|

|
|
|
|
llama-cli(the inference result is incorrect)
|
|

|
|
|
|
|
|
2. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
|
|
|
|
upstream llama.cpp with latest codes:
|
|
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
3. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".
|
|
|
|
upstream llama.cpp with latest codes:
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
4. both build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
|
|
|
|
upstream llama.cpp with latest codes:
|
|
|
|

|
|
|
|
the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.
|
|
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
---
|
|
|
|
👤 **zhouwg** commented the **2025-06-22** at **10:58:12**:<br>
|
|
|
|
comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28(the following benchmark data might-be depend on the workload of Android OS):
|
|
|
|
1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
|
|
|
|
upstream llama.cpp with latest codes:
|
|
llama-bench:
|
|

|
|
llama-cli:
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|
|
|

|
|
|
|

|
|
|
|
llama-cli(the inference result is incorrect)
|
|

|
|
|
|
|
|
2. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
|
|
|
|
upstream llama.cpp with latest codes:
|
|
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
3. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".
|
|
|
|
upstream llama.cpp with latest codes:
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
4. both build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
|
|
|
|
upstream llama.cpp with latest codes:
|
|
|
|

|
|
|
|
the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.
|
|
|
|

|
|
|
|
ik_llama.cpp with latest codes:
|
|

|
|
|
|
after enable GGML_IQK_FLASH_ATTENTION
|
|
|
|
build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
|
|

|
|
|
|
|
|
build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
|
|
|
|

|
|
|
|
build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
|
|
|
|

|
|
|
|
build failed with " -O3 -march=armv8.7-a -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
|
|
|
|
build with " -O3 -march=armv8.7-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
|
|

|
|
|
|
build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
|
|

|
|
|
|
|
|
build with "-O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
|
|
 |