2.8 KiB
🗣️ #556 - ik_llama.cpp for Armv8.0
| Author | NotAHero04 |
|---|---|
| Created | 2025-06-25 |
| Updated | 2025-06-26 |
Description
I managed to port ik_llama.cpp to my phone which has a Snapdragon 680 CPU. Although under heavy emulation, it's still much faster than mainline llama.cpp. All of the tests are done using Qwen 3 0.6B model.
What works:
- Quants: legacy quants (tested Q4_0, Q8_0), i-quants (IQ4_XS), k-quants (Q4_K_M), iqk-quants (IQ4_KS, IQ5_K).
- Flash attention.
What doesn't work:
- Trellis quants (tested IQ4_KT), though it might be specific to model or to my quantization. I'll test it more tonight.
- Repacking (both online and quantized forms, tested Q4_0_R8 and Q8_0_R8).
If anyone is interested, I'll publish a fork. It just adds emulation for some NEON dot product and float16 arithmetic intrinsics. (mainline also has some level of emulation for v8.0)
🗣️ Discussion
👤 ikawrakow replied the 2025-06-25 at 07:52:27:
Nice 😄
The repacked variants don't work because the emulation for vdotq_laneq_s32 is incorrect, or is there some other issue? But I guess it may not be worth putting too much effort into this as one would need to use vgetq_lane_X, which will make the dot products quite slow, I think.
👤 NotAHero04 replied the 2025-06-25 at 14:37:21:
I did a fresh recompile and repacking works now! Unfortunately IQ4_KT still doesn't work :(
👤 ikawrakow replied the 2025-06-25 at 15:30:22:
The *_KT quants are very slow on my M2-Max CPU, so it may not be worth putting the effort to make them work on a v8.0 phone.
👤 NotAHero04 replied the 2025-06-26 at 09:18:15:
So the KT quants do work after all, I just have to get the model from my PC. And yes, it is unbearably slow. (Q4_0 is 3x faster in TG)
👤 ikawrakow replied the 2025-06-26 at 16:57:03:
Yes, the *_kt quants performance is very competitive on a GPU, nearly competitive on the two x86_64 CPU's that I have available, 2X slower than corresponding size quant on the M2-Max CPU, and ridiculously slow on the M2-Max GPU.
But nice you have made all this work!