ik_llama.cpp/556 - ik_llama.cpp for Armv8.0.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

2.8 KiB

Raw Permalink Blame History

🗣️ #556 - ik_llama.cpp for Armv8.0

Author	`NotAHero04`
Created	2025-06-25
Updated	2025-06-26

Description

I managed to port ik_llama.cpp to my phone which has a Snapdragon 680 CPU. Although under heavy emulation, it's still much faster than mainline llama.cpp. All of the tests are done using Qwen 3 0.6B model. What works:

Quants: legacy quants (tested Q4_0, Q8_0), i-quants (IQ4_XS), k-quants (Q4_K_M), iqk-quants (IQ4_KS, IQ5_K).
Flash attention.

What doesn't work:

Trellis quants (tested IQ4_KT), though it might be specific to model or to my quantization. I'll test it more tonight.
Repacking (both online and quantized forms, tested Q4_0_R8 and Q8_0_R8). If anyone is interested, I'll publish a fork. It just adds emulation for some NEON dot product and float16 arithmetic intrinsics. (mainline also has some level of emulation for v8.0)

🗣️ Discussion

👤 ikawrakow replied the 2025-06-25 at 07:52:27:

Nice 😄

The repacked variants don't work because the emulation for vdotq_laneq_s32 is incorrect, or is there some other issue? But I guess it may not be worth putting too much effort into this as one would need to use vgetq_lane_X, which will make the dot products quite slow, I think.

👤 NotAHero04 replied the 2025-06-25 at 14:37:21:

I did a fresh recompile and repacking works now! Unfortunately IQ4_KT still doesn't work :(

👤 ikawrakow replied the 2025-06-25 at 15:30:22:

The *_KT quants are very slow on my M2-Max CPU, so it may not be worth putting the effort to make them work on a v8.0 phone.

👤 NotAHero04 replied the 2025-06-26 at 09:18:15:
So the KT quants do work after all, I just have to get the model from my PC. And yes, it is unbearably slow. (Q4_0 is 3x faster in TG)

👤 ikawrakow replied the 2025-06-26 at 16:57:03:

Yes, the *_kt quants performance is very competitive on a GPU, nearly competitive on the two x86_64 CPU's that I have available, 2X slower than corresponding size quant on the M2-Max CPU, and ridiculously slow on the M2-Max GPU.

But nice you have made all this work!

2.8 KiB Raw Permalink Blame History

🗣️ #556 - ik_llama.cpp for Armv8.0

Description

🗣️ Discussion

2.8 KiB

Raw Permalink Blame History