### 🗣️ [#556](https://github.com/ikawrakow/ik_llama.cpp/discussions/556) - ik_llama.cpp for Armv8.0
| **Author** | `NotAHero04` |
| :--- | :--- |
| **Created** | 2025-06-25 |
| **Updated** | 2025-06-26 |
---
#### Description
I managed to port ik_llama.cpp to my phone which has a Snapdragon 680 CPU. Although under heavy emulation, it's still much faster than mainline llama.cpp. All of the tests are done using Qwen 3 0.6B model.

What works:
- Quants: legacy quants (tested Q4_0, Q8_0), i-quants (IQ4_XS), k-quants (Q4_K_M), iqk-quants (IQ4_KS, IQ5_K).
- Flash attention.

What doesn't work:
- Trellis quants (tested IQ4_KT), though it might be specific to model or to my quantization. I'll test it more tonight.
- Repacking (both online and quantized forms, tested Q4_0_R8 and Q8_0_R8).

If anyone is interested, I'll publish a fork. It just adds emulation for some NEON dot product and float16 arithmetic intrinsics. (mainline also has some level of emulation for v8.0)
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-06-25** at **07:52:27**:
Nice 😄
The repacked variants don't work because the emulation for `vdotq_laneq_s32` is incorrect, or is there some other issue? But I guess it may not be worth putting too much effort into this as one would need to use `vgetq_lane_X`, which will make the dot products quite slow, I think.
---
👤 **NotAHero04** replied the **2025-06-25** at **14:37:21**:
I did a fresh recompile and repacking works now! Unfortunately IQ4_KT still doesn't work :(

---
👤 **ikawrakow** replied the **2025-06-25** at **15:30:22**:
The `*_KT` quants are very slow on my M2-Max CPU, so it may not be worth putting the effort to make them work on a v8.0 phone.
> 👤 **NotAHero04** replied the **2025-06-26** at **09:18:15**:
> So the KT quants do work after all, I just have to get the model from my PC. And yes, it is unbearably slow. (Q4_0 is 3x faster in TG)
> 
---
👤 **ikawrakow** replied the **2025-06-26** at **16:57:03**:
Yes, the `*_kt` quants performance is very competitive on a GPU, nearly competitive on the two `x86_64` CPU's that I have available, 2X slower than corresponding size quant on the M2-Max CPU, and ridiculously slow on the M2-Max GPU.
But nice you have made all this work!