mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 09:09:50 +00:00
906 B
906 B
🔀 #125 - R4 improvements on ARM_NEON
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-08 |
| Updated | 2024-12-08 |
Description
This PR accomplishes two things:
- Reduces bloat by using a template for the
ARM_NEONmatrix multiplication implementation of interleaved rows quantsQ4_0_R4, Q5_0_R4, Q6_0_R4, IQ4_NL_X4, IQ4_XS_R4, Q8_0_R4(and I should do the same forAVX2/Zen4) - Achieves a ~7% PP speedup for all
R4quants exceptIQ4_XS_R4. With thisQ4_0_R4now outperforms the hand-written assembly in mainlinellama.cppby a small margin (125 t/s vs 123 t/s)Q8_0_R4becomes the fastest type for prompt processing onARM_NEON(PP-512 = 128 t/s for LLaMA-3.1-8B on M2-Max).- All
R4quants achieve PP-512 > 100 t/s for LLaMA-3.1-8B on M2-Max