mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
1.3 KiB
1.3 KiB
🔀 #46 - IQ1_TN Metal implementation
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-09-10 |
| Updated | 2024-09-10 |
Description
IQ1_BN stores a scale at the beginning of each row, followed by IQ1_BN packing of the ternary quants. The existing Metal implementation does not allow for that sort of thing, so some changes were necessary (apart from adding the necessary additions in ggml-metal.m):
- We modify the
kernel_mul_mmandkernel_mul_mm_id_impltemplates to have a dequantizer type as a template parameter (instead of a dequantization function) - We provide a default dequantizer that does what the existing implementation does. This is used for all existing quants
- We add a dequantizer for
IQ1_BN. It simply gets the scale from the first two bytes of a row, uses the existingIQ1_BNimplementation to convert the ternary bits tofloat4x4orhalf4x4, and then multiplies the result with the row scale before returning it to the caller. - We also add a dequantization kernel that takes a dequantizer as a template parameter (heeded for
get_rows)
With this, the IQ1_TN implementation is complete for all supported platforms (Zen4, AVX2, ARM_NEON, CUDA, Metal).