ik_llama.cpp/46 - IQ1_TN Metal implementation.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.3 KiB

Raw Permalink Blame History

🔀 #46 - IQ1_TN Metal implementation

Author	`ikawrakow`
State	❌ Closed
Created	2024-09-10
Updated	2024-09-10

Description

IQ1_BN stores a scale at the beginning of each row, followed by IQ1_BN packing of the ternary quants. The existing Metal implementation does not allow for that sort of thing, so some changes were necessary (apart from adding the necessary additions in ggml-metal.m):

We modify the kernel_mul_mm and kernel_mul_mm_id_impl templates to have a dequantizer type as a template parameter (instead of a dequantization function)
We provide a default dequantizer that does what the existing implementation does. This is used for all existing quants
We add a dequantizer for IQ1_BN. It simply gets the scale from the first two bytes of a row, uses the existing IQ1_BN implementation to convert the ternary bits to float4x4 or half4x4, and then multiplies the result with the row scale before returning it to the caller.
We also add a dequantization kernel that takes a dequantizer as a template parameter (heeded for get_rows)

With this, the IQ1_TN implementation is complete for all supported platforms (Zen4, AVX2, ARM_NEON, CUDA, Metal).

1.3 KiB Raw Permalink Blame History

🔀 #46 - IQ1_TN Metal implementation

Description

1.3 KiB

Raw Permalink Blame History