* iq4_k: basics
* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.
* iq4_k: TG now works on CUDA
* iq4_k: AVX512 implementation
For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.
* iq4_k: AVX2 implementation
For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.
* iq4_k: NEON implementation
For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.
* iq4_k: Metal implementation
For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.
* iq4_k: scalar dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
It seemed Gemma-2 performance is lower than expected for its size.
Looking at the architecture, I noticed that tanh is used in each layer,
and then at the end for softcaping the final output. ggml had tanh
set to be computed with a single thread. Combined with tanh(x) being a
pretty expensive operation, this resulted in a significant fraction
of the time being spent in the tanh operation.
After multi-threading ggml_vec_soft_max_f32 and simd-ifying the
tanh computation, I observe a 33% gain in prompt processing speed (!!!)
TG is of course memory bound, but despite this, we still get a
~2% boost at 4 threads (which gives max TG performance on my
Ryzen-7950X).
Simd-ifying:
We have
tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)
so we can just use Justine Tunney's SIMD exp implementation.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
OK, I should have checked how it was done for Gemma and do
the same for Bitnet. But better late than never.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* bitnet: put token embeddings on the GPU
* Update README with the new CUDA/Meat performance
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.
For x86 slaren was genereous enough to add _mm_pause() to the busy
spin wait loop in ggml_barrier(), but everything else just busy
spins, loading an atomic int on every iteration, thus forcing cache
sync between the cores. This results in a massive drop in performance
on my M2-Max laptop when using 8 threads. The closest approximation
to _mm_pause() on Arm seems to be
__asm__ __volatile__("isb\n");
After adding this to the busy spin loop, performance for 8 threads
recovers back to expected levels.
K*Q and KQ*V are n_kv_embed x n_token x n_head matrix multiplications.
Before this PR, this meant n_head calls to iqk_mul_mat to perform
n_kv_embed x n_token 2D multiplications, each using nth threads.
Instead, in this PR, if n_head is a multiple of nth, each thread
does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices.
This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from
409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B,
we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from
139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
I was trying to understand where the Bitnet bottleneck is, and at
some point noticed the Q*K matrixt multiplication where Q and K
have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for
floats rerquiers that the row size is a multiple of the SIMD vector size
(so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this
matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat
float kernel to handle row sizes that are a multiple of 4 (via __m128
for the last values in a row) resulted in nearly a 20% performance boost
for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance
increases by nearly 70%!
At the end of the day, lookup is still better when not using simd.
This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X
with 16 threads (up from 10.5 t/s).
Instead of shuffling quant data into a 128-bit register containing
8-bit ints, and then converting to 16 bit, we directly shuffle into
a 256-bit register containing 16 bit ints.
TG-128 @ 2 threads goes from 18.3 to 21.6 t/s.
TG-128 performance now saturates already at 8 threads getting 60.4 t/s.
There is almost no impact on PP-512 (322 -> 323 t/s). I guess,
we amortize dequantization cost pretty well, so we don't gain much
there.
We get close to 100 GB/s single-threaded float32 throuput:
./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn
iq1_bn
vec_dot_q
4096 values (0.02 MB)
min cycles/32 vals : 3.87
avg cycles/32 vals : 4.40
float32 throughput : 98.27 GB/s
quantized throughput : 4.99 GB/s
We have 4 groups of 16 in a block of 64 quants.
For each group of 16 we have 3 groups of 5, each using 8 bits.
The remaining 16'th quants of the 4 groups of 16 are encoded
with 8 bits using the same encoding as the groups of 5.
The only kernel where we have complications is the CUDA dequantize
kernel (because we are dequantizing 8 quants there, and we have
different encoding for the 1st and 2nd group of 8 in a group of 16).
Ths achieves better performance on all tested platforms than
any previous 1.625 bpw attempt. We have:
| model | size | params | backend | threads | test | t/s |
| ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | pp512 | 9613.02 ± 24.54 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | tg128 | 229.85 ± 0.33 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | pp512 | 322.59 ± 1.00 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | tg128 | 59.79 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 8 | tg128 | 57.62 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 4 | tg128 | 33.66 ± 0.29 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 2 | tg128 | 18.30 ± 0.01 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | pp512 | 698.13 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | tg128 | 68.88 ± 0.24 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | pp512 | 196.80 ± 0.50 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | tg128 | 51.58 ± 0.41 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 4 | tg128 | 30.80 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 2 | tg128 | 16.89 ± 0.01 |
It is still slower than 2 bpw Bitnet, but the difference now is not as
dramatic.
In summary, compared to lookup, the multiplication based approach is
* Much better on AVX2
* Slightly better on CUDA
* Slightly worse on Metal
* Much worse on NEON
For some models the same tensor is used for token embeddings and
output. This tensor tends to be named token_embedding.weight rather
than output.weight, which prevernts us from collecting imatrix data
for this tensor. With this commit we can tell the name of the
output tensor to the imatrix tool.
The AVX2 implementation was the only one left using it, so
I decided to see if we can get a performant implementation
using the 0,1,2 lookup table. Turns out we can, and it is
even slightly faster than the sign based table. We now
get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads
on the Ryzen-7950X.
With only one lookup table left for iq1_bn, I renamed it to
iq1bn_grid_u16.