Commit Graph

3512 Commits

Author SHA1 Message Date
Iwan Kawrakow
4608f0cc6d iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
0ffc9b435c iq3_kt: CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9e5879b94 iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c59830dafb iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
8f0d075f5e iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dfcc8a9cf3 iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
386d139e13 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
f1fb59b44b iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
435eb9bdd3 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
08503cec7d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
977f94b3e0 Forgotten change 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4774788136 Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
590f47278b Minor 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7bf6e158a9 iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7cafafc69e iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b354392c77 iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
aed3910dfa iq2_kt: very slightly faster CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
d2331b9287 iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b3dfe9984b iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
36e9c922b8 iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
766fa600c8 WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
86948f9c5d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
a961a48e88 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
426a6e685f iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
a4f1ac8da4 iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
2024-11-21 08:16:40 +02:00
Iwan Kawrakow
f1df1b7e15 Testing Trellis quantization: playing with scales and generators 2024-11-21 08:16:40 +02:00
Iwan Kawrakow
9ec145550d Testing Trellis quantization: 4-bit quantized block scales
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
2024-11-21 08:16:40 +02:00
Iwan Kawrakow
f21dd3fb15 Testing Trellis quantization
Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
2024-11-21 08:16:40 +02:00
Iwan Kawrakow
afe9db7143 WIP 2024-11-21 08:16:40 +02:00
Iwan Kawrakow
c578478911 WIP 2024-11-21 08:16:40 +02:00
Iwan Kawrakow
798f93ce40 WIP 2024-11-21 08:16:40 +02:00
Kawrakow
4d2fbde0cb MMQ for Q6_0 (#115)
* MMQ for Q6_0

* Add Q6_0 MMQ to template generator

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-11-21 07:12:11 +01:00
Kawrakow
52874c5d21 Faster MoE inference (#112)
* multi_sdd: WIP

* multi_sdd: CPU works

* multi_add: CUDA

* multi_add: simplify

* multi_add: Metal

* Metal: speed up mul_mat_id

For the Granite-1B MoE model PP-512 goes from
156 t/s to 890 t/s, so nearly a 6X speedup!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-31 12:05:27 +01:00
Kawrakow
5ad6439486 Use fused mul - unary op also for MoE models (#111)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 18:23:54 +02:00
Kawrakow
2e5f6db5de Bitnet: use the fused mul-silu in the FFN network (#110)
I had forgotten that build_bitnet() does not use the standerd
llm_build_ffn function, so the fused mul-silu didn't get used
for Bitnet when I added it to llm_build_ffn.

This gives us another ~1% speedup for TG-128.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 17:40:32 +02:00
Kawrakow
bd309cb782 Bitnet CUDA improvements (#109)
* iq1_bn: improve CUDA TG

On RTX-3080 TG-128(Bitnet-1.58b-3B) goes from 318 t/s to 340 t/s.
I see I have on the front page 301 t/s, so pretty nice improvement
since then.

* iq2_bn(CUDA): quants are not 4-byte aligned

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 16:26:04 +02:00
Kawrakow
3805c84686 Improve Bitnet PP on Metal (#108)
iq1_bn goes from 702 t/s to 716 t/s
iq2_bn goes from 714 t/s to 743 t/s

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 15:13:45 +02:00
Kawrakow
f7b05a09dd Faster IQ1_BN Metal implementation (#107)
* iq1_bn: faster Metal dot product

82 t/s -> 87.9 t/s

* iq1_bn(Metal): 87.9 -> 89.0 t/s for TG-128

* iq1_bn(Metal): 89.0 -> 94.7 t/s for TG-128

So, total improvement is ~15%. Not bad.

* iq1_bn(Metal): 686 -> 702 t/s for PP-512

* iq2_bn(Metal): 710 -> 714 t/s for PP-512

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 10:59:59 +02:00
Iwan Kawrakow
19cc3329bf Remove forgotten IQ1_TN, IQ2_TN enum values 2024-10-25 14:14:56 +03:00
Kawrakow
6b968f3894 Bitnet changes (#106)
* Adapting iq2_bn to work without separate scale tensors

Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).

* Adapting iq1_bn to work without separate scale tensors

* Adapting iq2_bn: CUDA dequantize

* Adapting iq2_bn: CUDA works

* Adapting iq1_bn: CUDA works

* Adapting iq1_bn, iq2_bn: NEON

* Adapting iq1_bn, iq2_bn: Metal

Dequantize works, but there is still something wrong
with the dot products.

* WIP

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.

* Remove iq1_tn and iq2_tn - Part 1

Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.

* Remove iq1_tn and iq2_tn - Part 2

* Bitnet: use the standard llm_build_kv to build self attention

My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).

* Revert "Avoid rebuild of GGML graph for each token (#98)"

This reverts commit f2d315b46f.
As far as I can tell, the commit breaks Metal TG.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25 13:08:43 +02:00
Kawrakow
9114078959 Fix quantized k-cache without FA (#105)
* Added Johannes' changes, still getting NaNs with quantized k-cache.

Also getting NaN's on Johannes's mainline branch.

* This fixes it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-24 12:20:30 +02:00
Kawrakow
b61cf7d0d7 Add support for Granite and GraniteMoE models (#102)
* Add Granite and GranoteMoE models

* Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-22 17:28:14 +02:00
Kawrakow
462c6cd7b1 Enable q6_0 for flash attention (#101)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-22 11:34:49 +02:00
Kawrakow
dbf951df15 Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99)
* Enable IQ4_NL for V-cache in token generation

* We don't need these

* Update printour of allowed quantized KV-cache combinations

* Add IQ4_NL + IQ4_NL to FA

This is a better alternative than Q4_0 + Q4_0 for the VRAM poor.

* Remove file added by mistake

* Fix typo, which is not really a bug

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-21 12:16:54 +02:00
agray3
f2d315b46f Avoid rebuild of GGML graph for each token (#98)
Introduces caching of GGML graph to avoid unnecessary full rebuild between each token.
KV cache parameters, which change with each token, are updated directly in cached GGML
graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.
2024-10-20 08:36:16 +02:00
Kawrakow
afbf2ef3e2 Bitnet: make the scale tensors optional (#97)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-19 18:52:58 +02:00
Nexes the Elder
a077f09bcb Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96)
* attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S

Pattern worth to be tested on more quants and on L3 8B.
PPL 512 = -0.024 for 70b ; - 0.005 for 8b
Size = - 640MiB for 70b ; - 64MiB for 8b

70b Q5_K_S now beats Q5_K_M by -0.012 ppl

I suspect that it goes for L3 as well, which was quite insensitive to attn_q quantization.

* indent
2024-10-19 17:24:43 +02:00
Kawrakow
7b886ae3d8 Attempt to blindly fix Windows build failure (#93)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-19 11:43:04 +02:00
Nexes the Elder
03cabe1540 CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)
To complement the token_embd.weight and output.weight :

attn_v.weight
attn_k.weight.
attn_q_weight
attn_output.weight
attn_qkv.weight
ffn_gate
ffn_down
ffn_up
2024-10-18 09:48:15 +02:00
Kawrakow
76b97c8064 Adding IQ4_KSS: 4.0 bpw quants (#89)
* iq4_kss: WIP

* iq4_kss: CUDA dequantize works

So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.

* iq4_kss: slightly better quantization

* iq4_kss: another small quantization improvement

* iq4_kss: CUDA works

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.

* iq4_kss: new bit arrangement - CUDA and Zen4 work

Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.

* iq4_kss: ARM_NEON. Predictably very slow

* iq4_kss: Metal

PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.

* iq4_kss: somewhat faster Metal dot product

45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0

* iq4_kss: AVX2

Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.

* iq4_kss: very slightly faster Metal dot product

48.7 t/s -> 49.3 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-16 15:18:26 +03:00