Commit Graph

3377 Commits

Author SHA1 Message Date
Kawrakow
904fdbcfb7 iq2/3_k: tiny bit faster Metal dot products 2024-08-01 09:38:06 +02:00
Kawrakow
088a8360a1 iq3_k: slightly faster Metal dequantize kernel
PP-512 goes to 473 t/s up from 452 t/s.
2024-08-01 09:38:06 +02:00
Kawrakow
606f02ae89 iq3_k: Metal dot product
Quite slow: 43 t/s for a 7B model
2024-08-01 09:38:06 +02:00
Kawrakow
95a6820d79 iq2_k: Metal dot product finally works
It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs,
or 63.3 t/s for q2_K_S.
2024-08-01 09:38:06 +02:00
Kawrakow
033299c9f9 iq3_k: Metal dequantize 2024-08-01 09:38:06 +02:00
Kawrakow
2927d4f841 iq3_k: NEON 2024-08-01 09:38:06 +02:00
Kawrakow
9c1eea6048 iq3_k: AVX2 iqk_mul_mat
We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.
2024-08-01 09:38:06 +02:00
Kawrakow
a9fa3b1563 iq3_k: AVX512 iqk_mul_mat
We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X
for LLaMA-3.1-8B.
In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with
iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp
2024-08-01 09:38:06 +02:00
Kawrakow
a4371b7842 iq3_k: faster CUDA dot product
138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.
2024-08-01 09:38:06 +02:00
Kawrakow
81f15c0ba8 iq3_k: CUDA dot product
Slightly slower than iq3_s - 132 t/s vs 138 t/s for
LLaMA-3.1-8B.
2024-08-01 09:38:06 +02:00
Kawrakow
fb4cff3458 iq3_k: Basics
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01 09:38:06 +02:00
Kawrakow
7dcd64c9bd iq2_k: very slightly better CUDA dot product
169.2 t/s vs 167.8 t/s before.
2024-08-01 09:38:06 +02:00
Kawrakow
0c1d7383a5 iq2_k: better CUDA dot product
Almost on par with iq2_xs (168 t/s vs 172 t/s).
2024-08-01 09:38:06 +02:00
Kawrakow
f30bcc1e17 iq2_k: CUDA dot product finally works
Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs
172 t/s for iq2_xs.
2024-08-01 09:38:06 +02:00
Kawrakow
53fdb30ca6 iq5_k: CUDA dot product finally works 2024-08-01 09:38:06 +02:00
Kawrakow
8654a425ae Factor out iqk CUDA dot products
I cannot possibly wait for a 5 minutes nvcc compilation
each time I touch vecdotq.cuh.

Also, cmake was adding --options-file X.rsp to the nvcc
compile commands, which confuses clangd, so I have turned
that off.
2024-08-01 09:38:06 +02:00
Kawrakow
99456e2e94 iq5_k: CUDA dot product still not working 2024-08-01 09:38:06 +02:00
Kawrakow
b591023479 iq5_k: Metal
Performance is roughly on par with q5_0.
2024-08-01 09:38:06 +02:00
Kawrakow
0ab3f0ff86 iq5_k: NEON 2024-08-01 09:38:06 +02:00
Kawrakow
daf608e227 iq5_k: AVX512 2024-08-01 09:38:06 +02:00
Kawrakow
e9c3ebcbe9 iq5_k: AVX2 2024-08-01 09:38:06 +02:00
Kawrakow
e5cd93b4b7 iq5_k: Basics
Quantize/dequantize, CUDA dequantize
2024-08-01 09:38:06 +02:00
Kawrakow
ace8f921bb iq2_k: Metal. Dot product is wrong 2024-08-01 09:38:06 +02:00
Kawrakow
f7ab9a13df iq2_k: NEON 2024-08-01 09:38:06 +02:00
Kawrakow
cc8e351b68 iq2_k: slightly faster AVX512 2024-08-01 09:38:06 +02:00
Kawrakow
764d4675b8 iq2_k: simplify AVX512 2024-08-01 09:38:06 +02:00
Kawrakow
21319d6fca iq2_k: AVX2 2024-08-01 09:38:06 +02:00
Kawrakow
3f7dad3000 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00
Kawrakow
007d2a56b3 IQ4_K: SOTA 4-bit quantization (#6)
* iq4_k: basics

* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
  PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
  In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
  products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.

* iq4_k: TG now works on CUDA

* iq4_k: AVX512 implementation

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.

* iq4_k: AVX2 implementation

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.

* iq4_k: NEON implementation

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

* iq4_k: Metal implementation

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

* iq4_k: scalar dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-28 12:11:59 +02:00
Kawrakow
8963f383c0 Simdify and multi-thread tanh (#4)
It seemed Gemma-2 performance is lower than expected for its size.
Looking at the architecture, I noticed that tanh is used in each layer,
and then at the end for softcaping the final output. ggml had tanh
set to be computed with a single thread. Combined with tanh(x) being a
pretty expensive operation, this resulted in a significant fraction
of the time being spent in the tanh operation.

After multi-threading ggml_vec_soft_max_f32 and simd-ifying the
tanh computation, I observe a 33% gain in prompt processing speed (!!!)
TG is of course memory bound, but despite this, we still get a
~2% boost at 4 threads (which gives max TG performance on my
Ryzen-7950X).

Simd-ifying:
We have
   tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)
so we can just use Justine Tunney's SIMD exp implementation.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 08:44:18 +02:00
Kawrakow
0ceeb11721 Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00
Kawrakow
afd9fd274e Offload Bitnet token embeddings to the GPU - the right way (#2)
OK, I should have checked how it was done for Gemma and do
the same for Bitnet. But better late than never.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26 12:57:23 +02:00
Kawrakow
a14a9426ec Offload Bitnet token embeddings to the GPU (#1)
* bitnet: put token embeddings on the GPU

* Update README with the new CUDA/Meat performance

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26 09:41:04 +02:00
Kawrakow
4673de8cbe iqk_mul_mat(NEON): adding forgotten fp16 matrix x vector implementation 2024-07-25 08:37:13 +02:00
Kawrakow
5626b09e4b Update README.md 2024-07-24 19:55:06 +02:00
Kawrakow
ddaae42194 Update README.md
Trying to avoid line breaks in table
2024-07-24 19:44:52 +02:00
Kawrakow
914b7ef460 Update README.md 2024-07-24 19:20:46 +02:00
Kawrakow
010466af1e Add copyright notices
Only on the files where I have contributed in a significant way,
or the files I wrote myself.
2024-07-24 20:11:42 +03:00
Kawrakow
e0b2dd511c Remove unused file 2024-07-24 19:33:19 +03:00
Kawrakow
6fd0a92cb0 Remove security 2024-07-24 19:25:21 +03:00
Kawrakow
28b4229295 Correct spelling in README 2024-07-24 19:22:43 +03:00
Kawrakow
b84d0c1744 Update README.md
Adding some more details
2024-07-24 17:38:37 +02:00
Kawrakow
de43999de5 Update README.md
Adding MoE and Bitnet performance tables
2024-07-24 16:49:00 +02:00
Kawrakow
cd77618324 Update README.md
I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.
2024-07-24 11:18:50 +02:00
Kawrakow
4bb58ea8f8 Update README.md
Added performance comparison tables
2024-07-24 11:01:16 +02:00
Kawrakow
73b94e5c3f iqk_mul_mat(NEON): special case for n not divisible by 8
Else fp16 PP performance drops by nearly a factor of 2 compared to
what we had before.
2024-07-24 08:04:47 +02:00
Kawrakow
5992d2652b ggml: thread syncronization on Arm
For x86 slaren was genereous enough to add _mm_pause() to the busy
spin wait loop in ggml_barrier(), but everything else just busy
spins, loading an atomic int on every iteration, thus forcing cache
sync between the cores. This results in a massive drop in performance
on my M2-Max laptop when using 8 threads. The closest approximation
to _mm_pause() on Arm seems to be
     __asm__ __volatile__("isb\n");
After adding this to the busy spin loop, performance for 8 threads
recovers back to expected levels.
2024-07-24 08:04:47 +02:00
Kawrakow
005674cecc Fix "make it work for row sizes that are multiple of 4 on NEON" 2024-07-24 08:04:47 +02:00
Kawrakow
847588cc92 Update README.md 2024-07-23 18:05:05 +02:00
Kawrakow
97680f602c Update README.md 2024-07-23 12:23:06 +02:00