ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-08 23:40:10 +00:00

Author	SHA1	Message	Date
Kawrakow	904fdbcfb7	iq2/3_k: tiny bit faster Metal dot products	2024-08-01 09:38:06 +02:00
Kawrakow	088a8360a1	iq3_k: slightly faster Metal dequantize kernel PP-512 goes to 473 t/s up from 452 t/s.	2024-08-01 09:38:06 +02:00
Kawrakow	606f02ae89	iq3_k: Metal dot product Quite slow: 43 t/s for a 7B model	2024-08-01 09:38:06 +02:00
Kawrakow	95a6820d79	iq2_k: Metal dot product finally works It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.	2024-08-01 09:38:06 +02:00
Kawrakow	033299c9f9	iq3_k: Metal dequantize	2024-08-01 09:38:06 +02:00
Kawrakow	2927d4f841	iq3_k: NEON	2024-08-01 09:38:06 +02:00
Kawrakow	9c1eea6048	iq3_k: AVX2 iqk_mul_mat We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.	2024-08-01 09:38:06 +02:00
Kawrakow	a9fa3b1563	iq3_k: AVX512 iqk_mul_mat We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp	2024-08-01 09:38:06 +02:00
Kawrakow	a4371b7842	iq3_k: faster CUDA dot product 138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.	2024-08-01 09:38:06 +02:00
Kawrakow	81f15c0ba8	iq3_k: CUDA dot product Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.	2024-08-01 09:38:06 +02:00
Kawrakow	fb4cff3458	iq3_k: Basics Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.	2024-08-01 09:38:06 +02:00
Kawrakow	7dcd64c9bd	iq2_k: very slightly better CUDA dot product 169.2 t/s vs 167.8 t/s before.	2024-08-01 09:38:06 +02:00
Kawrakow	0c1d7383a5	iq2_k: better CUDA dot product Almost on par with iq2_xs (168 t/s vs 172 t/s).	2024-08-01 09:38:06 +02:00
Kawrakow	f30bcc1e17	iq2_k: CUDA dot product finally works Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs 172 t/s for iq2_xs.	2024-08-01 09:38:06 +02:00
Kawrakow	53fdb30ca6	iq5_k: CUDA dot product finally works	2024-08-01 09:38:06 +02:00
Kawrakow	8654a425ae	Factor out iqk CUDA dot products I cannot possibly wait for a 5 minutes nvcc compilation each time I touch vecdotq.cuh. Also, cmake was adding --options-file X.rsp to the nvcc compile commands, which confuses clangd, so I have turned that off.	2024-08-01 09:38:06 +02:00
Kawrakow	99456e2e94	iq5_k: CUDA dot product still not working	2024-08-01 09:38:06 +02:00
Kawrakow	b591023479	iq5_k: Metal Performance is roughly on par with q5_0.	2024-08-01 09:38:06 +02:00
Kawrakow	0ab3f0ff86	iq5_k: NEON	2024-08-01 09:38:06 +02:00
Kawrakow	daf608e227	iq5_k: AVX512	2024-08-01 09:38:06 +02:00
Kawrakow	e9c3ebcbe9	iq5_k: AVX2	2024-08-01 09:38:06 +02:00
Kawrakow	e5cd93b4b7	iq5_k: Basics Quantize/dequantize, CUDA dequantize	2024-08-01 09:38:06 +02:00
Kawrakow	ace8f921bb	iq2_k: Metal. Dot product is wrong	2024-08-01 09:38:06 +02:00
Kawrakow	f7ab9a13df	iq2_k: NEON	2024-08-01 09:38:06 +02:00
Kawrakow	cc8e351b68	iq2_k: slightly faster AVX512	2024-08-01 09:38:06 +02:00
Kawrakow	764d4675b8	iq2_k: simplify AVX512	2024-08-01 09:38:06 +02:00
Kawrakow	21319d6fca	iq2_k: AVX2	2024-08-01 09:38:06 +02:00
Kawrakow	3f7dad3000	iq2_k: Basics Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.	2024-08-01 09:38:06 +02:00
Kawrakow	007d2a56b3	IQ4_K: SOTA 4-bit quantization (#6 ) * iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-28 12:11:59 +02:00
Kawrakow	8963f383c0	Simdify and multi-thread tanh (#4 ) It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that tanh is used in each layer, and then at the end for softcaping the final output. ggml had tanh set to be computed with a single thread. Combined with tanh(x) being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the tanh operation. After multi-threading ggml_vec_soft_max_f32 and simd-ifying the tanh computation, I observe a 33% gain in prompt processing speed (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X). Simd-ifying: We have tanh(x) = (exp(2x) - 1)/(exp(2x) + 1) so we can just use Justine Tunney's SIMD exp implementation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 08:44:18 +02:00
Kawrakow	0ceeb11721	Merge mainline llama.cpp (#3 ) * Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 07:55:01 +02:00
Kawrakow	afd9fd274e	Offload Bitnet token embeddings to the GPU - the right way (#2 ) OK, I should have checked how it was done for Gemma and do the same for Bitnet. But better late than never. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-26 12:57:23 +02:00
Kawrakow	a14a9426ec	Offload Bitnet token embeddings to the GPU (#1 ) * bitnet: put token embeddings on the GPU * Update README with the new CUDA/Meat performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-26 09:41:04 +02:00
Kawrakow	4673de8cbe	iqk_mul_mat(NEON): adding forgotten fp16 matrix x vector implementation	2024-07-25 08:37:13 +02:00
Kawrakow	5626b09e4b	Update README.md	2024-07-24 19:55:06 +02:00
Kawrakow	ddaae42194	Update README.md Trying to avoid line breaks in table	2024-07-24 19:44:52 +02:00
Kawrakow	914b7ef460	Update README.md	2024-07-24 19:20:46 +02:00
Kawrakow	010466af1e	Add copyright notices Only on the files where I have contributed in a significant way, or the files I wrote myself.	2024-07-24 20:11:42 +03:00
Kawrakow	e0b2dd511c	Remove unused file	2024-07-24 19:33:19 +03:00
Kawrakow	6fd0a92cb0	Remove security	2024-07-24 19:25:21 +03:00
Kawrakow	28b4229295	Correct spelling in README	2024-07-24 19:22:43 +03:00
Kawrakow	b84d0c1744	Update README.md Adding some more details	2024-07-24 17:38:37 +02:00
Kawrakow	de43999de5	Update README.md Adding MoE and Bitnet performance tables	2024-07-24 16:49:00 +02:00
Kawrakow	cd77618324	Update README.md I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.	2024-07-24 11:18:50 +02:00
Kawrakow	4bb58ea8f8	Update README.md Added performance comparison tables	2024-07-24 11:01:16 +02:00
Kawrakow	73b94e5c3f	iqk_mul_mat(NEON): special case for n not divisible by 8 Else fp16 PP performance drops by nearly a factor of 2 compared to what we had before.	2024-07-24 08:04:47 +02:00
Kawrakow	5992d2652b	ggml: thread syncronization on Arm For x86 slaren was genereous enough to add _mm_pause() to the busy spin wait loop in ggml_barrier(), but everything else just busy spins, loading an atomic int on every iteration, thus forcing cache sync between the cores. This results in a massive drop in performance on my M2-Max laptop when using 8 threads. The closest approximation to _mm_pause() on Arm seems to be __asm__ __volatile__("isb\n"); After adding this to the busy spin loop, performance for 8 threads recovers back to expected levels.	2024-07-24 08:04:47 +02:00
Kawrakow	005674cecc	Fix "make it work for row sizes that are multiple of 4 on NEON"	2024-07-24 08:04:47 +02:00
Kawrakow	847588cc92	Update README.md	2024-07-23 18:05:05 +02:00
Kawrakow	97680f602c	Update README.md	2024-07-23 12:23:06 +02:00

1 2 3 4 5 ...

3377 Commits