ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-08 07:20:12 +00:00

Author	SHA1	Message	Date
Kawrakow	b4ecd2dce6	iqk_mul_mat: fp16 implementation cleanup It turns out on my Ryzen-7950X CPU using AVX512 is slower.	2024-06-22 12:02:50 +03:00
Kawrakow	e0b52e14a6	iqk_mul_mat: fp16 implementation for AVX2 This simple implementation beats jart's tiniBLAS by a small margin (143 t/s vs 137 t/s for PP-512, TG is 4.75 t/s, so exactly the same as ggml).	2024-06-22 12:02:50 +03:00
Kawrakow	2328da1aa7	iqk_mul_mat: multi-thread quantization also for MoE models	2024-06-22 12:02:50 +03:00
Kawrakow	ea239f8572	iqk_mul_mat: make it independent of sgemm	2024-06-22 12:02:50 +03:00
Kawrakow	5039ea8930	iqk_mul_mat: minor improvements Current performance: \| model \| size \| threads \| test \| t/s \| \| ----------------- \| ---------: \| -------: \| ------: \| ---------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 16 \| pp512 \| 100.21 ± 0.32 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 16 \| pp512 \| 105.25 ± 0.75 \| \| llama 7B IQ2_M \| 2.20 GiB \| 16 \| pp512 \| 117.88 ± 0.15 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 16 \| pp512 \| 136.38 ± 0.24 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 16 \| pp512 \| 128.47 ± 0.39 \| mean: 117.64 \| ----------------- \| ---------: \| -------: \| ------: \| ---------------: \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 8 \| tg128 \| 23.94 ± 0.04 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 8 \| tg128 \| 23.27 ± 0.03 \| \| llama 7B IQ2_M \| 2.20 GiB \| 8 \| tg128 \| 18.88 ± 0.03 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 8 \| tg128 \| 19.07 ± 0.04 \| \| llama 7B IQ3_S \| 2.75 GiB \| 8 \| tg128 \| 15.44 ± 0.05 \| mean: 20.12	2024-06-22 12:02:50 +03:00
Kawrakow	e85753e1ad	iqk_mul_mat: no more templates in the IQ dequantizers Also moved the quant specific code from the EvenSignHelper into the corresponding dequantizers. These two changes had a tiniy performance benefit (much too small compared to what I was expecting/hoping for).	2024-06-22 12:02:50 +03:00
Kawrakow	b8556267cd	iqk_mul_mat: remove template on one of the prepare() functions	2024-06-22 12:02:49 +03:00
Kawrakow	44b1b4fb97	iqk_mul_mat: experimenting with zen4 Nope, we cannot have good performance for iq2_xxs and iq3_xxs at the same time. If I don't force inline the sign functions, I get better performnce for iq2_xxs and bad performance for iq3_xxs. If I fore inline them, it is the other way around. Anyway, this is what we have now on Zen4 for all quants with forced inline EvenSignHelper methods: \| model \| size \| threads \| test \| t/s \| \| -----------------\| ---------: \| ------: \| -----: \| ------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 16 \| pp512 \| 100.91 ± 0.26 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 16 \| pp512 \| 106.08 ± 0.78 \| \| llama 7B IQ2_M \| 2.20 GiB \| 16 \| pp512 \| 116.41 ± 0.25 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 16 \| pp512 \| 132.54 ± 1.07 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 16 \| pp512 \| 125.53 ± 0.06 \| arithmetic mean: 116.29 geometric mean: 115.70 \| -----------------\| ---------: \| ------: \| -----: \| ------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 8 \| tg128 \| 15.69 ± 0.04 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 8 \| tg128 \| 18.02 ± 0.04 \| \| llama 7B IQ2_M \| 2.20 GiB \| 8 \| tg128 \| 18.94 ± 0.03 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 8 \| tg128 \| 23.29 ± 0.02 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 8 \| tg128 \| 22.96 ± 0.09 \| arithmetic mean: 19.78 geometric mean: 19.56 Without force-inlining, PP(iq3_xxs) drops to 98 t/s while PP(iq2_xxs) increases to 137 t/s.	2024-06-22 12:02:49 +03:00
Kawrakow	eb9e2b628a	iqk_mul_mat: experimenting with zen4 (iq2_xxs) Observing again the wierdness of performance drop in a quant because of a change in another quant. After I added FANCY_SIMD implementations for ia3_s, iq2_s and iq2_xs, I'm observing that iq2_xxs PP performance dropped to 130 t/s from 139 t/s. Adding FANCY_SIMD implementation for applying the signs brings it back to 137 t/s and gives a small boost for TG as well (23.4 vs 23.0 t/s)	2024-06-22 12:02:49 +03:00
Kawrakow	2c8d3dad1f	iqk_mul_mat: experimenting with zen4 (iq2_xs)	2024-06-22 12:02:49 +03:00
Kawrakow	0d9027fe74	iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m)	2024-06-22 12:02:49 +03:00
Kawrakow	ed8f1fe490	iqk_mul_mat: small improvement for iq3_s The same as in llamafile. We get PP-512 = 96.6 t/s TG-128 = 7.77 t/s @ 4 threads 14.4 t/s @ 8 threads 16.3 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Kawrakow	01d55dcbf0	iqk_mul_mat: better AVX2 implementation for iq2_xxs From here on switching to GCC 12. PP-512 is now 139.3 t/s. TG-128 is 13.5 t/s @ 4 threads 23.0 t/s @ 8 threads 25.1 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Kawrakow	d4e9e595f9	iqk_mul_mat: better AVX2 implementation for iq2_xxs 2.41X for PP-512 (120.5 t/s). Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s). But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s. Very strange.	2024-06-22 12:02:49 +03:00
Kawrakow	41391ff4b0	iqk_mul_mat: AVX2 implementation for iq2_xxs 2.09X for PP-512 (104.7 t/s), worse than mainline for TG. I think it needs more work.	2024-06-22 12:02:49 +03:00
Kawrakow	be132341f5	iqk_mul_mat: AVX2 implementation for iq2_xs We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK (slightly better @ 4 threads, slightly worse @ 16 threads).	2024-06-22 12:02:49 +03:00
Kawrakow	3c448906bf	iqk_mul_mat: AVX2 implementation for iq2_s We get 2.04X for PP-512 (107 t/s). TG againsuffers a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)	2024-06-22 12:02:49 +03:00
Kawrakow	f31200bde1	Separate templates for TG and PP for i-quants on AVX2	2024-06-22 12:02:49 +03:00
Kawrakow	3f90520d1f	iqk_mul_mat: AVX2 implementation for iq3_xxs We get 2.3X for PP-512 (87 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation. Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.	2024-06-22 12:02:49 +03:00
Kawrakow	24ccf42a4f	iqk_mul_mat: AVX2 implementation for iq3_s We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation.	2024-06-22 12:02:49 +03:00
Kawrakow	32f20a1b9b	Cleanup - Arm i-quants should be good now Still missing iq1_s and iq1_m, but I don't think I'll do those.	2024-06-22 12:02:49 +03:00
Kawrakow	7235135c3e	iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version) Here we get 3.65X (!) for PP-512 (53 t/s).	2024-06-22 12:02:49 +03:00
Kawrakow	482dd30382	Simplify	2024-06-22 12:02:49 +03:00
Kawrakow	6aa7ac9cd3	iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version) We get 2.66X for PP-512 (42.35 t/s)	2024-06-22 12:02:49 +03:00
Kawrakow	d041c81b1d	iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version) We get 2.2X for PP-512 (52 t/s)	2024-06-22 12:02:49 +03:00
Kawrakow	3fe4e1b27c	iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version) We get only a 2.07X for PP-512 to get up to 31 t/s, so iq2_s remains slow.	2024-06-22 12:02:49 +03:00
Kawrakow	4c0920cb1b	Add Q8_0	2024-06-22 12:02:49 +03:00
Kawrakow	62122c1950	Cosmetics	2024-06-22 12:02:49 +03:00
Kawrakow	fb8bc26dc5	iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version) We get ~5% speeedup for TG-128, 3X for PP-512	2024-06-22 12:02:49 +03:00
Kawrakow	a18a564e54	iqk_mul_mat: faster q3_K TG We get 31 t/s up from 26 t/s, but we need to treat PP differently from TG, else we get a ~10% drop in PP performance.	2024-06-22 12:02:49 +03:00
Kawrakow	d434b4751a	iqk_mul_mat for llama.cpp	2024-06-22 12:02:49 +03:00
Clint Herron	9fa7946997	JSON Schema to GBNF integration tests (#7790 ) * Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars. * Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program. * Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs. * Merging improved schema test methods added by @ochafik in #7797 * Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework. * Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings. * Fixing grammar indentation to be consistent throughout file.	2024-06-21 23:18:36 -04:00
k.h.lai	d34e2e8860	vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022 ) * vulkan: detect multiple devices by deviceUUID instead of deviceID * vulkan: remove unneeded variables * vulkan: fix id query	2024-06-21 10:28:20 +02:00
Eve	7ccc0cb46d	ggml : AVX IQ quants (#7845 ) * initial iq4_xs * fix ci * iq4_nl * iq1_m * iq1_s * iq2_xxs * iq3_xxs * iq2_s * iq2_xs * iq3_s before sllv * iq3_s * iq3_s small fix * iq3_s sllv can be safely replaced with sse multiply	2024-06-21 08:57:36 +03:00
Georgi Gerganov	46e0320612	llama : optimize long word tokenization with WPM (#8034 ) ggml-ci	2024-06-21 08:51:28 +03:00
Douglas Hanley	a895a1b78e	llama : allow pooled embeddings on any model (#7477 ) * create append_pooling operation; allow to specify attention_type; add last token pooling; update examples * find result_norm/result_embd tensors properly; update output allocation logic * only use embd output for pooling_type NONE * get rid of old causal_attn accessor * take out attention_type; add in llama_set_embeddings * bypass logits when doing non-NONE pooling	2024-06-21 08:38:22 +03:00
Shuichi Tsutsumi	7ab016f973	swiftui : enable stream updating (#7754 )	2024-06-21 08:30:58 +03:00
Hamdoud Hakem	4fb22fa139	requirements : Bump torch and numpy for python3.12 (#8041 )	2024-06-20 22:01:15 +02:00
Hamdoud Hakem	e767e20fc6	convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040 )	2024-06-20 21:59:59 +02:00
Johannes Gäßler	5b4e0a2a38	common: fix warning (#8036 ) * common: fix warning * Update common/common.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-20 16:40:13 +02:00
luoyu-intel	20a2d77aa2	[SYCL] Fix windows build and inference (#8003 ) * add sycl preset * fix debug link error. fix windows crash * update README	2024-06-20 21:19:05 +08:00
Johannes Gäßler	24dfdbb1a3	CUDA: stream-k decomposition for MMQ (#8018 ) * CUDA: stream-k decomposition for MMQ * fix undefined memory reads for small matrices	2024-06-20 14:39:21 +02:00
Michael de Gans	4f46967577	metal : fix `ggml_metal_supports_op` for BF16 (#8021 ) Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.	2024-06-20 08:32:01 +03:00
sasha0552	c7d9dd7634	server : fix smart slot selection (#8020 )	2024-06-20 09:57:10 +10:00
Michael de Gans	9d63d2b978	un-ignore `build-info.cmake` and `build-info.sh` (#7996 ) * un-ignore `build-info.cmake` and `build-info.sh` I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases. * un-ignore `build-info.cpp.in` For the same reason as the previous two files. * Reorganize `.gitignore` * Add exceptions for files mentioned by @slaren I did leave .clang-tidy since it was explicitly ignored before. * Add comments for organization * Sort some lines for pretty * Test with `make` and `cmake` builds to ensure no build artifacts might be comitted * Remove `.clang-tidy` from `.gitignore` Per comment by @ggerganov * Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`	2024-06-19 22:10:42 +02:00
slaren	028d6b31c6	ggml : synchronize threads using barriers (#7993 )	2024-06-19 15:04:15 +02:00
Georgi Gerganov	efc3d09e43	codecov : remove (#8004 )	2024-06-19 13:04:36 +03:00
Meng, Hengyu	ce37982f07	[SYCL] refactor (#6408 ) * seperate lower precision GEMM from the main files * fix workgroup size hardcode	2024-06-19 09:11:51 +08:00
jaime-m-p	b8114be2fd	tokenizer : BPE fixes (#7530 ) * Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t	2024-06-18 18:40:52 +02:00
Sigbjørn Skjæret	083d5edc87	Only use FIM middle token if it exists (#7648 ) * Only use FIM middle if it exists * Only use FIM middle if it exists	2024-06-18 22:19:45 +10:00

1 2 3 4 5 ...

3229 Commits