ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-09 16:00:12 +00:00

Author	SHA1	Message	Date
Daniel Bevenius	787dbf4d0f	llava : update surgery script to not remove tensors (#5536 ) This commit updates the surgery script to not remove the tensors from the model file. For this to work the `--skip-unknown` flag is added as an argument to the convert.py script in README.md. The motivation for this change is that the surgery script currently removes the projector tensors from the model file. If the model was checked out from a repository, the model file will have been updated and have to be checked out again to reset this effect. If this can be avoided I think it would be preferable. I did not perform this change for BakLLaVA models as I am not sure how that part works.	2024-02-18 18:19:23 +02:00
Kawrakow	fa40433c9d	1.5 bit quantization (#5453 ) * iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-18 18:16:55 +02:00
github-actions[bot]	b02abf3383	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07) → 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)	2024-02-18 06:39:58 -08:00
Georgi Gerganov	a5c73a0e9d	ggml : add ALiBi support for ggml_soft_max_ext (#5488 ) * ggml : avoid recomputing alibi slopes (CPU) * llama : reuse hparams.f_max_alibi_bias in all cases ggml-ci * ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal) ggml-ci * ggml : handle all SRCs (do not break on first null) ggml-ci * tests : do not use slope for large soft_max accumulates too much error ggml-ci * ggml : alternative ALiBi without extra tensor We compute the slopes in the kernel ggml-ci * cuda : add ALiBi support in ggml_soft_max_ext ggml-ci * ggml : deprecate ggml_alibi * ggml : support multi-sequence ALiBi (Metal) ggml-ci * cuda : add multi-seq ALiBi + remote F16 soft_max ggml-ci * ggml : update deprecation message * ggml : fix pos ptr when no ALiBi ggml-ci * cuda : fix performance (pow -> powf) * cuda : precompute ALiBi constants * metal : pre-compute ALiBi slopes ggml-ci * llama : init kq_pos only if needed ggml-ci * test-backend-ops : add null pos test to soft_max test-backend-ops : replace soft_max tests ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-17 23:04:16 +02:00
Ananta Bastola	27a984488c	ci : add an option to fail on compile warning (#3952 ) * feat(ci): add an option to fail on compile warning * Update CMakeLists.txt * minor : fix compile warnings ggml-ci * ggml : fix unreachable code warnings ggml-ci * ci : disable fatal warnings for windows, ios and tvos * ggml : fix strncpy warning * ci : disable fatal warnings for MPI build * ci : add fatal warnings to ggml-ci ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-17 23:03:14 +02:00
clibdev	5c46b0e0bc	gitignore : update for CLion IDE (#5544 )	2024-02-17 18:28:37 +02:00
Georgi Gerganov	8f20d4fb3e	cmake : fix VULKAN and ROCm builds (#5525 ) * cmake : fix VULKAN and ROCm builds * cmake : fix (cont) * vulkan : fix compile warnings ggml-ci * cmake : fix ggml-ci * cmake : minor ggml-ci	2024-02-16 19:05:56 +02:00
Georgi Gerganov	b6ebf5b6ff	scripts : add helpers script for bench comparing commits (#5521 ) * scripts : add helpers script for bench comparing commits * scripts : detect CUDA * set flags after checking the command line * fix make flags --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-16 15:14:40 +02:00
Herman Semenov	26bfe98833	llava : removed excess free(NULL) operation (#5531 )	2024-02-16 14:43:23 +02:00
Herman Semenov	f54ed222bf	llama : minor fixed return int value (#5529 )	2024-02-16 13:45:48 +02:00
Alexey Parfenov	14c96c2c4c	server : add "samplers" param to control the samplers order (#5494 )	2024-02-16 13:33:25 +02:00
Rőczey Barnabás	f183700c25	server : fix system prompt cli (#5516 )	2024-02-16 12:00:56 +02:00
bmwl	4bc5d852a2	ggml : add numa options (#5377 ) * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-16 11:31:07 +02:00
Daniel Bevenius	384e6bbbe7	llava : fix clip-model-is-vision flag in README.md (#5509 ) * llava: fix clip-model-is-vision flag in README.md This commit fixes the flag `--clip_model_is_vision` in README.md which is does not match the actual flag: ```console $ python convert-image-encoder-to-gguf.py --help ... --clip-model-is-vision The clip model is a pure vision model (ShareGPT4V vision extract for example) ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: update link to vit config in README.md Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-16 11:24:39 +02:00
Georgi Gerganov	adfe875b17	ci : fix BERT model download and convert	2024-02-16 09:57:55 +02:00
Douglas Hanley	9bc28075c1	Use correct type of pooling for embedding models (#5500 ) Use correct type of pooling for embedding models	2024-02-15 12:21:49 -05:00
Georgi Gerganov	fc7f903883	clip : fix wrong loop condition	2024-02-15 18:49:08 +02:00
slaren	a8d3e145cc	cuda : print message when initialization fails (#5512 ) * cuda : print message when initialization fails * use CUDA_NAME both times	2024-02-15 16:49:01 +01:00
Georgi Gerganov	31e77c5029	scripts : add hf.sh helper script (#5501 ) * scripts : add hf.sh helper scripts * hf : add error logs * hf : add support for --repo and --file	2024-02-15 15:41:15 +02:00
Michaël de Vries	e56fe7b5ca	fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false (#5487 ) * fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false * fix(gguf-py): added missing cls and mask token ids to the gguf metadata	2024-02-15 14:14:37 +01:00
Elbios	e4880f18f3	llava : fix memory management bug (#5491 ) * Fix memory management in llava and server code Fixes this error: llama_new_context_with_model: graph splits (measure): 3 Available slots: -> Slot 0 - max context: 6000 {"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"} all slots are idle and system prompt is empty, clear the KV cache slot 0 - loaded image slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0 - encoding image [id: 1] munmap_chunk(): invalid pointer Aborted * Make it cleaner by checking size in batch free wrapper	2024-02-15 10:01:57 +02:00
John	66faa7f5d2	llaba : hotfix for llava-1.6 image number (#5495 ) Co-authored-by: John <cmt-nct@users.noreply.github.com>	2024-02-15 09:59:18 +02:00
Neuman Vong	a5579e4f93	vulkan: Find optimal memory type but with fallback (#5381 ) * @0cc4m feedback * More feedback @0cc4m	2024-02-15 07:11:15 +01:00
Rune	de06c28a15	readme : fix typo (#5490 ) executabhle -> executable	2024-02-14 17:15:49 +02:00
John	07687d7350	llava : update README.md (#5489 ) * Update README.md * Update README.md * Update examples/llava/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-14 16:49:42 +02:00
Michael Podvitskiy	f5c9547e67	cmake : ARM intrinsics detection for MSVC (#5401 )	2024-02-14 10:49:01 +02:00
John	4351148229	llava : support v1.6 (#5267 ) * Create llava-survery-v2.py * Update convert-image-encoder-to-gguf.py * Update convert-image-encoder-to-gguf.py * Rename llava-survery-v2.py to llava-surgery-v2.py * Update convert-image-encoder-to-gguf.py will now search for projector * Update convert-image-encoder-to-gguf.py whoops * Update llava-surgery-v2.py * Clip: Bugfix for normalization (it did not loat the 3 std and mean values) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening * whitespace corrections * ws * Tensors are now properly permuted. Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference. * ws * added verbose_prompt support into cli added stopwords for llava-1.6 into cli * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed * ws * convert : skip unknown tensors (need for LLaVA) * llava : update readme * llava : fix compile warnings * llava : style * convert : add --skip-unknown CLI arg * server : remove clip structs * bugfix for non llava-1.6 It should now work with llava-1.5 as well * clip : minor code rearrange * llava : update readme a bit --------- Co-authored-by: John <cmt-nct@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-14 09:38:35 +02:00
AT	0aa506cb97	Early return for zero size calls to get_tensor. (#5482 ) * Early return for zero size calls to get_tensor. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add an early return to the get/set tensor when the size is null. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Early return after the assertions. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Since we do the early return in the generic backend now no reason to do so here as well. Signed-off-by: Adam Treat <treat.adam@gmail.com> --------- Signed-off-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-13 22:44:25 +01:00
John	3872f9b23f	gguf : add python reader example (#5216 ) * Update CMakeLists.txt * Create reader.py * Update reader.py * Update reader.py another whitespace :\| * Update reader.py * lintlintlint	2024-02-13 19:56:38 +02:00
Jared Van Bortel	f79cb016f4	llama : add support for Nomic Embed (#5468 )	2024-02-13 12:03:53 -05:00
Aarni Koskela	d03fb522e5	llama : allow raw byte in SPM vocabs; don't crash on nl 404 (#5478 ) * common : don't crash if newline token is not found * common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs	2024-02-13 18:18:16 +02:00
Aarni Koskela	604cdd2d78	llama : make load error reporting more granular (#5477 ) Makes it easier to pinpoint where e.g. `unordered_map::at: key not found` comes from.	2024-02-13 15:24:50 +02:00
Daniel Bevenius	49cc8b2c60	finetune : rename feed-forward tensors (w1/w2/w3) (#4839 ) * finetune: rename feed-forward tensors (w1/w2/w3) This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * train-text-from-scratch: rename ff tensors This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-13 15:15:42 +02:00
Georgi Gerganov	b8667a8447	tests : multi-thread the tokenizer tests (#5474 ) * tests : multi-thread the tokenizer tests ggml-ci * unicode : fix data race for unidentified codepoints ggml-ci * unicode : minor style fixes ggml-ci	2024-02-13 15:14:22 +02:00
Douglas Hanley	f54932ce66	llama : support batched embeddings (#5466 ) * batched embedding: pool outputs by sequence id. updated embedding example * bring back non-causal attention * embd : minor improvements * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-13 14:06:58 +02:00
Johannes Gäßler	990dfea10c	make: add error message for bad CUDA version (#5444 ) * make: add error message for bad CUDA version * Update Makefile Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-13 12:38:37 +01:00
Georgi Gerganov	a6747dccc5	bert : add tests + fix quantization (#5475 ) * llama : do not quantize pos embd and token type tensors * ci : add BERT tests ggml-ci * ci : do not do BERT tests on low-perf nodes ggml-ci	2024-02-13 13:01:29 +02:00
Georgi Gerganov	fea77814c7	tests : disable moe test (#5473 )	2024-02-13 11:20:24 +02:00
Kawrakow	a0074519b4	ggml-quants : fix compiler warnings (shadow variable) (#5472 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-13 09:07:57 +02:00
Georgi Gerganov	fc01cc08e5	llama : fix quantization when tensors are missing (#5423 )	2024-02-12 20:14:39 +02:00
Georgi Gerganov	6858a4fa83	swift : package no longer use ggml dependency (#5465 ) * Revert "swift : update Package.swift to use ggml as dependency (#4691)" This reverts commit `ece9a45e8f`. * spm : add ggml headers	2024-02-12 19:54:29 +02:00
Lee	6a83825ce0	py : fix persimmon `n_rot` conversion (#5460 ) * convert : fix persimmon offical weight conversion to write correct n_rot. * Update convert-persimmon-to-gguf.py --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-12 19:29:57 +02:00
Abhilash Majumder	16dba45e9c	ggml-sycl: Replace 3d ops with macro (#5458 ) * use macro * use macro * fix format	2024-02-12 20:22:05 +05:30
Daniel Bevenius	5270994e6d	llava : remove prog parameter from ArgumentParser (#5457 ) * llava: remove prog parameter from ArgumentParser This commit removes the `prog` parameter from `ArgumentParser` so that it uses the default value which is the name of the script. The motivation for this change is that currently the usage output looks like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert_hf_to_gguf.py [-h] ... ``` And with this change it will look like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert-image-encoder-to-gguf.py [-h] ... ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * ci: add W503 to flake8 ignore list This commit adds W503 to the ignore list for flake8. This is done to avoid the following error: W503 line break before binary operator Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-12 10:38:44 +02:00
Georgi Gerganov	0ca4e0c14c	sync : ggml (#5452 ) * ggml-alloc : v3 (ggml/727) * ggml-alloc v3 ggml-ci * fix ci ggml-ci * whisper : check for backend buffer allocation failures * whisper : avoid leaks when initialization fails * cleanup ggml-ci * style fixes ggml-ci * sync : ggml * update llama.cpp, clip.cpp, export-lora.cpp * update finetune.cpp, train-text-from-scratch.cpp ggml-ci * ggml-backend : reduce alignment to 32 to match gguf and fix mmap --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-12 09:16:06 +02:00
Johannes Gäßler	8377d606b8	CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434 ) * CUDA: mul_mat_vec_q tiling, refactor mul mat logic Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-11 19:08:39 +01:00
Douglas Hanley	2dc6eef6da	Add support for BERT embedding models (#5423 ) * BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 11:21:38 -05:00
github-actions[bot]	713ce99422	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31) → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)	2024-02-11 07:50:41 -08:00
Sergio López	68e7918281	vulkan: only use M-sized matmul on Apple GPUs (#5412 ) * vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <slp@redhat.com> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-02-11 15:12:00 +01:00
Alexey Parfenov	f2f1af418b	common : use enums for sampler types (#5418 ) * common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:43:31 +02:00

1 2 3 4 5 ...

2173 Commits