ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-01 20:19:52 +00:00

Author	SHA1	Message	Date
slaren	cbfef1f00c	cuda, metal : fix nans in soft_max (#5574 ) * cuda : fix nans in soft_max * metal : fix nans in soft_max --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-19 10:04:45 +02:00
Mirko185	568decfaa7	readme : update (#5572 ) Added 1.5-bit on README.md	2024-02-19 09:39:31 +02:00
bmwl	72c0e5290e	ggml : android and old glibc NUMA incompatibility bugfixes (#5557 ) * #ifdef out some code NUMA blocks for Android due to lack of support * added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper * Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc * harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways --------- Co-authored-by: root <root@nenya.lothlorien.ca>	2024-02-19 09:38:32 +02:00
Jared Van Bortel	6b6f1ee292	build : pass all warning flags to nvcc via -Xcompiler (#5570 ) * build : pass all warning flags to nvcc via -Xcompiler * make : fix apparent mis-merge from #3952 * make : fix incorrect GF_CC_VER for CUDA host compiler	2024-02-18 16:21:52 -05:00
Georgi Gerganov	e0282dbdd6	ggml : restore vec dot stride arg names (#5453 )	2024-02-18 22:58:57 +02:00
Georgi Gerganov	a889a4a4c6	ci : fix wikitext url + compile warnings (#5569 ) ggml-ci	2024-02-18 22:39:30 +02:00
Georgi Gerganov	b3c5c440bc	metal : fix unused warnings (#0 )	2024-02-18 21:39:58 +02:00
Robey Holderith	fbc3ee16c2	common, server : surface min_keep as its own parameter (#5567 ) * Feature - surface min_keep as its own parameter * Updated README with min_keep param	2024-02-18 21:11:16 +02:00
Pierrick Hymbert	c0c9caa18f	server : slots monitoring endpoint (#5550 )	2024-02-18 19:39:57 +02:00
Georgi Gerganov	1441b588ab	sampling : do not set min_keep to n_probs (#5564 )	2024-02-18 19:38:06 +02:00
Georgi Gerganov	67c5f360ce	cmake : fix GGML_USE_SYCL typo (#5555 )	2024-02-18 19:17:00 +02:00
Pierrick Hymbert	af5d2d4d3d	server : enhanced health endpoint (#5548 ) * server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md	2024-02-18 18:31:28 +02:00
Pierrick Hymbert	f01cb6dac9	server : --n-predict option document and cap to max value (#5549 ) * server: document --n-predict * server: ensure client request cannot override n_predict if set * server: fix print usage LF in new --n-predict option	2024-02-18 18:30:09 +02:00
Daniel Hiltgen	42c5518ad5	server : graceful server shutdown (#5244 ) This updates the server queue to support graceful shutdown of the server on signals.	2024-02-18 18:23:16 +02:00
Georgi Gerganov	3c8f28b4d9	common : fix ub (#5530 )	2024-02-18 18:21:52 +02:00
Herman Semenov	d5d86073dc	ggml, common, examples, tests : fixed type arguments in printf (#5528 )	2024-02-18 18:20:12 +02:00
Daniel Bevenius	787dbf4d0f	llava : update surgery script to not remove tensors (#5536 ) This commit updates the surgery script to not remove the tensors from the model file. For this to work the `--skip-unknown` flag is added as an argument to the convert.py script in README.md. The motivation for this change is that the surgery script currently removes the projector tensors from the model file. If the model was checked out from a repository, the model file will have been updated and have to be checked out again to reset this effect. If this can be avoided I think it would be preferable. I did not perform this change for BakLLaVA models as I am not sure how that part works.	2024-02-18 18:19:23 +02:00
Kawrakow	fa40433c9d	1.5 bit quantization (#5453 ) * iq1_s: WIP basics * iq1_s: CUDA is working * iq1_s: scalar CPU dot product * iq1_s: WIP AVX2 dot product - something is not right * Fix tests * Fix shadow warnings * Fix after merge with latest master * iq1_s: AVX2 finally works * iq1_s: ARM_NEON dot product. Works, but not very fast * iq1_s: better grid * iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL. * iq1_s: Metal basics Dequantize works, but not dot product * iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write. * iq1_s: Tests * iq1_s: slightly faster dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-18 18:16:55 +02:00
github-actions[bot]	b02abf3383	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07) → 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)	2024-02-18 06:39:58 -08:00
Georgi Gerganov	a5c73a0e9d	ggml : add ALiBi support for ggml_soft_max_ext (#5488 ) * ggml : avoid recomputing alibi slopes (CPU) * llama : reuse hparams.f_max_alibi_bias in all cases ggml-ci * ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal) ggml-ci * ggml : handle all SRCs (do not break on first null) ggml-ci * tests : do not use slope for large soft_max accumulates too much error ggml-ci * ggml : alternative ALiBi without extra tensor We compute the slopes in the kernel ggml-ci * cuda : add ALiBi support in ggml_soft_max_ext ggml-ci * ggml : deprecate ggml_alibi * ggml : support multi-sequence ALiBi (Metal) ggml-ci * cuda : add multi-seq ALiBi + remote F16 soft_max ggml-ci * ggml : update deprecation message * ggml : fix pos ptr when no ALiBi ggml-ci * cuda : fix performance (pow -> powf) * cuda : precompute ALiBi constants * metal : pre-compute ALiBi slopes ggml-ci * llama : init kq_pos only if needed ggml-ci * test-backend-ops : add null pos test to soft_max test-backend-ops : replace soft_max tests ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-17 23:04:16 +02:00
Ananta Bastola	27a984488c	ci : add an option to fail on compile warning (#3952 ) * feat(ci): add an option to fail on compile warning * Update CMakeLists.txt * minor : fix compile warnings ggml-ci * ggml : fix unreachable code warnings ggml-ci * ci : disable fatal warnings for windows, ios and tvos * ggml : fix strncpy warning * ci : disable fatal warnings for MPI build * ci : add fatal warnings to ggml-ci ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-17 23:03:14 +02:00
clibdev	5c46b0e0bc	gitignore : update for CLion IDE (#5544 )	2024-02-17 18:28:37 +02:00
Georgi Gerganov	8f20d4fb3e	cmake : fix VULKAN and ROCm builds (#5525 ) * cmake : fix VULKAN and ROCm builds * cmake : fix (cont) * vulkan : fix compile warnings ggml-ci * cmake : fix ggml-ci * cmake : minor ggml-ci	2024-02-16 19:05:56 +02:00
Georgi Gerganov	b6ebf5b6ff	scripts : add helpers script for bench comparing commits (#5521 ) * scripts : add helpers script for bench comparing commits * scripts : detect CUDA * set flags after checking the command line * fix make flags --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-16 15:14:40 +02:00
Herman Semenov	26bfe98833	llava : removed excess free(NULL) operation (#5531 )	2024-02-16 14:43:23 +02:00
Herman Semenov	f54ed222bf	llama : minor fixed return int value (#5529 )	2024-02-16 13:45:48 +02:00
Alexey Parfenov	14c96c2c4c	server : add "samplers" param to control the samplers order (#5494 )	2024-02-16 13:33:25 +02:00
Rőczey Barnabás	f183700c25	server : fix system prompt cli (#5516 )	2024-02-16 12:00:56 +02:00
bmwl	4bc5d852a2	ggml : add numa options (#5377 ) * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverted Makefile * Fixed include * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables * removed trailing whitespace * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h * Reverting Makefile * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet * Removing MIRROR_MODE code for this PR * Removing last bit of MIRROR_MODE code for this PR * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static * Fixed lingering init_llama_backend() bool calls in tests and examples * Remote enum llama_numa_strategies * Revert bad merge with dynatemp flags * add missing enum ggml_numa_strategies declaration and revert sync problem with master * add missing enum ggml_numa_strategies declaration * fixed ggml_init_numa variable * Update ggml.h Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples * Fix up some boolean vs enum comparisons * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype * Update ggml.h Align enum values Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c Remove whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml.c align paremeters Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Remove whitespace and align brace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example * Update ggml.c simplified return for platforms without NUMA support Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * removed redundant else from cli argument processing of --numa * whitespace --------- Co-authored-by: root <root@nenya.lothlorien.ca> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-16 11:31:07 +02:00
Daniel Bevenius	384e6bbbe7	llava : fix clip-model-is-vision flag in README.md (#5509 ) * llava: fix clip-model-is-vision flag in README.md This commit fixes the flag `--clip_model_is_vision` in README.md which is does not match the actual flag: ```console $ python convert-image-encoder-to-gguf.py --help ... --clip-model-is-vision The clip model is a pure vision model (ShareGPT4V vision extract for example) ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: update link to vit config in README.md Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-16 11:24:39 +02:00
Georgi Gerganov	adfe875b17	ci : fix BERT model download and convert	2024-02-16 09:57:55 +02:00
Douglas Hanley	9bc28075c1	Use correct type of pooling for embedding models (#5500 ) Use correct type of pooling for embedding models	2024-02-15 12:21:49 -05:00
Georgi Gerganov	fc7f903883	clip : fix wrong loop condition	2024-02-15 18:49:08 +02:00
slaren	a8d3e145cc	cuda : print message when initialization fails (#5512 ) * cuda : print message when initialization fails * use CUDA_NAME both times	2024-02-15 16:49:01 +01:00
Georgi Gerganov	31e77c5029	scripts : add hf.sh helper script (#5501 ) * scripts : add hf.sh helper scripts * hf : add error logs * hf : add support for --repo and --file	2024-02-15 15:41:15 +02:00
Michaël de Vries	e56fe7b5ca	fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false (#5487 ) * fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false * fix(gguf-py): added missing cls and mask token ids to the gguf metadata	2024-02-15 14:14:37 +01:00
Elbios	e4880f18f3	llava : fix memory management bug (#5491 ) * Fix memory management in llava and server code Fixes this error: llama_new_context_with_model: graph splits (measure): 3 Available slots: -> Slot 0 - max context: 6000 {"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"} all slots are idle and system prompt is empty, clear the KV cache slot 0 - loaded image slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0 - encoding image [id: 1] munmap_chunk(): invalid pointer Aborted * Make it cleaner by checking size in batch free wrapper	2024-02-15 10:01:57 +02:00
John	66faa7f5d2	llaba : hotfix for llava-1.6 image number (#5495 ) Co-authored-by: John <cmt-nct@users.noreply.github.com>	2024-02-15 09:59:18 +02:00
Neuman Vong	a5579e4f93	vulkan: Find optimal memory type but with fallback (#5381 ) * @0cc4m feedback * More feedback @0cc4m	2024-02-15 07:11:15 +01:00
Rune	de06c28a15	readme : fix typo (#5490 ) executabhle -> executable	2024-02-14 17:15:49 +02:00
John	07687d7350	llava : update README.md (#5489 ) * Update README.md * Update README.md * Update examples/llava/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-14 16:49:42 +02:00
Michael Podvitskiy	f5c9547e67	cmake : ARM intrinsics detection for MSVC (#5401 )	2024-02-14 10:49:01 +02:00
John	4351148229	llava : support v1.6 (#5267 ) * Create llava-survery-v2.py * Update convert-image-encoder-to-gguf.py * Update convert-image-encoder-to-gguf.py * Rename llava-survery-v2.py to llava-surgery-v2.py * Update convert-image-encoder-to-gguf.py will now search for projector * Update convert-image-encoder-to-gguf.py whoops * Update llava-surgery-v2.py * Clip: Bugfix for normalization (it did not loat the 3 std and mean values) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening * whitespace corrections * ws * Tensors are now properly permuted. Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference. * ws * added verbose_prompt support into cli added stopwords for llava-1.6 into cli * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed * ws * convert : skip unknown tensors (need for LLaVA) * llava : update readme * llava : fix compile warnings * llava : style * convert : add --skip-unknown CLI arg * server : remove clip structs * bugfix for non llava-1.6 It should now work with llava-1.5 as well * clip : minor code rearrange * llava : update readme a bit --------- Co-authored-by: John <cmt-nct@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-14 09:38:35 +02:00
AT	0aa506cb97	Early return for zero size calls to get_tensor. (#5482 ) * Early return for zero size calls to get_tensor. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add an early return to the get/set tensor when the size is null. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Early return after the assertions. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Since we do the early return in the generic backend now no reason to do so here as well. Signed-off-by: Adam Treat <treat.adam@gmail.com> --------- Signed-off-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-13 22:44:25 +01:00
John	3872f9b23f	gguf : add python reader example (#5216 ) * Update CMakeLists.txt * Create reader.py * Update reader.py * Update reader.py another whitespace :\| * Update reader.py * lintlintlint	2024-02-13 19:56:38 +02:00
Jared Van Bortel	f79cb016f4	llama : add support for Nomic Embed (#5468 )	2024-02-13 12:03:53 -05:00
Aarni Koskela	d03fb522e5	llama : allow raw byte in SPM vocabs; don't crash on nl 404 (#5478 ) * common : don't crash if newline token is not found * common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs	2024-02-13 18:18:16 +02:00
Aarni Koskela	604cdd2d78	llama : make load error reporting more granular (#5477 ) Makes it easier to pinpoint where e.g. `unordered_map::at: key not found` comes from.	2024-02-13 15:24:50 +02:00
Daniel Bevenius	49cc8b2c60	finetune : rename feed-forward tensors (w1/w2/w3) (#4839 ) * finetune: rename feed-forward tensors (w1/w2/w3) This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * train-text-from-scratch: rename ff tensors This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate, ffn_down and ffn_up respectively. The motivation for this change is to make it easier to understand the purpose of the tensors. This also seems to be inline with the names used in the llama_layer struct in llama.cpp Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-13 15:15:42 +02:00
Georgi Gerganov	b8667a8447	tests : multi-thread the tokenizer tests (#5474 ) * tests : multi-thread the tokenizer tests ggml-ci * unicode : fix data race for unidentified codepoints ggml-ci * unicode : minor style fixes ggml-ci	2024-02-13 15:14:22 +02:00

... 8 9 10 11 12 ...

2639 Commits