ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-11 00:40:09 +00:00

Author	SHA1	Message	Date
Maximilian Winter	9342050de3	examples : add pydantic models to GBNF grammar generator (#4883 ) * Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.	2024-01-12 21:46:45 +02:00
slaren	882a16a127	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Daniel Bevenius	e55262208d	export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894 ) This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-12 19:54:53 +02:00
Zay	9905daaaa3	llama.swiftui : update models layout (#4826 ) * Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout	2024-01-12 14:48:00 +02:00
Kawrakow	b8e769ee21	Importance Matrix calculation (#4861 ) * imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-12 06:59:57 +01:00
Georgi Gerganov	599288bf8a	server : fix infill when prompt is empty (#4833 )	2024-01-11 23:23:49 +02:00
Georgi Gerganov	d12d72c0a2	main : better name for variable n_print (#4874 )	2024-01-11 22:46:26 +02:00
Georgi Gerganov	2e19c63956	main : disable token count by default (#4874 )	2024-01-11 22:43:05 +02:00
Kawrakow	a9d5db805b	llama : restore intended k-quants mixes for MoE models (#4872 ) * Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-11 21:43:15 +02:00
Laura	026f72d14b	server : implement credentialed CORS (#4514 ) * Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage	2024-01-11 20:02:48 +02:00
Michael Coppola	2fce0d62ba	server : support for multiple api keys (#4864 ) * server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-01-11 19:51:17 +02:00
Behnam M	1760ce4a1d	server : add `LOG_INFO` when model is successfully loaded (#4881 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading	2024-01-11 19:41:39 +02:00
pudepiedj	948a870c14	main : print total token count and tokens consumed so far (#4874 ) * Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn	2024-01-11 18:14:52 +02:00
Isaac McFadyen	d08d46765f	server : fix typo in model name (#4876 )	2024-01-11 16:33:26 +02:00
Behnam M	751a33212c	server : update readme to document the new `/health` endpoint (#4866 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too	2024-01-11 09:12:05 +02:00
Georgi Gerganov	a2e0602ac0	server : fix build + rename enums (#4870 )	2024-01-11 09:10:34 +02:00
Behnam M	fe3d53f647	server : add a `/health` endpoint (#4860 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line	2024-01-10 21:56:05 +02:00
John	239c336728	clip : support more quantization types (#4846 ) Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.	2024-01-10 15:37:09 +02:00
Justine Tunney	acc4879612	llava-cli : don't crash if --image flag is invalid (#4835 ) This change fixes an issue where supplying `--image missing-file` would result in a segfault due to a null pointer being dereferenced. This can result in distracting info being printed if robust crash analysis tools are being used.	2024-01-09 19:59:14 +02:00
Behnam M	b680381cfd	server : update readme about token probs (#4777 ) * updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-09 12:02:05 +02:00
Zsapi	2d0d38f5e0	server : add api-key flag to documentation (#4832 ) Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441	2024-01-09 11:12:43 +02:00
Georgi Gerganov	3e86f86432	llama.swiftui : update readme	2024-01-08 15:57:36 +02:00
Georgi Gerganov	7ddf5857e7	main : add self-extend support (#4815 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div	2024-01-08 11:18:32 +02:00
Georgi Gerganov	a386b0dd63	examples : add passkey test (#3856 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme	2024-01-08 11:14:04 +02:00
slaren	d513cfc4b5	llama-bench : add no-kv-offload parameter (#4812 )	2024-01-07 17:59:01 +01:00
Alex Azarov	30df691a96	llama.swiftui : use llama.cpp as SPM package (#4804 )	2024-01-07 10:20:50 +02:00
Alex Azarov	8c36aaf5a8	llama.swiftui : add visionOS target (#4805 )	2024-01-07 09:46:55 +02:00
Georgi Gerganov	003f85d7ea	server : fix n_predict check (#4798 )	2024-01-07 08:45:26 +02:00
Daniel Illescas Romero	34d18eff4c	llama.swiftui : use correct pointer for llama_token_eos (#4797 )	2024-01-06 17:12:59 +02:00
Georgi Gerganov	33c9d849fd	examples : improve base-translate.sh script (#4783 )	2024-01-06 11:40:24 +02:00
Georgi Gerganov	7e27e37f26	metal : switch back to default.metallib (ggml/681) ggml-ci	2024-01-05 18:02:06 +02:00
Georgi Gerganov	41ced5ce3c	examples : add few-shot translation example (#4783 )	2024-01-05 15:11:10 +02:00
Daniel Bevenius	0c4cb7138c	finetune : remove unused includes (#4756 ) This commit removes unused includes from finetune.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-04 21:45:37 +02:00
Georgi Gerganov	82e82f484d	server : send token probs for "stream == false" (#4714 )	2024-01-04 19:56:33 +02:00
singularity	2d08e99f47	llama.swiftui : support loading custom model from file picker (#4767 ) * swiftui: support load model from file picker * swiftui: remove trailing whitespace	2024-01-04 10:22:38 +02:00
Michael Coppola	85648efa9e	server : fix options in README.md (#4765 ) * fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-04 10:17:09 +02:00
singularity	c399a87c6b	llama.swiftui : fix build of ggml.metallib (#4754 ) * metal: fix metal backend init failure in swiftui * metal: build ggml.metallib instead of copy src * llama.swift : remove debug flags from metallib build --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-04 09:58:16 +02:00
Justin Parker	5b56760f5c	server : throw an error when `slot unavailable` (#4741 )	2024-01-03 10:43:19 +02:00
Phil H	421b0da133	server : add token counts to html footer (#4738 ) * server: add token counts to stats * server: generate hpp --------- Co-authored-by: phiharri <ph@got-root.co.uk>	2024-01-02 17:48:49 +02:00
Georgi Gerganov	8243feab46	editorconfig : fix whitespace and indentation #4710	2024-01-02 13:28:15 +02:00
minarchist	37b6fbf892	server : add --override-kv parameter (#4710 ) * Changes to server to allow metadata override * documentation * flake.nix: expose full scope in legacyPackages * flake.nix: rocm not yet supported on aarch64, so hide the output * flake.nix: expose checks * workflows: nix-ci: init; build flake outputs * workflows: nix-ci: add a job for eval * workflows: weekly `nix flake update` * workflows: nix-flakestry: drop tag filters ...and add a job for flakehub.com * workflows: nix-ci: add a qemu job for jetsons * flake.nix: suggest the binary caches * flake.lock: update to a commit recently cached by nixpkgs-cuda-ci --------- Co-authored-by: John <john@jLap.lan> Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>	2024-01-02 12:38:15 +02:00
Daniel Bevenius	ffcf2ca432	finetune: fix typo in README.md (#4733 ) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-02 10:16:55 +01:00
Georgi Gerganov	2db1c8f6f2	clip : refactor + bug fixes (#4696 ) * clip : refactor + bug fixes ggml-ci * server : add log message	2023-12-30 23:24:42 +02:00
Georgi Gerganov	1e12070633	clip : use ggml_backend_buffer_is_host (#4205 )	2023-12-29 18:53:34 +02:00
Steward Garcia	3226acb20c	clip : enable gpu backend (#4205 ) * clip: enable CUDA backend * add missing kernels * add enough padding for alignment * remove ggml_repeat of clip.cpp * add metal backend * llava : fixes - avoid ggml_repeat - use GGML_USE_ instead of CLIP_USE_ macros - remove unused vars --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-29 18:52:15 +02:00
Cuong Trinh Manh	5f8aa28f03	cmake : fix ld warning duplicate libraries libllama.a (#4671 ) * fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.	2023-12-29 16:39:15 +02:00
Justine Tunney	4e00adf486	llava-cli : refactor to use sampling library (#4669 ) This change makes it possible to use flags like `--grammar` when using the `llava-cli` program. The rest is just code cleanup deleting a long standing TODO comment. This change also ensures that logging information is emitted to stderr which helps the `llava-cli` command be more friendly to shell scripts. See Mozilla-Ocho/llamafile@1cd334f	2023-12-29 16:38:38 +02:00
Justine Tunney	a2a1f7333e	server : replace sleep with condition variables (#4673 ) The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b	2023-12-29 16:24:12 +02:00
SakuraUmi	bb6f9cfce2	server : fix OpenAI server sampling w.r.t. penalty. (#4675 )	2023-12-29 16:22:44 +02:00
Karthik Sethuraman	be677135fb	server : allow to generate multimodal embeddings (#4681 )	2023-12-29 16:22:10 +02:00

1 2 3 4 5 ...

509 Commits