ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-20 13:14:09 +00:00

Author	SHA1	Message	Date
Behnam M	751a33212c	server : update readme to document the new `/health` endpoint (#4866 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too	2024-01-11 09:12:05 +02:00
Georgi Gerganov	a2e0602ac0	server : fix build + rename enums (#4870 )	2024-01-11 09:10:34 +02:00
Behnam M	fe3d53f647	server : add a `/health` endpoint (#4860 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line	2024-01-10 21:56:05 +02:00
John	239c336728	clip : support more quantization types (#4846 ) Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.	2024-01-10 15:37:09 +02:00
Justine Tunney	acc4879612	llava-cli : don't crash if --image flag is invalid (#4835 ) This change fixes an issue where supplying `--image missing-file` would result in a segfault due to a null pointer being dereferenced. This can result in distracting info being printed if robust crash analysis tools are being used.	2024-01-09 19:59:14 +02:00
Behnam M	b680381cfd	server : update readme about token probs (#4777 ) * updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-09 12:02:05 +02:00
Zsapi	2d0d38f5e0	server : add api-key flag to documentation (#4832 ) Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441	2024-01-09 11:12:43 +02:00
Georgi Gerganov	3e86f86432	llama.swiftui : update readme	2024-01-08 15:57:36 +02:00
Georgi Gerganov	7ddf5857e7	main : add self-extend support (#4815 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div	2024-01-08 11:18:32 +02:00
Georgi Gerganov	a386b0dd63	examples : add passkey test (#3856 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme	2024-01-08 11:14:04 +02:00
slaren	d513cfc4b5	llama-bench : add no-kv-offload parameter (#4812 )	2024-01-07 17:59:01 +01:00
Alex Azarov	30df691a96	llama.swiftui : use llama.cpp as SPM package (#4804 )	2024-01-07 10:20:50 +02:00
Alex Azarov	8c36aaf5a8	llama.swiftui : add visionOS target (#4805 )	2024-01-07 09:46:55 +02:00
Georgi Gerganov	003f85d7ea	server : fix n_predict check (#4798 )	2024-01-07 08:45:26 +02:00
Daniel Illescas Romero	34d18eff4c	llama.swiftui : use correct pointer for llama_token_eos (#4797 )	2024-01-06 17:12:59 +02:00
Georgi Gerganov	33c9d849fd	examples : improve base-translate.sh script (#4783 )	2024-01-06 11:40:24 +02:00
Georgi Gerganov	7e27e37f26	metal : switch back to default.metallib (ggml/681) ggml-ci	2024-01-05 18:02:06 +02:00
Georgi Gerganov	41ced5ce3c	examples : add few-shot translation example (#4783 )	2024-01-05 15:11:10 +02:00
Daniel Bevenius	0c4cb7138c	finetune : remove unused includes (#4756 ) This commit removes unused includes from finetune.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-04 21:45:37 +02:00
Georgi Gerganov	82e82f484d	server : send token probs for "stream == false" (#4714 )	2024-01-04 19:56:33 +02:00
singularity	2d08e99f47	llama.swiftui : support loading custom model from file picker (#4767 ) * swiftui: support load model from file picker * swiftui: remove trailing whitespace	2024-01-04 10:22:38 +02:00
Michael Coppola	85648efa9e	server : fix options in README.md (#4765 ) * fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-04 10:17:09 +02:00
singularity	c399a87c6b	llama.swiftui : fix build of ggml.metallib (#4754 ) * metal: fix metal backend init failure in swiftui * metal: build ggml.metallib instead of copy src * llama.swift : remove debug flags from metallib build --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-04 09:58:16 +02:00
Justin Parker	5b56760f5c	server : throw an error when `slot unavailable` (#4741 )	2024-01-03 10:43:19 +02:00
Phil H	421b0da133	server : add token counts to html footer (#4738 ) * server: add token counts to stats * server: generate hpp --------- Co-authored-by: phiharri <ph@got-root.co.uk>	2024-01-02 17:48:49 +02:00
Georgi Gerganov	8243feab46	editorconfig : fix whitespace and indentation #4710	2024-01-02 13:28:15 +02:00
minarchist	37b6fbf892	server : add --override-kv parameter (#4710 ) * Changes to server to allow metadata override * documentation * flake.nix: expose full scope in legacyPackages * flake.nix: rocm not yet supported on aarch64, so hide the output * flake.nix: expose checks * workflows: nix-ci: init; build flake outputs * workflows: nix-ci: add a job for eval * workflows: weekly `nix flake update` * workflows: nix-flakestry: drop tag filters ...and add a job for flakehub.com * workflows: nix-ci: add a qemu job for jetsons * flake.nix: suggest the binary caches * flake.lock: update to a commit recently cached by nixpkgs-cuda-ci --------- Co-authored-by: John <john@jLap.lan> Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>	2024-01-02 12:38:15 +02:00
Daniel Bevenius	ffcf2ca432	finetune: fix typo in README.md (#4733 ) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-01-02 10:16:55 +01:00
Georgi Gerganov	2db1c8f6f2	clip : refactor + bug fixes (#4696 ) * clip : refactor + bug fixes ggml-ci * server : add log message	2023-12-30 23:24:42 +02:00
Georgi Gerganov	1e12070633	clip : use ggml_backend_buffer_is_host (#4205 )	2023-12-29 18:53:34 +02:00
Steward Garcia	3226acb20c	clip : enable gpu backend (#4205 ) * clip: enable CUDA backend * add missing kernels * add enough padding for alignment * remove ggml_repeat of clip.cpp * add metal backend * llava : fixes - avoid ggml_repeat - use GGML_USE_ instead of CLIP_USE_ macros - remove unused vars --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-29 18:52:15 +02:00
Cuong Trinh Manh	5f8aa28f03	cmake : fix ld warning duplicate libraries libllama.a (#4671 ) * fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.	2023-12-29 16:39:15 +02:00
Justine Tunney	4e00adf486	llava-cli : refactor to use sampling library (#4669 ) This change makes it possible to use flags like `--grammar` when using the `llava-cli` program. The rest is just code cleanup deleting a long standing TODO comment. This change also ensures that logging information is emitted to stderr which helps the `llava-cli` command be more friendly to shell scripts. See Mozilla-Ocho/llamafile@1cd334f	2023-12-29 16:38:38 +02:00
Justine Tunney	a2a1f7333e	server : replace sleep with condition variables (#4673 ) The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b	2023-12-29 16:24:12 +02:00
SakuraUmi	bb6f9cfce2	server : fix OpenAI server sampling w.r.t. penalty. (#4675 )	2023-12-29 16:22:44 +02:00
Karthik Sethuraman	be677135fb	server : allow to generate multimodal embeddings (#4681 )	2023-12-29 16:22:10 +02:00
andrijdavid	fe6e204f91	main-cmake-pkg : fix build issue (#4665 ) * Fix main-cmake-pkg compilation * Use glob to load common files * cmake : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-29 16:18:20 +02:00
Peter Sugihara	0f60ba09ce	llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674 ) * fix infinite loop * slight UI simplification, clearer UX * clearer UI text, add timings to completion log	2023-12-29 15:58:56 +02:00
Justine Tunney	ca7d2aabab	Fix OpenAI server sampling w.r.t. temp and seed (#4668 ) The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29	2023-12-28 15:20:00 -04:00
Daniel Bevenius	766ccb2615	finetune : fix output formatting in print_params (#4653 ) This commit fixes the output formatting in the print_params function which currently looks like this: ```console print_params: n_vocab: 32000 print_params: n_ctx: 128 print_params: n_embd: 4096 print_params: n_ff: 11008 print_params: n_head: 32 print_params: n_head_kv: 32 print_params: n_layer: 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` With this comit the output will look like this: ```console print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 4096 print_params: n_ff : 11008 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-12-27 16:16:55 +02:00
Alexey Parfenov	593a2e1be5	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
LeonEricsson	4f3f1b832f	lookup : add prompt lookup decoding example (#4484 ) * initial commit, going through initializations * main loop finished, starting to debug * BUG: generates gibberish/repeating tokens after a while * kv_cache management * Added colors to distinguish drafted tokens (--color). Updated README * lookup : fix token positions in the draft batch * lookup : use n_draft from CLI params * lookup : final touches --------- Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 18:05:56 +02:00
Georgi Gerganov	f330ea5c2e	ggml : change ggml_scale to take a float instead of tensor (#4573 ) * ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci	2023-12-21 23:20:49 +02:00
Georgi Gerganov	4c919cc3e8	gguf : simplify example dependencies	2023-12-21 23:08:14 +02:00
Georgi Gerganov	7a72042b8f	llama.swiftui : add tinyllama 1.1B F16	2023-12-18 20:17:43 +02:00
Georgi Gerganov	8e9f54e3e2	llama.swiftui : add more models	2023-12-18 20:05:12 +02:00
Georgi Gerganov	6851c8fb39	llama.swiftui : add bench functionality (#4483 ) * llama.swiftui : add bench button * llama.swiftui : initial bench functionality * force to use n_gpu_layers on simulator * add download buttons & expose llamaState.loadModel * update project.pbxproj * comment #Preview & fix editorconfig check * gitignore : xcode stuff * llama.swiftui : UX improvements * llama.swiftui : avoid data copy via "downloadTask" * llama.swiftui : remove model from project * llama : remove "mostly" from model infos * llama.swiftui : improve bench --------- Co-authored-by: jhen <developer@jhen.me>	2023-12-17 19:38:41 +02:00
slaren	4994747b7f	finetune : keep allocs alive until all allocations are done (#4486 )	2023-12-17 16:05:56 +01:00
olexiyb	1f6c89aa4e	server : disable llm logs if SERVER_VERBOSE is off (#3792 )	2023-12-17 17:02:16 +02:00
AdithyanI	25469cab7f	server : fix grammar being ignored (#4494 ) Fix bug in identifying the grammar.	2023-12-17 16:57:56 +02:00

1 2 3 4 5 ...

495 Commits