ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-31 03:29:52 +00:00

Author	SHA1	Message	Date
Justine Tunney	a2a1f7333e	server : replace sleep with condition variables (#4673 ) The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b	2023-12-29 16:24:12 +02:00
SakuraUmi	bb6f9cfce2	server : fix OpenAI server sampling w.r.t. penalty. (#4675 )	2023-12-29 16:22:44 +02:00
Karthik Sethuraman	be677135fb	server : allow to generate multimodal embeddings (#4681 )	2023-12-29 16:22:10 +02:00
andrijdavid	fe6e204f91	main-cmake-pkg : fix build issue (#4665 ) * Fix main-cmake-pkg compilation * Use glob to load common files * cmake : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-29 16:18:20 +02:00
Peter Sugihara	0f60ba09ce	llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674 ) * fix infinite loop * slight UI simplification, clearer UX * clearer UI text, add timings to completion log	2023-12-29 15:58:56 +02:00
Justine Tunney	ca7d2aabab	Fix OpenAI server sampling w.r.t. temp and seed (#4668 ) The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29	2023-12-28 15:20:00 -04:00
Daniel Bevenius	766ccb2615	finetune : fix output formatting in print_params (#4653 ) This commit fixes the output formatting in the print_params function which currently looks like this: ```console print_params: n_vocab: 32000 print_params: n_ctx: 128 print_params: n_embd: 4096 print_params: n_ff: 11008 print_params: n_head: 32 print_params: n_head_kv: 32 print_params: n_layer: 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` With this comit the output will look like this: ```console print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 4096 print_params: n_ff : 11008 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-12-27 16:16:55 +02:00
Alexey Parfenov	593a2e1be5	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
LeonEricsson	4f3f1b832f	lookup : add prompt lookup decoding example (#4484 ) * initial commit, going through initializations * main loop finished, starting to debug * BUG: generates gibberish/repeating tokens after a while * kv_cache management * Added colors to distinguish drafted tokens (--color). Updated README * lookup : fix token positions in the draft batch * lookup : use n_draft from CLI params * lookup : final touches --------- Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-22 18:05:56 +02:00
Georgi Gerganov	f330ea5c2e	ggml : change ggml_scale to take a float instead of tensor (#4573 ) * ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci	2023-12-21 23:20:49 +02:00
Georgi Gerganov	4c919cc3e8	gguf : simplify example dependencies	2023-12-21 23:08:14 +02:00
Georgi Gerganov	7a72042b8f	llama.swiftui : add tinyllama 1.1B F16	2023-12-18 20:17:43 +02:00
Georgi Gerganov	8e9f54e3e2	llama.swiftui : add more models	2023-12-18 20:05:12 +02:00
Georgi Gerganov	6851c8fb39	llama.swiftui : add bench functionality (#4483 ) * llama.swiftui : add bench button * llama.swiftui : initial bench functionality * force to use n_gpu_layers on simulator * add download buttons & expose llamaState.loadModel * update project.pbxproj * comment #Preview & fix editorconfig check * gitignore : xcode stuff * llama.swiftui : UX improvements * llama.swiftui : avoid data copy via "downloadTask" * llama.swiftui : remove model from project * llama : remove "mostly" from model infos * llama.swiftui : improve bench --------- Co-authored-by: jhen <developer@jhen.me>	2023-12-17 19:38:41 +02:00
slaren	4994747b7f	finetune : keep allocs alive until all allocations are done (#4486 )	2023-12-17 16:05:56 +01:00
olexiyb	1f6c89aa4e	server : disable llm logs if SERVER_VERBOSE is off (#3792 )	2023-12-17 17:02:16 +02:00
AdithyanI	25469cab7f	server : fix grammar being ignored (#4494 ) Fix bug in identifying the grammar.	2023-12-17 16:57:56 +02:00
Alexey Parfenov	6c9141e794	server : fix possible ambiguity in content type charset (#4501 )	2023-12-17 16:56:09 +02:00
mzcu	690e6659ed	server : allow requests larger than 8K (#4500 )	2023-12-17 16:54:37 +02:00
ShadovvBeast	81cae950cd	server : add optional API Key Authentication example (#4441 ) * Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-15 13:49:01 +02:00
slaren	ca79db8866	ggml : remove n_dims from ggml_tensor (#4469 ) ggml-ci	2023-12-14 16:52:08 +01:00
LostRuins	e6ddbb28d3	ggml : add ggml_row_size() (fixes llama out of space) (#4461 ) * Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values * do not cast to size_t, instead just use doubles * ggml : add ggml_row_size(), deprecate ggml_type_sizef() * ggml : fix row size compute to avoid overflows * tests : fix sizey -> sizez --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-14 14:13:33 +02:00
shibe2	92c5a97cf2	server : fix handling of characters that span multiple tokens when streaming (#4446 )	2023-12-13 21:57:15 +02:00
kalomaze	bf286d0526	server : tweak default sampling parameters (#4367 ) * Set a more typical Top P setting as the default * Update temp max	2023-12-12 12:12:35 +02:00
Richard Kiss	1249c107d3	english : use `typos` to fix comments and logs (#4354 )	2023-12-12 11:53:36 +02:00
Vladimir Zorin	93aa8edb9f	server : fix local model name in server (#4420 )	2023-12-12 11:25:29 +02:00
Yueh-Po Peng	a43b1ac59b	Update README.md (#4388 ) Fix small typo.	2023-12-10 23:27:38 +01:00
Georgi Gerganov	86bfbd3afc	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Hongyu Ouyang	c1d99cafb7	train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351 ) On commit b1108 (`44c117f4`) xaedes added ggml_allocr * alloc = NULL; ... (many lines in between) if (alloc) { ggml_allocr_free(alloc); } Which is correct, but it's easy to lose context after many lines in between. On commit b1287 (`0e76a899`) xaedes made a big change. From here on, alloc is freed eagerly. alloc = ggml_allocr_new(...) ... (short lines of code) ggml_allocr_free(alloc) This happens a few times, but alloc is never set to NULL, and many lines below, we still have if (alloc) { ggml_allocr_free(alloc); } which causes a double-free.	2023-12-07 12:25:22 +02:00
Georgi Gerganov	602cf4fd3a	server : recognize cache_prompt parameter in OAI API (#4347 )	2023-12-06 20:21:59 +02:00
stduhpf	f098f30b29	speculative : support `--color` (#4343 ) * speculative: add some colors * minor : add braces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-06 10:08:17 +02:00
MaggotHATE	6a3e340d1e	sampling : custom samplers order (#4285 ) * Samplers sequence order w parameter * Cleaned commented code * Fixed formatting * Rewrote with unordered_map * Revert and rewrite, too many problems and safeguards would be needed * Fixed code style * Code style fixes according to review * More readable samplers input string, fixed help * Style fix in sampler_queue * Formatting fixes * Fixing whitespaces	2023-12-05 12:05:51 +02:00
Daniel Bevenius	0e6885a7ac	simple : update error message for KV cache check (#4324 ) This commit updates the error message that is printed when the KV cache is not big enough to hold all the prompt and generated tokens. Specifically it removes the reference to n_parallel and replaces it with n_len. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-12-04 18:04:21 +02:00
Miwa / Ensan	ca44b588eb	swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325 )	2023-12-04 18:03:49 +02:00
Miwa / Ensan	79d6bdf363	swift : fix prompt tokenization logic (#4321 )	2023-12-04 15:43:45 +02:00
Ed Lee	748ed27910	server : fix OpenAI API `stop` field to be optional (#4299 ) (cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)	2023-12-03 11:10:43 +02:00
Rickard Edén	1999041306	py : add grammar to oai like api (#4294 )	2023-12-03 11:03:25 +02:00
Georgi Gerganov	f742a36093	llama : support optional tensors (#4283 )	2023-12-01 20:35:47 +02:00
Miwa / Ensan	620a06de72	swift : fix token_to_piece implementation (#4278 ) * Fix token_to_piece implementation in Swift * Fix errors	2023-12-01 20:19:45 +02:00
Georgi Gerganov	88dbc5a1e9	ggml : add ggml_soft_max_ext (#4256 ) * metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug	2023-12-01 10:51:24 +02:00
Ziad Ben Hadj-Alouane	49e1009f75	server : add --log-disable to disable logging to file (#4260 ) * * add --log-disable to disable logging to file in the server example * * typo fix	2023-12-01 00:25:49 +02:00
Ziad Ben Hadj-Alouane	66ee2cd128	server : add single-client multi-prompt support (#4232 ) * * add multiprompt support * * cleanup * * more cleanup * * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests * * remove all references to mutex_multitasks * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * * change to set --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-12-01 00:25:04 +02:00
John	53efe4ab1d	llava : ShareGPT4V compatibility (vision encoder only loading) (#4172 ) * ShareGPT4 compatibility (vision encoder only loading) Load only a CLIP vision encoder (as supplied by ShareGPT finetunes) Corrects the argument parsing for --img_mean and --img_std (which were previously not parsed but attempted to access) Defines defaults for img_mean and img_std which are equal to the llava 1.5 CLIP encoder, so you do not have to provide them * Update convert-image-encoder-to-gguf.py	2023-11-30 23:11:14 +01:00
Andrew Godfrey	10ba0288a3	main : pass LOG_TEE callback to llama.cpp log (#4033 ) * main : Call llama_log_set to use LOG_TEE * tabs to spaces	2023-11-30 23:56:19 +02:00
Miwa / Ensan	27645bac96	batched.swift : update README.md (#4214 ) docs: update how to run	2023-11-30 23:45:17 +02:00
rhjdvsgsgks	36a0f98394	py : fix oai proxy (#3972 ) * fix oai proxy fix generation not stoped while bot stop talking in chat mode fix possible `slot_id` not exist response for cors (and pre flight) * oai proxy: workaround for some client (such as Chatbox) * use stop as separator to replace hardcoded `\n`	2023-11-30 22:50:40 +02:00
Georgi Gerganov	65125b5b15	examples : add readme files	2023-11-29 11:00:17 +02:00
Bailey Chittle	a6a660c556	examples : iOS example with swift ui (#4159 ) * copy to llama.cpp as subdir * attempt enabling metal, fails * ggml metal compiles! * Update README.md * initial conversion to new format, utf8 errors? * bug fixes, but now has an invalid memory access :( * added O3, now has insufficient memory access * begin sync with master * update to match latest code, new errors * fixed it! * fix for loop conditionals, increase result size * fix current workflow errors * attempt a llama.swiftui workflow * Update .github/workflows/build.yml Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-11-27 16:56:52 +02:00
Georgi Gerganov	03d295f72b	lookahead : support `-n -1` infinite generation	2023-11-26 21:52:23 +02:00
Georgi Gerganov	538b538dec	lookahead : add example for lookahead decoding (#4207 ) * lookahead : init * lookahead : generate and store n-grams * lookahead : use loop instead recursion to generate n-grams * lookahead : initial working implementation * lookahead : filter repeating n-grams * lookahead : use deterministic init * lookahead : add to Makefile * lookahead : fix a bug in the seq_id of the lookahead tokens * lookahead : add comments --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-11-26 20:33:07 +02:00

... 5 6 7 8 9 ...

762 Commits