Commit Graph

27 Commits

Author SHA1 Message Date
anzz1
e0522e5dd3 feat: '--in-prefix STRING' option (#426)
Prefix user inputs with a string
2023-03-25 14:03:19 +02:00
Georgi Gerganov
92dc17b275 Reduce memory usage and allocate enough memory for largest context (#473)
* Reduce memory usage and allocate enough memory for large contexts

* Simpler scratch buffer usage

* Reenable BLAS for quantized mul_mat

* Fix number of layers in 30B and 65B

* Fix KV cache size for F32
2023-03-24 23:17:37 +02:00
comex
d86b7f08ad Support calling mlock() on loaded model data on Linux and macOS (#453)
* Support calling mlock() on loaded model data on Linux and macOS

This is enabled by a new --mlock command line option.

Using mlock() disables swapping and memory compression for the model
data.  Doing so can be useful on systems where the model takes up a
large fraction of system RAM.  In my experience, macOS is quite eager to
start compressing llama.cpp's memory, which then makes it halt for a few
seconds while it decompresses, even with a model that uses "only" 25GB
out of 32GB.

Of course, this comes at the cost of forcing the system to swap or
compress other processes' memory instead, so it needs to be used with
care and shouldn't be enabled by default.

In theory it should be possible to support this on Windows as well using
VirtualLock(), but I'm not much of a Windows user.

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-24 17:19:05 +02:00
Luciano
605a3aaef3 Add embedding mode with arg flag. Currently working (#282)
* working but ugly

* add arg flag, not working on embedding mode

* typo

* Working! Thanks to @nullhook

* make params argument instead of hardcoded boolean. remove useless time check

* start doing the instructions but not finished. This probably doesnt compile

* Embeddings extraction support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-24 17:05:13 +02:00
tjohnman
815b60c690 Don't force immediate interactive without -i (#354)
* Don't force immediate interactive without -i

Sometimes we might want to use a reverse prompt but we want to let the
model generate tokens right after the initial prompt. So we don't force
user input mode if the -i flag wasn't specified and instead let it run
until we encounter the reverse prompt.

This gives use some more flexibility, since it doesn't force the user to
enter a newline if they want to let the model generate text right after
the initial prompt and only be asked for input if the reverse prompt is
encountered.

The `--interactive-first` flag is reintroduced to force the old
behavior. `-r` behaves like `-i` plus introduces a reverse prompt (it
can be specified more than once).

* Update help output.

---------

Co-authored-by: Johnman <tjohnman@github>
2023-03-22 19:16:35 +02:00
Georgi Gerganov
c40b5b3d59 Introduce C-style API (#370)
* Major refactoring - introduce C-style API

* Clean up

* Add <cassert>

* Add <iterator>

* Add <algorithm> ....

* Fix timing reporting and accumulation

* Measure eval time only for single-token calls

* Change llama_tokenize return meaning
2023-03-22 07:32:36 +02:00
Fabio R. Sluzala
edb2b84e60 We could use std::unordered_map over std::map (#305)
* Improve performance by changing std::map to std::unordered_map and std::map<id, token> id_to_token; to std::vector<token> id_to_token;

* fix last commit on gpt_vocab_init add vocab.id_to_token.resize(vocab.token_to_id.size());

* Removed include <map>

* Nest struct token score inside gpt_vocab

* renamed token to tok
2023-03-21 19:21:50 +02:00
Gary Linscott
4458de7520 Compute perplexity over prompt (#270)
* Compute perplexity over prompt

* More accurate perplexity calculation - over all logits in the context window (so 512x more tokens!)

* Output all perplexitiies

* Add timing/ETA
2023-03-21 18:27:42 +02:00
anzz1
a2c6b64d8d cmdline option for custom amount of model parts (--n_parts N) (#348)
* cmdline option for custom amount of model parts (--n_parts N)

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 17:42:43 +02:00
Georgi Gerganov
de8241445c Change default repeat_penalty to 1.0
I feel this penalty is not really helping.
Especially for the example from the README it makes results pretty bad
2023-03-21 17:32:14 +02:00
Georgi Gerganov
26d3089d76 Add tokenizer test + revert to C++11 (#355)
* Add test-tokenizer-0 to do a few tokenizations - feel free to expand
* Added option to convert-pth-to-ggml.py script to dump just the vocabulary
* Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)
* Added utility to load vocabulary file from previous point (temporary implementation)
* Avoid using std::string_view and drop back to C++11 (hope I didn't break something)
* Rename gpt_vocab -> llama_vocab
* All CMake binaries go into ./bin/ now
2023-03-21 17:29:41 +02:00
Mack Straight
1130f96c88 move file magic/version to header, print expected version (#319) 2023-03-20 19:26:01 +00:00
Mack Straight
60d93896be sentencepiece bpe compatible tokenizer (#252)
* potential out of bounds read

* fix quantize

* style

* Update convert-pth-to-ggml.py

* mild cleanup

* don't need the space-prefixing here rn since main.cpp already does it

* new file magic + version header field

* readme notice

* missing newlines

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
2023-03-20 03:17:23 -07:00
tjohnman
3a9d372c08 Support for multiple reverse prompts. (#299)
Co-authored-by: Johnman <>
Co-authored-by: Johnman <tjohnman@github>
2023-03-19 21:33:06 +02:00
tjohnman
eacfb91e66 Make prompt randomization optional. (#300)
Co-authored-by: Johnman <>
2023-03-19 20:36:19 +02:00
slaren
d51eb6a70f Add --ignore-eos parameter (#181)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-19 20:22:48 +02:00
Erik Scholz
30ea62e25d Command line switch to use F16 for memory_k and memory_v (refactor of #154) (#294)
* Use F16 for memory_k and memory_v

* add command line switch to use f16 instead of f32 for memory k+v

---------

Co-authored-by: Ty Everett <ty@tyweb.us>
2023-03-19 19:57:00 +02:00
Georgi Gerganov
2ad520e892 Add "--instruct" argument for usage with Alpaca (#240)
Also start adding prompts in "./prompts"
2023-03-19 18:37:02 +02:00
Georgi Gerganov
e9cd7a260b Default to 4 threads (#243) 2023-03-17 21:46:46 +02:00
Stephan Walter
45113b2f42 Don't tell users to use a bad number of threads (#243)
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
2023-03-17 19:47:35 +02:00
Justin Suess
a4d17b7096 added ctx_size parameter (#148)
* added ctx_size parameter

* added it in more places

* Apply suggestions from code review

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-15 21:42:40 +02:00
Matvey Soloviev
d35528087e Add interactive mode (#61)
* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
2023-03-12 23:13:28 +02:00
beiller
c763dc1bc2 Add back top_k (#56)
* Add back top_k

* Update utils.cpp

* Update utils.h

---------

Co-authored-by: Bill Hamilton <bill.hamilton@shopify.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-12 22:23:15 +02:00
beiller
a63a748bba Add repetition penalty (#20)
* Adding repeat penalization

* Update utils.h

* Update utils.cpp

* Numeric fix

Should probably still scale by temp even if penalized

* Update comments, more proper application

I see that numbers can go negative so a fix from a referenced commit

* Minor formatting

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-12 11:27:42 +02:00
Georgi Gerganov
8453184bb2 Fix a bug in the rope calculation 2023-03-10 23:46:57 +02:00
Georgi Gerganov
3cda59d04e Final touches 2023-03-10 21:50:46 +02:00
Georgi Gerganov
4b5b86d6ee Initial release 2023-03-10 20:56:40 +02:00