Commit Graph

52 Commits

Author SHA1 Message Date
tjohnman
1b4b61fb60 Fix instruct mode broken by PR #354 (#409)
Co-authored-by: Johnman <tjohnman@github>
2023-03-23 01:30:23 +01:00
tjohnman
815b60c690 Don't force immediate interactive without -i (#354)
* Don't force immediate interactive without -i

Sometimes we might want to use a reverse prompt but we want to let the
model generate tokens right after the initial prompt. So we don't force
user input mode if the -i flag wasn't specified and instead let it run
until we encounter the reverse prompt.

This gives use some more flexibility, since it doesn't force the user to
enter a newline if they want to let the model generate text right after
the initial prompt and only be asked for input if the reverse prompt is
encountered.

The `--interactive-first` flag is reintroduced to force the old
behavior. `-r` behaves like `-i` plus introduces a reverse prompt (it
can be specified more than once).

* Update help output.

---------

Co-authored-by: Johnman <tjohnman@github>
2023-03-22 19:16:35 +02:00
Erik Scholz
48c8ad5bcf fix perplexity after c-api refactor (#390)
* preallocate a buffer of fitting size for tokenization (utils.cpp)

* don't create a new std::string (especially here, where it's usually large)
2023-03-22 18:09:38 +02:00
Georgi Gerganov
dc2abfbdf0 When seed <= 0 - use the clock to generate one 2023-03-22 07:47:15 +02:00
Georgi Gerganov
5d0d8f903c Init llama_context_params properly from CLI (#370) 2023-03-22 07:45:14 +02:00
Georgi Gerganov
c40b5b3d59 Introduce C-style API (#370)
* Major refactoring - introduce C-style API

* Clean up

* Add <cassert>

* Add <iterator>

* Add <algorithm> ....

* Fix timing reporting and accumulation

* Measure eval time only for single-token calls

* Change llama_tokenize return meaning
2023-03-22 07:32:36 +02:00
Fabio R. Sluzala
edb2b84e60 We could use std::unordered_map over std::map (#305)
* Improve performance by changing std::map to std::unordered_map and std::map<id, token> id_to_token; to std::vector<token> id_to_token;

* fix last commit on gpt_vocab_init add vocab.id_to_token.resize(vocab.token_to_id.size());

* Removed include <map>

* Nest struct token score inside gpt_vocab

* renamed token to tok
2023-03-21 19:21:50 +02:00
Matvey Soloviev
5b330f1320 Fix color codes emitting mid-UTF8 code. (#312) 2023-03-21 19:11:01 +02:00
comex
bbc3cd7558 Importer for GPTQ quantized LLaMA models (#301)
* [WIP, broken] Importer for GPTQ quantized LLaMA models

Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa

Current status: Something is busted.  The output starts out decent, but
quickly degrades into gibberish.  This doesn't happen with either the
original GPTQ-for-LLaMa using the same weights, or llama.cpp when using
weights quantized by its own quantizer.  Is there a bug in the
conversion script that somehow only comes into play with a large context
size?

I did notice one potential issue.  It's clearly not the main cause of
the gibberish, since it doesn't happen when using q4_1 weights quantized
by llama.cpp itself, but it seems concerning.  When doing a matrix
multiplication of f16 * f32 => f32 or q4_1 * f32 => f32, at least when
the multiplication is not done with BLAS, the intermediate results are
stored in the smaller format rather than f32.  This seems like an
unnecessary waste of precision, especially in the q4_1 case.

I was originally hoping to validate the results by matching the Python
implementation's output exactly, but precision and non-associativity
issues make this very difficult, including when performing matrix
multiplications and, especially, computing norms.

Anyway, design details:

The models being imported store per-layer weights in essentially q4_1
format, although the addend and scale are shared across an entire row
rather than every group of 32 weights.  This script duplicates the
addend and scale to match ggml's expectations, at the cost of wasting
some memory.

However, there are two differences which I accommodated changing the
output format (and adding corresponding support to main.cpp) rather than
having the script match the existing one:

- The tok_embeddings and output weights (i.e. the weights that aren't
  per-layer) are f16 instead of q4_1.  They could be converted to q4_1,
  and the impact of the loss of precision would probably be low, but
  this would rule out exactly matching the Python implementation's
  output for validation.

- There is no sharding, since the input doesn't have it, and for a
  CPU-only implementation it seems more useful to avoid having to deal
  with multiple files.

The new format is differentiated from existing q4_1 format by changing
the 'f16' header flag to a new value, 4.  That said, I think a cleaner
approach would be to change main.cpp to support loading each tensor with
an arbitrary sharding configuration and type rather than hardcoding
specific combinations of types.  So far I've wasted too much time
debugging to try implementing this...

* Add missing permutation.  Now it works.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 18:42:25 +02:00
Gary Linscott
4458de7520 Compute perplexity over prompt (#270)
* Compute perplexity over prompt

* More accurate perplexity calculation - over all logits in the context window (so 512x more tokens!)

* Output all perplexitiies

* Add timing/ETA
2023-03-21 18:27:42 +02:00
anzz1
1afe9491f6 Enable ANSI colors on Windows 10+ (#311)
* Enable ANSI colors on Windows 10+

On older versions function will silently fail without any ill effects

* Do not call SetConsoleMode if the mode is already set

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 18:14:46 +02:00
tjohnman
f571f4b61c Check for reverse prompt by characters instead of tokens (#292) (#330)
* Check for reverse prompt by characters instead of tokens (#292)

* Update main.cpp

Wording.

* Cleanup.

* Remove unnecessary use of std::stringstream.

---------

Co-authored-by: Johnman <tjohnman@github>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 18:04:43 +02:00
Georgi Gerganov
614b1afa1c Fix convert script, warnings alpaca instructions, default params 2023-03-21 17:59:16 +02:00
anzz1
a2c6b64d8d cmdline option for custom amount of model parts (--n_parts N) (#348)
* cmdline option for custom amount of model parts (--n_parts N)

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 17:42:43 +02:00
Georgi Gerganov
26d3089d76 Add tokenizer test + revert to C++11 (#355)
* Add test-tokenizer-0 to do a few tokenizations - feel free to expand
* Added option to convert-pth-to-ggml.py script to dump just the vocabulary
* Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)
* Added utility to load vocabulary file from previous point (temporary implementation)
* Avoid using std::string_view and drop back to C++11 (hope I didn't break something)
* Rename gpt_vocab -> llama_vocab
* All CMake binaries go into ./bin/ now
2023-03-21 17:29:41 +02:00
Mack Straight
1130f96c88 move file magic/version to header, print expected version (#319) 2023-03-20 19:26:01 +00:00
Mack Straight
60d93896be sentencepiece bpe compatible tokenizer (#252)
* potential out of bounds read

* fix quantize

* style

* Update convert-pth-to-ggml.py

* mild cleanup

* don't need the space-prefixing here rn since main.cpp already does it

* new file magic + version header field

* readme notice

* missing newlines

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
2023-03-20 03:17:23 -07:00
cocktailpeanut
b80cecdcc4 bugfix: default should not be interactive (#304) 2023-03-19 23:44:20 +02:00
Rickey Bowers Jr
fa0645a55c fix coloring of last n_batch of prompt, and refactor line input (#221)
* fix coloring of last `n_batch` of prompt, and refactor line input
* forgot the newline that needs to be sent to the model
* (per #283) try to force flush of color reset in SIGINT handler
2023-03-19 19:44:30 +00:00
tjohnman
3a9d372c08 Support for multiple reverse prompts. (#299)
Co-authored-by: Johnman <>
Co-authored-by: Johnman <tjohnman@github>
2023-03-19 21:33:06 +02:00
tjohnman
eacfb91e66 Make prompt randomization optional. (#300)
Co-authored-by: Johnman <>
2023-03-19 20:36:19 +02:00
tjohnman
c98f6cd405 Respect the maximum number of tokens in interactive. (#298)
Co-authored-by: Johnman <johnman@github>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-19 20:31:17 +02:00
slaren
d51eb6a70f Add --ignore-eos parameter (#181)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-19 20:22:48 +02:00
Qingyou Meng
d0eef77580 interactive mode: print '\n' in sigint_handler, this flush stdout thus ensure color reset. (#283) 2023-03-19 20:10:00 +02:00
Erik Scholz
30ea62e25d Command line switch to use F16 for memory_k and memory_v (refactor of #154) (#294)
* Use F16 for memory_k and memory_v

* add command line switch to use f16 instead of f32 for memory k+v

---------

Co-authored-by: Ty Everett <ty@tyweb.us>
2023-03-19 19:57:00 +02:00
Georgi Gerganov
9b87f28d8d Fix off-by-one bug (#115) 2023-03-19 19:46:32 +02:00
Georgi Gerganov
5ff8d6f48e Drop trailing new line from file prompts (#80) 2023-03-19 19:05:04 +02:00
Georgi Gerganov
2ad520e892 Add "--instruct" argument for usage with Alpaca (#240)
Also start adding prompts in "./prompts"
2023-03-19 18:37:02 +02:00
Ronsor
0334548e49 Warn user if a context size greater than 2048 tokens is specified (#274)
LLaMA doesn't support more than 2048 token context sizes, and going above that produces terrible results.
2023-03-18 20:10:47 -04:00
Alex Nguyen
2e061e8283 Remove unused code since n_vocab is model.hparams.n_vocab (#262) 2023-03-18 13:51:49 +00:00
Justin Suess
62c6897bc1 fixed warning with std::ignore about unused function result (#151)
fixed warning with std::ignore about unused function result
2023-03-18 11:44:09 +00:00
thement
95eab2152b Implement non-greedy tokenizer that tries to maximize token lengths (#242)
* Implement non-greedy tokenizer that tries to maximize token lengths

* Insert single space in front of the prompt

- this is to match original llama tokenizer behavior

---------

Co-authored-by: Jakub Horak <jakub.horak@ibawizard.net>
2023-03-17 21:05:58 +01:00
hoangmit
12b9bd9b13 Add RMS norm and use it (#187)
* add ggml_rms_norm

* update op num
2023-03-16 00:41:38 +02:00
Rickey Bowers Jr
f88e2693cc add SIGINT support for _WIN32 environments (#120)
* add SIGINT support for _WIN32 environments

* perhaps more consistent
2023-03-15 21:56:24 +02:00
Justin Suess
a4d17b7096 added ctx_size parameter (#148)
* added ctx_size parameter

* added it in more places

* Apply suggestions from code review

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-15 21:42:40 +02:00
Justin Suess
3d4b93a8d4 fixed color reset on exit (#149)
* fixed color reset on exit

* added sigint handler for ansi_color_reset

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-15 21:39:38 +02:00
Georgi Gerganov
222ee5f918 Print system information 2023-03-13 19:15:08 +02:00
Pavol Rusnak
e429f5b9e0 Use fprintf for diagnostic output (#48)
keep printf only for printing model output

one can now use ./main ... 2>dev/null to suppress any diagnostic output
2023-03-13 18:39:56 +02:00
uint256_t
a81c113197 Reduce model loading time (#43)
* Use buffering

* Use vector

* Minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-13 18:33:43 +02:00
Val Kharitonov
d35d36dff8 Fix UTF-8 handling (including colors) (#79) 2023-03-13 18:24:18 +02:00
Matvey Soloviev
00be0e42e4 Gate signal support on being on a unixoid system. (#74) 2023-03-13 04:08:01 +01:00
Matvey Soloviev
a30749e299 Fix token count accounting 2023-03-13 01:04:41 +01:00
Matvey Soloviev
fedc405b41 Fix color getting reset before prompt output done (#65)
(cherry picked from commit 7eb2987619feee04c40eff69b604017d09919cb6)
2023-03-13 00:07:34 +02:00
Matvey Soloviev
d35528087e Add interactive mode (#61)
* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
2023-03-12 23:13:28 +02:00
beiller
c763dc1bc2 Add back top_k (#56)
* Add back top_k

* Update utils.cpp

* Update utils.h

---------

Co-authored-by: Bill Hamilton <bill.hamilton@shopify.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-12 22:23:15 +02:00
Sebastián A
fde84afbed Windows fixes (#31)
* Apply fixes suggested to build on windows

Issue: https://github.com/ggerganov/llama.cpp/issues/22

* Remove unsupported VLAs

* MSVC: Remove features that are only available on MSVC C++20.

* Fix zero initialization of the other fields.

* Change the use of vector for stack allocations.
2023-03-12 22:15:00 +02:00
beiller
a63a748bba Add repetition penalty (#20)
* Adding repeat penalization

* Update utils.h

* Update utils.cpp

* Numeric fix

Should probably still scale by temp even if penalized

* Update comments, more proper application

I see that numbers can go negative so a fix from a referenced commit

* Minor formatting

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-12 11:27:42 +02:00
Georgi Gerganov
5afe16962e Bump memory buffer 2023-03-11 12:45:01 +02:00
Georgi Gerganov
a2799521b9 Support all LLaMA models + change Q4_0 quantization storage 2023-03-11 11:28:30 +02:00
Georgi Gerganov
8453184bb2 Fix a bug in the rope calculation 2023-03-10 23:46:57 +02:00