ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 16:40:16 +00:00

Author	SHA1	Message	Date
Ron Evans	66990fbcc9	examples : add llama_init_from_gpt_params() common function (#1290 ) Signed-off-by: deadprogram <ron@hybridgroup.com>	2023-05-02 23:39:51 +03:00
Georgi Gerganov	922abdc9c8	llama : fix compile warnings	2023-05-02 23:09:08 +03:00
Ron Evans	ae3d46ba30	examples : improve vertical alignment of a few variables (#1286 ) Signed-off-by: deadprogram <ron@hybridgroup.com>	2023-05-02 20:53:52 +03:00
Robert Brisita	19855ef65a	llama : allow 0 as a seed number. (#1275 )	2023-05-02 19:23:44 +03:00
Ron Evans	890730a4de	main : switch input_noecho to input_echo to remove negation (#979 ) Signed-off-by: deadprogram <ron@hybridgroup.com>	2023-05-02 19:13:26 +03:00
DannyDaemonic	8f29a9ed07	Add git-based build information for better issue tracking (#1232 ) * Add git-based build information for better issue tracking * macOS fix * "build (hash)" and "CMAKE_SOURCE_DIR" changes * Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages * Fix conditional dependency on missing target * Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile * 4 space indenting for cmake, attempt to clean up my mess in Makefile * Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it	2023-05-01 18:23:47 +02:00
Georgi Gerganov	e1103bdf91	llama : fix session load / save (#1263 )	2023-05-01 14:54:59 +03:00
jon-chuang	90c41e9a61	common : better default number of threads (#934 ) * commit * fix * try-catch * apply code review * improve * improve * add macos headers * done * remove color * fix windows * minor * fix * Apply suggestions from code review Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com> * remove * minor * minor --------- Co-authored-by: jon-chuang <jon-chuang@users.noreply.github.com> Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>	2023-04-30 21:41:35 +03:00
Stephan Walter	8ceb2442b7	Various fixes to mat_mul benchmark (#1253 )	2023-04-30 12:32:37 +00:00
Georgi Gerganov	91dc074524	build : fix reference to old llama_util.h	2023-04-29 13:53:12 +03:00
Georgi Gerganov	d483156c17	examples : fix save-load-state + rename llama-util.h	2023-04-29 13:48:11 +03:00
Georgi Gerganov	1fb89c9913	common : change default parameters to pre-#1126 (#1223 )	2023-04-29 09:51:06 +03:00
Ivan Stepanov	d00422bd62	llama : new sampling algorithms (#1126 ) * Sample interface, new samplers. New samplers: - locally typical sampling - tail free sampling - frequency and presence penalty - mirostat Ignore EOS fix: -inf should be used. * mirostat * Added --logit-bias and --no-penalize-nl, removed std::span * Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and k) Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and k) * Save and load example adjust * Tests * Windows build fix * Windows test fix	2023-04-29 08:34:41 +03:00
Stephan Walter	4aeea780bc	Remove Q4_3 which is no better than Q5 (#1218 )	2023-04-28 23:10:43 +00:00
CRD716	c1343a8d6b	examples : add Jeopardy example (#1168 ) * Basic Setup * Prevent Results.txt from coming up * Prefixes, Line separators, etc * editorcheck * introduction to give more consistent results * Basic graph thing * Grading, ready for testing! * Y'all ready to get funky? * fix column removal stuff * missed a few	2023-04-28 19:13:33 +03:00
Evan Jones	66ad009c5e	llama : add session file format and saved sessions in main (#1169 )	2023-04-28 18:59:37 +03:00
Georgi Gerganov	8e0ec92ace	ggml : add Q5_0 and Q5_1 quantization (#1187 ) * ggml : add Q5_0 quantization (cuBLAS only) * ggml : fix Q5_0 qh -> uint32_t * ggml : fix q5_0 histogram stats * ggml : q5_0 scalar dot product * ggml : q5_0 ARM NEON dot * ggml : q5_0 more efficient ARM NEON using uint64_t masks * ggml : rename Q5_0 -> Q5_1 * ggml : adding Q5_0 mode * quantize : add Q5_0 and Q5_1 to map * ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195) --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-26 23:14:13 +03:00
Pavol Rusnak	9e63abecf7	quantize : use `map` to assign quantization type from `string` (#1191 ) instead of `int` (while `int` option still being supported) This allows the following usage: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0` instead of: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`	2023-04-26 18:43:27 +02:00
Georgi Gerganov	17cded0679	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179 ) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3	2023-04-25 23:40:51 +03:00
xaedes	0cad6d7c3a	examples : add save_load_state example (#1150 ) * add save_load_state example * use <cstdio> instead of <iostream> and fprintf / printf instead of cout * renamed save-load-state example files replacing underscores by dashes	2023-04-24 19:23:31 +03:00
mgroeber9110	094250bc21	examples/main README improvements and some light refactoring (#1131 )	2023-04-24 15:45:32 +00:00
slaren	7de8c169ec	Fix LoRA acronym (#1145 )	2023-04-23 23:03:44 +02:00
DannyDaemonic	2e00da6d52	Added README.md for main with examples and explanations (#1139 )	2023-04-23 15:37:02 +00:00
Stephan Walter	949ca5ce05	Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122 )	2023-04-22 10:54:13 +00:00
wbpxre150	74cec44258	llama : print timings on ctrl+c exit (#1021 ) * print timings on ctrl+c exit * remove redundant free memory call. * add global pointer to ctx.	2023-04-22 11:56:35 +03:00
eiery	0f230cce20	llama : have n_batch default to 512 (#1091 ) * set default n_batch to 512 when using BLAS * spacing * alternate implementation of setting different n_batch for BLAS * set n_batch to 512 for all cases	2023-04-22 11:27:05 +03:00
Clint Herron	81fc78220f	examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107 ) * Moving parameters to separate lines for readability. * Increasing repeate_penalty to 1.1 to make alpaca more usable by default. * Adding trailing newline.	2023-04-22 09:54:33 +03:00
Alex Klinkhamer	f6c87bbcaa	main : evaluate tokens in batches after swapping context (#1014 ) * examples : evaluate tokens in batches after swapping context * Update examples/main/main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-21 21:18:09 +03:00
slaren	b9387e79c4	Show perplexity ETA in hours and minutes (#1096 )	2023-04-21 14:57:57 +02:00
Kawrakow	6e34a4c7c8	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-20 20:42:27 +03:00
Georgi Gerganov	0a8cdb2ea1	ggml : add Q4_3 quantization (#1082 )	2023-04-20 20:35:53 +03:00
Georgi Gerganov	4207b4b129	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	2023-04-18 23:54:57 +03:00
slaren	dc0fa95077	Add LoRA support (#820 )	2023-04-17 17:28:55 +02:00
Georgi Gerganov	9bed0fc823	quantize-stats : fix bug in --type argument	2023-04-17 17:31:06 +03:00
Pavol Rusnak	c5cb4f71c6	examples: add missing <ctime> include for time() (#1011 )	2023-04-16 10:13:00 +00:00
Ivan Komarov	ae921afa4a	benchmark : fix result validation in benchmark-q4_0-matmult (#987 )	2023-04-15 08:51:54 +03:00
Pavol Rusnak	87bb0f74bd	Revert "main : alternative instruct mode (Vicuna support, etc.) (#863 )" (#982 ) This reverts commit `f4d277ae17`.	2023-04-14 22:58:43 +03:00
Pavol Rusnak	5de4cdb1fb	Expose type name from ggml (#970 ) Avoid duplication of type names in utils Co-authored-by: Håkon H. Hitland <haakon@likedan.net>	2023-04-14 20:05:37 +02:00
Tomáš Pazdiora	e5fe7fa65c	main : alternative instruct mode (Vicuna support, etc.) (#863 ) * Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt * Add multiline mode, update text input. * bugfix * update implementation * typos * Change --multiline implementation to be toggled by EOF. * bugfix * default multiline mode * add more configs * update formating * update formatting * apply suggestions	2023-04-14 18:19:17 +03:00
Gary Linscott	5d44c13ecb	perplexity : add support for batch size to `--perplexity` (#407 ) * Add support to batch size for perplexity * Revert "Fix memory allocation issues and seg faults" This reverts commit `4870e455b3`. * update from merge * Remove perplexity from main * updates * Update batch size for efficiency	2023-04-14 00:50:42 +03:00
CRD716	83af18bc18	common : remove unnecessary includes (#947 )	2023-04-13 18:39:25 +03:00
Georgi Gerganov	d9dff86873	llama : merge llama_internal.h into llama.h Hide it behind an #ifdef	2023-04-13 18:04:45 +03:00
CRD716	62addfa732	fix whitespace (#944 )	2023-04-13 16:03:57 +02:00
niansa/tuxifan	9a5c5d1e92	examples : add -n to alpaca and gpt4all scripts (#706 )	2023-04-13 16:03:39 +03:00
SebastianApel	45a86141bb	benchmark : add tool for timing q4_0 matrix multiplication (#653 ) * Initial version of q4_0 matrix multiplication benchmark * Bugfix: Added dependency to ggml.o to benchmark * Reviewer requests: added parameter for threads, switched to ggml_time_us() * Reviewer input: removed rtsc, use epsilon for check * Review comment: Removed set_locale * Feature: Param for numer of iterations, Bugfix for use of parameter threads * Reviewer suggestion: Moved to examples * Reviewer feedback: Updated clean: and benchmark: sections --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-13 15:46:23 +03:00
Pavol Rusnak	e4d3b4b251	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
Stephan Walter	c8296315db	Add enum llama_ftype, sync ggml_type to model files (#709 )	2023-04-11 15:03:51 +00:00
comex	74ad81e8a0	Windows fixes (#890 ) Mostly for msys2 and mingw64 builds, which are different from each other and different from standard Visual Studio builds. Isn't Windows fun? - Define _GNU_SOURCE in more files (it's already used in ggml.c for Linux's sake). - Don't use PrefetchVirtualMemory if not building for Windows 8 or later (mingw64 doesn't by default). But warn the user about this situation since it's probably not intended. - Check for NOMINMAX already being defined, which it is on mingw64. - Actually use the `increment` variable (bug in my `pizza` PR). - Suppress unused variable warnings in the fake pthread_create and pthread_join implementations for Windows. - (not Windows-related) Remove mention of `asprintf` from comment; `asprintf` is no longer used. Fixes #871.	2023-04-11 15:19:54 +02:00
comex	84cfa98c43	Rewrite loading code to try to satisfy everyone: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)	2023-04-10 01:10:46 +02:00
Tomáš Pazdiora	ecc4a042a0	fix for windows utf-8 input (#840 ) Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros	2023-04-08 17:49:39 +02:00

1 2

79 Commits