ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-12 00:50:22 +00:00

Author	SHA1	Message	Date
源文雨	b6c1bfc960	fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080 )	2023-04-20 15:28:43 +02:00
Stephan Walter	091a53228c	AVX2 optimization for vec_dot_q4_2_q8_0 (#1068 )	2023-04-20 08:45:41 +02:00
slaren	881ecfb4ef	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	2023-04-20 03:14:14 +02:00
CRD716	7ecc2d9e42	Minor: Readme fixed grammar, spelling, and misc updates (#1071 )	2023-04-19 19:52:14 +00:00
Kawrakow	e0e10251a3	Q4_2 quantization with rmse-optimized scale and quants (#1062 ) * Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 20:20:14 +02:00
Georgi Gerganov	73a59affb2	ggml : use 8-bit precision for Q4_1 intermediate results (#1047 ) * ggml : use 8-bit precision for Q4_1 intermediate results (ARM) * ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32 56 ms/token with Q4_1 ! * ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051) * gitignore : ignore ppl-*.txt files --------- Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>	2023-04-19 20:10:08 +03:00
Georgi Gerganov	068083ca76	readme : add warning about Q4_2 and Q4_3	2023-04-19 19:07:54 +03:00
Stephan Walter	ec0e355be1	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	2023-04-19 19:06:37 +03:00
slaren	bc5977cc90	Add NVIDIA cuBLAS support (#1044 )	2023-04-19 11:22:45 +02:00
slaren	dee44e099f	Multi-threaded ggml_cpy (#1035 ) * Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 00:53:24 +02:00
Georgi Gerganov	4207b4b129	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	2023-04-18 23:54:57 +03:00
Georgi Gerganov	5eaa6d25cf	ggml : scratch that - vmlaq_n_f32 is always better Had a background process that was messing with the timings	2023-04-18 23:11:23 +03:00
Georgi Gerganov	998b1d59c8	gitignore : vdot	2023-04-18 23:00:08 +03:00
Georgi Gerganov	47aacf4239	ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators	2023-04-18 22:59:17 +03:00
Kawrakow	684aeeb3e0	Adding a simple program to measure speed of dot products (#1041 ) On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-04-18 19:00:14 +00:00
Georgi Gerganov	426d0c45f4	readme : update hot topics about new LoRA functionality	2023-04-18 20:10:26 +03:00
Georgi Gerganov	b2ef9f4eae	ci : do not run on drafts	2023-04-18 19:57:06 +03:00
Ivan Komarov	b1f527be59	Do not close file after mmap (Windows version) (#1034 )	2023-04-18 03:15:50 +02:00
Atsushi Tatsuma	3c9a24cc72	readme : add Ruby bindings (#1029 )	2023-04-17 22:34:35 +03:00
Cameron	5e5d6ffdaa	add 4_0 to default outfile namestr dict (#1031 ) this came up when trying to convert the gpt4all-lora-unfiltered-quantized.bin file	2023-04-17 20:26:23 +02:00
slaren	dc0fa95077	Add LoRA support (#820 )	2023-04-17 17:28:55 +02:00
Arik Poznanski	368d63e55f	llama : well-defined static initialization of complex objects (#927 ) * Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build * replaced use of auto with exact type to avoid using -std=c++14 * Made the assessors functions for static maps be static const	2023-04-17 17:41:53 +03:00
Georgi Gerganov	9bed0fc823	quantize-stats : fix bug in --type argument	2023-04-17 17:31:06 +03:00
Georgi Gerganov	42ea22af13	ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c	2023-04-17 16:16:23 +03:00
Ivan Komarov	fb550a0f64	Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933 )	2023-04-17 15:10:57 +02:00
slaren	b5fefdd2a8	Fix: do not close file on mmap (#1017 )	2023-04-16 21:27:38 +02:00
Georgi Gerganov	aa84f3b5d5	stdout : vertical align outputs for better readibility	2023-04-16 13:59:27 +03:00
Pavol Rusnak	c5cb4f71c6	examples: add missing <ctime> include for time() (#1011 )	2023-04-16 10:13:00 +00:00
nanahi	598810e9c4	Fix msys2 build error and warnings (#1009 )	2023-04-16 11:13:42 +02:00
comex	a0909d9b15	convert.py: Fix loading safetensors and ggml format on Windows (#991 ) Calling `mmap.mmap` on Windows apparently resets the file offset of the raw file object (and makes the BufferedReader return a negative file offset). For safetensors, avoid using the file offset after calling mmap. For GGML format, explicitly save and restore the offset. Fixes #966.	2023-04-15 23:53:21 +02:00
Stephan Walter	0363b17de2	Fix potential int8 overflow in non-SIMD vec_dot (#986 )	2023-04-15 18:28:56 +00:00
Stephan Walter	378ffbab0e	Refactor ggml.c for future tensor types (#1001 )	2023-04-15 16:25:38 +00:00
Georgi Gerganov	053915a751	ggml : add Q8_0 quantization for intermediate results (#951 ) * ggml : add Q8_0 quantization for intermediate results * quantize-stats : fix test + add it to Makefile default * Q8: use int8_t, AVX/AVX2 optimizations * ggml : fix quantize_row_q8_0() ARM_NEON rounding * minor : updates after rebase to latest master * quantize-stats : delete obsolete strings * ggml : fix q4_1 dot func --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-15 17:53:22 +03:00
Georgi Gerganov	a15576393c	ggml : use posix_memalign on non-Windows env	2023-04-15 14:25:45 +03:00
Ivan Komarov	ae921afa4a	benchmark : fix result validation in benchmark-q4_0-matmult (#987 )	2023-04-15 08:51:54 +03:00
katsu560	05d97008d3	cmake : add finding the OpenBLAS header file (#992 )	2023-04-15 08:51:11 +03:00
Pavol Rusnak	87bb0f74bd	Revert "main : alternative instruct mode (Vicuna support, etc.) (#863 )" (#982 ) This reverts commit `f4d277ae17`.	2023-04-14 22:58:43 +03:00
Pavol Rusnak	66cf09af08	py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976 )	2023-04-14 19:46:49 +00:00
Stephan Walter	50879f9f5b	make : fix dependencies, use auto variables (#983 )	2023-04-14 22:39:48 +03:00
Pavol Rusnak	5de4cdb1fb	Expose type name from ggml (#970 ) Avoid duplication of type names in utils Co-authored-by: Håkon H. Hitland <haakon@likedan.net>	2023-04-14 20:05:37 +02:00
Tomáš Pazdiora	e5fe7fa65c	main : alternative instruct mode (Vicuna support, etc.) (#863 ) * Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt * Add multiline mode, update text input. * bugfix * update implementation * typos * Change --multiline implementation to be toggled by EOF. * bugfix * default multiline mode * add more configs * update formating * update formatting * apply suggestions	2023-04-14 18:19:17 +03:00
Kerfuffle	0c6e3a6e6f	ggml : add unary and binary map operations (#874 ) * GGML map ops proof of concept. * Various cleanups. Add handling for task setting. Add handling for ggml_compute_backward. Rename functions to ggml_map_unary_f32 and ggml_map_binary_f32 Fix compiler warnings related to casting function pointers and `void ` Reorder functions and definitions based on the GGML op number. Use typedefs for map op function pointer types. Fix position of map ops cases in ggml_compute_forward	2023-04-14 17:43:55 +03:00
Pavol Rusnak	147f0b769c	py : cleanup dependencies (#962 ) after #545 we do not need torch, tqdm and requests in the dependencies	2023-04-14 15:37:11 +02:00
Pavol Rusnak	dd9b2450a6	py : fix flake8 and isort nitpicks (#960 )	2023-04-14 14:23:21 +02:00
Georgi Gerganov	64179095f2	ggml : minor	2023-04-14 13:31:29 +03:00
Georgi Gerganov	ebc6e99a4a	ggml : always allocate buffers with size multiple of GGML_MEM_ALIGN	2023-04-14 13:31:15 +03:00
comex	3573ed90b8	py : new conversion script (#545 ) Current status: Working, except for the latest GPTQ-for-LLaMa format that includes `g_idx`. This turns out to require changes to GGML, so for now it only works if you use the `--outtype` option to dequantize it back to f16 (which is pointless except for debugging). I also included some cleanup for the C++ code. This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with: - [x] `LLaMA` (original) - [x] `llama-65b-4bit` - [x] `alpaca-native` - [x] `alpaca-native-4bit` - [x] LLaMA converted to 'transformers' format using `convert_llama_weights_to_hf.py` - [x] `alpaca-native` quantized with `--true-sequential --act-order --groupsize 128` (dequantized only) - [x] same as above plus `--save_safetensors` - [x] GPT4All - [x] stock unversioned ggml - [x] ggmh There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script. I haven't tried this with Alpaca-LoRA because I don't know where to find it. Useful features: - Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway). - Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once. - Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes. - Type annotations checked with mypy. - Some attempts to be extra user-friendly: - The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it. - The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json. - The script tries to give a helpful error message if added_tokens.json is missing.	2023-04-14 10:03:03 +03:00
Georgi Gerganov	d7f330d1c4	ggml : fix q4_1 dot product types	2023-04-14 09:45:42 +03:00
Howard Su	e0dbf8218f	ggml : optimize rope function to avoid call powf in the tight loop (#807 )	2023-04-14 09:24:52 +03:00
Gary Linscott	5d44c13ecb	perplexity : add support for batch size to `--perplexity` (#407 ) * Add support to batch size for perplexity * Revert "Fix memory allocation issues and seg faults" This reverts commit `4870e455b3`. * update from merge * Remove perplexity from main * updates * Update batch size for efficiency	2023-04-14 00:50:42 +03:00

... 44 45 46 47 48 ...

2639 Commits