Commit Graph

397 Commits

Author SHA1 Message Date
Georgi Gerganov
9e824bf15c ggml : sync ggml (add GPT-NeoX RoPE implementation) 2023-04-20 23:32:59 +03:00
Georgi Gerganov
f04613a668 ggml : fix bug in ggml_compute_forward_dup_f32() 2023-04-20 21:58:38 +03:00
slaren
dc2fb22941 Add Q4_3 support to cuBLAS (#1086) 2023-04-20 20:49:53 +02:00
Georgi Gerganov
7a693926b8 ggml : do not break cuBLAS build (Q4_3 is not yet implemented) 2023-04-20 21:43:50 +03:00
Georgi Gerganov
c3aa2316ac ggml : fix Q4_3 quantization
Broke it during conflict resolution in last PR
2023-04-20 20:44:05 +03:00
Kawrakow
6e34a4c7c8 llama : multi-threaded quantization (#1075)
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-20 20:42:27 +03:00
Georgi Gerganov
0a8cdb2ea1 ggml : add Q4_3 quantization (#1082) 2023-04-20 20:35:53 +03:00
Ivan Komarov
59f4d32a01 ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074)
[Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed.

This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).
2023-04-20 18:15:18 +03:00
源文雨
b6c1bfc960 fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080) 2023-04-20 15:28:43 +02:00
Stephan Walter
091a53228c AVX2 optimization for vec_dot_q4_2_q8_0 (#1068) 2023-04-20 08:45:41 +02:00
slaren
881ecfb4ef Improve cuBLAS performance by dequantizing on the GPU (#1065) 2023-04-20 03:14:14 +02:00
CRD716
7ecc2d9e42 Minor: Readme fixed grammar, spelling, and misc updates (#1071) 2023-04-19 19:52:14 +00:00
Kawrakow
e0e10251a3 Q4_2 quantization with rmse-optimized scale and quants (#1062)
* Q4_2 quantization with rmse-optimized scale and quants

For quantize-stats we get
q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012

For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks.

Quantization is slow (~90 seconds on my Mac for 7B) as not
multi-threaded as in PR #896.

* ggml : satisfy the sanitizer builds

Not sure why this makes them fail

* Better follow ggml conventions for function names

* Fixed type as per reviewer comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-19 20:20:14 +02:00
Georgi Gerganov
73a59affb2 ggml : use 8-bit precision for Q4_1 intermediate results (#1047)
* ggml : use 8-bit precision for Q4_1 intermediate results (ARM)

* ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32

56 ms/token with Q4_1 !

* ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051)

* gitignore : ignore ppl-*.txt files

---------

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
2023-04-19 20:10:08 +03:00
Georgi Gerganov
068083ca76 readme : add warning about Q4_2 and Q4_3 2023-04-19 19:07:54 +03:00
Stephan Walter
ec0e355be1 ggml : Q4 cleanup - remove 4-bit dot product code (#1061)
* Q4 cleanup

* Remove unused AVX512 Q4_0 code
2023-04-19 19:06:37 +03:00
slaren
bc5977cc90 Add NVIDIA cuBLAS support (#1044) 2023-04-19 11:22:45 +02:00
slaren
dee44e099f Multi-threaded ggml_cpy (#1035)
* Multi-threaded ggml_cpy

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Also fix wdata offset in ggml_compute_forward_add_q_f32

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-19 00:53:24 +02:00
Georgi Gerganov
4207b4b129 ggml : add new Q4_2 quantization (ARM only) (#1046)
* ggml : Q4_2 ARM

* ggml : add ggml_is_quantized()

* llama : update llama_type_name() with Q4_2 entry

* ggml : speed-up q4_2

- 4 threads: ~100ms -> ~90ms
- 8 threads:  ~55ms -> ~50ms

* ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
2023-04-18 23:54:57 +03:00
Georgi Gerganov
5eaa6d25cf ggml : scratch that - vmlaq_n_f32 is always better
Had a background process that was messing with the timings
2023-04-18 23:11:23 +03:00
Georgi Gerganov
998b1d59c8 gitignore : vdot 2023-04-18 23:00:08 +03:00
Georgi Gerganov
47aacf4239 ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators 2023-04-18 22:59:17 +03:00
Kawrakow
684aeeb3e0 Adding a simple program to measure speed of dot products (#1041)
On my Mac, the direct Q4_1 product is marginally slower
(~69 vs ~55 us for Q4_0). The SIMD-ified ggml version
is now almost 2X slower (~121 us).

On a Ryzen 7950X CPU, the direct product for Q4_1 quantization
is faster than the AVX2 implementation (~60 vs ~62 us).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-04-18 19:00:14 +00:00
Georgi Gerganov
426d0c45f4 readme : update hot topics about new LoRA functionality 2023-04-18 20:10:26 +03:00
Georgi Gerganov
b2ef9f4eae ci : do not run on drafts 2023-04-18 19:57:06 +03:00
Ivan Komarov
b1f527be59 Do not close file after mmap (Windows version) (#1034) 2023-04-18 03:15:50 +02:00
Atsushi Tatsuma
3c9a24cc72 readme : add Ruby bindings (#1029) 2023-04-17 22:34:35 +03:00
Cameron
5e5d6ffdaa add 4_0 to default outfile namestr dict (#1031)
this came up when trying to convert the gpt4all-lora-unfiltered-quantized.bin file
2023-04-17 20:26:23 +02:00
slaren
dc0fa95077 Add LoRA support (#820) 2023-04-17 17:28:55 +02:00
Arik Poznanski
368d63e55f llama : well-defined static initialization of complex objects (#927)
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build

* replaced use of auto with exact type to avoid using -std=c++14

* Made the assessors functions for static maps be static const
2023-04-17 17:41:53 +03:00
Georgi Gerganov
9bed0fc823 quantize-stats : fix bug in --type argument 2023-04-17 17:31:06 +03:00
Georgi Gerganov
42ea22af13 ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c 2023-04-17 16:16:23 +03:00
Ivan Komarov
fb550a0f64 Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933) 2023-04-17 15:10:57 +02:00
slaren
b5fefdd2a8 Fix: do not close file on mmap (#1017) 2023-04-16 21:27:38 +02:00
Georgi Gerganov
aa84f3b5d5 stdout : vertical align outputs for better readibility 2023-04-16 13:59:27 +03:00
Pavol Rusnak
c5cb4f71c6 examples: add missing <ctime> include for time() (#1011) 2023-04-16 10:13:00 +00:00
nanahi
598810e9c4 Fix msys2 build error and warnings (#1009) 2023-04-16 11:13:42 +02:00
comex
a0909d9b15 convert.py: Fix loading safetensors and ggml format on Windows (#991)
Calling `mmap.mmap` on Windows apparently resets the file offset of the
raw file object (and makes the BufferedReader return a *negative* file
offset).  For safetensors, avoid using the file offset after calling
mmap.  For GGML format, explicitly save and restore the offset.

Fixes #966.
2023-04-15 23:53:21 +02:00
Stephan Walter
0363b17de2 Fix potential int8 overflow in non-SIMD vec_dot (#986) 2023-04-15 18:28:56 +00:00
Stephan Walter
378ffbab0e Refactor ggml.c for future tensor types (#1001) 2023-04-15 16:25:38 +00:00
Georgi Gerganov
053915a751 ggml : add Q8_0 quantization for intermediate results (#951)
* ggml : add Q8_0 quantization for intermediate results

* quantize-stats : fix test + add it to Makefile default

* Q8: use int8_t, AVX/AVX2 optimizations

* ggml : fix quantize_row_q8_0() ARM_NEON rounding

* minor : updates after rebase to latest master

* quantize-stats : delete obsolete strings

* ggml : fix q4_1 dot func

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-15 17:53:22 +03:00
Georgi Gerganov
a15576393c ggml : use posix_memalign on non-Windows env 2023-04-15 14:25:45 +03:00
Ivan Komarov
ae921afa4a benchmark : fix result validation in benchmark-q4_0-matmult (#987) 2023-04-15 08:51:54 +03:00
katsu560
05d97008d3 cmake : add finding the OpenBLAS header file (#992) 2023-04-15 08:51:11 +03:00
Pavol Rusnak
87bb0f74bd Revert "main : alternative instruct mode (Vicuna support, etc.) (#863)" (#982)
This reverts commit f4d277ae17.
2023-04-14 22:58:43 +03:00
Pavol Rusnak
66cf09af08 py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976) 2023-04-14 19:46:49 +00:00
Stephan Walter
50879f9f5b make : fix dependencies, use auto variables (#983) 2023-04-14 22:39:48 +03:00
Pavol Rusnak
5de4cdb1fb Expose type name from ggml (#970)
Avoid duplication of type names in utils

Co-authored-by: Håkon H. Hitland <haakon@likedan.net>
2023-04-14 20:05:37 +02:00
Tomáš Pazdiora
e5fe7fa65c main : alternative instruct mode (Vicuna support, etc.) (#863)
* Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt

* Add multiline mode, update text input.

* bugfix

* update implementation

* typos

* Change --multiline implementation to be toggled by EOF.

* bugfix

* default multiline mode

* add more configs

* update formating

* update formatting

* apply suggestions
2023-04-14 18:19:17 +03:00
Kerfuffle
0c6e3a6e6f ggml : add unary and binary map operations (#874)
* GGML map ops proof of concept.

* Various cleanups.

Add handling for task setting.

Add handling for ggml_compute_backward.

Rename functions to ggml_map_unary_f32 and ggml_map_binary_f32

Fix compiler warnings related to casting function pointers and `void *`

Reorder functions and definitions based on the GGML op number.

Use typedefs for map op function pointer types.

* Fix position of map ops cases in ggml_compute_forward
2023-04-14 17:43:55 +03:00