Commit Graph

374 Commits

Author SHA1 Message Date
Cameron Kaiser
5571dc71c4 additional optimizations for POWER9 (#454) 2023-03-24 17:19:26 +02:00
comex
d86b7f08ad Support calling mlock() on loaded model data on Linux and macOS (#453)
* Support calling mlock() on loaded model data on Linux and macOS

This is enabled by a new --mlock command line option.

Using mlock() disables swapping and memory compression for the model
data.  Doing so can be useful on systems where the model takes up a
large fraction of system RAM.  In my experience, macOS is quite eager to
start compressing llama.cpp's memory, which then makes it halt for a few
seconds while it decompresses, even with a model that uses "only" 25GB
out of 32GB.

Of course, this comes at the cost of forcing the system to swap or
compress other processes' memory instead, so it needs to be used with
care and shouldn't be enabled by default.

In theory it should be possible to support this on Windows as well using
VirtualLock(), but I'm not much of a Windows user.

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-24 17:19:05 +02:00
Stephan Walter
43a021a260 Deduplicate q4 quantization functions (#383)
* Deduplicate q4 quantization functions

* Use const; add basic test

* Re-enable quantization test

* Disable AVX2 flags in CI

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-22 19:29:06 +02:00
Valentyn Bezshapkin
f520f9be86 fix: add POSIX functionality for Linux compilation (#51)
* fix: add POSIX functionality for Linux compilation

* fix: older standard for compatibility
2023-03-22 19:20:25 +02:00
Georgi Gerganov
c40b5b3d59 Introduce C-style API (#370)
* Major refactoring - introduce C-style API

* Clean up

* Add <cassert>

* Add <iterator>

* Add <algorithm> ....

* Fix timing reporting and accumulation

* Measure eval time only for single-token calls

* Change llama_tokenize return meaning
2023-03-22 07:32:36 +02:00
Kevin Lo
f2222d00b1 Add OpenBSD support (#314) 2023-03-21 17:50:09 +02:00
Casey Primozic
de601697f3 Add initial AVX512 support for dot product on Linux (#320)
* Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
2023-03-21 15:35:42 +01:00
Georgi Gerganov
960c6bfb09 Change RMSNorm eps to 1e-6 (#173)
I think this is what is used in the Python code
2023-03-19 17:30:00 +02:00
Stephan Walter
45113b2f42 Don't tell users to use a bad number of threads (#243)
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
2023-03-17 19:47:35 +02:00
Matvey Soloviev
03c8e88515 Q4_1 quantization (#193)
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix #152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
2023-03-17 06:48:39 +02:00
Nebula
1b96142bae Fix RMS norm in GGML (#191) 2023-03-15 19:29:25 -04:00
hoangmit
12b9bd9b13 Add RMS norm and use it (#187)
* add ggml_rms_norm

* update op num
2023-03-16 00:41:38 +02:00
hoangmit
735b1a2aaa inline -> static inline for "bytesFromNibbles" (#161)
Without "static" prefix, it fails to compile in clang
2023-03-15 21:05:14 +02:00
Ronsor
55f8043b2f Don't use vdotq_s32 if it's not available (#139)
* Don't use vdotq_s32 if it's not available

`dotprod` extensions aren't available on some ARM CPUs (e.g. Raspberry Pi 4), so check for them and only use them if they're available.

Reintroduces the code removed in 84d9015 if `__ARM_FEATURE_DOTPROD` isn't defined.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-14 21:34:37 +02:00
Thomas Klausner
d3ed019b74 Add NetBSD support. (#90) 2023-03-13 18:40:54 +02:00
Georgi Gerganov
c1eebc2a25 Use vdotq_s32 to improve performance (#67)
* 10% performance boost on ARM

* Back to original change
2023-03-13 18:36:44 +02:00
Georgi Gerganov
49a8c7675b Revert "10% performance boost on ARM"
This reverts commit 113a9e83eb.

There are some reports for illegal instruction.
Moved this stuff to vdotq_s32 branch until resolve
2023-03-13 01:28:08 +02:00
Georgi Gerganov
c47fa0ea5e Check for vdotq_s32 availability 2023-03-13 01:21:03 +02:00
Georgi Gerganov
c00675331e Ammend to previous commit - forgot to update non-QRDMX branch 2023-03-13 01:05:24 +02:00
Georgi Gerganov
f48b7628ea 10% performance boost on ARM 2023-03-13 00:56:10 +02:00
Sebastián A
fde84afbed Windows fixes (#31)
* Apply fixes suggested to build on windows

Issue: https://github.com/ggerganov/llama.cpp/issues/22

* Remove unsupported VLAs

* MSVC: Remove features that are only available on MSVC C++20.

* Fix zero initialization of the other fields.

* Change the use of vector for stack allocations.
2023-03-12 22:15:00 +02:00
Georgi Gerganov
cc0f26bef3 Add AVX2 support for x86 architectures thanks to @Const-me ! 2023-03-11 18:04:25 +02:00
Georgi Gerganov
a2799521b9 Support all LLaMA models + change Q4_0 quantization storage 2023-03-11 11:28:30 +02:00
Georgi Gerganov
4b5b86d6ee Initial release 2023-03-10 20:56:40 +02:00