Commit Graph

124 Commits

Author SHA1 Message Date
Gustavo Rocha Dias
0db3668c98 readme : alternative way to build for Android with CLBlast. (#1828) 2023-06-17 12:01:06 +03:00
Aisuko
5e19207f69 doc : fix wrong address of BLIS.md (#1772)
Signed-off-by: Aisuko <urakiny@gmail.com>
2023-06-10 17:08:11 +03:00
Georgi Gerganov
e115880e31 readme : add June roadmap 2023-06-07 07:15:08 +03:00
Yuval Peled
739a917df4 docs : add performance troubleshoot + example benchmark documentation (#1674)
* test anchor link

* test table

* add benchmarks

* Add performance troubleshoot & benchmark

* add benchmarks

* remove unneeded line

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 23:32:36 +03:00
Foul-Tarnished
ae6585dec5 readme : fix typo (#1700)
Fix a typo in a command in README.md
2023-06-05 23:28:37 +03:00
Georgi Gerganov
595b95c725 readme : update hot topics 2023-06-04 23:38:19 +03:00
Georgi Gerganov
442db16478 llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77fe811082f672538fc91bcfc9072d63.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
Henri Vasserman
a65a9c322b Add info about CUDA_VISIBLE_DEVICES (#1682) 2023-06-03 16:35:20 +03:00
Henri Vasserman
c31487b41a Add documentation about CLBlast (#1604)
Installing, compiling and using.
2023-05-27 18:47:55 +03:00
Evan Jones
5da20efb48 readme : add docs for chat-persistent.sh (#1568)
* readme : add docs for chat-persistent.sh

* Update README.md
2023-05-24 09:24:01 +03:00
Zenix
3d683d65fc feature : support blis and other blas implementation (#1536)
* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 17:58:31 +03:00
Georgi Gerganov
ff47d6030f Revert "feature : add blis and other BLAS implementation support (#1502)"
This reverts commit 07e9ace0f9.
2023-05-20 12:03:48 +03:00
Zenix
1a287d274c feature : add blis and other BLAS implementation support (#1502)
* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 12:02:48 +03:00
Georgi Gerganov
744804ef91 ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics
2023-05-19 22:17:18 +03:00
David Kennedy
1917bffd34 readme : adds WizardLM to the list of supported models (#1485) 2023-05-19 20:16:30 +03:00
Georgi Gerganov
490afcc5d0 readme : update Q4_0 perplexities
I think these were affected by the removal of the `round` during quantization
2023-05-13 09:12:44 +03:00
Rinne
440c83c83d readme : add C#/.NET bindings repo (#1409) 2023-05-12 08:39:40 +03:00
Georgi Gerganov
65f22db997 ggml : remove bit shuffling (#1405)
* ggml : remove Q4_0 bit shufling (ARM NEON)

* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

* ggml : remove Q5_0 bit shuffling (ARM NEON)

* ggml : 2x faster scalar implementations

* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

* ggml : simplify scalar dot

* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

* ggml : fix Q4_1 quantization

* ggml : update cuBLAS + normalize variable names

* ggml : remove Q4_2 mode

* ggml : minor formatting

* ggml : fix Q5_0 quantization

* scripts : add script for measuring the time per token

* AVX implementations (#1370)

* ggml : uniform 5th bit extraction

* llama : produce error upon loading old model files

* llama : fix model magic/version write

* ggml : speed-up Q5_0 + Q5_1 at 4 threads

* ggml : preserve old Q4 and Q5 formats

* ggml : simplify Q8_1 - no need for low / high sums anymore

* ggml : fix Q8_0 and Q8_1 rounding

* Revert "AVX implementations (#1370)"

This reverts commit 948d124837f9d287d8490f41338e0e4cceb0814f.

* ggml : fix AVX2 implementation

* sha : update hashes for 7B and 13B

* readme : update timings + remove warning banner

* llama : update v2 PR number to 1405

* ggml : fix WASM comments

* ggml : back to original bit order

* readme : add note that Q4 and Q5 have been changed

* llama : fix return for unknown version

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
Georgi Gerganov
b778f64c54 readme : add notice about upcoming breaking change 2023-05-08 22:52:18 +03:00
AlpinDale
bcfad0f2bc readme : add TOC and Pygmalion instructions (#1359) 2023-05-08 19:33:30 +03:00
Georgi Gerganov
e5c1a67530 llama : require first token to be BOS (#1303)
* llama : require first token to be BOS

* scripts : add ppl-run-all.sh

* perplexity : add BOS for each chunk

* readme : update perplexity values after BOS fix

* perplexity : add clarifying comments
2023-05-08 17:41:54 +03:00
Johannes Gäßler
2e69d4ed6e Documented CUDA reproducibility, added warning (#1346) 2023-05-08 02:42:01 +02:00
DaniAndTheWeb
fe86ad336b makefile: automatic Arch Linux detection (#1332)
This commit is a port of a detection method used in koboldcpp's Makefile in order to automatically set the -lcblas option on Arch Linux
2023-05-05 23:57:14 +02:00
Pavol Rusnak
f415e2f7e5 readme: add missing info (#1324) 2023-05-05 16:43:36 +02:00
44670
38022bc8e5 readme : add OpenBuddy link (#1321) 2023-05-04 19:33:31 +03:00
Georgi Gerganov
aaa2ee9e37 minor : fix whitespaces (#1302) 2023-05-03 20:09:42 +03:00
KASR
2290898428 scripts : platform independent script to verify sha256 checksums (#1203)
* python script to verify the checksum of the llama models

Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability.

* Update README.md

update to the readme for improved readability and to explain the usage of the python checksum verification script

* update the verification script

I've extended the script based on suggestions by @prusnak

The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks.

* minor improvment

small change so that the available ram is checked and not the total ram

* remove the part of the code that reads the file at once if enough ram is available

based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks.

* Update verify-checksum-models.py

quick fix to pass the git check
2023-05-03 18:31:28 +03:00
Stephan Walter
4aeea780bc Remove Q4_3 which is no better than Q5 (#1218) 2023-04-28 23:10:43 +00:00
Georgi Gerganov
3a8a3891b3 readme : update hot topics 2023-04-28 21:32:52 +03:00
Folko-Ven
afe6563a58 Correcting link to w64devkit (#1214)
Correcting link to w64devkit (change seeto to skeeto).
2023-04-28 16:22:48 +02:00
Georgi Gerganov
f9318ab76d readme : add quantization info 2023-04-26 23:24:42 +03:00
DaniAndTheWeb
8ad378c494 Updating build instructions to include BLAS support (#1183)
* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
2023-04-26 22:03:03 +02:00
Pavol Rusnak
9e63abecf7 quantize : use map to assign quantization type from string (#1191)
instead of `int` (while `int` option still being supported)

This allows the following usage:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0`

instead of:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`
2023-04-26 18:43:27 +02:00
mgroeber9110
094250bc21 examples/main README improvements and some light refactoring (#1131) 2023-04-24 15:45:32 +00:00
Pavol Rusnak
0ba63e81ff readme : update gpt4all instructions (#980) 2023-04-23 10:21:26 +02:00
CRD716
7ecc2d9e42 Minor: Readme fixed grammar, spelling, and misc updates (#1071) 2023-04-19 19:52:14 +00:00
Georgi Gerganov
068083ca76 readme : add warning about Q4_2 and Q4_3 2023-04-19 19:07:54 +03:00
Georgi Gerganov
426d0c45f4 readme : update hot topics about new LoRA functionality 2023-04-18 20:10:26 +03:00
Atsushi Tatsuma
3c9a24cc72 readme : add Ruby bindings (#1029) 2023-04-17 22:34:35 +03:00
comex
3573ed90b8 py : new conversion script (#545)
Current status: Working, except for the latest GPTQ-for-LLaMa format
  that includes `g_idx`.  This turns out to require changes to GGML, so
  for now it only works if you use the `--outtype` option to dequantize it
  back to f16 (which is pointless except for debugging).

  I also included some cleanup for the C++ code.

  This script is meant to replace all the existing conversion scripts
  (including the ones that convert from older GGML formats), while also
  adding support for some new formats.  Specifically, I've tested with:

  - [x] `LLaMA` (original)
  - [x] `llama-65b-4bit`
  - [x] `alpaca-native`
  - [x] `alpaca-native-4bit`
  - [x] LLaMA converted to 'transformers' format using
        `convert_llama_weights_to_hf.py`
  - [x] `alpaca-native` quantized with `--true-sequential --act-order
        --groupsize 128` (dequantized only)
  - [x] same as above plus `--save_safetensors`
  - [x] GPT4All
  - [x] stock unversioned ggml
  - [x] ggmh

  There's enough overlap in the logic needed to handle these different
  cases that it seemed best to move to a single script.

  I haven't tried this with Alpaca-LoRA because I don't know where to find
  it.

  Useful features:

  - Uses multiple threads for a speedup in some cases (though the Python
    GIL limits the gain, and sometimes it's disk-bound anyway).

  - Combines split models into a single file (both the intra-tensor split
    of the original and the inter-tensor split of 'transformers' format
    files).  Single files are more convenient to work with and more
    friendly to future changes to use memory mapping on the C++ side.  To
    accomplish this without increasing memory requirements, it has some
    custom loading code which avoids loading whole input files into memory
    at once.

  - Because of the custom loading code, it no longer depends in PyTorch,
    which might make installing dependencies slightly easier or faster...
    although it still depends on NumPy and sentencepiece, so I don't know
    if there's any meaningful difference.  In any case, I also added a
    requirements.txt file to lock the dependency versions in case of any
    future breaking changes.

  - Type annotations checked with mypy.

  - Some attempts to be extra user-friendly:

      - The script tries to be forgiving with arguments, e.g. you can
        specify either the model file itself or the directory containing
        it.

      - The script doesn't depend on config.json / params.json, just in
        case the user downloaded files individually and doesn't have those
        handy.  But you still need tokenizer.model and, for Alpaca,
        added_tokens.json.

      - The script tries to give a helpful error message if
        added_tokens.json is missing.
2023-04-14 10:03:03 +03:00
CRD716
d74ce11c25 readme : remove python 3.10 warning (#929) 2023-04-13 16:59:53 +03:00
Genkagaku.GPT
c720fa4877 readme : llama node binding (#911)
* chore: add nodejs binding

* chore: add nodejs binding
2023-04-13 16:54:27 +03:00
Judd
b9a4538eaa zig : update build.zig (#872)
* update

* update readme

* minimize the changes.

---------

Co-authored-by: zjli2019 <zhengji.li@ingchips.com>
2023-04-13 16:43:22 +03:00
Georgi Gerganov
be7082caef readme : change "GPU support" link to discussion 2023-04-12 14:48:57 +03:00
Georgi Gerganov
9b68f0ee36 readme : update hot topics with link to "GPU support" issue 2023-04-12 14:31:12 +03:00
Nicolai Weitkemper
e610019c01 readme: link to sha256sums file (#902)
This is to emphasize that these do not need to be obtained from elsewhere.
2023-04-12 08:46:20 +02:00
Pavol Rusnak
e4d3b4b251 Fix whitespace, add .editorconfig, add GitHub workflow (#883) 2023-04-11 19:45:44 +00:00
qouoq
4b0adc70d7 Add BAIR's Koala to supported models (#877) 2023-04-10 22:41:53 +02:00
Pavol Rusnak
944d161986 Make docker instructions more explicit (#785) 2023-04-06 08:56:58 +02:00
Georgi Gerganov
0470adafd0 Update README.md 2023-04-05 19:54:30 +03:00