Commit Graph

800 Commits

Author SHA1 Message Date
Howard Su
b29cd6ba6d Fix opencl by wrap #if-else-endif with \n (#2086) 2023-07-07 05:34:18 +02:00
Georgi Gerganov
0949e52280 ggml : fix restrict usage 2023-07-06 19:41:31 +03:00
Judd
56e5233320 convert : update for baichuan (#2081)
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.

Co-authored-by: Judd <foldl@boxvest.com>
2023-07-06 19:23:49 +03:00
tslmy
6815028f35 alpaca.sh : update model file name (#2074)
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
2023-07-06 19:17:50 +03:00
Tobias Lütke
2978dd92ec Expose generation timings from server & update completions.js (#2116)
* use javascript generators as much cleaner API

Also add ways to access completion as promise and EventSource

* export llama_timings as struct and expose them in server

* update readme, update baked includes

* llama : uniform variable names + struct init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson
8a3fecd99b Update Server Instructions (#2113)
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text

---------

Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 21:03:19 +03:00
Georgi Gerganov
bd32bac50c ggml : fix bug introduced in #1237 2023-07-05 20:44:11 +03:00
Georgi Gerganov
5ac2fe8ccf tests : fix test-grad0 2023-07-05 20:20:25 +03:00
Stephan Walter
e0a5b08cdc ggml : generalize quantize_fns for simpler FP16 handling (#1237)
* Generalize quantize_fns for simpler FP16 handling

* Remove call to ggml_cuda_mul_mat_get_wsize

* ci : disable FMA for mac os actions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson
2dc0a82cd0 Update server instructions for web front end (#2103)
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 18:13:35 +03:00
Johannes Gäßler
aa060f64de Quantized dot products for CUDA mul mat vec (#2067) 2023-07-05 14:19:42 +02:00
Howard Su
928a2061d8 llama: Don't double count the sampling time (#2107) 2023-07-05 18:31:23 +08:00
Johannes Gäßler
bfbd12322e Fixed OpenCL offloading prints (#2082) 2023-07-05 08:58:05 +02:00
Nigel Bosch
fdc61bd755 embd-input: Fix input embedding example unsigned int seed (#2105) 2023-07-05 07:33:33 +08:00
Georgi Gerganov
d9c33ac749 readme : add link web chat PR 2023-07-04 22:25:22 +03:00
Georgi Gerganov
40c6a525ea ggml : sync latest (new ops, macros, refactoring) (#2106)
- add ggml_argmax()
- add ggml_tanh()
- add ggml_elu()
- refactor ggml_conv_1d() and variants
- refactor ggml_conv_2d() and variants
- add helper macros to reduce code duplication in ggml.c
2023-07-04 21:54:11 +03:00
jwj7140
d7d540c04d Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
2023-07-04 21:06:12 +03:00
Tobias Lütke
111f1b47ab Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04 16:05:27 +02:00
Henri Vasserman
f1533445e7 Allow old Make to build server. (#2098)
Also make server build by default.

Tested with Make 3.82
2023-07-04 15:38:04 +03:00
ZhouYuChen
7342fb172e Update Makefile: clean simple (#2097) 2023-07-04 14:15:16 +02:00
Erik Scholz
bdf2652c83 CI: make the brew update temporarily optional. (#2092)
until they decide to fix the brew installation in the macos runners.
see the open issues. eg https://github.com/actions/runner-images/pull/7710
2023-07-04 01:50:12 +02:00
Govlzkoy
1d97415a70 [ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088) 2023-07-04 07:50:00 +08:00
Henri Vasserman
8e9801d5b0 fix server crashes (#2076) 2023-07-04 00:05:23 +03:00
Howard Su
b4acd69f66 Fix crash of test-tokenizer-0 under Debug build (#2064)
* Fix crash of test-tokenizer-0 under Debug build

* Change per comment
2023-07-03 20:43:55 +02:00
Howard Su
1a58c36ca8 [llama] No need to check file version when loading vocab score (#2079) 2023-07-03 19:58:58 +08:00
WangHaoranRobin
552a05c6dc server: add option to output probabilities for completion (#1962)
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
2023-07-03 00:38:44 +03:00
Georgi Gerganov
8fe97102e5 ggml : fix build with OpenBLAS (close #2066) 2023-07-02 09:46:46 +03:00
Johannes Gäßler
1e9814bf70 Better CUDA synchronization logic (#2057) 2023-07-01 21:49:44 +02:00
Johannes Gäßler
2764eaeaf7 Test-based VRAM scratch size + context adjustment (#2056) 2023-07-01 21:47:26 +02:00
Daniel Drake
abbce9e3f9 cmake : don't force -mcpu=native on aarch64 (#2063)
It's currently not possible to cross-compile llama.cpp for aarch64
because CMakeLists.txt forces -mcpu=native for that target.

-mcpu=native doesn't make sense if your build host is not the
target architecture, and clang rejects it for that reason, aborting the
build. This can be easily reproduced using the current Android NDK to build
for aarch64 on an x86_64 host.

If there is not a specific CPU-tuning target for aarch64 then -mcpu
should be omitted completely. I think that makes sense, there is not
enough variance in the aarch64 instruction set to warrant a fixed -mcpu
optimization at this point. And if someone is building natively and wishes
to enable any possible optimizations for the host device, then there is
already the LLAMA_NATIVE option available.

Fixes #495.
2023-07-01 21:31:44 +03:00
Aaron Miller
b67aea635d metal : release buffers when freeing metal context (#2062) 2023-07-01 21:14:59 +03:00
Judd
94e1a0ab7d convert : add support of baichuan-7b (#2055)
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-01 20:00:25 +03:00
Georgi Gerganov
fcce8eb52b llama : fix return value of llama_load_session_file_internal (#2022) 2023-07-01 19:05:09 +03:00
Rand Xie
f44d5638b8 llama : catch llama_load_session_file_internal exceptions (#2022)
* convert checks in llama_load_session_file to throw and handle them

* make llama_load_session_file_internal static

* address feedbacks to avoid using exceptions
2023-07-01 19:02:58 +03:00
Georgi Gerganov
0f42fa7e17 embd-input : fix returning ptr to temporary 2023-07-01 18:46:00 +03:00
Georgi Gerganov
080d870d99 train : fix compile warning 2023-07-01 18:45:44 +03:00
Qingyou Meng
6fcaeb4790 ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995)
Will not be scheduled unless explicitly enabled.
2023-07-01 18:42:43 +03:00
Howard Su
41ce2a335b Use unsigned for random seed (#2006)
* Use unsigned for random seed. Keep -1 as the value to use a time based seed.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-29 06:15:15 -07:00
LostRuins
f3aefd46e3 Porting the improved K-Quant CUDA kernels to OpenCL (#1966)
* Added broken new q4k quant

* xx + ib0

* Fix q2_k fast kernel

* Use preprocessor for QK_K

* Add q6_k fast matmul kernel

* ported q3k speedup successfully

* ported q2k and q5k speedups

* remove old dot kernels and template

* fixed global const struct types

* fixing address spaces

* fixed string too long CI issue

---------

Co-authored-by: 0cc4m <picard12@live.de>
2023-06-29 05:56:43 +02:00
m3ndax
f3a12da865 llama : replacing auto &kv with const auto &kv (#2041)
* Replacing auto &kv with const auto &kv

* Create codacy.yml

* Delete codacy.yml
2023-06-28 21:39:08 +03:00
Salvador E. Tropea
1070867381 cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)
- Not used
2023-06-28 20:27:31 +03:00
Salvador E. Tropea
55112aa1db cuda : fix missing const qualifier in casts (#2027) 2023-06-28 20:26:26 +03:00
Howard Su
6d8b07691b llama : remove shards weight file support (#2000)
* Remove multiple shards

* Remove multiple file loaders

* Remove llama_load_tensor_shard class

* Simplify load logic

* Remove dead code guess_n_parts function

* Remove vocab_only from constructor of llama_model_loader

* Remove alignment_prevents_mmap which is not more needed.

* Remove useless check
2023-06-28 20:13:02 +03:00
Johannes Gäßler
85f06e26fe CUDA GPU acceleration for LoRAs + f16 models (#1970) 2023-06-28 18:35:54 +02:00
ningshanwutuobang
24b1895f76 llama : support input embeddings directly (#1910)
* add interface for float input

* fixed inpL shape and type

* add examples of input floats

* add test example for embd input

* fixed sampling

* add free for context

* fixed add end condition for generating

* add examples for llava.py

* add READMD for llava.py

* add READMD for llava.py

* add example of PandaGPT

* refactor the interface and fixed the styles

* add cmake build for embd-input

* add cmake build for embd-input

* Add MiniGPT-4 example

* change the order of the args of llama_eval_internal

* fix ci error
2023-06-28 18:53:37 +03:00
Erik Scholz
67de6d1ed6 fix pthreads setaffinity usage on android (#2020) 2023-06-27 19:06:33 +02:00
Howard Su
0dacdf3f52 baby-llama : fix build after ggml_rope change (#2016) 2023-06-27 08:07:13 +03:00
Georgi Gerganov
49d212967e llama : fix rope usage after ChatGLM change 2023-06-27 00:37:33 +03:00
Georgi Gerganov
6ac63cb561 ggml : add support for ChatGLM RoPE 2023-06-27 00:06:51 +03:00
Roman Parykin
a0289f4d24 readme : add Scala 3 bindings repo (#2010) 2023-06-26 22:47:59 +03:00