Commit Graph

601 Commits

Author SHA1 Message Date
turboderp
f7c89f4c51 Add Q4 test results (draft) 2024-03-09 05:59:26 +01:00
turboderp
02edfb9f2f Fix HumanEval test for Gemma models 2024-03-09 05:59:00 +01:00
turboderp
222c8465bc Rework HumanEval test 2024-03-07 17:54:07 +01:00
turboderp
c60ac6e9fd Bump to 0.0.15 v0.0.15 2024-03-07 03:10:45 +01:00
turboderp
082a9fe9df Fix Q4 cache in chat example 2024-03-06 19:13:21 +01:00
turboderp
0b05686e76 Refactor, clean up and consolidate architecture logic 2024-03-06 02:46:47 +01:00
turboderp
eb8269726f Update examples 2024-03-06 02:41:23 +01:00
turboderp
dce84866e1 Support for StarCoder2, initial 2024-03-05 21:20:29 +01:00
turboderp
28609c7d29 Fix cache.cu compile on ROCm, consolidate ROCm compatibility functions 2024-03-05 00:32:21 +01:00
turboderp
d09f97aedc Add Q4 option to chat example 2024-03-05 00:29:12 +01:00
turboderp
bafe539728 Add Q4 cache mode 2024-03-03 23:34:11 +01:00
turboderp
b4e6c5e9c9 Remove debug code 2024-03-02 00:46:49 +01:00
turboderp
61637c5da5 Fix for some Yi tokenizers 2024-02-27 09:43:18 +01:00
turboderp
68daf0b0f6 Merge pull request #354 from seanlynch/fix_filters
Fix a couple of filter bugs
2024-02-27 08:45:43 +01:00
turboderp
a1c23b16fe Merge pull request #352 from ParisNeo/master
Added a mention of lollms-webui as another possible webui that can be used with exllamav2 as a backend
2024-02-27 08:43:50 +01:00
Sean Lynch
55000aa6b7 Use sampler's filter_end output as default value for eos
Previously the generator was just ignoring this, which breaks the
Select filter because it tries to keep going even after there are no
more options left.
2024-02-26 14:53:47 -05:00
Sean Lynch
7cd560da92 Accept None as the argument to Select.begin()
When there is no healed token, the sampler passes None, not the empty
string.
2024-02-26 14:52:15 -05:00
turboderp
1de4cdd70b Add skew sampling 2024-02-25 15:53:31 +01:00
Saifeddine ALOUI
1de788fb85 Updated README.md to add lollms-webui to supported webuis
I added a line about lollms-webui who supports exllamav2 via the exllamav2 binding.
2024-02-25 11:14:37 +01:00
Saifeddine ALOUI
0d432c97a9 Merge branch 'turboderp:master' into master 2024-02-25 11:11:14 +01:00
turboderp
d6fb70ab41 Faster dequant for 6-bit groups 2024-02-24 16:37:46 +01:00
turboderp
2586566840 Fix ROCm compile 2024-02-24 08:00:30 +01:00
turboderp
67af1d101a Bump to 0.0.14 v0.0.14 2024-02-24 06:39:39 +01:00
turboderp
cc1094a41b Support Gemma 2024-02-22 14:45:27 +01:00
turboderp
a19a2eccb4 Add option to force BOS for ppl test 2024-02-22 14:44:27 +01:00
turboderp
69fba75225 Add Gemma prompt format to example chatbot 2024-02-22 14:43:42 +01:00
turboderp
2044f8a31c Set inference_mode when compiling model 2024-02-22 10:48:44 +01:00
turboderp
983a229913 Add BOS token by default in test_inference.py, option to override 2024-02-22 09:42:43 +01:00
turboderp
d7fdfe7f0d Fix for GPTQ models with zero bias tensors 2024-02-21 08:23:17 +01:00
turboderp
c8e2bf4594 Fix small mistake in example 2024-02-19 14:20:17 +01:00
turboderp
229019d86e Add lm-format-enforcer JSON example 2024-02-19 00:56:06 +01:00
turboderp
f194d9d7b0 Add filter_prefer_eos option 2024-02-19 00:14:14 +01:00
turboderp
daf7844d18 Add prefix filter 2024-02-18 23:58:25 +01:00
turboderp
26f4bf8997 Make sure first_token is always set when beginning stream (bugfix) 2024-02-18 22:16:01 +01:00
turboderp
8c3b30dc4b Fix tokenizer decoding for Qwen 2024-02-16 22:53:24 +01:00
turboderp
7af6494afa Drop device tensors for head layer during conversion 2024-02-16 17:31:19 +01:00
turboderp
5967a29eb4 Fix architecture detection 2024-02-16 01:52:26 +01:00
turboderp
cedeb616ce Support Qwen2 2024-02-15 20:50:24 +01:00
turboderp
1bc7c85a27 Disambiguate sampling params 2024-02-15 20:04:47 +01:00
turboderp
702dd9740a VRAM optimizations during quant 2024-02-15 20:03:47 +01:00
turboderp
75f969a6d3 Disable cudaMallocAsync for post2 release 0.0.13.post2 2024-02-15 00:07:47 +01:00
turboderp
0535783ad3 Bump to 0.0.13.post2 2024-02-14 23:54:59 +01:00
turboderp
3424e70cae Only change allocator if Torch is not already imported 2024-02-14 23:54:48 +01:00
turboderp
c29f42626e Use cudaMallocAsync allocator by default 2024-02-14 23:18:07 +01:00
turboderp
69bfbea7b1 Allow autosplit to work with cudaMallocAsync backend 2024-02-14 20:19:28 +01:00
turboderp
9c37d64d74 Remove TODO items 2024-02-14 20:00:10 +01:00
turboderp
b0dc588d9b Remove return values from load_gen 2024-02-14 19:41:59 +01:00
turboderp
1c67f97f3d New API for streaming generator 2024-02-11 20:31:58 +01:00
turboderp
944e523109 Merge pull request #324 from flying-x/master
2 minor changes
2024-02-11 10:09:55 +01:00
turboderp
d7eddbaee0 Optimize typical sampling 2024-02-10 20:13:09 +01:00