turboderp
f7c89f4c51
Add Q4 test results (draft)
2024-03-09 05:59:26 +01:00
turboderp
02edfb9f2f
Fix HumanEval test for Gemma models
2024-03-09 05:59:00 +01:00
turboderp
222c8465bc
Rework HumanEval test
2024-03-07 17:54:07 +01:00
turboderp
c60ac6e9fd
Bump to 0.0.15
v0.0.15
2024-03-07 03:10:45 +01:00
turboderp
082a9fe9df
Fix Q4 cache in chat example
2024-03-06 19:13:21 +01:00
turboderp
0b05686e76
Refactor, clean up and consolidate architecture logic
2024-03-06 02:46:47 +01:00
turboderp
eb8269726f
Update examples
2024-03-06 02:41:23 +01:00
turboderp
dce84866e1
Support for StarCoder2, initial
2024-03-05 21:20:29 +01:00
turboderp
28609c7d29
Fix cache.cu compile on ROCm, consolidate ROCm compatibility functions
2024-03-05 00:32:21 +01:00
turboderp
d09f97aedc
Add Q4 option to chat example
2024-03-05 00:29:12 +01:00
turboderp
bafe539728
Add Q4 cache mode
2024-03-03 23:34:11 +01:00
turboderp
b4e6c5e9c9
Remove debug code
2024-03-02 00:46:49 +01:00
turboderp
61637c5da5
Fix for some Yi tokenizers
2024-02-27 09:43:18 +01:00
turboderp
68daf0b0f6
Merge pull request #354 from seanlynch/fix_filters
...
Fix a couple of filter bugs
2024-02-27 08:45:43 +01:00
turboderp
a1c23b16fe
Merge pull request #352 from ParisNeo/master
...
Added a mention of lollms-webui as another possible webui that can be used with exllamav2 as a backend
2024-02-27 08:43:50 +01:00
Sean Lynch
55000aa6b7
Use sampler's filter_end output as default value for eos
...
Previously the generator was just ignoring this, which breaks the
Select filter because it tries to keep going even after there are no
more options left.
2024-02-26 14:53:47 -05:00
Sean Lynch
7cd560da92
Accept None as the argument to Select.begin()
...
When there is no healed token, the sampler passes None, not the empty
string.
2024-02-26 14:52:15 -05:00
turboderp
1de4cdd70b
Add skew sampling
2024-02-25 15:53:31 +01:00
Saifeddine ALOUI
1de788fb85
Updated README.md to add lollms-webui to supported webuis
...
I added a line about lollms-webui who supports exllamav2 via the exllamav2 binding.
2024-02-25 11:14:37 +01:00
Saifeddine ALOUI
0d432c97a9
Merge branch 'turboderp:master' into master
2024-02-25 11:11:14 +01:00
turboderp
d6fb70ab41
Faster dequant for 6-bit groups
2024-02-24 16:37:46 +01:00
turboderp
2586566840
Fix ROCm compile
2024-02-24 08:00:30 +01:00
turboderp
67af1d101a
Bump to 0.0.14
v0.0.14
2024-02-24 06:39:39 +01:00
turboderp
cc1094a41b
Support Gemma
2024-02-22 14:45:27 +01:00
turboderp
a19a2eccb4
Add option to force BOS for ppl test
2024-02-22 14:44:27 +01:00
turboderp
69fba75225
Add Gemma prompt format to example chatbot
2024-02-22 14:43:42 +01:00
turboderp
2044f8a31c
Set inference_mode when compiling model
2024-02-22 10:48:44 +01:00
turboderp
983a229913
Add BOS token by default in test_inference.py, option to override
2024-02-22 09:42:43 +01:00
turboderp
d7fdfe7f0d
Fix for GPTQ models with zero bias tensors
2024-02-21 08:23:17 +01:00
turboderp
c8e2bf4594
Fix small mistake in example
2024-02-19 14:20:17 +01:00
turboderp
229019d86e
Add lm-format-enforcer JSON example
2024-02-19 00:56:06 +01:00
turboderp
f194d9d7b0
Add filter_prefer_eos option
2024-02-19 00:14:14 +01:00
turboderp
daf7844d18
Add prefix filter
2024-02-18 23:58:25 +01:00
turboderp
26f4bf8997
Make sure first_token is always set when beginning stream (bugfix)
2024-02-18 22:16:01 +01:00
turboderp
8c3b30dc4b
Fix tokenizer decoding for Qwen
2024-02-16 22:53:24 +01:00
turboderp
7af6494afa
Drop device tensors for head layer during conversion
2024-02-16 17:31:19 +01:00
turboderp
5967a29eb4
Fix architecture detection
2024-02-16 01:52:26 +01:00
turboderp
cedeb616ce
Support Qwen2
2024-02-15 20:50:24 +01:00
turboderp
1bc7c85a27
Disambiguate sampling params
2024-02-15 20:04:47 +01:00
turboderp
702dd9740a
VRAM optimizations during quant
2024-02-15 20:03:47 +01:00
turboderp
75f969a6d3
Disable cudaMallocAsync for post2 release
0.0.13.post2
2024-02-15 00:07:47 +01:00
turboderp
0535783ad3
Bump to 0.0.13.post2
2024-02-14 23:54:59 +01:00
turboderp
3424e70cae
Only change allocator if Torch is not already imported
2024-02-14 23:54:48 +01:00
turboderp
c29f42626e
Use cudaMallocAsync allocator by default
2024-02-14 23:18:07 +01:00
turboderp
69bfbea7b1
Allow autosplit to work with cudaMallocAsync backend
2024-02-14 20:19:28 +01:00
turboderp
9c37d64d74
Remove TODO items
2024-02-14 20:00:10 +01:00
turboderp
b0dc588d9b
Remove return values from load_gen
2024-02-14 19:41:59 +01:00
turboderp
1c67f97f3d
New API for streaming generator
2024-02-11 20:31:58 +01:00
turboderp
944e523109
Merge pull request #324 from flying-x/master
...
2 minor changes
2024-02-11 10:09:55 +01:00
turboderp
d7eddbaee0
Optimize typical sampling
2024-02-10 20:13:09 +01:00