exllamav2

mirror of https://github.com/turboderp-org/exllamav2.git synced 2026-05-05 13:41:15 +00:00

Author	SHA1	Message	Date
turboderp	f7c89f4c51	Add Q4 test results (draft)	2024-03-09 05:59:26 +01:00
turboderp	02edfb9f2f	Fix HumanEval test for Gemma models	2024-03-09 05:59:00 +01:00
turboderp	222c8465bc	Rework HumanEval test	2024-03-07 17:54:07 +01:00
turboderp	c60ac6e9fd	Bump to 0.0.15 v0.0.15	2024-03-07 03:10:45 +01:00
turboderp	082a9fe9df	Fix Q4 cache in chat example	2024-03-06 19:13:21 +01:00
turboderp	0b05686e76	Refactor, clean up and consolidate architecture logic	2024-03-06 02:46:47 +01:00
turboderp	eb8269726f	Update examples	2024-03-06 02:41:23 +01:00
turboderp	dce84866e1	Support for StarCoder2, initial	2024-03-05 21:20:29 +01:00
turboderp	28609c7d29	Fix cache.cu compile on ROCm, consolidate ROCm compatibility functions	2024-03-05 00:32:21 +01:00
turboderp	d09f97aedc	Add Q4 option to chat example	2024-03-05 00:29:12 +01:00
turboderp	bafe539728	Add Q4 cache mode	2024-03-03 23:34:11 +01:00
turboderp	b4e6c5e9c9	Remove debug code	2024-03-02 00:46:49 +01:00
turboderp	61637c5da5	Fix for some Yi tokenizers	2024-02-27 09:43:18 +01:00
turboderp	68daf0b0f6	Merge pull request #354 from seanlynch/fix_filters Fix a couple of filter bugs	2024-02-27 08:45:43 +01:00
turboderp	a1c23b16fe	Merge pull request #352 from ParisNeo/master Added a mention of lollms-webui as another possible webui that can be used with exllamav2 as a backend	2024-02-27 08:43:50 +01:00
Sean Lynch	55000aa6b7	Use sampler's filter_end output as default value for eos Previously the generator was just ignoring this, which breaks the Select filter because it tries to keep going even after there are no more options left.	2024-02-26 14:53:47 -05:00
Sean Lynch	7cd560da92	Accept None as the argument to Select.begin() When there is no healed token, the sampler passes None, not the empty string.	2024-02-26 14:52:15 -05:00
turboderp	1de4cdd70b	Add skew sampling	2024-02-25 15:53:31 +01:00
Saifeddine ALOUI	1de788fb85	Updated README.md to add lollms-webui to supported webuis I added a line about lollms-webui who supports exllamav2 via the exllamav2 binding.	2024-02-25 11:14:37 +01:00
Saifeddine ALOUI	0d432c97a9	Merge branch 'turboderp:master' into master	2024-02-25 11:11:14 +01:00
turboderp	d6fb70ab41	Faster dequant for 6-bit groups	2024-02-24 16:37:46 +01:00
turboderp	2586566840	Fix ROCm compile	2024-02-24 08:00:30 +01:00
turboderp	67af1d101a	Bump to 0.0.14 v0.0.14	2024-02-24 06:39:39 +01:00
turboderp	cc1094a41b	Support Gemma	2024-02-22 14:45:27 +01:00
turboderp	a19a2eccb4	Add option to force BOS for ppl test	2024-02-22 14:44:27 +01:00
turboderp	69fba75225	Add Gemma prompt format to example chatbot	2024-02-22 14:43:42 +01:00
turboderp	2044f8a31c	Set inference_mode when compiling model	2024-02-22 10:48:44 +01:00
turboderp	983a229913	Add BOS token by default in test_inference.py, option to override	2024-02-22 09:42:43 +01:00
turboderp	d7fdfe7f0d	Fix for GPTQ models with zero bias tensors	2024-02-21 08:23:17 +01:00
turboderp	c8e2bf4594	Fix small mistake in example	2024-02-19 14:20:17 +01:00
turboderp	229019d86e	Add lm-format-enforcer JSON example	2024-02-19 00:56:06 +01:00
turboderp	f194d9d7b0	Add filter_prefer_eos option	2024-02-19 00:14:14 +01:00
turboderp	daf7844d18	Add prefix filter	2024-02-18 23:58:25 +01:00
turboderp	26f4bf8997	Make sure first_token is always set when beginning stream (bugfix)	2024-02-18 22:16:01 +01:00
turboderp	8c3b30dc4b	Fix tokenizer decoding for Qwen	2024-02-16 22:53:24 +01:00
turboderp	7af6494afa	Drop device tensors for head layer during conversion	2024-02-16 17:31:19 +01:00
turboderp	5967a29eb4	Fix architecture detection	2024-02-16 01:52:26 +01:00
turboderp	cedeb616ce	Support Qwen2	2024-02-15 20:50:24 +01:00
turboderp	1bc7c85a27	Disambiguate sampling params	2024-02-15 20:04:47 +01:00
turboderp	702dd9740a	VRAM optimizations during quant	2024-02-15 20:03:47 +01:00
turboderp	75f969a6d3	Disable cudaMallocAsync for post2 release 0.0.13.post2	2024-02-15 00:07:47 +01:00
turboderp	0535783ad3	Bump to 0.0.13.post2	2024-02-14 23:54:59 +01:00
turboderp	3424e70cae	Only change allocator if Torch is not already imported	2024-02-14 23:54:48 +01:00
turboderp	c29f42626e	Use cudaMallocAsync allocator by default	2024-02-14 23:18:07 +01:00
turboderp	69bfbea7b1	Allow autosplit to work with cudaMallocAsync backend	2024-02-14 20:19:28 +01:00
turboderp	9c37d64d74	Remove TODO items	2024-02-14 20:00:10 +01:00
turboderp	b0dc588d9b	Remove return values from load_gen	2024-02-14 19:41:59 +01:00
turboderp	1c67f97f3d	New API for streaming generator	2024-02-11 20:31:58 +01:00
turboderp	944e523109	Merge pull request #324 from flying-x/master 2 minor changes	2024-02-11 10:09:55 +01:00
turboderp	d7eddbaee0	Optimize typical sampling	2024-02-10 20:13:09 +01:00

1 2 3 4 5 ...

601 Commits