exllamav2

mirror of https://github.com/turboderp-org/exllamav2.git synced 2026-05-05 21:51:21 +00:00

Author	SHA1	Message	Date
turboderp	c2aac982e4	Globally set Torch number of threads to 1	2024-06-17 00:39:16 +02:00
turboderp	5b1b8d4169	Q GEMM: Initialize with bias when possible	2024-06-17 00:37:36 +02:00
turboderp	a2b2684e9a	Paged attn: Skip some flash-attn wrapper code	2024-06-17 00:34:52 +02:00
turboderp	843cec5206	Non-blocking host-device copies in forward pass	2024-06-16 19:18:01 +02:00
turboderp	522cab53fa	QMLP: Skip .view	2024-06-16 19:14:47 +02:00
turboderp	22d6823f98	Only convert blocked_tokens set to list once	2024-06-16 16:41:17 +02:00
turboderp	ec804a0291	Don't apply temperature in AVX2 softmax when temperature == 1	2024-06-16 16:14:58 +02:00
turboderp	67c270c724	Improve AVX2 softmax approximation	2024-06-16 16:13:42 +02:00
turboderp	3f805f511a	Unpin logit/ID buffers (pinning doesn't improve performance and is potentially problematic)	2024-06-16 14:56:52 +02:00
turboderp	4dc5ad127b	Propagate max logit from softmax to top-K sampler to skip search when top_k==1	2024-06-16 14:11:23 +02:00
turboderp	cf864726c4	Use -Ofast for gcc	2024-06-16 14:08:19 +02:00
turboderp	87085771e5	HIP: Add fallback for __stwb and __stcg	2024-06-15 12:35:28 +02:00
turboderp	9a3bbe91f0	Dynamic gen: Fix fallback mode for long prompts	2024-06-15 03:42:57 +02:00
turboderp	a7a751d966	Add bulk inference example	2024-06-14 00:45:46 +02:00
turboderp	5d5d57083e	Increase quant tolerance slightly (for small Qwen2 models esp.)	2024-06-13 20:44:51 +02:00
turboderp	60eedf4622	Add exit status code for quant error	2024-06-13 20:43:49 +02:00
turboderp	9f53341cbc	Q cache: Dequant after QKV projection to increase L2 cache hits in attn	2024-06-10 13:03:14 +02:00
turboderp	8a4e0ce12d	Q cache: Add cache hints	2024-06-10 00:22:41 +02:00
turboderp	f5981e9615	Q cache: Skip kernel launch when no sequences in paged batch have past tokens	2024-06-10 00:19:27 +02:00
turboderp	3a3e69fd16	Bump to v0.1.5 v0.1.5	2024-06-09 02:15:09 +02:00
turboderp	675450d845	Add Q6 and Q8 cache options to eval scripts	2024-06-09 02:13:06 +02:00
turboderp	f3596fc0d9	Add Q6 cache mode	2024-06-09 01:23:50 +02:00
turboderp	f6abbba183	Add Q8 cache option to example chatbot	2024-06-08 22:40:12 +02:00
turboderp	6030517a6f	Option to resume conversion job with no other args	2024-06-08 22:15:41 +02:00
turboderp	de05ac696b	Add more sh tags	2024-06-08 20:41:34 +02:00
Timon Käch	95c16a8bc8	Make comments real comments (#491 ) Co-authored-by: turboderp <11859846+turboderp@users.noreply.github.com>	2024-06-08 20:39:44 +02:00
turboderp	513c030935	Bump wheels from PyTorch 2.3.0 to 2.3.1	2024-06-08 20:32:01 +02:00
turboderp	291ebf5e2f	Update safetensore req	2024-06-08 20:31:32 +02:00
turboderp	713c35b7b4	Merge branch 'refs/heads/master' into dev # Conflicts: # .github/workflows/build-wheels-release.yml # .github/workflows/build_wheels_release_python312test.yml	2024-06-08 20:27:29 +02:00
turboderp	5ca51dd5d8	Dynamic generator writeup	2024-06-08 20:26:52 +02:00
turboderp	e4ef7cfef2	Docs for eval scripts	2024-06-08 15:48:20 +02:00
turboderp	40c037ff16	Merge remote-tracking branch 'origin/master'	2024-06-08 15:39:40 +02:00
turboderp	fb61a817ec	Add Q8 cache mode	2024-06-08 15:33:19 +02:00
Brian Dashore	b1c9020c2d	Update Actions (#497 ) * Actions: Python 3.12: Override with VS 2022 17.9 Github Actions updated their runner image with VS 17.10 which is incompatible with older versions of CUDA. Force a downgrade to 17.9 and build. Signed-off-by: kingbri <bdashore3@proton.me> * Actions: Update to VS 2022 17.9 Github Actions updated their Windows runner image with VS 17.10 which is incompatible with older versions of CUDA. Force a downgrade to 17.9 and build. Signed-off-by: kingbri <bdashore3@proton.me> * Actions: Add Python 3.12 to releases Python 3.12 ExllamaV2 is stable. So, add it into builds. Signed-off-by: kingbri <bdashore3@proton.me> --------- Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-08 00:51:57 +02:00
turboderp	34677da2b9	Try vs build tools instead	2024-06-07 16:57:37 +02:00
turboderp	cd754389a7	Enable SDPA for torch>=2.3.0 since it now supports lower-right masking	2024-06-07 14:32:34 +02:00
turboderp	90796477c4	Attempt to uninstall VS2022	2024-06-07 12:11:51 +02:00
turboderp	91681732f4	Downgrade VS2022 enterprise	2024-06-07 11:52:00 +02:00
turboderp	afd0853212	Try to downgrade to VS 17.9	2024-06-07 02:26:22 +02:00
turboderp	9902e446a5	Update build_wheels_release_python312test.yml	2024-06-07 01:38:40 +02:00
turboderp	a9d5831264	Update build_wheels_release_python312test.yml	2024-06-07 01:07:55 +02:00
turboderp	99fa4bbda1	Update build_wheels_release_python312test.yml	2024-06-07 00:41:25 +02:00
turboderp	8e098e57ce	upgrade numpy req to 1.26.4	2024-06-07 00:30:28 +02:00
turboderp	dad511afb0	Update build_wheels_release_python312test.yml	2024-06-07 00:29:06 +02:00
turboderp	5b5f162395	Test Python 3.12 build	2024-06-07 00:08:26 +02:00
turboderp	4af022a3c1	Merge branch 'refs/heads/master' into dev	2024-06-06 17:25:01 +02:00
turboderp	01d01a14fe	Merge remote-tracking branch 'origin/master'	2024-06-06 17:24:49 +02:00
RodriMora	7fad4f3ec2	Added steps to benchmark in README (#488 ) * Added steps to benchmark using the included mmlu script * Added steps to benchmark using the included mmlu script	2024-06-06 17:22:57 +02:00
turboderp	4dea0c2451	Shuffle option for MMLU eval	2024-06-06 11:54:25 +02:00
turboderp	d053e9ea80	Fix defrag for Q4 cache	2024-06-06 03:32:15 +02:00

1 2 3 4 5 ...

1030 Commits