Commit Graph

1030 Commits

Author SHA1 Message Date
turboderp
c2aac982e4 Globally set Torch number of threads to 1 2024-06-17 00:39:16 +02:00
turboderp
5b1b8d4169 Q GEMM: Initialize with bias when possible 2024-06-17 00:37:36 +02:00
turboderp
a2b2684e9a Paged attn: Skip some flash-attn wrapper code 2024-06-17 00:34:52 +02:00
turboderp
843cec5206 Non-blocking host-device copies in forward pass 2024-06-16 19:18:01 +02:00
turboderp
522cab53fa QMLP: Skip .view 2024-06-16 19:14:47 +02:00
turboderp
22d6823f98 Only convert blocked_tokens set to list once 2024-06-16 16:41:17 +02:00
turboderp
ec804a0291 Don't apply temperature in AVX2 softmax when temperature == 1 2024-06-16 16:14:58 +02:00
turboderp
67c270c724 Improve AVX2 softmax approximation 2024-06-16 16:13:42 +02:00
turboderp
3f805f511a Unpin logit/ID buffers (pinning doesn't improve performance and is potentially problematic) 2024-06-16 14:56:52 +02:00
turboderp
4dc5ad127b Propagate max logit from softmax to top-K sampler to skip search when top_k==1 2024-06-16 14:11:23 +02:00
turboderp
cf864726c4 Use -Ofast for gcc 2024-06-16 14:08:19 +02:00
turboderp
87085771e5 HIP: Add fallback for __stwb and __stcg 2024-06-15 12:35:28 +02:00
turboderp
9a3bbe91f0 Dynamic gen: Fix fallback mode for long prompts 2024-06-15 03:42:57 +02:00
turboderp
a7a751d966 Add bulk inference example 2024-06-14 00:45:46 +02:00
turboderp
5d5d57083e Increase quant tolerance slightly (for small Qwen2 models esp.) 2024-06-13 20:44:51 +02:00
turboderp
60eedf4622 Add exit status code for quant error 2024-06-13 20:43:49 +02:00
turboderp
9f53341cbc Q cache: Dequant after QKV projection to increase L2 cache hits in attn 2024-06-10 13:03:14 +02:00
turboderp
8a4e0ce12d Q cache: Add cache hints 2024-06-10 00:22:41 +02:00
turboderp
f5981e9615 Q cache: Skip kernel launch when no sequences in paged batch have past tokens 2024-06-10 00:19:27 +02:00
turboderp
3a3e69fd16 Bump to v0.1.5 v0.1.5 2024-06-09 02:15:09 +02:00
turboderp
675450d845 Add Q6 and Q8 cache options to eval scripts 2024-06-09 02:13:06 +02:00
turboderp
f3596fc0d9 Add Q6 cache mode 2024-06-09 01:23:50 +02:00
turboderp
f6abbba183 Add Q8 cache option to example chatbot 2024-06-08 22:40:12 +02:00
turboderp
6030517a6f Option to resume conversion job with no other args 2024-06-08 22:15:41 +02:00
turboderp
de05ac696b Add more sh tags 2024-06-08 20:41:34 +02:00
Timon Käch
95c16a8bc8 Make comments real comments (#491)
Co-authored-by: turboderp <11859846+turboderp@users.noreply.github.com>
2024-06-08 20:39:44 +02:00
turboderp
513c030935 Bump wheels from PyTorch 2.3.0 to 2.3.1 2024-06-08 20:32:01 +02:00
turboderp
291ebf5e2f Update safetensore req 2024-06-08 20:31:32 +02:00
turboderp
713c35b7b4 Merge branch 'refs/heads/master' into dev
# Conflicts:
#	.github/workflows/build-wheels-release.yml
#	.github/workflows/build_wheels_release_python312test.yml
2024-06-08 20:27:29 +02:00
turboderp
5ca51dd5d8 Dynamic generator writeup 2024-06-08 20:26:52 +02:00
turboderp
e4ef7cfef2 Docs for eval scripts 2024-06-08 15:48:20 +02:00
turboderp
40c037ff16 Merge remote-tracking branch 'origin/master' 2024-06-08 15:39:40 +02:00
turboderp
fb61a817ec Add Q8 cache mode 2024-06-08 15:33:19 +02:00
Brian Dashore
b1c9020c2d Update Actions (#497)
* Actions: Python 3.12: Override with VS 2022 17.9

Github Actions updated their runner image with VS 17.10 which is
incompatible with older versions of CUDA. Force a downgrade to 17.9
and build.

Signed-off-by: kingbri <bdashore3@proton.me>

* Actions: Update to VS 2022 17.9

Github Actions updated their Windows runner image with VS 17.10 which is
incompatible with older versions of CUDA. Force a downgrade to 17.9
and build.

Signed-off-by: kingbri <bdashore3@proton.me>

* Actions: Add Python 3.12 to releases

Python 3.12 ExllamaV2 is stable. So, add it into builds.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-08 00:51:57 +02:00
turboderp
34677da2b9 Try vs build tools instead 2024-06-07 16:57:37 +02:00
turboderp
cd754389a7 Enable SDPA for torch>=2.3.0 since it now supports lower-right masking 2024-06-07 14:32:34 +02:00
turboderp
90796477c4 Attempt to uninstall VS2022 2024-06-07 12:11:51 +02:00
turboderp
91681732f4 Downgrade VS2022 enterprise 2024-06-07 11:52:00 +02:00
turboderp
afd0853212 Try to downgrade to VS 17.9 2024-06-07 02:26:22 +02:00
turboderp
9902e446a5 Update build_wheels_release_python312test.yml 2024-06-07 01:38:40 +02:00
turboderp
a9d5831264 Update build_wheels_release_python312test.yml 2024-06-07 01:07:55 +02:00
turboderp
99fa4bbda1 Update build_wheels_release_python312test.yml 2024-06-07 00:41:25 +02:00
turboderp
8e098e57ce upgrade numpy req to 1.26.4 2024-06-07 00:30:28 +02:00
turboderp
dad511afb0 Update build_wheels_release_python312test.yml 2024-06-07 00:29:06 +02:00
turboderp
5b5f162395 Test Python 3.12 build 2024-06-07 00:08:26 +02:00
turboderp
4af022a3c1 Merge branch 'refs/heads/master' into dev 2024-06-06 17:25:01 +02:00
turboderp
01d01a14fe Merge remote-tracking branch 'origin/master' 2024-06-06 17:24:49 +02:00
RodriMora
7fad4f3ec2 Added steps to benchmark in README (#488)
* Added steps to benchmark using the included mmlu script

* Added steps to benchmark using the included mmlu script
2024-06-06 17:22:57 +02:00
turboderp
4dea0c2451 Shuffle option for MMLU eval 2024-06-06 11:54:25 +02:00
turboderp
d053e9ea80 Fix defrag for Q4 cache 2024-06-06 03:32:15 +02:00