turboderp
c2aac982e4
Globally set Torch number of threads to 1
2024-06-17 00:39:16 +02:00
turboderp
5b1b8d4169
Q GEMM: Initialize with bias when possible
2024-06-17 00:37:36 +02:00
turboderp
a2b2684e9a
Paged attn: Skip some flash-attn wrapper code
2024-06-17 00:34:52 +02:00
turboderp
843cec5206
Non-blocking host-device copies in forward pass
2024-06-16 19:18:01 +02:00
turboderp
522cab53fa
QMLP: Skip .view
2024-06-16 19:14:47 +02:00
turboderp
22d6823f98
Only convert blocked_tokens set to list once
2024-06-16 16:41:17 +02:00
turboderp
ec804a0291
Don't apply temperature in AVX2 softmax when temperature == 1
2024-06-16 16:14:58 +02:00
turboderp
67c270c724
Improve AVX2 softmax approximation
2024-06-16 16:13:42 +02:00
turboderp
3f805f511a
Unpin logit/ID buffers (pinning doesn't improve performance and is potentially problematic)
2024-06-16 14:56:52 +02:00
turboderp
4dc5ad127b
Propagate max logit from softmax to top-K sampler to skip search when top_k==1
2024-06-16 14:11:23 +02:00
turboderp
cf864726c4
Use -Ofast for gcc
2024-06-16 14:08:19 +02:00
turboderp
87085771e5
HIP: Add fallback for __stwb and __stcg
2024-06-15 12:35:28 +02:00
turboderp
9a3bbe91f0
Dynamic gen: Fix fallback mode for long prompts
2024-06-15 03:42:57 +02:00
turboderp
a7a751d966
Add bulk inference example
2024-06-14 00:45:46 +02:00
turboderp
5d5d57083e
Increase quant tolerance slightly (for small Qwen2 models esp.)
2024-06-13 20:44:51 +02:00
turboderp
60eedf4622
Add exit status code for quant error
2024-06-13 20:43:49 +02:00
turboderp
9f53341cbc
Q cache: Dequant after QKV projection to increase L2 cache hits in attn
2024-06-10 13:03:14 +02:00
turboderp
8a4e0ce12d
Q cache: Add cache hints
2024-06-10 00:22:41 +02:00
turboderp
f5981e9615
Q cache: Skip kernel launch when no sequences in paged batch have past tokens
2024-06-10 00:19:27 +02:00
turboderp
3a3e69fd16
Bump to v0.1.5
v0.1.5
2024-06-09 02:15:09 +02:00
turboderp
675450d845
Add Q6 and Q8 cache options to eval scripts
2024-06-09 02:13:06 +02:00
turboderp
f3596fc0d9
Add Q6 cache mode
2024-06-09 01:23:50 +02:00
turboderp
f6abbba183
Add Q8 cache option to example chatbot
2024-06-08 22:40:12 +02:00
turboderp
6030517a6f
Option to resume conversion job with no other args
2024-06-08 22:15:41 +02:00
turboderp
de05ac696b
Add more sh tags
2024-06-08 20:41:34 +02:00
Timon Käch
95c16a8bc8
Make comments real comments ( #491 )
...
Co-authored-by: turboderp <11859846+turboderp@users.noreply.github.com >
2024-06-08 20:39:44 +02:00
turboderp
513c030935
Bump wheels from PyTorch 2.3.0 to 2.3.1
2024-06-08 20:32:01 +02:00
turboderp
291ebf5e2f
Update safetensore req
2024-06-08 20:31:32 +02:00
turboderp
713c35b7b4
Merge branch 'refs/heads/master' into dev
...
# Conflicts:
# .github/workflows/build-wheels-release.yml
# .github/workflows/build_wheels_release_python312test.yml
2024-06-08 20:27:29 +02:00
turboderp
5ca51dd5d8
Dynamic generator writeup
2024-06-08 20:26:52 +02:00
turboderp
e4ef7cfef2
Docs for eval scripts
2024-06-08 15:48:20 +02:00
turboderp
40c037ff16
Merge remote-tracking branch 'origin/master'
2024-06-08 15:39:40 +02:00
turboderp
fb61a817ec
Add Q8 cache mode
2024-06-08 15:33:19 +02:00
Brian Dashore
b1c9020c2d
Update Actions ( #497 )
...
* Actions: Python 3.12: Override with VS 2022 17.9
Github Actions updated their runner image with VS 17.10 which is
incompatible with older versions of CUDA. Force a downgrade to 17.9
and build.
Signed-off-by: kingbri <bdashore3@proton.me >
* Actions: Update to VS 2022 17.9
Github Actions updated their Windows runner image with VS 17.10 which is
incompatible with older versions of CUDA. Force a downgrade to 17.9
and build.
Signed-off-by: kingbri <bdashore3@proton.me >
* Actions: Add Python 3.12 to releases
Python 3.12 ExllamaV2 is stable. So, add it into builds.
Signed-off-by: kingbri <bdashore3@proton.me >
---------
Signed-off-by: kingbri <bdashore3@proton.me >
2024-06-08 00:51:57 +02:00
turboderp
34677da2b9
Try vs build tools instead
2024-06-07 16:57:37 +02:00
turboderp
cd754389a7
Enable SDPA for torch>=2.3.0 since it now supports lower-right masking
2024-06-07 14:32:34 +02:00
turboderp
90796477c4
Attempt to uninstall VS2022
2024-06-07 12:11:51 +02:00
turboderp
91681732f4
Downgrade VS2022 enterprise
2024-06-07 11:52:00 +02:00
turboderp
afd0853212
Try to downgrade to VS 17.9
2024-06-07 02:26:22 +02:00
turboderp
9902e446a5
Update build_wheels_release_python312test.yml
2024-06-07 01:38:40 +02:00
turboderp
a9d5831264
Update build_wheels_release_python312test.yml
2024-06-07 01:07:55 +02:00
turboderp
99fa4bbda1
Update build_wheels_release_python312test.yml
2024-06-07 00:41:25 +02:00
turboderp
8e098e57ce
upgrade numpy req to 1.26.4
2024-06-07 00:30:28 +02:00
turboderp
dad511afb0
Update build_wheels_release_python312test.yml
2024-06-07 00:29:06 +02:00
turboderp
5b5f162395
Test Python 3.12 build
2024-06-07 00:08:26 +02:00
turboderp
4af022a3c1
Merge branch 'refs/heads/master' into dev
2024-06-06 17:25:01 +02:00
turboderp
01d01a14fe
Merge remote-tracking branch 'origin/master'
2024-06-06 17:24:49 +02:00
RodriMora
7fad4f3ec2
Added steps to benchmark in README ( #488 )
...
* Added steps to benchmark using the included mmlu script
* Added steps to benchmark using the included mmlu script
2024-06-06 17:22:57 +02:00
turboderp
4dea0c2451
Shuffle option for MMLU eval
2024-06-06 11:54:25 +02:00
turboderp
d053e9ea80
Fix defrag for Q4 cache
2024-06-06 03:32:15 +02:00