Commit Graph

1418 Commits

Author SHA1 Message Date
turboderp
c050aec764 Bump to 0.2.0 2024-08-28 22:56:01 +02:00
turboderp
1a82283df3 Merge branch 'refs/heads/dev' 2024-08-28 22:55:11 +02:00
turboderp
f1d8909809 Catch all exceptions for nvidia-smi and rocm-smi 2024-08-28 20:25:56 +02:00
turboderp
db14154fee Don't use default stream for logit padding mask after all 2024-08-27 22:06:19 +02:00
turboderp
8d3d4c227e Ensure logit padding happens on default stream 2024-08-27 21:48:30 +02:00
turboderp
d9f0ecc12c TP: Fix vocab split for models with odd vocab sizes 2024-08-27 21:47:49 +02:00
turboderp
7319b6ea31 Fix graph update for MLP with post layernorm 2024-08-27 20:07:21 +02:00
turboderp
4230dab3c1 Fix model_init feedback 2024-08-27 19:16:23 +02:00
turboderp
69291d1333 Fix another possible sync issues with fasttensors (for Windows) 2024-08-25 22:01:53 +02:00
turboderp
e539f7cc28 Fix another possible sync issues with fasttensors 2024-08-25 21:56:02 +02:00
turboderp
d3fe9f25d2 Unmap tensors on CPU to reduce temp VRAM overhead while loading 2024-08-25 21:15:28 +02:00
turboderp
7e15947aa1 Fix possible sync issues with fasttensors 2024-08-25 14:51:46 +02:00
turboderp
57ee846672 Fix ROCm compile 2024-08-22 14:04:20 +02:00
turboderp
b926ed4c7e Merge remote-tracking branch 'origin/master' v0.1.9 2024-08-22 13:50:47 +02:00
turboderp
42e1baeef0 Bump to 0.1.9 2024-08-22 13:48:36 +02:00
turboderp
d58d5ab723 Add some docstrings 2024-08-22 13:43:19 +02:00
turboderp
555c360798 Update TP example 2024-08-22 13:43:19 +02:00
turboderp
4117daa546 Cleanup 2024-08-22 12:49:37 +02:00
turboderp
9917403229 Bulk example: Compute immediate output tokens/second 2024-08-22 12:46:58 +02:00
turboderp
547135cc43 Skip sampling if only one token allowed by filters 2024-08-22 12:33:19 +02:00
turboderp
0978ba5f86 Optimize logit filtering in sampler 2024-08-22 11:17:45 +02:00
Michael Panchenko
f1d79c96d2 Fix in ext.py (index error on system without GPU) (#594)
* Update README.md

* Fix in ext.py (index error on system without GPU)

---------

Co-authored-by: turboderp <11859846+turboderp@users.noreply.github.com>
2024-08-20 17:18:50 +02:00
turboderp
e7053198ba Enable defrag for paged TP cache 2024-08-20 17:10:51 +02:00
turboderp
e89dc5b762 Fix for older Torch versions 2024-08-20 14:08:35 +02:00
turboderp
15d39896f4 Fix Q cache for TP mode 2024-08-20 11:52:23 +02:00
turboderp
4c6dc5812a Disable unused code to fix compilation issues 2024-08-20 11:50:10 +02:00
turboderp
a72b73fc89 Update README.md (#593) 2024-08-20 01:00:49 +02:00
turboderp
f17feb8345 Row split + all_reduce for MLP (not faster, disabled) 2024-08-20 00:48:55 +02:00
turboderp
373bcc187e Update benchmarks 2024-08-20 00:48:55 +02:00
turboderp
b120d89b6a Permute output features of MLP up+gate with input map of MLP down 2024-08-20 00:48:54 +02:00
turboderp
507ce60d7a Add cudaEventDisableTiming for faster sync 2024-08-17 10:05:01 +02:00
turboderp
8a98b435d6 Fix bug 2024-08-17 00:18:03 +02:00
turboderp
e90bf8bc6e Update README.md 2024-08-16 21:00:20 +02:00
turboderp
61e30923e6 Update benchmarks after graphs 2024-08-15 22:59:10 +02:00
turboderp
96ebec1d8b Revert to event sync (barrier kernel is unstable) 2024-08-15 22:47:59 +02:00
turboderp
bef04b9b30 Use non-optimized unpaged TP forward when no cache provided 2024-08-15 18:44:58 +02:00
turboderp
49bf149c8a Fix device allocation for sin/cos tensors 2024-08-15 18:43:09 +02:00
turboderp
8477da8f8c Chatbot: Load draft model first 2024-08-15 18:42:37 +02:00
turboderp
c185d3fc68 Extension function for TP attn (unpaged) 2024-08-15 16:20:45 +02:00
turboderp
aa5fd98ab6 Fix sync issue 2024-08-15 15:04:53 +02:00
turboderp
6a5448407a Fix TP load bug 2024-08-15 13:33:31 +02:00
turboderp
c19e9f2002 Balance automatic tensor split more intelligently 2024-08-15 12:33:24 +02:00
turboderp
8a85865af5 Fix sync issues 2024-08-15 11:43:23 +02:00
turboderp
18157d7e3c Sync to all active devices regardless of split 2024-08-14 23:58:04 +02:00
turboderp
b30f796690 TP mode for attn layer, non-paged 2024-08-14 23:41:10 +02:00
turboderp
65b9e17c4f Add cross-device barrier kernel, remove event-based sync 2024-08-14 11:50:41 +02:00
turboderp
1c5f3de883 TP support for Qwen2 2024-08-14 11:20:14 +02:00
turboderp
b8f655dc02 Bugfix 2024-08-13 20:19:47 +02:00
turboderp
fced82e8bc Re-enable barrier after broadcast 2024-08-12 20:11:38 +02:00
turboderp
c9a5ac6e99 Double-buffered broadcast/gather 2024-08-12 20:11:05 +02:00