Update README.md

2026-03-15 00:07:26 +00:00 · 2024-05-25 22:50:36 +02:00
parent f6e8495e58
commit e6f230bf06
3 changed files with 72 additions and 36 deletions
--- a/README.md
+++ b/README.md
@@ -3,17 +3,57 @@
 ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.


-## Overview of differences compared to V1
+## New in v0.1.0:
+
+- ExLlamaV2 now supports paged attention via [Flash Attention](https://github.com/Dao-AILab/flash-attention) 2.5.7+
+- New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
+
+![alt_text](doc/dynamic_gen.gif)
+
+## Dynamic generator examples
+
+The dynamic generator supports all inference, sampling and speculative decoding features of the previous two 
+generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and
+performs better anyway, see [here](doc/qcache_eval.md).)
+
+- Single generation:
+  ```python
+  output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)
+  ```
+- Batched generation:
+    ```python
+    outputs = generator.generate(
+        prompt = [
+            "Hello, my name is",
+            "Once upon a time,",
+            "Large language models are",
+        ], 
+        max_new_tokens = 200
+    )
+    ```
+- Streamed generation with `asyncio`:
+    ```python
+    job = ExLlamaV2DynamicJobAsync(
+        generator,
+        input_ids = tokenizer.encode("You can lead a horse to water"),
+        banned_strings = ["make it drink"],
+        gen_settings = ExLlamaV2Sampler.Settings.greedy(),
+        max_new_tokens = 200
+    )  
+    async for result in job:
+        text = result.get("text", "")
+        print(text, end = "")       
+    ``` 
+See the full, updated examples [here](https://github.com/turboderp/exllamav2/tree/master/examples).
+
+

- Faster, better kernels
- Cleaner and more versatile codebase
- Support for a new quant format (see below)


 ## Performance

-Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
-speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
+Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future,
+and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:

 | Model      | Mode         | Size  | grpsz | act | 3090Ti  | 4090        |
 |------------|--------------|-------|-------|-----|---------|-------------|
@@ -33,13 +73,11 @@ speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
 ## How to

 To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio
-on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/),
-then run:
+on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/), then run:

 ```
 git clone https://github.com/turboderp/exllamav2
 cd exllamav2
-# Optionally, create and activate a new conda environment
 pip install -r requirements.txt
 pip install .

@@ -50,13 +88,11 @@ python test_inference.py -m <path_to_model> -p "Once upon a time,"
 A simple console chatbot is included. Run it with:

 ```
-python examples/chat.py -m <path_to_model> -mode llama
-# Append the '--gpu_split auto' flag for multi-GPU inference
+python examples/chat.py -m <path_to_model> -mode llama -gs auto
 ```


-The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
-probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base 
+The `-mode` argument chooses the prompt format to use. `raw` will produce a simple chatlog-style chat that works with base 
 models and various other finetunes. Run with `-modes` for a list of all available prompt formats. You can also provide
 a custom system prompt with `-sp`. 

@@ -100,8 +136,11 @@ C++ extension in the process. Instead, the extension will be built the first tim

 ### Method 2: Install from release (with prebuilt extension)

-Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the
-extension binaries. Make sure to grab the right version, matching your platform, Python version (`cp`) and CUDA version.
+Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the extension binaries. Make sure to grab
+the right version, matching your platform, Python version (`cp`) and CUDA version. Crucially, you must also match
+the prebuilt wheel with your PyTorch version, since the Torch C++ extension ABI breaks with every new version of 
+PyTorch.
+
 Either download an appropriate wheel or install directly from the appropriate URL:

 ```
@@ -113,15 +152,12 @@ can also be installed this way, and it will build the extension while installing

 ### Method 3: Install from PyPI

-A PyPI package is available as well. It can be installed with:
+A PyPI package is available as well. This is the same as the JIT version (see above). It can be installed with:

 ```
 pip install exllamav2
 ```

-The version available through PyPI is the JIT version (see above). Still working on a solution for distributing
-prebuilt wheels via PyPI.
-

 ## EXL2 quantization

--- a/doc/dynamic_gen.gif
+++ b/doc/dynamic_gen.gif
--- a/examples/dynamic_gen.py
+++ b/examples/dynamic_gen.py
@@ -77,10 +77,10 @@ prompts = [
    "Please guess the next 20 numbers in this sequence: " + ", ".join(str(n) for n in range(700)),
    "Write a short essay about cell membranes.",
    "What's up?",
-    "How do I open a can of beans?",
-    "How do I open a can of soup?",
-    "How do I open a can of strawberry jam?",
-    "How do I open a can of raspberry jam?",
+    # "How do I open a can of beans?",
+    # "How do I open a can of soup?",
+    # "How do I open a can of strawberry jam?",
+    # "How do I open a can of raspberry jam?",
    "What's the tallest building in Paris?",
    "What's the most populous nation on Earth?",
    "What's the most populous nation on Mars?",
@@ -90,25 +90,25 @@ prompts = [
    "Who is Waldo?",
    "Why is Waldo?",
    "Is it legal to base jump off the Eiffel Tower?",
-    "Is it legal to base jump into a volcano?",
-    "Why are cats better than dogs?",
+    # "Is it legal to base jump into a volcano?",
+    # "Why are cats better than dogs?",
    "Why is the Hulk so angry all the time?",
    "How do I build a time machine?",
    "What seems out of place in this sequence: " + ", ".join(str(n if n != 123 else 69) for n in range(200)),
    "Is it legal to grow your own catnip?",
    "What seems out of place in this sequence: " + ", ".join(str(n if n != 160 else 420) for n in range(400)),
    "What seems out of place in this sequence: " + ", ".join(str(n if n != 161 else 421) for n in range(400)),
-    "What's inside a black hole?",
-    "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
-    "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
-    "Is there life on Mars?",
-    "Hello!",
-    "Hi!",
-    "Boop!",
-    "Why are cats better than dogs?",
-    "Why are cats better than dogs?",
-    "Why are cats better than dogs?",
-    "Why are cats better than dogs?",
+    # "What's inside a black hole?",
+    # "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
+    # "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
+    # "Is there life on Mars?",
+    # "Hello!",
+    # "Hi!",
+    # "Boop!",
+    # "Why are cats better than dogs?",
+    # "Why are cats better than dogs?",
+    # "Why are cats better than dogs?",
+    # "Why are cats better than dogs?",
 ]

 term = Terminal()