Update README.md

2026-03-15 00:07:24 +00:00 · 2026-03-07 20:48:24 +01:00
parent 0e8dd89874
commit 60afe8d983
2 changed files with 5 additions and 7 deletions
--- a/README.md
+++ b/README.md
@@ -17,11 +17,10 @@ The official and recommended backend server for ExLlamaV3 is [TabbyAPI](https://

 ### ⚠️ Important

- **Qwen3-Next** support is currently experimental and still requires some optimization, so don't expect
-  optimal performance just yet. [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention) is required
-  and this in turn requires Triton. [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) is supported and 
-  recommended but not required.
- **Qwen3-Next** currently does not support tensor/expert parallelism.
+- **Qwen3-Next** and **Qwen3.5** can take advantage of [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), though this requires
+  Triton, and performance can be shaky due to the sporadic JIT compilation it imposes. [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) is
+  supported and recommended but not required.
+- **Qwen3-Next** and **Qwen3.5** currently do not support tensor/expert parallelism.

 ## Architecture support

--- a/exllamav3/modules/gated_delta_net.py
+++ b/exllamav3/modules/gated_delta_net.py
@@ -616,8 +616,7 @@ class GatedDeltaNet(Module):
                )

            # Use chunked rule when advantageous and available
-            # TODO: At least warn if chunked rule (i.e. flash-linear-attention) is not available
-            #       since performance will tank on prompt ingestion
+            # TODO: Replace chunked fn with non-Triton implementation
            if seqlen >= self.num_v_heads and chunk_gated_delta_rule is not None:
                mixed_qkv = mixed_qkv.transpose(1, 2)