Update README.md

This commit is contained in:
turboderp
2026-03-07 20:48:24 +01:00
parent 0e8dd89874
commit 60afe8d983
2 changed files with 5 additions and 7 deletions

View File

@@ -17,11 +17,10 @@ The official and recommended backend server for ExLlamaV3 is [TabbyAPI](https://
### ⚠️ Important
- **Qwen3-Next** support is currently experimental and still requires some optimization, so don't expect
optimal performance just yet. [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention) is required
and this in turn requires Triton. [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) is supported and
recommended but not required.
- **Qwen3-Next** currently does not support tensor/expert parallelism.
- **Qwen3-Next** and **Qwen3.5** can take advantage of [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), though this requires
Triton, and performance can be shaky due to the sporadic JIT compilation it imposes. [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) is
supported and recommended but not required.
- **Qwen3-Next** and **Qwen3.5** currently do not support tensor/expert parallelism.
## Architecture support

View File

@@ -616,8 +616,7 @@ class GatedDeltaNet(Module):
)
# Use chunked rule when advantageous and available
# TODO: At least warn if chunked rule (i.e. flash-linear-attention) is not available
# since performance will tank on prompt ingestion
# TODO: Replace chunked fn with non-Triton implementation
if seqlen >= self.num_v_heads and chunk_gated_delta_rule is not None:
mixed_qkv = mixed_qkv.transpose(1, 2)