mirror of
https://github.com/turboderp-org/exllamav3.git
synced 2026-03-15 00:07:24 +00:00
Update README.md
This commit is contained in:
@@ -17,11 +17,10 @@ The official and recommended backend server for ExLlamaV3 is [TabbyAPI](https://
|
||||
|
||||
### ⚠️ Important
|
||||
|
||||
- **Qwen3-Next** support is currently experimental and still requires some optimization, so don't expect
|
||||
optimal performance just yet. [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention) is required
|
||||
and this in turn requires Triton. [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) is supported and
|
||||
recommended but not required.
|
||||
- **Qwen3-Next** currently does not support tensor/expert parallelism.
|
||||
- **Qwen3-Next** and **Qwen3.5** can take advantage of [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), though this requires
|
||||
Triton, and performance can be shaky due to the sporadic JIT compilation it imposes. [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) is
|
||||
supported and recommended but not required.
|
||||
- **Qwen3-Next** and **Qwen3.5** currently do not support tensor/expert parallelism.
|
||||
|
||||
## Architecture support
|
||||
|
||||
|
||||
@@ -616,8 +616,7 @@ class GatedDeltaNet(Module):
|
||||
)
|
||||
|
||||
# Use chunked rule when advantageous and available
|
||||
# TODO: At least warn if chunked rule (i.e. flash-linear-attention) is not available
|
||||
# since performance will tank on prompt ingestion
|
||||
# TODO: Replace chunked fn with non-Triton implementation
|
||||
if seqlen >= self.num_v_heads and chunk_gated_delta_rule is not None:
|
||||
mixed_qkv = mixed_qkv.transpose(1, 2)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user