mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
71 lines
2.1 KiB
Plaintext
71 lines
2.1 KiB
Plaintext
---
|
|
title: "Transformers Fallback in SGLang"
|
|
---
|
|
|
|
`sglang` can fall back to using models that are available in `transformers`. This works for most decoder-style language models and support for vision-language models is coming soon!
|
|
|
|
## Example Launch Command
|
|
|
|
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.
|
|
|
|
<CodeGroup>
|
|
```shell Launch Server
|
|
python3 -m sglang.launch_server \
|
|
--model-path meta-llama/Llama-3.2-1B-Instruct \
|
|
--host 0.0.0.0 \
|
|
--port 30000 \
|
|
--model-impl transformers
|
|
```
|
|
</CodeGroup>
|
|
|
|
## Supported Features
|
|
|
|
### Quantization
|
|
|
|
Transformers fallback has supported most of available quantization in SGLang (except GGUF). See the [Quantization page](../advanced_features/quantization) for more information about supported quantization in SGLang.
|
|
|
|
### Remote Code
|
|
|
|
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
|
|
|
|
A model just needs the following two things:
|
|
|
|
<CodeGroup>
|
|
```python Required Implementation
|
|
from transformers import PreTrainedModel
|
|
from torch import nn
|
|
|
|
class MyAttention(nn.Module):
|
|
def forward(self, hidden_states, **kwargs): # <- kwargs are required
|
|
...
|
|
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
|
attn_output, attn_weights = attention_interface(
|
|
self,
|
|
query_states,
|
|
key_states,
|
|
value_states,
|
|
**kwargs,
|
|
)
|
|
...
|
|
|
|
class MyModel(PreTrainedModel):
|
|
_supports_attention_backend = True
|
|
```
|
|
</CodeGroup>
|
|
|
|
Here is what happens in the background:
|
|
|
|
1. **Load the config**
|
|
|
|
The config is loaded.
|
|
|
|
2. **Load the model class**
|
|
|
|
`MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
|
|
|
|
3. **Use the TransformersModel backend**
|
|
|
|
The `TransformersModel` backend is used. See `/srt/models/transformers`, which leverages `self.config._attn_implementation = "sglang"`, thus the need to use `ALL_ATTENTION_FUNCTIONS`.
|
|
|
|
That's it!
|