sglang/docs_new/docs/supported-models/transformers-fallback.mdx

---
title: "Transformers Fallback in SGLang"
---

`sglang` can fall back to using models that are available in `transformers`. This works for most decoder-style language models and support for vision-language models is coming soon!

## Example Launch Command

By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting `--model-impl` to `transformers`.

<CodeGroup>
```shell Launch Server
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --model-impl transformers
```
</CodeGroup>

## Supported Features

### Quantization

Transformers fallback has supported most of available quantization in SGLang (except GGUF). See the [Quantization page](../advanced_features/quantization) for more information about supported quantization in SGLang.

### Remote Code

This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!

A model just needs the following two things:

<CodeGroup>
```python Required Implementation
from transformers import PreTrainedModel
from torch import nn

class MyAttention(nn.Module):
  def forward(self, hidden_states, **kwargs): # <- kwargs are required
    ...
    attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
    attn_output, attn_weights = attention_interface(
      self,
      query_states,
      key_states,
      value_states,
      **kwargs,
    )
    ...

class MyModel(PreTrainedModel):
  _supports_attention_backend = True
```
</CodeGroup>

Here is what happens in the background:

1. **Load the config**

The config is loaded.

2. **Load the model class**

`MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.

3. **Use the TransformersModel backend**

The `TransformersModel` backend is used. See `/srt/models/transformers`, which leverages `self.config._attn_implementation = "sglang"`, thus the need to use `ALL_ATTENTION_FUNCTIONS`.

That's it!