custom_flashinfer/docs/api/logits_processor.rst

.. _apilogitsprocessor:

flashinfer.logits_processor
===========================

A declarative, pluggable framework for building processing pipelines for LLM outputs.

.. currentmodule:: flashinfer.logits_processor

Pipeline Construction
---------------------

Use :class:`LogitsPipe` to create processing pipelines:

.. code-block:: python

    import torch
    from flashinfer.logits_processor import LogitsPipe, Temperature, Softmax, TopP, Sample

    # Create a pipeline
    pipe = LogitsPipe([
        Temperature(),      # Scale logits by temperature
        Softmax(),          # Convert logits to probabilities
        TopP(),             # Apply top-p filtering
        Sample()            # Sample from the distribution
    ])

    # Apply the pipeline
    batch_size = 4
    vocab_size = 5
    logits = torch.randn(batch_size, vocab_size, device="cuda")
    output_ids = pipe(logits, temperature=0.7, top_p=0.9)

Pipeline
--------

.. autosummary::
    :toctree: ../generated

    LogitsPipe

Processors
----------

.. autosummary::
    :toctree: ../generated

    LogitsProcessor
    Temperature
    Softmax
    TopK
    TopP
    MinP
    Sample

Types
-----

.. autosummary::
    :toctree: ../generated

    TensorType
    TaggedTensor

Customization Features
-------------

Custom Logits Processor
^^^^^^^^^^^^^^^^^^^^^^^

You can create your own logits processor by subclassing :class:`LogitsProcessor`:

.. code-block:: python

    class CustomLogitsProcessor(LogitsProcessor):

        def __init__(self, **params: Any):
            super().__init__(**params)

        def legalize(self, input_type: TensorType) -> List["Op"]:
            return [CustomOp(**self.params)]

    class CustomOp(Op):
        # Define the input and output tensor types
        IN = TensorType.LOGITS
        OUT = TensorType.LOGITS

        def __call__(self, tensor: TaggedTensor, **kwargs: Any) -> TaggedTensor:
            pass

    pipe = LogitsPipe([CustomLogitsProcessor()])  # The pipe will be compiled into [CustomOp]

Custom Fusion Rules
^^^^^^^^^^^^^^^^^^^

You can register custom fusion rules to optimize specific processor combinations:

.. code-block:: python

    def custom_fusion_guard(window: List[Op]) -> bool:
        # Whether the fusion should be applied
        return True

    def build_custom_fusion(window: List[Op]) -> Op:
        # Create a fused operator by setting the parameters etc.
        return CustomOp()

    custom_rule = FusionRule(
        pattern=(Temperature, Softmax),
        guard=custom_fusion_guard,
        build=build_custom_fusion,
        prio=20
    )

    pipe = LogitsPipe(
        [Temperature(), Softmax(), Sample()],
        custom_fusion_rules=[custom_rule]
    )   # The compiled ops in the pipeline will be [CustomOp, Sample]