v4.4 tag release update. (#3032)

This commit is contained in:
Junkai-Wu
2026-02-14 12:27:58 +08:00
committed by GitHub
parent 01687cfba1
commit d4bbf728ca
140 changed files with 41624 additions and 3691 deletions

View File

@@ -20,3 +20,4 @@ CuTe DSL
Deprecation Policy <deprecation.rst>
Compile with TVM FFI <cute_dsl_general/compile_with_tvm_ffi.rst>
Ahead-of-Time (AOT) Compilation <cute_dsl_general/dsl_ahead_of_time_compilation.rst>
Talks and Presentations <cute_dsl_general/resources.rst>

View File

@@ -112,9 +112,6 @@ For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmaticall
- ``__cubin__``: The generated CUBIN data of the compiled kernel.
- ``__mlir__``: The generated IR code of the compiled kernel.
These attributes are populated only when the corresponding ``CUTE_DSL_KEEP_*`` environment variable is enabled;
otherwise they return ``None``.
.. code:: python
compiled_foo = cute.compile(foo, ...)

View File

@@ -236,4 +236,4 @@ For more information, see the section "Exporting Compiled Module" in :doc:`compi
The primary distinction is that, when TVM FFI is enabled, |DSL| generates a dedicated wrapper function on top of the underlying CuTe ABI. This wrapper adheres to the calling conventions defined by TVM FFI.
In contrast, the CuTe ABI entry function is specified directly in the generated header file, which affects how arguments must be provided.
For instance, with the TVM FFI wrapper function, users are able to pass in arguments such as ``torch.Tensor`` directly. However, when calling the CuTe ABI entry function, arguments should be provided as ``cute.Tensor`` types.
For instance, with the TVM FFI wrapper function, users are able to pass in arguments such as ``torch.Tensor`` directly. However, when calling the CuTe ABI entry function, arguments should be provided as ``cute.Tensor`` types.

View File

@@ -7,11 +7,24 @@ End-to-End Code Generation
==========================
1. Techniques for Turning Python into |IR|
------------------------------------------
1. Hybrid DSL: Python Metaprogramming, Structured GPU Code
----------------------------------------------------------
1.1 AST rewrite
^^^^^^^^^^^^^^^^
|DSL| is a **hybrid DSL** that combines two compilation techniques: *AST rewrite*
and *tracing*. This combination gives you the best of both worlds:
* **Program structure is preserved** — control flow (loops, branches) is
captured via AST rewrite, compiling to proper structured code instead of
flattened traces.
* **Python stays Python** — arithmetic and tensor operations are captured via
tracing, so dynamic shapes, metaprogramming, and Python's rich expression
language work naturally.
To understand why this matters, let's look at each technique.
1.1 AST Rewrite
^^^^^^^^^^^^^^^
The functions abstract-syntax tree is analysed **before** execution.
Python control-flow (``for``/``while``, ``if``/``else``) and built-ins are converted to structured |IR|
constructs. Computation inside each region is left untouched at this stage.
@@ -47,11 +60,206 @@ trace that is lowered to |IR|.
* Data-dependent control-flow freezes to a single execution path.
2. |DSL| Code-Generation Modes
1.3 The Hybrid Solution
^^^^^^^^^^^^^^^^^^^^^^^
As shown above, neither technique alone is sufficient—but together they
complement each other perfectly.
**Why this works: GPU kernels are simple at runtime**
High-performance GPU kernels are structurally simple at runtime: they avoid
deep call hierarchies, complex branching, and dynamic dispatch. However,
*authoring* such kernels benefits greatly from Python's abstractions—classes,
metaprogramming, and polymorphic patterns improve readability and
maintainability.
The hybrid approach resolves this tension by evaluating Python abstractions at
compile time while emitting simple, optimized code for runtime execution.
**How |DSL| divides the work:**
1. **AST rewrite handles structure** — loops (``for``, ``while``) and branches
(``if``/``else``) are converted to structured |IR| *before* execution.
This solves tracing's control-flow problem.
2. **Tracing handles arithmetic** — inside each structured region, the tracer
records tensor operations exactly as they execute. No need to model Python's
complex semantics—just run Python and record what happens. This solves AST
rewriting's complexity problem.
The result:
* Loops compile to real loops, not unrolled traces.
* All branches are preserved, even if not taken during tracing.
* Dynamic shapes, metaprogramming, and Python idioms work naturally.
* The rewriter only needs to understand control flow, not all of Python.
2. |DSL| Compilation Flow: Meta-Stage to Object-Stage
------------------------------------------------------
|DSL| bridges Python and GPU hardware through a three-stage pipeline.
.. _fig-dsl-modes:
.. figure:: dsl_modes.png
:width: 400
:align: center
*Left*: tracing mode records only the path that executed.
*Right*: preprocessor mode emits structured |IR| for every branch and loop
before tracing the arithmetic.
The default |DSL| compilation pipeline (mode 2): Python source flows through AST preprocessing
and interpreter-driven tracing to produce |IR|, which is then lowered and
compiled to device code.
**Stage 1: Pre-Staging (Python AST)**
Before any code executes, the AST preprocessor rewrites the decorated function.
It inserts *callbacks* around control-flow constructs—loops, branches, and
function boundaries—so that program structure is captured explicitly rather than
lost during execution.
**Stage 2: Meta-Stage (Python Interpreter)**
The rewritten function runs in the Python interpreter with proxy tensor
arguments. As execution proceeds:
* Callbacks fire at control-flow boundaries, emitting structured |IR| (loops,
branches, etc.).
* Tensor operations are traced: each operator invocation records the
corresponding operation.
* Compile-time constants are *partially evaluated*—values known at JIT time
fold directly into the |IR|, enabling aggressive specialization.
The result is a complete representation of the kernel, with both high-level
structure and low-level arithmetic intact.
**Stage 3: Object-Stage (Compiler Backend)**
The internal representation passes through a lowering pipeline:
1. High-level operations are progressively lowered toward hardware-specific
representations.
2. Optimization passes (tiling, vectorization, memory promotion) reshape the
code for the target architecture.
3. The final code is translated to PTX/SASS (for NVIDIA GPUs) and assembled
into a device binary.
At runtime, the compiled kernel is loaded and launched on the accelerator.
3. Meta-Programming vs Runtime: Two Worlds in One Function
----------------------------------------------------------
A key insight for understanding |DSL| is that **your Python code runs twice**,
in two very different contexts:
1. **Meta-programming time (compilation)** — Python executes to *build* the
kernel. This happens on the host CPU when you call a ``@jit`` function.
2. **Runtime (execution)** — The compiled kernel runs on the GPU with actual
tensor data.
This distinction determines what you can observe and when.
``print()`` vs ``cute.printf()``: Meta-Stage vs Object-Stage Output
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|DSL| provides two ways to print values, each operating at a different stage:
* **Python's** ``print()`` — executes during the **meta-stage** (compilation).
Use it to inspect what the compiler sees.
* ``cute.printf()`` — compiles into the kernel and executes at **runtime** on
the GPU. Use it to observe actual tensor values during execution.
The following examples demonstrate how the same ``result`` variable appears
differently depending on when and how you print it.
**Example 1: Dynamic variables (both** ``a`` **and** ``b`` **are runtime values)**
.. code-block:: python
@cute.jit
def add_dynamicexpr(b: cutlass.Float32):
a = cutlass.Float32(2.0)
result = a + b
print("[meta-stage] result =", result) # runs at compile time
cute.printf("[object-stage] result = %f\n", result) # runs on GPU
add_dynamicexpr(5.0)
.. code-block:: text
$> python myprogram.py
[meta-stage] result = <Float32 proxy>
[object-stage] result = 7.000000
At meta-stage, ``result`` is a proxy—its value is unknown until the kernel runs.
At runtime, ``cute.printf()`` prints the actual GPU-computed value.
**Example 2: Compile-time constants (both** ``a`` **and** ``b`` **are Constexpr)**
.. code-block:: python
@cute.jit
def add_constexpr(b: cutlass.Constexpr):
a = 2.0
result = a + b
print("[meta-stage] result =", result) # runs at compile time
cute.printf("[object-stage] result = %f\n", result) # runs on GPU
add_constexpr(5.0)
.. code-block:: text
$> python myprogram.py
[meta-stage] result = 7.0
[object-stage] result = 7.000000
Both values are known at compile time, so Python evaluates ``2.0 + 5.0 = 7.0``
during tracing. The constant is baked into the compiled kernel.
**Example 3: Hybrid (** ``a`` **is dynamic,** ``b`` **is Constexpr)**
.. code-block:: python
@cute.jit
def add_hybrid(b: cutlass.Constexpr):
a = cutlass.Float32(2.0)
result = a + b
print("[meta-stage] result =", result) # runs at compile time
cute.printf("[object-stage] result = %f\n", result) # runs on GPU
add_hybrid(5.0)
.. code-block:: text
$> python myprogram.py
[meta-stage] result = <Float32 proxy>
[object-stage] result = 7.000000
The constant ``b = 5.0`` is folded in, but since ``a`` is dynamic, the result
remains a proxy at meta-stage. The GPU computes the final answer at runtime.
Practical Implications
^^^^^^^^^^^^^^^^^^^^^^
* **Use** ``print()`` **to debug your meta-program** — inspect shapes, strides,
tile sizes, and compile-time decisions.
* **Constexpr parameters enable specialization** — the compiler can generate
tighter code when values are known at JIT time.
* **Dynamic parameters preserve generality** — a single compiled kernel can
handle varying input sizes without recompilation.
4. |DSL| Code-Generation Modes
------------------------------
CuTes Python front-end combines the techniques above into **two mutually
exclusive modes**, selectable with the ``preprocessor`` flag of the
CuTe's Python front-end combines the techniques above into **two mutually
exclusive modes** (see :ref:`fig-dsl-modes`), selectable with the ``preprocessor`` flag of the
``@jit`` decorator:
1. Tracing mode ``@jit(preprocess=False)`` tracing only.
@@ -64,23 +272,3 @@ optimisation problems of pure tracing; tracing then fills in the arithmetic.
This hybrid “preprocessor” pipeline is unique to |DSL| and was designed
specifically to overcome the disadvantages identified above.
.. figure:: dsl_modes.png
:width: 400
:align: center
*Left*: tracing mode records only the path that executed.
*Right*: preprocessor mode emits structured |IR| for every branch and loop
before tracing the arithmetic.
Why Tracing-Only Is Insufficient for Control-Flow
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* **Branch loss** The untaken side of an ``if``/``else`` is never lowered.
* **Loop unrolling** Loops are flattened to the iteration count observed,
destroying structure needed for parallel mapping and tiling.
* **Data-dependent paths** Control-flow that depends on tensor values freezes
to a single execution path at trace time.
The preprocessor mode fixes all of these by lowering control-flow first and delegating
only the arithmetic to the tracer.

Binary file not shown.

After

Width:  |  Height:  |  Size: 282 KiB

View File

@@ -5,20 +5,30 @@
Introduction
======================
============
Overview
--------
|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of numeric and GPU-oriented code. Its primary goals are:
|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of
high-performance GPU kernels. It evolved from the C++ CUTLASS library and is
now available as a decorator-based DSL.
- **Consistent with CuTe C++**, allowing users to express GPU kernels with full control of the hardware.
Its primary goals are:
- **Zero-cost abstraction**, DSL is a zero-cost abstraction thanks to Hybrid DSL approach.
- **Consistent with CuTe C++**, allowing users to express GPU kernels with full
control of the hardware.
- **JIT compilation** for both host and GPU execution.
- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless interop with frameworks (e.g., PyTorch, JAX).
- **JIT caching**, so that repeated calls to the same function benefit from cached |IR| modules.
- **Native types and type inference** to reduce boilerplate and improve performance.
- **Optional lower-level control**, offering direct access to GPU backends or specialized |IR| dialects.
- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless
interop with frameworks (e.g., PyTorch, JAX).
- **JIT caching**, so that repeated calls to the same function benefit from
cached |IR| modules.
- **Native types and type inference** to reduce boilerplate and improve
performance.
- **Optional lower-level control**, offering direct access to GPU backends or
specialized |IR| dialects.
Decorators
----------

View File

@@ -0,0 +1,28 @@
.. _talks_and_presentations:
.. |DSL| replace:: CuTe DSL
Talks and Presentations
=======================
This page collects talks, presentations, and other resources related to |DSL|
and CUTLASS Python infrastructure.
Conference Talks
----------------
**CuTeDSL: CUTLASS Python DSL Infrastructure***LLVM 2025*
An introduction to the |DSL| architecture, covering the hybrid AST-rewrite and
tracing approach, MLIR code generation, and integration with CUTLASS.
* `LLVM Video <https://www.youtube.com/watch?v=5NXd6MbKYNQ>`_
* `Slides (PDF) <https://llvm.org/devmtg/2025-10/slides/technical_talks/ozen.pdf>`_
----
**Enable Tensor Core Programming in Python with CUTLASS 4.0***GTC 2025*
Learn how to leverage Tensor Cores directly from Python using CUTLASS 4.0's
new DSL front-end, enabling rapid kernel development without writing CUDA C++.
* `GTC Video <https://www.nvidia.com/en-us/on-demand/session/gtc25-s74639/>`_

View File

@@ -105,4 +105,4 @@ You can:
- Propose support for additional data types or kernel variants
- Help prioritize roadmap features by upvoting GitHub issues
Thank you for helping shape the future of CUTLASS DSLs!
Thank you for helping shape the future of CUTLASS DSLs!