mirror of
https://github.com/NVIDIA/cutlass.git
synced 2026-05-12 09:15:56 +00:00
v4.4 tag release update. (#3032)
This commit is contained in:
@@ -20,3 +20,4 @@ CuTe DSL
|
||||
Deprecation Policy <deprecation.rst>
|
||||
Compile with TVM FFI <cute_dsl_general/compile_with_tvm_ffi.rst>
|
||||
Ahead-of-Time (AOT) Compilation <cute_dsl_general/dsl_ahead_of_time_compilation.rst>
|
||||
Talks and Presentations <cute_dsl_general/resources.rst>
|
||||
|
||||
@@ -112,9 +112,6 @@ For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmaticall
|
||||
- ``__cubin__``: The generated CUBIN data of the compiled kernel.
|
||||
- ``__mlir__``: The generated IR code of the compiled kernel.
|
||||
|
||||
These attributes are populated only when the corresponding ``CUTE_DSL_KEEP_*`` environment variable is enabled;
|
||||
otherwise they return ``None``.
|
||||
|
||||
.. code:: python
|
||||
|
||||
compiled_foo = cute.compile(foo, ...)
|
||||
|
||||
@@ -236,4 +236,4 @@ For more information, see the section "Exporting Compiled Module" in :doc:`compi
|
||||
The primary distinction is that, when TVM FFI is enabled, |DSL| generates a dedicated wrapper function on top of the underlying CuTe ABI. This wrapper adheres to the calling conventions defined by TVM FFI.
|
||||
In contrast, the CuTe ABI entry function is specified directly in the generated header file, which affects how arguments must be provided.
|
||||
|
||||
For instance, with the TVM FFI wrapper function, users are able to pass in arguments such as ``torch.Tensor`` directly. However, when calling the CuTe ABI entry function, arguments should be provided as ``cute.Tensor`` types.
|
||||
For instance, with the TVM FFI wrapper function, users are able to pass in arguments such as ``torch.Tensor`` directly. However, when calling the CuTe ABI entry function, arguments should be provided as ``cute.Tensor`` types.
|
||||
@@ -7,11 +7,24 @@ End-to-End Code Generation
|
||||
==========================
|
||||
|
||||
|
||||
1. Techniques for Turning Python into |IR|
|
||||
------------------------------------------
|
||||
1. Hybrid DSL: Python Metaprogramming, Structured GPU Code
|
||||
----------------------------------------------------------
|
||||
|
||||
1.1 AST rewrite
|
||||
^^^^^^^^^^^^^^^^
|
||||
|DSL| is a **hybrid DSL** that combines two compilation techniques: *AST rewrite*
|
||||
and *tracing*. This combination gives you the best of both worlds:
|
||||
|
||||
* **Program structure is preserved** — control flow (loops, branches) is
|
||||
captured via AST rewrite, compiling to proper structured code instead of
|
||||
flattened traces.
|
||||
* **Python stays Python** — arithmetic and tensor operations are captured via
|
||||
tracing, so dynamic shapes, metaprogramming, and Python's rich expression
|
||||
language work naturally.
|
||||
|
||||
To understand why this matters, let's look at each technique.
|
||||
|
||||
|
||||
1.1 AST Rewrite
|
||||
^^^^^^^^^^^^^^^
|
||||
The function’s abstract-syntax tree is analysed **before** execution.
|
||||
Python control-flow (``for``/``while``, ``if``/``else``) and built-ins are converted to structured |IR|
|
||||
constructs. Computation inside each region is left untouched at this stage.
|
||||
@@ -47,11 +60,206 @@ trace that is lowered to |IR|.
|
||||
* Data-dependent control-flow freezes to a single execution path.
|
||||
|
||||
|
||||
2. |DSL| Code-Generation Modes
|
||||
1.3 The Hybrid Solution
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
As shown above, neither technique alone is sufficient—but together they
|
||||
complement each other perfectly.
|
||||
|
||||
**Why this works: GPU kernels are simple at runtime**
|
||||
|
||||
High-performance GPU kernels are structurally simple at runtime: they avoid
|
||||
deep call hierarchies, complex branching, and dynamic dispatch. However,
|
||||
*authoring* such kernels benefits greatly from Python's abstractions—classes,
|
||||
metaprogramming, and polymorphic patterns improve readability and
|
||||
maintainability.
|
||||
|
||||
The hybrid approach resolves this tension by evaluating Python abstractions at
|
||||
compile time while emitting simple, optimized code for runtime execution.
|
||||
|
||||
**How |DSL| divides the work:**
|
||||
|
||||
1. **AST rewrite handles structure** — loops (``for``, ``while``) and branches
|
||||
(``if``/``else``) are converted to structured |IR| *before* execution.
|
||||
This solves tracing's control-flow problem.
|
||||
|
||||
2. **Tracing handles arithmetic** — inside each structured region, the tracer
|
||||
records tensor operations exactly as they execute. No need to model Python's
|
||||
complex semantics—just run Python and record what happens. This solves AST
|
||||
rewriting's complexity problem.
|
||||
|
||||
The result:
|
||||
|
||||
* Loops compile to real loops, not unrolled traces.
|
||||
* All branches are preserved, even if not taken during tracing.
|
||||
* Dynamic shapes, metaprogramming, and Python idioms work naturally.
|
||||
* The rewriter only needs to understand control flow, not all of Python.
|
||||
|
||||
|
||||
2. |DSL| Compilation Flow: Meta-Stage to Object-Stage
|
||||
------------------------------------------------------
|
||||
|
||||
|DSL| bridges Python and GPU hardware through a three-stage pipeline.
|
||||
|
||||
.. _fig-dsl-modes:
|
||||
|
||||
.. figure:: dsl_modes.png
|
||||
:width: 400
|
||||
:align: center
|
||||
|
||||
*Left*: tracing mode records only the path that executed.
|
||||
*Right*: preprocessor mode emits structured |IR| for every branch and loop
|
||||
before tracing the arithmetic.
|
||||
|
||||
|
||||
The default |DSL| compilation pipeline (mode 2): Python source flows through AST preprocessing
|
||||
and interpreter-driven tracing to produce |IR|, which is then lowered and
|
||||
compiled to device code.
|
||||
|
||||
**Stage 1: Pre-Staging (Python AST)**
|
||||
|
||||
Before any code executes, the AST preprocessor rewrites the decorated function.
|
||||
It inserts *callbacks* around control-flow constructs—loops, branches, and
|
||||
function boundaries—so that program structure is captured explicitly rather than
|
||||
lost during execution.
|
||||
|
||||
**Stage 2: Meta-Stage (Python Interpreter)**
|
||||
|
||||
The rewritten function runs in the Python interpreter with proxy tensor
|
||||
arguments. As execution proceeds:
|
||||
|
||||
* Callbacks fire at control-flow boundaries, emitting structured |IR| (loops,
|
||||
branches, etc.).
|
||||
* Tensor operations are traced: each operator invocation records the
|
||||
corresponding operation.
|
||||
* Compile-time constants are *partially evaluated*—values known at JIT time
|
||||
fold directly into the |IR|, enabling aggressive specialization.
|
||||
|
||||
The result is a complete representation of the kernel, with both high-level
|
||||
structure and low-level arithmetic intact.
|
||||
|
||||
**Stage 3: Object-Stage (Compiler Backend)**
|
||||
|
||||
The internal representation passes through a lowering pipeline:
|
||||
|
||||
1. High-level operations are progressively lowered toward hardware-specific
|
||||
representations.
|
||||
2. Optimization passes (tiling, vectorization, memory promotion) reshape the
|
||||
code for the target architecture.
|
||||
3. The final code is translated to PTX/SASS (for NVIDIA GPUs) and assembled
|
||||
into a device binary.
|
||||
|
||||
At runtime, the compiled kernel is loaded and launched on the accelerator.
|
||||
|
||||
|
||||
3. Meta-Programming vs Runtime: Two Worlds in One Function
|
||||
----------------------------------------------------------
|
||||
|
||||
A key insight for understanding |DSL| is that **your Python code runs twice**,
|
||||
in two very different contexts:
|
||||
|
||||
1. **Meta-programming time (compilation)** — Python executes to *build* the
|
||||
kernel. This happens on the host CPU when you call a ``@jit`` function.
|
||||
2. **Runtime (execution)** — The compiled kernel runs on the GPU with actual
|
||||
tensor data.
|
||||
|
||||
This distinction determines what you can observe and when.
|
||||
|
||||
``print()`` vs ``cute.printf()``: Meta-Stage vs Object-Stage Output
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|DSL| provides two ways to print values, each operating at a different stage:
|
||||
|
||||
* **Python's** ``print()`` — executes during the **meta-stage** (compilation).
|
||||
Use it to inspect what the compiler sees.
|
||||
* ``cute.printf()`` — compiles into the kernel and executes at **runtime** on
|
||||
the GPU. Use it to observe actual tensor values during execution.
|
||||
|
||||
The following examples demonstrate how the same ``result`` variable appears
|
||||
differently depending on when and how you print it.
|
||||
|
||||
**Example 1: Dynamic variables (both** ``a`` **and** ``b`` **are runtime values)**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@cute.jit
|
||||
def add_dynamicexpr(b: cutlass.Float32):
|
||||
a = cutlass.Float32(2.0)
|
||||
result = a + b
|
||||
print("[meta-stage] result =", result) # runs at compile time
|
||||
cute.printf("[object-stage] result = %f\n", result) # runs on GPU
|
||||
|
||||
add_dynamicexpr(5.0)
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$> python myprogram.py
|
||||
[meta-stage] result = <Float32 proxy>
|
||||
[object-stage] result = 7.000000
|
||||
|
||||
At meta-stage, ``result`` is a proxy—its value is unknown until the kernel runs.
|
||||
At runtime, ``cute.printf()`` prints the actual GPU-computed value.
|
||||
|
||||
**Example 2: Compile-time constants (both** ``a`` **and** ``b`` **are Constexpr)**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@cute.jit
|
||||
def add_constexpr(b: cutlass.Constexpr):
|
||||
a = 2.0
|
||||
result = a + b
|
||||
print("[meta-stage] result =", result) # runs at compile time
|
||||
cute.printf("[object-stage] result = %f\n", result) # runs on GPU
|
||||
|
||||
add_constexpr(5.0)
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$> python myprogram.py
|
||||
[meta-stage] result = 7.0
|
||||
[object-stage] result = 7.000000
|
||||
|
||||
Both values are known at compile time, so Python evaluates ``2.0 + 5.0 = 7.0``
|
||||
during tracing. The constant is baked into the compiled kernel.
|
||||
|
||||
**Example 3: Hybrid (** ``a`` **is dynamic,** ``b`` **is Constexpr)**
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@cute.jit
|
||||
def add_hybrid(b: cutlass.Constexpr):
|
||||
a = cutlass.Float32(2.0)
|
||||
result = a + b
|
||||
print("[meta-stage] result =", result) # runs at compile time
|
||||
cute.printf("[object-stage] result = %f\n", result) # runs on GPU
|
||||
|
||||
add_hybrid(5.0)
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$> python myprogram.py
|
||||
[meta-stage] result = <Float32 proxy>
|
||||
[object-stage] result = 7.000000
|
||||
|
||||
The constant ``b = 5.0`` is folded in, but since ``a`` is dynamic, the result
|
||||
remains a proxy at meta-stage. The GPU computes the final answer at runtime.
|
||||
|
||||
|
||||
Practical Implications
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* **Use** ``print()`` **to debug your meta-program** — inspect shapes, strides,
|
||||
tile sizes, and compile-time decisions.
|
||||
* **Constexpr parameters enable specialization** — the compiler can generate
|
||||
tighter code when values are known at JIT time.
|
||||
* **Dynamic parameters preserve generality** — a single compiled kernel can
|
||||
handle varying input sizes without recompilation.
|
||||
|
||||
4. |DSL| Code-Generation Modes
|
||||
------------------------------
|
||||
|
||||
CuTe’s Python front-end combines the techniques above into **two mutually
|
||||
exclusive modes**, selectable with the ``preprocessor`` flag of the
|
||||
CuTe's Python front-end combines the techniques above into **two mutually
|
||||
exclusive modes** (see :ref:`fig-dsl-modes`), selectable with the ``preprocessor`` flag of the
|
||||
``@jit`` decorator:
|
||||
|
||||
1. Tracing mode ``@jit(preprocess=False)`` – tracing only.
|
||||
@@ -64,23 +272,3 @@ optimisation problems of pure tracing; tracing then fills in the arithmetic.
|
||||
This hybrid “preprocessor” pipeline is unique to |DSL| and was designed
|
||||
specifically to overcome the disadvantages identified above.
|
||||
|
||||
.. figure:: dsl_modes.png
|
||||
:width: 400
|
||||
:align: center
|
||||
|
||||
*Left*: tracing mode records only the path that executed.
|
||||
*Right*: preprocessor mode emits structured |IR| for every branch and loop
|
||||
before tracing the arithmetic.
|
||||
|
||||
|
||||
Why Tracing-Only Is Insufficient for Control-Flow
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* **Branch loss** – The untaken side of an ``if``/``else`` is never lowered.
|
||||
* **Loop unrolling** – Loops are flattened to the iteration count observed,
|
||||
destroying structure needed for parallel mapping and tiling.
|
||||
* **Data-dependent paths** – Control-flow that depends on tensor values freezes
|
||||
to a single execution path at trace time.
|
||||
|
||||
The preprocessor mode fixes all of these by lowering control-flow first and delegating
|
||||
only the arithmetic to the tracer.
|
||||
|
||||
BIN
media/docs/pythonDSL/cute_dsl_general/dsl_compilation.png
Normal file
BIN
media/docs/pythonDSL/cute_dsl_general/dsl_compilation.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 282 KiB |
@@ -5,20 +5,30 @@
|
||||
|
||||
|
||||
Introduction
|
||||
======================
|
||||
============
|
||||
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of numeric and GPU-oriented code. Its primary goals are:
|
||||
|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of
|
||||
high-performance GPU kernels. It evolved from the C++ CUTLASS library and is
|
||||
now available as a decorator-based DSL.
|
||||
|
||||
- **Consistent with CuTe C++**, allowing users to express GPU kernels with full control of the hardware.
|
||||
Its primary goals are:
|
||||
|
||||
- **Zero-cost abstraction**, DSL is a zero-cost abstraction thanks to Hybrid DSL approach.
|
||||
- **Consistent with CuTe C++**, allowing users to express GPU kernels with full
|
||||
control of the hardware.
|
||||
- **JIT compilation** for both host and GPU execution.
|
||||
- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless interop with frameworks (e.g., PyTorch, JAX).
|
||||
- **JIT caching**, so that repeated calls to the same function benefit from cached |IR| modules.
|
||||
- **Native types and type inference** to reduce boilerplate and improve performance.
|
||||
- **Optional lower-level control**, offering direct access to GPU backends or specialized |IR| dialects.
|
||||
- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless
|
||||
interop with frameworks (e.g., PyTorch, JAX).
|
||||
- **JIT caching**, so that repeated calls to the same function benefit from
|
||||
cached |IR| modules.
|
||||
- **Native types and type inference** to reduce boilerplate and improve
|
||||
performance.
|
||||
- **Optional lower-level control**, offering direct access to GPU backends or
|
||||
specialized |IR| dialects.
|
||||
|
||||
Decorators
|
||||
----------
|
||||
|
||||
28
media/docs/pythonDSL/cute_dsl_general/resources.rst
Normal file
28
media/docs/pythonDSL/cute_dsl_general/resources.rst
Normal file
@@ -0,0 +1,28 @@
|
||||
.. _talks_and_presentations:
|
||||
.. |DSL| replace:: CuTe DSL
|
||||
|
||||
Talks and Presentations
|
||||
=======================
|
||||
|
||||
This page collects talks, presentations, and other resources related to |DSL|
|
||||
and CUTLASS Python infrastructure.
|
||||
|
||||
Conference Talks
|
||||
----------------
|
||||
|
||||
**CuTeDSL: CUTLASS Python DSL Infrastructure** — *LLVM 2025*
|
||||
|
||||
An introduction to the |DSL| architecture, covering the hybrid AST-rewrite and
|
||||
tracing approach, MLIR code generation, and integration with CUTLASS.
|
||||
|
||||
* `LLVM Video <https://www.youtube.com/watch?v=5NXd6MbKYNQ>`_
|
||||
* `Slides (PDF) <https://llvm.org/devmtg/2025-10/slides/technical_talks/ozen.pdf>`_
|
||||
|
||||
----
|
||||
|
||||
**Enable Tensor Core Programming in Python with CUTLASS 4.0** — *GTC 2025*
|
||||
|
||||
Learn how to leverage Tensor Cores directly from Python using CUTLASS 4.0's
|
||||
new DSL front-end, enabling rapid kernel development without writing CUDA C++.
|
||||
|
||||
* `GTC Video <https://www.nvidia.com/en-us/on-demand/session/gtc25-s74639/>`_
|
||||
@@ -105,4 +105,4 @@ You can:
|
||||
- Propose support for additional data types or kernel variants
|
||||
- Help prioritize roadmap features by upvoting GitHub issues
|
||||
|
||||
Thank you for helping shape the future of CUTLASS DSLs!
|
||||
Thank you for helping shape the future of CUTLASS DSLs!
|
||||
Reference in New Issue
Block a user