v4.4 tag release update. (#3032)

2026-05-12 09:15:56 +00:00 · 2026-02-14 12:27:58 +08:00
parent 01687cfba1
commit d4bbf728ca
140 changed files with 41624 additions and 3691 deletions
--- a/media/docs/pythonDSL/cute_dsl.rst
+++ b/media/docs/pythonDSL/cute_dsl.rst
@@ -20,3 +20,4 @@ CuTe DSL
  Deprecation Policy <deprecation.rst>
  Compile with TVM FFI <cute_dsl_general/compile_with_tvm_ffi.rst>
  Ahead-of-Time (AOT) Compilation <cute_dsl_general/dsl_ahead_of_time_compilation.rst>
+  Talks and Presentations <cute_dsl_general/resources.rst>
--- a/media/docs/pythonDSL/cute_dsl_general/debugging.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/debugging.rst
@@ -112,9 +112,6 @@ For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmaticall
 - ``__cubin__``: The generated CUBIN data of the compiled kernel.
 - ``__mlir__``: The generated IR code of the compiled kernel.

-These attributes are populated only when the corresponding ``CUTE_DSL_KEEP_*`` environment variable is enabled;
-otherwise they return ``None``.
-
 .. code:: python
    
    compiled_foo = cute.compile(foo, ...)
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_ahead_of_time_compilation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_ahead_of_time_compilation.rst
@@ -236,4 +236,4 @@ For more information, see the section "Exporting Compiled Module" in :doc:`compi
 The primary distinction is that, when TVM FFI is enabled, |DSL| generates a dedicated wrapper function on top of the underlying CuTe ABI. This wrapper adheres to the calling conventions defined by TVM FFI.
 In contrast, the CuTe ABI entry function is specified directly in the generated header file, which affects how arguments must be provided.

-For instance, with the TVM FFI wrapper function, users are able to pass in arguments such as ``torch.Tensor`` directly. However, when calling the CuTe ABI entry function, arguments should be provided as ``cute.Tensor`` types.
+For instance, with the TVM FFI wrapper function, users are able to pass in arguments such as ``torch.Tensor`` directly. However, when calling the CuTe ABI entry function, arguments should be provided as ``cute.Tensor`` types.
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
@@ -7,11 +7,24 @@ End-to-End Code Generation
 ==========================


-1. Techniques for Turning Python into |IR|
------------------------------------------
+1. Hybrid DSL: Python Metaprogramming, Structured GPU Code
+----------------------------------------------------------

-1.1 AST rewrite
-^^^^^^^^^^^^^^^^
+|DSL| is a **hybrid DSL** that combines two compilation techniques: *AST rewrite*
+and *tracing*.  This combination gives you the best of both worlds:
+
+*  **Program structure is preserved** — control flow (loops, branches) is
+   captured via AST rewrite, compiling to proper structured code instead of
+   flattened traces.
+*  **Python stays Python** — arithmetic and tensor operations are captured via
+   tracing, so dynamic shapes, metaprogramming, and Python's rich expression
+   language work naturally.
+
+To understand why this matters, let's look at each technique.
+
+
+1.1 AST Rewrite
+^^^^^^^^^^^^^^^
 The function’s abstract-syntax tree is analysed **before** execution.
 Python control-flow (``for``/``while``, ``if``/``else``) and built-ins are converted to structured |IR|
 constructs.  Computation inside each region is left untouched at this stage.
@@ -47,11 +60,206 @@ trace that is lowered to |IR|.
 *  Data-dependent control-flow freezes to a single execution path.


-2. |DSL| Code-Generation Modes
+1.3 The Hybrid Solution
+^^^^^^^^^^^^^^^^^^^^^^^
+
+As shown above, neither technique alone is sufficient—but together they
+complement each other perfectly.
+
+**Why this works: GPU kernels are simple at runtime**
+
+High-performance GPU kernels are structurally simple at runtime: they avoid
+deep call hierarchies, complex branching, and dynamic dispatch.  However,
+*authoring* such kernels benefits greatly from Python's abstractions—classes,
+metaprogramming, and polymorphic patterns improve readability and
+maintainability.
+
+The hybrid approach resolves this tension by evaluating Python abstractions at
+compile time while emitting simple, optimized code for runtime execution.
+
+**How |DSL| divides the work:**
+
+1. **AST rewrite handles structure** — loops (``for``, ``while``) and branches
+   (``if``/``else``) are converted to structured |IR| *before* execution.
+   This solves tracing's control-flow problem.
+
+2. **Tracing handles arithmetic** — inside each structured region, the tracer
+   records tensor operations exactly as they execute.  No need to model Python's
+   complex semantics—just run Python and record what happens.  This solves AST
+   rewriting's complexity problem.
+
+The result:
+
+*  Loops compile to real loops, not unrolled traces.
+*  All branches are preserved, even if not taken during tracing.
+*  Dynamic shapes, metaprogramming, and Python idioms work naturally.
+*  The rewriter only needs to understand control flow, not all of Python.
+
+
+2. |DSL| Compilation Flow: Meta-Stage to Object-Stage
+------------------------------------------------------
+
+|DSL| bridges Python and GPU hardware through a three-stage pipeline.
+
+.. _fig-dsl-modes:
+
+.. figure:: dsl_modes.png
+   :width: 400
+   :align: center
+
+   *Left*: tracing mode records only the path that executed.
+   *Right*: preprocessor mode emits structured |IR| for every branch and loop
+   before tracing the arithmetic.
+
+
+   The default |DSL| compilation pipeline (mode 2): Python source flows through AST preprocessing
+   and interpreter-driven tracing to produce |IR|, which is then lowered and
+   compiled to device code.
+
+**Stage 1: Pre-Staging (Python AST)**
+
+Before any code executes, the AST preprocessor rewrites the decorated function.
+It inserts *callbacks* around control-flow constructs—loops, branches, and
+function boundaries—so that program structure is captured explicitly rather than
+lost during execution.
+
+**Stage 2: Meta-Stage (Python Interpreter)**
+
+The rewritten function runs in the Python interpreter with proxy tensor
+arguments.  As execution proceeds:
+
+*  Callbacks fire at control-flow boundaries, emitting structured |IR| (loops,
+   branches, etc.).
+*  Tensor operations are traced: each operator invocation records the
+   corresponding operation.
+*  Compile-time constants are *partially evaluated*—values known at JIT time
+   fold directly into the |IR|, enabling aggressive specialization.
+
+The result is a complete representation of the kernel, with both high-level
+structure and low-level arithmetic intact.
+
+**Stage 3: Object-Stage (Compiler Backend)**
+
+The internal representation passes through a lowering pipeline:
+
+1. High-level operations are progressively lowered toward hardware-specific
+   representations.
+2. Optimization passes (tiling, vectorization, memory promotion) reshape the
+   code for the target architecture.
+3. The final code is translated to PTX/SASS (for NVIDIA GPUs) and assembled
+   into a device binary.
+
+At runtime, the compiled kernel is loaded and launched on the accelerator.
+
+
+3. Meta-Programming vs Runtime: Two Worlds in One Function
+----------------------------------------------------------
+
+A key insight for understanding |DSL| is that **your Python code runs twice**,
+in two very different contexts:
+
+1. **Meta-programming time (compilation)** — Python executes to *build* the
+   kernel.  This happens on the host CPU when you call a ``@jit`` function.
+2. **Runtime (execution)** — The compiled kernel runs on the GPU with actual
+   tensor data.
+
+This distinction determines what you can observe and when.
+
+``print()`` vs ``cute.printf()``: Meta-Stage vs Object-Stage Output
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+|DSL| provides two ways to print values, each operating at a different stage:
+
+*  **Python's** ``print()`` — executes during the **meta-stage** (compilation).
+   Use it to inspect what the compiler sees.
+*  ``cute.printf()`` — compiles into the kernel and executes at **runtime** on
+   the GPU.  Use it to observe actual tensor values during execution.
+
+The following examples demonstrate how the same ``result`` variable appears
+differently depending on when and how you print it.
+
+**Example 1: Dynamic variables (both** ``a`` **and** ``b`` **are runtime values)**
+
+.. code-block:: python
+
+   @cute.jit
+   def add_dynamicexpr(b: cutlass.Float32):
+       a = cutlass.Float32(2.0)
+       result = a + b
+       print("[meta-stage] result =", result)          # runs at compile time
+       cute.printf("[object-stage] result = %f\n", result)  # runs on GPU
+
+   add_dynamicexpr(5.0)
+
+.. code-block:: text
+
+   $> python myprogram.py
+   [meta-stage] result = <Float32 proxy>
+   [object-stage] result = 7.000000
+
+At meta-stage, ``result`` is a proxy—its value is unknown until the kernel runs.
+At runtime, ``cute.printf()`` prints the actual GPU-computed value.
+
+**Example 2: Compile-time constants (both** ``a`` **and** ``b`` **are Constexpr)**
+
+.. code-block:: python
+
+   @cute.jit
+   def add_constexpr(b: cutlass.Constexpr):
+       a = 2.0
+       result = a + b
+       print("[meta-stage] result =", result)          # runs at compile time
+       cute.printf("[object-stage] result = %f\n", result)  # runs on GPU
+
+   add_constexpr(5.0)
+
+.. code-block:: text
+
+   $> python myprogram.py
+   [meta-stage] result = 7.0
+   [object-stage] result = 7.000000
+
+Both values are known at compile time, so Python evaluates ``2.0 + 5.0 = 7.0``
+during tracing.  The constant is baked into the compiled kernel.
+
+**Example 3: Hybrid (** ``a`` **is dynamic,** ``b`` **is Constexpr)**
+
+.. code-block:: python
+
+   @cute.jit
+   def add_hybrid(b: cutlass.Constexpr):
+       a = cutlass.Float32(2.0)
+       result = a + b
+       print("[meta-stage] result =", result)          # runs at compile time
+       cute.printf("[object-stage] result = %f\n", result)  # runs on GPU
+
+   add_hybrid(5.0)
+
+.. code-block:: text
+
+   $> python myprogram.py
+   [meta-stage] result = <Float32 proxy>
+   [object-stage] result = 7.000000
+
+The constant ``b = 5.0`` is folded in, but since ``a`` is dynamic, the result
+remains a proxy at meta-stage.  The GPU computes the final answer at runtime.
+
+
+Practical Implications
+^^^^^^^^^^^^^^^^^^^^^^
+
+*  **Use** ``print()`` **to debug your meta-program** — inspect shapes, strides,
+   tile sizes, and compile-time decisions.
+*  **Constexpr parameters enable specialization** — the compiler can generate
+   tighter code when values are known at JIT time.
+*  **Dynamic parameters preserve generality** — a single compiled kernel can
+   handle varying input sizes without recompilation.
+
+4. |DSL| Code-Generation Modes
 ------------------------------

-CuTe’s Python front-end combines the techniques above into **two mutually
-exclusive modes**, selectable with the ``preprocessor`` flag of the
+CuTe's Python front-end combines the techniques above into **two mutually
+exclusive modes** (see :ref:`fig-dsl-modes`), selectable with the ``preprocessor`` flag of the
 ``@jit`` decorator:

 1. Tracing mode ``@jit(preprocess=False)`` – tracing only.
@@ -64,23 +272,3 @@ optimisation problems of pure tracing; tracing then fills in the arithmetic.
 This hybrid “preprocessor” pipeline is unique to |DSL| and was designed
 specifically to overcome the disadvantages identified above.

-.. figure:: dsl_modes.png
-   :width: 400
-   :align: center
-
-   *Left*: tracing mode records only the path that executed.
-   *Right*: preprocessor mode emits structured |IR| for every branch and loop
-   before tracing the arithmetic.
-
-
-Why Tracing-Only Is Insufficient for Control-Flow
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-* **Branch loss** – The untaken side of an ``if``/``else`` is never lowered.
-* **Loop unrolling** – Loops are flattened to the iteration count observed,
-  destroying structure needed for parallel mapping and tiling.
-* **Data-dependent paths** – Control-flow that depends on tensor values freezes
-  to a single execution path at trace time.
-
-The preprocessor mode fixes all of these by lowering control-flow first and delegating
-only the arithmetic to the tracer.
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_compilation.png
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_compilation.png
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
@@ -5,20 +5,30 @@


 Introduction
-======================
+============


 Overview
 --------

-|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of numeric and GPU-oriented code. Its primary goals are:
+|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of
+high-performance GPU kernels.  It evolved from the C++ CUTLASS library and is
+now available as a decorator-based DSL.

- **Consistent with CuTe C++**, allowing users to express GPU kernels with full control of the hardware.
+Its primary goals are:
+
+- **Zero-cost abstraction**, DSL is a zero-cost abstraction thanks to Hybrid DSL approach.
+- **Consistent with CuTe C++**, allowing users to express GPU kernels with full
+  control of the hardware.
 - **JIT compilation** for both host and GPU execution.
- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless interop with frameworks (e.g., PyTorch, JAX).
- **JIT caching**, so that repeated calls to the same function benefit from cached |IR| modules.
- **Native types and type inference** to reduce boilerplate and improve performance.
- **Optional lower-level control**, offering direct access to GPU backends or specialized |IR| dialects.
+- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless
+  interop with frameworks (e.g., PyTorch, JAX).
+- **JIT caching**, so that repeated calls to the same function benefit from
+  cached |IR| modules.
+- **Native types and type inference** to reduce boilerplate and improve
+  performance.
+- **Optional lower-level control**, offering direct access to GPU backends or
+  specialized |IR| dialects.

 Decorators
 ----------
--- a/media/docs/pythonDSL/cute_dsl_general/resources.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/resources.rst
@@ -0,0 +1,28 @@
+.. _talks_and_presentations:
+.. |DSL| replace:: CuTe DSL
+
+Talks and Presentations
+=======================
+
+This page collects talks, presentations, and other resources related to |DSL|
+and CUTLASS Python infrastructure.
+
+Conference Talks
+----------------
+
+**CuTeDSL: CUTLASS Python DSL Infrastructure** — *LLVM 2025*
+
+An introduction to the |DSL| architecture, covering the hybrid AST-rewrite and
+tracing approach, MLIR code generation, and integration with CUTLASS.
+
+*  `LLVM Video <https://www.youtube.com/watch?v=5NXd6MbKYNQ>`_
+*  `Slides (PDF) <https://llvm.org/devmtg/2025-10/slides/technical_talks/ozen.pdf>`_
+
+----
+
+**Enable Tensor Core Programming in Python with CUTLASS 4.0** — *GTC 2025*
+
+Learn how to leverage Tensor Cores directly from Python using CUTLASS 4.0's
+new DSL front-end, enabling rapid kernel development without writing CUDA C++.
+
+*  `GTC Video <https://www.nvidia.com/en-us/on-demand/session/gtc25-s74639/>`_
--- a/media/docs/pythonDSL/overview.rst
+++ b/media/docs/pythonDSL/overview.rst
@@ -105,4 +105,4 @@ You can:
 - Propose support for additional data types or kernel variants
 - Help prioritize roadmap features by upvoting GitHub issues

-Thank you for helping shape the future of CUTLASS DSLs!
+Thank you for helping shape the future of CUTLASS DSLs!