Files
2026-06-22 22:07:29 -04:00
..
2026-06-22 22:07:29 -04:00
2026-06-22 22:07:29 -04:00
2026-06-01 08:58:58 +08:00
2026-06-22 22:07:29 -04:00
2026-05-05 20:55:27 -04:00
2026-05-05 20:55:27 -04:00
2026-05-05 20:55:27 -04:00
2026-06-22 22:07:29 -04:00
2026-05-05 20:55:27 -04:00
2026-05-05 20:55:27 -04:00
2026-05-05 20:55:27 -04:00
2026-06-15 23:23:20 -04:00
2026-05-05 20:55:27 -04:00
2026-05-05 20:55:27 -04:00

DSL Feature Examples

This directory demonstrates CuTe DSL capabilities beyond kernel authoring itself: exporting compiled kernels for deployment, integrating with ML frameworks, using foreign function interfaces, and accessing low-level DSL features like inline PTX and shared memory allocation.


Directory Structure

dsl/
  export/                        Exporting kernels to C shared libraries
    export_to_c.py                 Compile a kernel and export as .so/.dylib
    load_in_python.py              Load and call the exported library from Python
    run_with_dynamic_loading.cpp   C++ driver using dlopen
    run_with_dynamic_loading.sh    Build/run script for dynamic loading
    run_with_static_linking.cpp    C++ driver using static linking
    run_with_static_linking.sh     Build/run script for static linking
  ffi/                           Foreign function interface
    jit_argument.py                JIT compilation with argument passing
    tensor.cpp                     C++ tensor interop implementation
    CMakeLists.txt                 CMake build for FFI examples
  jax/                           JAX integration
    cutlass_call_basic.py          Basic CUTLASS kernel call from JAX
    cutlass_call_export.py         Export a CUTLASS kernel for JAX
    cutlass_call_sharding.py       Multi-device sharding with CUTLASS kernels
    elementwise_apply_example.py   Elementwise apply via JAX
  tvm_ffi/                       TVM FFI integration
    jit_and_use_in_torch.py        JIT compile and call from PyTorch
    jit_and_use_in_jax.py          JIT compile and call from JAX
    aot_export.py                  Ahead-of-time export
    aot_use_in_torch.py            Use AOT-exported kernel in PyTorch
    aot_use_in_jax.py              Use AOT-exported kernel in JAX
    aot_use_in_cpp_bundle.cpp      Use AOT-exported kernel in C++
    aot_use_in_cpp_bundle.sh       Build/run script for C++ AOT usage
    compile_with_fake_tensor.py    Compile using fake tensors
    compile_with_symint_arg.py     Compile with symbolic integer arguments
    ampere_gemm_with_fake_tensor.py  Ampere GEMM with fake tensor compilation
    error_reporting.py             Error reporting and diagnostics
  call_bypass_dlpack.py          Calling kernels bypassing DLPack
  call_from_jit.py               Calling conventions from JIT-compiled code
  cooperative_launch.py          Cooperative kernel launch (multi-CTA)
  dynamic_smem_size.py           Dynamic shared memory allocation
  inline_ptx.py                  Embedding inline PTX assembly
  launch_completion_and_programmatic_events.py
                                 Launch completion / programmatic events with cudaEvent_t and CUevent
  pointer.py                     Pointer manipulation in DSL
  print_latex.py                 LaTeX rendering of CuTe layouts
  programmatic_dependent_launch.py  Programmatic dependent launch (PDL)
  smem_allocator.py              Shared memory allocator usage
  torch_fake_tensor.py           PyTorch fake tensor integration
  torch_fp4.py                   PyTorch FP4 tensor support

Subdirectory Guides

export/ -- Kernel Export

Shows how to compile a CuTe DSL kernel into a standalone C shared library (.so) that can be loaded and called from C++ or Python without any CuTe DSL dependency at runtime. Includes complete examples for both dynamic loading (dlopen) and static linking workflows.

ffi/ -- Foreign Function Interface

Demonstrates how to pass arguments between Python/CuTe DSL and C++ code using the FFI layer. Useful for integrating CuTe DSL kernels into existing C++ applications.

jax/ -- JAX Integration

Shows how to call CuTe DSL kernels from JAX using cutlass_call, including basic invocation, kernel export for JAX, multi-device sharding, and elementwise application patterns.

tvm_ffi/ -- TVM FFI Integration

Comprehensive examples for using CuTe DSL kernels through TVM's foreign function interface. Covers both JIT and AOT (ahead-of-time) compilation workflows, with usage examples for PyTorch, JAX, and C++. Also demonstrates fake-tensor compilation (no GPU required at compile time) and symbolic integer arguments.


Top-Level Files

The top-level Python files demonstrate individual DSL features:

  • call_bypass_dlpack.py / call_from_jit.py -- Kernel calling conventions
  • inline_ptx.py -- Embedding inline PTX assembly in CuTe DSL kernels
  • launch_completion_and_programmatic_events.py -- Examples of launch_completion_event and programmatic_event launch attributes, using events created via torch.cuda.Event(enable_timing=False) and presented as either cudaEvent_t (cuda.bindings.runtime) or CUevent (cuda.bindings.driver). The stream is passed as a cudaStream_t (cuda.bindings.runtime)
  • programmatic_dependent_launch.py -- Programmatic dependent launch for chaining kernels with data dependencies
  • cooperative_launch.py -- Cooperative launch for multi-CTA kernels
  • dynamic_smem_size.py / smem_allocator.py -- Shared memory allocation
  • torch_fake_tensor.py / torch_fp4.py -- PyTorch integration features
  • pointer.py -- Pointer manipulation within DSL kernels
  • print_latex.py -- Render CuTe layouts as LaTeX for visualization