This directory demonstrates CuTe DSL capabilities beyond kernel authoring itself: exporting compiled kernels for deployment, integrating with ML frameworks, using foreign function interfaces, and accessing low-level DSL features like inline PTX and shared memory allocation.

Directory Structure

dsl/
  export/                        Exporting kernels to C shared libraries
    export_to_c.py                 Compile a kernel and export as .so/.dylib
    load_in_python.py              Load and call the exported library from Python
    run_with_dynamic_loading.cpp   C++ driver using dlopen
    run_with_dynamic_loading.sh    Build/run script for dynamic loading
    run_with_static_linking.cpp    C++ driver using static linking
    run_with_static_linking.sh     Build/run script for static linking
  ffi/                           Foreign function interface
    jit_argument.py                JIT compilation with argument passing
    tensor.cpp                     C++ tensor interop implementation
    CMakeLists.txt                 CMake build for FFI examples
  jax/                           JAX integration
    cutlass_call_basic.py          Basic CUTLASS kernel call from JAX
    cutlass_call_export.py         Export a CUTLASS kernel for JAX
    cutlass_call_sharding.py       Multi-device sharding with CUTLASS kernels
    elementwise_apply_example.py   Elementwise apply via JAX
  tvm_ffi/                       TVM FFI integration
    jit_and_use_in_torch.py        JIT compile and call from PyTorch
    jit_and_use_in_jax.py          JIT compile and call from JAX
    aot_export.py                  Ahead-of-time export
    aot_use_in_torch.py            Use AOT-exported kernel in PyTorch
    aot_use_in_jax.py              Use AOT-exported kernel in JAX
    aot_use_in_cpp_bundle.cpp      Use AOT-exported kernel in C++
    aot_use_in_cpp_bundle.sh       Build/run script for C++ AOT usage
    compile_with_fake_tensor.py    Compile using fake tensors
    compile_with_symint_arg.py     Compile with symbolic integer arguments
    ampere_gemm_with_fake_tensor.py  Ampere GEMM with fake tensor compilation
    error_reporting.py             Error reporting and diagnostics
  call_bypass_dlpack.py          Calling kernels bypassing DLPack
  call_from_jit.py               Calling conventions from JIT-compiled code
  cooperative_launch.py          Cooperative kernel launch (multi-CTA)
  dynamic_smem_size.py           Dynamic shared memory allocation
  inline_ptx.py                  Embedding inline PTX assembly
  launch_completion_and_programmatic_events.py
                                 Launch completion / programmatic events with cudaEvent_t and CUevent
  pointer.py                     Pointer manipulation in DSL
  print_latex.py                 LaTeX rendering of CuTe layouts
  programmatic_dependent_launch.py  Programmatic dependent launch (PDL)
  smem_allocator.py              Shared memory allocator usage
  torch_fake_tensor.py           PyTorch fake tensor integration
  torch_fp4.py                   PyTorch FP4 tensor support

Subdirectory Guides

`export/` -- Kernel Export

Shows how to compile a CuTe DSL kernel into a standalone C shared library (.so) that can be loaded and called from C++ or Python without any CuTe DSL dependency at runtime. Includes complete examples for both dynamic loading (dlopen) and static linking workflows.

`ffi/` -- Foreign Function Interface

Demonstrates how to pass arguments between Python/CuTe DSL and C++ code using the FFI layer. Useful for integrating CuTe DSL kernels into existing C++ applications.

`jax/` -- JAX Integration

Shows how to call CuTe DSL kernels from JAX using cutlass_call, including basic invocation, kernel export for JAX, multi-device sharding, and elementwise application patterns.

`tvm_ffi/` -- TVM FFI Integration

Comprehensive examples for using CuTe DSL kernels through TVM's foreign function interface. Covers both JIT and AOT (ahead-of-time) compilation workflows, with usage examples for PyTorch, JAX, and C++. Also demonstrates fake-tensor compilation (no GPU required at compile time) and symbolic integer arguments.

Top-Level Files

The top-level Python files demonstrate individual DSL features:

call_bypass_dlpack.py / call_from_jit.py -- Kernel calling conventions
inline_ptx.py -- Embedding inline PTX assembly in CuTe DSL kernels
launch_completion_and_programmatic_events.py -- Examples of launch_completion_event and programmatic_event launch attributes, using events created via torch.cuda.Event(enable_timing=False) and presented as either cudaEvent_t (cuda.bindings.runtime) or CUevent (cuda.bindings.driver). The stream is passed as a cudaStream_t (cuda.bindings.runtime)
programmatic_dependent_launch.py -- Programmatic dependent launch for chaining kernels with data dependencies
cooperative_launch.py -- Cooperative launch for multi-CTA kernels
dynamic_smem_size.py / smem_allocator.py -- Shared memory allocation
torch_fake_tensor.py / torch_fp4.py -- PyTorch integration features
pointer.py -- Pointer manipulation within DSL kernels
print_latex.py -- Render CuTe layouts as LaTeX for visualization

README.md

DSL Feature Examples

Directory Structure

Subdirectory Guides

export/ -- Kernel Export

ffi/ -- Foreign Function Interface

jax/ -- JAX Integration

tvm_ffi/ -- TVM FFI Integration

Top-Level Files

`export/` -- Kernel Export

`ffi/` -- Foreign Function Interface

`jax/` -- JAX Integration

`tvm_ffi/` -- TVM FFI Integration