127 Commits

Author SHA1 Message Date
Kaining Zhong
12ff513cea [CuTeDSL] Add a render function hook to allow render layout natively (#3135)
* [CuTeDSL] Add a render function hook to allow render layout natively

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

* nit

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>

---------

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
2026-06-26 15:14:55 -04:00
Junkai-Wu
8f50b052e1 Fix license. (#3328) 2026-06-22 22:07:29 -04:00
xiangg-nv
becfce08cd Enable tcgen05 blockscaled ops on Thor SM110 (#3283)
Edge-LLM NvFP4 MoE CuTeDSL kernels on Thor use tcgen05 blockscaled MMA and SMEM-to-TMEM scale-factor copies. The existing checks only admitted the SM100/SM103 paths, so source-built CuTeDSL rejected SM110.

Admit Thor's blockscaled MMA arch aliases sm_101a and sm_110a, and allow the SM110f family for S2T tcgen05 copy ops.

Validation:

- git diff --check

- python3 -m py_compile python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py

- DKG grouped_blockscaled_gemm.py documented 4-group example on Thor SM110: PASS

- Edge-LLM nvfp4_moe AOT for sm_110/aarch64: 12/12 variants PASS
2026-06-16 15:01:25 +08:00
Junkai-Wu
39b352fa93 v4.6 dev update. (#3315)
* v4.6 dev update.

* Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286)

* [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280)

* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM

Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).

New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
  CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
  Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
  Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
  Supports F8F6F4 MMA with TMA loads for both A and B operands.

- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
  CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
  Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
  from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
  dispatch policy, produces correctly-typed CollectiveOp.

Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
  (they now route to sm120_array_mma_builder.inl) and drop the
  IsPtrArrayKernel static_assert that enforced the restriction

Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).

Closes #3263

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>

* test: add SM120 ptr-array grouped GEMM unit tests

Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.

Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com>
Co-authored-by: Tyler <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-15 23:23:20 -04:00
George Karpenkov
1732ed7da3 [CuTeDSL] Make @cute.struct instances flattenable across scf.if / scf.while (#3270)
Fixes https://github.com/NVIDIA/cutlass/issues/3268

A `@cute.struct` instance captured into an `scf.if` branch or `scf.while`
body fails the DSL trace with:

    DSLRuntimeError: The 'if' statement encountered a user-defined Python
    object, which cannot be automatically converted into an dynamic
    expression.

This blocks the natural warp-specialization pattern, where each
`if warp_idx == <role>:` branch reads its tile from a shared storage
struct.

A struct instance is fully described by its `base` pointer (already
DynamicExpression-aware via `_Pointer`); every field instance is
re-derived from `base + static offsets` on construction. Implement the
DynamicExpression protocol on each decorated class by forwarding
`__get_mlir_types__` / `__extract_mlir_values__` to `base`, and
`__new_from_mlir_values__` to a fresh decorator invocation that
re-derives the fields from a rebuilt base pointer.

Tested in Docker on cutlass-dsl 4.5.1 with six new unit tests in
test/python/CuTeDSL/test_struct_in_if.py covering:
  * the original failing case (storage.get_tensor inside dynamic if),
  * regression: plain non-branched struct usage still works,
  * nested struct (struct-of-struct) inside a dynamic if,
  * if/else with both branches accessing the struct,
  * if/elif/elif/else (the actual warp-specialization shape),
  * scf.while body capturing the struct.
2026-05-29 08:34:48 +08:00
Junkai-Wu
5c54bee12b v4.5.2 update. (#3264) 2026-05-26 22:32:26 -04:00
Junkai-Wu
982cb9e718 v4.5.1 update. (#3237) 2026-05-18 22:35:08 -04:00
Haicheng Wu
ef120d0d09 update to 4.5 (#3228) 2026-05-12 14:44:22 +08:00
TungtungQia
1d9e1f6d7a [CuTeDSL] Fix loop carried target scope (#3200)
* [CuTeDSL] Bug fix for scf.for's write_args analysis

* [CuTeDSL] Add for loop test
2026-05-11 16:02:26 +08:00
Junkai-Wu
cb37157db5 v4.5 tag update (#3202)
* Python DSL examples reorganization.

* v4.5 tag update.
2026-05-05 20:55:27 -04:00
dePaul Miller
b46b16d003 Small Tile N BlockScaled GEMM + Grouped GEMM (#3176)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-04-21 12:32:40 -04:00
Nandor Licker
ea46e277d2 Add absf and floor to cute.math (#3156)
The ops are already exposed by the underlying dialect.
2026-04-17 08:54:24 +08:00
Nandor Licker
3f3db08a0a Add support for empty dataclass arguments (#3152)
A dataclass with no fields exposed a bug in `extract_dataclass_members`:

```
@dataclass
class Dummy:
  pass
```

The type/return path was inconsistent. This PR fixes the function to
support empty dataclasses, which are useful in unions.
2026-04-17 08:47:47 +08:00
Junkai-Wu
a221da7ccf v4.5 dev update. (#3153) 2026-04-07 12:16:05 -04:00
Johnsonms
982748aa73 [Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091)
Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using
CuTe DSL, extending the dense persistent GEMM with per-group TMA
descriptor management.

Kernel design (grouped_gemm.py):
- Warp-specialized pipeline: DMA warp group handles TMA loads and
  per-group tensormap updates; MMA warp group runs WGMMA and stores C
- StaticPersistentGroupTileScheduler for cross-group tile scheduling
- Per-group TMA descriptor updates via GMEM or SMEM mode
- Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes
- Configurable tile shapes (128x128, 128x256) and cluster shapes
- Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated
  block to avoid illegal @P0 R2UR on sm_90a

Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py):
- L0 compile and L1 correctness pytest suite covering tile shapes,
  dtypes, major modes, cluster shapes, group counts, and mixed sizes
- Move to test/examples/CuTeDSL/hopper/ following sm_100a convention
- Fix deprecated startdir arg in test_sharding.py pytest hook
2026-03-18 00:40:15 -04:00
Junkai-Wu
1b741cabaa v4.4.2 update. (#3104) 2026-03-17 00:58:19 -04:00
dePaul Miller
73c59c055c Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-03-06 20:36:29 -05:00
Johnsonms
e5fcd125a5 [fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands (#3087)
Before this fix, combining two Boolean (i1) DSL values with Python `and`
triggered a verbose i1→i32→i1 round-trip in __dsl_and__:
  arith.extui  (×3), arith.select, arith.cmpi ne (×2) — 6 extra MLIR ops.

Add a fast path: when both operands are Boolean, delegate directly to
__and__, emitting a single arith.andi %a, %b : i1 — identical to `&`.

Both operators were already semantically equivalent; this fix makes the
generated MLIR identical as well.

Includes:
- repro_dsl_and_bool.py  — minimal standalone reproducer / bug-report script
- test_dsl_and_fix.py    — pytest tests verifying the fixed behaviour
2026-03-05 17:20:26 +08:00
David W.H. Swenson
49e54f2b23 fix: add_help=False in temporary parser (#2721) 2026-03-02 15:33:42 +08:00
drazi
b9847690c5 Merge pull request #3028 from SzymonOzog/patch-3
Add option to not suffix prints with new line
2026-02-28 10:11:05 +08:00
Junkai-Wu
3bb6e28d3c v4.4.1 update (#3079) 2026-02-27 13:59:21 -05:00
Gabriel Wu
fc5bbc2dab Fix typo in cute.nvgpu.warpgroup.mma doc (#2548) 2026-02-27 16:22:55 +08:00
Haicheng Wu
954503d44c Bump version to 4.4.0 2026-02-25 00:04:04 -05:00
Haicheng Wu
6c4200f1bc Bump version from 4.3.5 to 4.4.0 2026-02-25 00:03:23 -05:00
Haicheng Wu
de93e8a4ac Bump version from 4.3.5 to 4.4.0 2026-02-25 00:03:04 -05:00
Haicheng Wu
b92b9f0d37 Bump version from 4.3.5 to 4.4.0 2026-02-25 00:02:41 -05:00
Yuan Xiaolan
395ab575f6 Merge branch 'main' into tvm-ffi 2026-02-14 13:35:28 +08:00
Junkai-Wu
d4bbf728ca v4.4 tag release update. (#3032) 2026-02-13 23:27:58 -05:00
Szymon Ożóg
beb80e04e1 Add option to not suffix prints with new line 2026-02-13 15:56:50 +01:00
drazi
01687cfba1 Merge pull request #3004 from tridao/add-sub-packed-f32x2
[CuTeDSL] Add sub_packed_f32x2 operation
2026-02-13 20:46:26 +08:00
drazi
5c42d0f28c Merge pull request #3021 from tridao/clc_no_multicast
[Cute-DSL] Add option for issue_clc_query without multicast
2026-02-13 20:45:52 +08:00
Tri Dao
244e8d00d5 [Cute-DSL] Add cute.arch.fmin by calling nvvm 2026-02-11 14:23:09 -05:00
Tri Dao
5b83b34afd [Cute-DSL] Add option for issue_clc_query without multicast 2026-02-11 14:19:29 -05:00
Tri Dao
51935551fb [CuTeDSL] Add sub_packed_f32x2 operation
Add subtraction operation for packed f32x2 values, following the same
pattern as the existing add_packed_f32x2 and mul_packed_f32x2 operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:18:46 +07:00
Junkai-Wu
6b3e607b85 v4.4 release update v2. (#2999) 2026-02-03 20:48:31 -05:00
yuanxiaolan
de161925a5 pass in stream=-1 2026-02-03 11:59:14 +08:00
yuanxiaolan
de198b2419 fix tvm-ffi path in from_dlpack 2026-02-03 11:59:13 +08:00
Xiao Song
acb45938e9 Update nvvm API call from nvvm enum to str (#2985) 2026-01-27 17:28:29 +08:00
Junkai-Wu
9fba3195f9 v4.4 update. (#2979) 2026-01-24 11:46:17 -05:00
Aidan Do
3f5bafb326 [Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles (#2946)
* .

* .

* .

* .

* .

* .

* .
2026-01-20 15:27:34 +08:00
Junkai-Wu
0d2b201e8c v4.3.5 update. (#2934)
* v4.3.5 update.

* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Wenxuan Tan
f86feb0aa8 Fix idx2crd docstring (#2914)
* fix idx2crd docstring

* fix

* fix
2026-01-07 13:11:38 -05:00
Junkai-Wu
7f5fe3edf1 v4.3.4 update. (#2892) 2025-12-21 11:49:12 -05:00
Haicheng Wu
d4e16f5d4e Bump version from 4.2.1 to 4.3.3 2025-12-11 23:58:38 -05:00
Junkai-Wu
d3a5492381 v4.3.3 update. (#2868) 2025-12-11 00:26:58 -05:00
Haicheng Wu
c4744f706e Bump version from 4.2.1 to 4.3.2 2025-12-05 13:45:16 -05:00
Junkai-Wu
bc680c7f67 v4.3.2 update. (#2839) 2025-12-04 10:14:32 -05:00
Haicheng Wu
5e847d37c4 Bump version from 4.2.1 to 4.3.1 2025-12-01 22:13:19 -05:00
Haicheng Wu
f16068b4db Bump version from 4.2.0 to 4.3.1 2025-12-01 22:12:20 -05:00
Haicheng Wu
1acfe141af Bump version from 4.2.1 to 4.3.1 2025-12-01 22:11:13 -05:00