Commit Graph

810 Commits

Author SHA1 Message Date
yuanxiaolan
de161925a5 pass in stream=-1 2026-02-03 11:59:14 +08:00
yuanxiaolan
de198b2419 fix tvm-ffi path in from_dlpack 2026-02-03 11:59:13 +08:00
Hua Huang
1cfbb53a23 [CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995)
* Fix: SM100 block-scale gemm overlapping accumulator

Signed-off-by: Hua Huang <huah@nvidia.com>

* Also include threads_per_warp fix

Signed-off-by: Hua Huang <huah@nvidia.com>

---------

Signed-off-by: Hua Huang <huah@nvidia.com>
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6 fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988)
* fix grouped gemm

* fix mixed input gemm

* fix mixed input grouped gemm

* fix version checking

* use advanced compiler options

* fix comment

* rename advanced compiler configs to adcanced compiler control

* fix comment

* fix name

* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300 fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990)
* fix regression with cu13.1

* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9 Update nvvm API call from nvvm enum to str (#2985) 2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776 update api usage (#2969) 2026-01-27 15:33:22 +08:00
drazi
51f82812ec Merge pull request #2891 from ColinPeppler/main
docs: note when DSL dumps are populated
2026-01-26 17:38:27 -08:00
Junkai-Wu
9fba3195f9 v4.4 update. (#2979) 2026-01-24 11:46:17 -05:00
Qi Yuhang
2fafefb7b9 [Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma (#2965)
* Fix NumSplitsM when TileShapeM < 128.

* Use cute::conditional_t to replace std::conditional_t.
2026-01-23 15:56:52 +08:00
Johnsonms
0edaa6e47d Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial (#2945) 2026-01-23 12:54:12 +08:00
Aidan Do
431d070fcb [docs] Add additional tip for generating less kernels in blockwise (#2940)
- Running without this generates a lot of kernels
- Clarified CMake configuration for selecting GEMM kernels and added details on kernel generation granularity.
2026-01-23 12:53:51 +08:00
Qi Yuhang
667446a9dd [Doc]Fix Mode Name and Stride in 0t_mma_atom.md (#2910)
* Fix wrong stride in 0t_mma_atom.md

* Fix mode name.

* Update media/docs/cpp/cute/0t_mma_atom.md

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

---------

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
2026-01-23 12:53:30 +08:00
Aidan Do
3f5bafb326 [Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles (#2946)
* .

* .

* .

* .

* .

* .

* .
2026-01-20 15:27:34 +08:00
Tianqi Chen
1e6da09275 [DOCS] Update docs to precisely describe env stream scenario (#2824)
Since the term synchronize may cause confusion that user thought it means stream sync, it simply means we pass the right current stream as env stream
2026-01-20 09:16:37 +08:00
Benjamin Leff
8debf77437 fix: 2305 omissions (#2957) 2026-01-14 00:55:05 -05:00
Brian K. Ryu
147f5673d0 New RMS Norm example with unit tests (#2917)
* Add rmsnorm example

* Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly

* Separate out reduce.py

* Change copyright notice years
2026-01-13 09:05:31 +08:00
Johnsonms
8c52459504 Fix incorrect tensor layout strides in Blackwell MMA tutorial comments (#2921) 2026-01-09 01:02:41 -05:00
kf-zhang
0deda34b9f fix typo (#2884) 2026-01-09 00:57:06 -05:00
Junkai-Wu
0d2b201e8c v4.3.5 update. (#2934)
* v4.3.5 update.

* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Wenxuan Tan
f86feb0aa8 Fix idx2crd docstring (#2914)
* fix idx2crd docstring

* fix

* fix
2026-01-07 13:11:38 -05:00
Andrew Yooeun Chun
eb61c91147 Fix CUDA version checking in examples (#2894)
* examples: update CUDA version requirements in Blackwell examples

* examples: fix comments to specify the correct CUDA version requirement
2026-01-07 00:20:37 -05:00
Aidan Do
670480df3a Fix SFB Layout scale granularity representation (#2924) 2026-01-06 23:55:21 -05:00
veritas-Qiu
61b560983a remove useless line (#2926)
the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.
2026-01-06 23:54:08 -05:00
dePaul Miller
7127592069 Replace CUDA driver API with runtime API (#2928)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-01-05 13:50:44 -05:00
questa-quan-wang
2aee73922c Minor fix for testing of blockscaled dense GEMM with TMA prefetch (#2930)
* new example with TMA prefetch feature targeting for DRAM latency bound cases

* minor fix to resitrct as 100a arch

* typo

* apply arch for whole pytest

---------

Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>
Co-authored-by: Questa Wang <questaw@umbriel-b200-145.ipp4a1.colossus.nvidia.com>
2026-01-05 16:36:03 +08:00
tsu-bin
3d9de19bb7 add constexpr specifier to make_tiled_copy (#2875) 2026-01-03 15:39:43 -05:00
Haicheng Wu
853ad93d60 Update README.md 2025-12-24 00:21:59 -05:00
Haicheng Wu
34a81f0497 Update driver bug workaround description in CHANGELOG
Clarified the description of a driver bug workaround in the CHANGELOG.
2025-12-24 00:20:21 -05:00
questa-quan-wang
3f4c086d09 new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881)
Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>
2025-12-23 15:29:48 +08:00
Junkai-Wu
b7ecaa605d v4.3.4 update v2. (#2898) 2025-12-22 22:28:26 -05:00
Junkai-Wu
7f5fe3edf1 v4.3.4 update. (#2892) 2025-12-21 11:49:12 -05:00
Colin Peppler
4b52d37ecd docs: note when DSL dumps are populated 2025-12-19 17:05:12 -08:00
dongxiao
331e2f451c add missing condition for sync (#2889) 2025-12-19 11:00:30 +08:00
Qi Yuhang
ebf3165efb [Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler (#2865)
* Delete unused #ifdef/#endif. Bypass sm120 case.

* Add todo.

* Fix pingpong.

* Revert "Add todo."

This reverts commit 246cb42091.

* Refine name.

Refine name again.

* Apply suggestions from code review

Skip `is_last_tile` for all sm120 kernels.

Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>

* Skip early stop for sm120 kernel.

* Fix typo.

---------

Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>
2025-12-18 08:51:38 +08:00
Haicheng Wu
d4e16f5d4e Bump version from 4.2.1 to 4.3.3 2025-12-11 23:58:38 -05:00
Junkai-Wu
d3a5492381 v4.3.3 update. (#2868) 2025-12-11 00:26:58 -05:00
Amin Sedaghat
49bd6bf1ba fix print_layout printf format in device code (#2688)
* fix print_layout printf format in device code

* Replace %.*s format specifier with explicit loop
* Remove unused delim variable

The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.

Fixes #2496

* Update include/cute/util/print_tensor.hpp

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

* Update include/cute/util/print_tensor.hpp

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

---------

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
2025-12-10 08:57:56 +08:00
Junkai-Wu
c6dfdf8375 Merge pull request #2719 from HydraQYH/dev_support_pdl_for_sm90_gemm_array_tma_ws
Support PDL for SM90 Array TMA GEMM
2025-12-09 18:32:15 +08:00
HydraQYH
95f8beb44c Revert "Remove unnecessary #ifdef #endif for general gemm."
This reverts commit 17ffd56dfe.
2025-12-09 11:52:23 +08:00
HydraQYH
17ffd56dfe Remove unnecessary #ifdef #endif for general gemm. 2025-12-06 10:20:35 +08:00
HydraQYH
ff7f2dcdfb Remove duplicated cutlass::arch::wait_on_dependent_grids(); 2025-12-06 10:20:35 +08:00
HydraQYH
929e1e0259 Remove unnecessary #ifdef / #endif for launch_dependent_grids. 2025-12-06 10:20:35 +08:00
HydraQYH
b6ad6db219 Delete unnecessary #ifdef / #endif. 2025-12-06 10:20:35 +08:00
HydraQYH
e1b2ec57e3 Hoist waits above the warp specialized region. 2025-12-06 10:20:35 +08:00
HydraQYH
1e5f95cbbe Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative 2025-12-06 10:20:35 +08:00
HydraQYH
acf5990cc2 Refine position for wait_on_dependent_grids. 2025-12-06 10:20:35 +08:00
HydraQYH
91de7891a5 Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp 2025-12-06 10:20:35 +08:00
Haicheng Wu
1e7d3e030b Update CHANGELOG for CuTe DSL enhancements
Added new environment variable for CuTe DSL cache directory.
2025-12-05 13:49:57 -05:00
Haicheng Wu
c4744f706e Bump version from 4.2.1 to 4.3.2 2025-12-05 13:45:16 -05:00