yuanxiaolan
de161925a5
pass in stream=-1
2026-02-03 11:59:14 +08:00
yuanxiaolan
de198b2419
fix tvm-ffi path in from_dlpack
2026-02-03 11:59:13 +08:00
Hua Huang
1cfbb53a23
[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator ( #2995 )
...
* Fix: SM100 block-scale gemm overlapping accumulator
Signed-off-by: Hua Huang <huah@nvidia.com >
* Also include threads_per_warp fix
Signed-off-by: Hua Huang <huah@nvidia.com >
---------
Signed-off-by: Hua Huang <huah@nvidia.com >
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6
fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release ( #2988 )
...
* fix grouped gemm
* fix mixed input gemm
* fix mixed input grouped gemm
* fix version checking
* use advanced compiler options
* fix comment
* rename advanced compiler configs to adcanced compiler control
* fix comment
* fix name
* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300
fix performance regression in cute-dsl examples for 4.4-ctk13.1 release ( #2990 )
...
* fix regression with cu13.1
* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9
Update nvvm API call from nvvm enum to str ( #2985 )
2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776
update api usage ( #2969 )
2026-01-27 15:33:22 +08:00
drazi
51f82812ec
Merge pull request #2891 from ColinPeppler/main
...
docs: note when DSL dumps are populated
2026-01-26 17:38:27 -08:00
Junkai-Wu
9fba3195f9
v4.4 update. ( #2979 )
2026-01-24 11:46:17 -05:00
Qi Yuhang
2fafefb7b9
[Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma ( #2965 )
...
* Fix NumSplitsM when TileShapeM < 128.
* Use cute::conditional_t to replace std::conditional_t.
2026-01-23 15:56:52 +08:00
Johnsonms
0edaa6e47d
Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial ( #2945 )
2026-01-23 12:54:12 +08:00
Aidan Do
431d070fcb
[docs] Add additional tip for generating less kernels in blockwise ( #2940 )
...
- Running without this generates a lot of kernels
- Clarified CMake configuration for selecting GEMM kernels and added details on kernel generation granularity.
2026-01-23 12:53:51 +08:00
Qi Yuhang
667446a9dd
[Doc]Fix Mode Name and Stride in 0t_mma_atom.md ( #2910 )
...
* Fix wrong stride in 0t_mma_atom.md
* Fix mode name.
* Update media/docs/cpp/cute/0t_mma_atom.md
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
2026-01-23 12:53:30 +08:00
Aidan Do
3f5bafb326
[Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles ( #2946 )
...
* .
* .
* .
* .
* .
* .
* .
2026-01-20 15:27:34 +08:00
Tianqi Chen
1e6da09275
[DOCS] Update docs to precisely describe env stream scenario ( #2824 )
...
Since the term synchronize may cause confusion that user thought it means stream sync, it simply means we pass the right current stream as env stream
2026-01-20 09:16:37 +08:00
Benjamin Leff
8debf77437
fix: 2305 omissions ( #2957 )
2026-01-14 00:55:05 -05:00
Brian K. Ryu
147f5673d0
New RMS Norm example with unit tests ( #2917 )
...
* Add rmsnorm example
* Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly
* Separate out reduce.py
* Change copyright notice years
2026-01-13 09:05:31 +08:00
Johnsonms
8c52459504
Fix incorrect tensor layout strides in Blackwell MMA tutorial comments ( #2921 )
2026-01-09 01:02:41 -05:00
kf-zhang
0deda34b9f
fix typo ( #2884 )
2026-01-09 00:57:06 -05:00
Junkai-Wu
0d2b201e8c
v4.3.5 update. ( #2934 )
...
* v4.3.5 update.
* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Wenxuan Tan
f86feb0aa8
Fix idx2crd docstring ( #2914 )
...
* fix idx2crd docstring
* fix
* fix
2026-01-07 13:11:38 -05:00
Andrew Yooeun Chun
eb61c91147
Fix CUDA version checking in examples ( #2894 )
...
* examples: update CUDA version requirements in Blackwell examples
* examples: fix comments to specify the correct CUDA version requirement
2026-01-07 00:20:37 -05:00
Aidan Do
670480df3a
Fix SFB Layout scale granularity representation ( #2924 )
2026-01-06 23:55:21 -05:00
veritas-Qiu
61b560983a
remove useless line ( #2926 )
...
the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.
2026-01-06 23:54:08 -05:00
dePaul Miller
7127592069
Replace CUDA driver API with runtime API ( #2928 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2026-01-05 13:50:44 -05:00
questa-quan-wang
2aee73922c
Minor fix for testing of blockscaled dense GEMM with TMA prefetch ( #2930 )
...
* new example with TMA prefetch feature targeting for DRAM latency bound cases
* minor fix to resitrct as 100a arch
* typo
* apply arch for whole pytest
---------
Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com >
Co-authored-by: Questa Wang <questaw@umbriel-b200-145.ipp4a1.colossus.nvidia.com >
2026-01-05 16:36:03 +08:00
tsu-bin
3d9de19bb7
add constexpr specifier to make_tiled_copy ( #2875 )
2026-01-03 15:39:43 -05:00
Haicheng Wu
853ad93d60
Update README.md
2025-12-24 00:21:59 -05:00
Haicheng Wu
34a81f0497
Update driver bug workaround description in CHANGELOG
...
Clarified the description of a driver bug workaround in the CHANGELOG.
2025-12-24 00:20:21 -05:00
questa-quan-wang
3f4c086d09
new example with TMA prefetch feature targeting for DRAM latency bound cases ( #2881 )
...
Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com >
2025-12-23 15:29:48 +08:00
Junkai-Wu
b7ecaa605d
v4.3.4 update v2. ( #2898 )
2025-12-22 22:28:26 -05:00
Junkai-Wu
7f5fe3edf1
v4.3.4 update. ( #2892 )
2025-12-21 11:49:12 -05:00
Colin Peppler
4b52d37ecd
docs: note when DSL dumps are populated
2025-12-19 17:05:12 -08:00
dongxiao
331e2f451c
add missing condition for sync ( #2889 )
2025-12-19 11:00:30 +08:00
Qi Yuhang
ebf3165efb
[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler ( #2865 )
...
* Delete unused #ifdef/#endif. Bypass sm120 case.
* Add todo.
* Fix pingpong.
* Revert "Add todo."
This reverts commit 246cb42091 .
* Refine name.
Refine name again.
* Apply suggestions from code review
Skip `is_last_tile` for all sm120 kernels.
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
* Skip early stop for sm120 kernel.
* Fix typo.
---------
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
2025-12-18 08:51:38 +08:00
Haicheng Wu
d4e16f5d4e
Bump version from 4.2.1 to 4.3.3
2025-12-11 23:58:38 -05:00
Junkai-Wu
d3a5492381
v4.3.3 update. ( #2868 )
2025-12-11 00:26:58 -05:00
Amin Sedaghat
49bd6bf1ba
fix print_layout printf format in device code ( #2688 )
...
* fix print_layout printf format in device code
* Replace %.*s format specifier with explicit loop
* Remove unused delim variable
The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.
Fixes #2496
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
2025-12-10 08:57:56 +08:00
Junkai-Wu
c6dfdf8375
Merge pull request #2719 from HydraQYH/dev_support_pdl_for_sm90_gemm_array_tma_ws
...
Support PDL for SM90 Array TMA GEMM
2025-12-09 18:32:15 +08:00
HydraQYH
95f8beb44c
Revert "Remove unnecessary #ifdef #endif for general gemm."
...
This reverts commit 17ffd56dfe .
2025-12-09 11:52:23 +08:00
HydraQYH
17ffd56dfe
Remove unnecessary #ifdef #endif for general gemm.
2025-12-06 10:20:35 +08:00
HydraQYH
ff7f2dcdfb
Remove duplicated cutlass::arch::wait_on_dependent_grids();
2025-12-06 10:20:35 +08:00
HydraQYH
929e1e0259
Remove unnecessary #ifdef / #endif for launch_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
b6ad6db219
Delete unnecessary #ifdef / #endif.
2025-12-06 10:20:35 +08:00
HydraQYH
e1b2ec57e3
Hoist waits above the warp specialized region.
2025-12-06 10:20:35 +08:00
HydraQYH
1e5f95cbbe
Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative
2025-12-06 10:20:35 +08:00
HydraQYH
acf5990cc2
Refine position for wait_on_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
91de7891a5
Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp
2025-12-06 10:20:35 +08:00
Haicheng Wu
1e7d3e030b
Update CHANGELOG for CuTe DSL enhancements
...
Added new environment variable for CuTe DSL cache directory.
2025-12-05 13:49:57 -05:00
Haicheng Wu
c4744f706e
Bump version from 4.2.1 to 4.3.2
2025-12-05 13:45:16 -05:00