Qi Yuhang
ebf3165efb
[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler ( #2865 )
...
* Delete unused #ifdef/#endif. Bypass sm120 case.
* Add todo.
* Fix pingpong.
* Revert "Add todo."
This reverts commit 246cb42091 .
* Refine name.
Refine name again.
* Apply suggestions from code review
Skip `is_last_tile` for all sm120 kernels.
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
* Skip early stop for sm120 kernel.
* Fix typo.
---------
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
2025-12-18 08:51:38 +08:00
Haicheng Wu
d4e16f5d4e
Bump version from 4.2.1 to 4.3.3
2025-12-11 23:58:38 -05:00
Junkai-Wu
d3a5492381
v4.3.3 update. ( #2868 )
2025-12-11 00:26:58 -05:00
Amin Sedaghat
49bd6bf1ba
fix print_layout printf format in device code ( #2688 )
...
* fix print_layout printf format in device code
* Replace %.*s format specifier with explicit loop
* Remove unused delim variable
The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.
Fixes #2496
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
2025-12-10 08:57:56 +08:00
Junkai-Wu
c6dfdf8375
Merge pull request #2719 from HydraQYH/dev_support_pdl_for_sm90_gemm_array_tma_ws
...
Support PDL for SM90 Array TMA GEMM
2025-12-09 18:32:15 +08:00
HydraQYH
95f8beb44c
Revert "Remove unnecessary #ifdef #endif for general gemm."
...
This reverts commit 17ffd56dfe .
2025-12-09 11:52:23 +08:00
HydraQYH
17ffd56dfe
Remove unnecessary #ifdef #endif for general gemm.
2025-12-06 10:20:35 +08:00
HydraQYH
ff7f2dcdfb
Remove duplicated cutlass::arch::wait_on_dependent_grids();
2025-12-06 10:20:35 +08:00
HydraQYH
929e1e0259
Remove unnecessary #ifdef / #endif for launch_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
b6ad6db219
Delete unnecessary #ifdef / #endif.
2025-12-06 10:20:35 +08:00
HydraQYH
e1b2ec57e3
Hoist waits above the warp specialized region.
2025-12-06 10:20:35 +08:00
HydraQYH
1e5f95cbbe
Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative
2025-12-06 10:20:35 +08:00
HydraQYH
acf5990cc2
Refine position for wait_on_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
91de7891a5
Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp
2025-12-06 10:20:35 +08:00
Haicheng Wu
1e7d3e030b
Update CHANGELOG for CuTe DSL enhancements
...
Added new environment variable for CuTe DSL cache directory.
2025-12-05 13:49:57 -05:00
Haicheng Wu
c4744f706e
Bump version from 4.2.1 to 4.3.2
2025-12-05 13:45:16 -05:00
Linfeng Zheng
f6402fcd5e
add pytest support for tutorial gemm ( #2826 )
...
* add pytest support for tutorial gemm
* add license
2025-12-05 08:45:01 -05:00
bangyu shen
7252a2d17e
remove internal comment ( #2841 )
...
Co-authored-by: bangyus <bangyus@nvidia.com >
2025-12-04 10:36:21 -05:00
Junkai-Wu
bc680c7f67
v4.3.2 update. ( #2839 )
2025-12-04 10:14:32 -05:00
bangyu shen
52ae719eda
[examples][CuTeDSL] init commit for distirbuted examples ( #2806 )
...
* init commit for distirbuted examples
* better OOB protection
* and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions
* add some lamport explanation
* enhance f8 output and warn that f8 output can have nan in it
* tell user why we need complicate data conversions in ref check part
* tell user we don't support nvshmem device function
---------
Co-authored-by: bangyus <bangyus@nvidia.com >
2025-12-01 22:25:40 -05:00
Haicheng Wu
5e847d37c4
Bump version from 4.2.1 to 4.3.1
2025-12-01 22:13:19 -05:00
Haicheng Wu
f16068b4db
Bump version from 4.2.0 to 4.3.1
2025-12-01 22:12:20 -05:00
Haicheng Wu
1acfe141af
Bump version from 4.2.1 to 4.3.1
2025-12-01 22:11:13 -05:00
Haicheng Wu
f11375bf91
Bump CUTLASS patch version to 1
2025-12-01 22:08:52 -05:00
Shreya Gaur
af8d5dfa54
bug fix for example 92 ( #2830 )
...
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com >
2025-12-01 22:02:59 -05:00
drazi
ec8daf642d
Merge pull request #2809 from whatdhack/patch-1
...
Update notebook title from 'Tour to' to 'Tour of'
2025-11-28 18:07:34 +08:00
drazi
5016493cc0
Merge pull request #2813 from fengxie/ftse/fix/example
...
Refactor TVM FFI examples and update doc
2025-11-28 09:07:15 +08:00
Fung Xie
8588d099e4
refactored doc
2025-11-27 17:04:20 -08:00
Fung Xie
8fc9bc5dda
update doc
2025-11-27 17:03:51 -08:00
Fung Xie
f71892b824
update doc
2025-11-27 17:03:03 -08:00
Fung Xie
03aa211310
update doc
2025-11-27 17:02:59 -08:00
Fung Xie
286781a1fb
add requirements.txt
2025-11-27 17:02:27 -08:00
Fung Xie
2664cac685
enhanced the example for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
b9154d65b3
update examples for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
afe2f71522
reorganize examples for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
739fffce27
fix TVM FFI doc and update example
2025-11-27 17:02:26 -08:00
Junkai-Wu
1de3a576cc
v4.3.1 update. ( #2817 )
2025-11-27 09:49:30 -05:00
Shreya Gaur
2052fd3885
Blockscaled Ragged Contiguous Grouped Gemm for MoEs ( #2790 )
...
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
2025-11-26 20:16:49 -05:00
whatdhack
4a55379686
Update notebook title from 'Tour to' to 'Tour of'
...
Grammar check . LLM's can quickly clean up such issues.
2025-11-24 20:11:14 -08:00
Haicheng Wu
e67e63c331
Bump version from 4.2.1 to 4.3.0
v4.3.0
2025-11-24 16:36:06 -05:00
Haicheng Wu
ddaf12c1b1
Bump version from 4.2.0 to 4.3.0
2025-11-24 16:35:27 -05:00
Haicheng Wu
7967ce5f83
Bump version to 4.3.0
2025-11-24 16:34:45 -05:00
Junkai-Wu
8cd5bef43a
v4.3 tag release update. ( #2789 )
2025-11-20 20:49:44 -05:00
Linfeng Zheng
406e078b29
add a notebook for tour to sol gemm ( #2780 )
...
* add tour to sol gemm notebook
* change some typos
* change some typos
2025-11-20 09:41:01 -05:00
Zekun Fan
a2439551c7
Fixed editable install to depend on CuTeDSL/requirements.txt ( #2768 )
...
To guarantee wheel version alignment of the source code.
2025-11-14 15:31:49 -08:00
drazi
bd96096d58
Merge pull request #2758 from limin2021/delete-pdl-example
...
[cute-dsl][fix]remove cute dsl pdl example.
2025-11-10 22:56:57 +08:00
Mindy Li
06b6bd7d7b
remove cute dsl pdl example.
2025-11-09 21:47:00 -08:00
Linfeng Zheng
2252254ce2
Add tutorial fp16_gemm_1 ( #2750 )
...
* Add tutorial fp16_gemm_1
* refine
* refine
* refine
* revert changes in fp16_gemm_0.py
2025-11-06 22:40:09 -05:00
Ali Hassani
d1ef0e87f2
DistGEMM bug fixes ( #2713 )
...
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00
ANIKET SHIVAM
020c700e97
support for K=0 for sm100 GG ( #2746 )
2025-11-04 11:25:39 -05:00