Linfeng Zheng
772fbb264e
[CLI] add cutedsl fp16 gemm tutorial from 2 to 6 ( #3106 )
...
* [CLI] add fp16 gemm tutorial from 2 to 6
* [CLI] refine comments
2026-03-17 10:11:55 +08:00
Blake Ledden
087c84df83
docs: Fix float16 documentation in elementwise_add notebook ( #2949 ) ( #3047 )
...
The notebook uses float16 tensors but the vectorized kernel documentation
incorrectly describes elements as 32-bit and uses 4-element vectorization.
Updated to correctly state 16-bit elements with 8-element vectorization
for proper 128-bit loads/stores.
Signed-off-by: Blake Ledden <bledden@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 10:29:46 +08:00
dePaul Miller
73c59c055c
Support for Group GEMM in CUTLASS Profiler for Geforce and Spark ( #3092 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2026-03-06 20:36:29 -05:00
Junkai-Wu
3bb6e28d3c
v4.4.1 update ( #3079 )
2026-02-27 13:59:21 -05:00
Tianqi Zhang (张天启)
c651d660d2
fix typo ( #3012 )
2026-02-27 16:25:35 +08:00
mnehete32
79345359a7
Fix debug typo in sgemm_2.cu and sgemm_sm70.cu ( #2678 )
2026-02-27 16:23:59 +08:00
Junkai-Wu
057635de5c
Remove redundant dsl example. ( #3074 )
2026-02-26 08:10:59 -05:00
Junkai-Wu
c213bfdfc1
Remove redundant dsl examples. ( #3071 )
2026-02-25 22:42:01 -05:00
Linfeng Zheng
3476ddb7bd
remove mixed_input_fmha_prefill ( #3041 )
2026-02-18 07:59:01 -05:00
Yihan Chen
291300ffff
[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) ( #3009 )
...
* kernel impl
* add copyright
2026-02-14 17:54:03 +08:00
aragorn-guan
f9a5f76b7a
Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py ( #3027 )
2026-02-14 17:51:20 +08:00
Junkai-Wu
d4bbf728ca
v4.4 tag release update. ( #3032 )
2026-02-13 23:27:58 -05:00
aragorn-guan
8dbce01473
[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store ( #2970 )
2026-02-11 11:54:00 +08:00
drazi
71aa7a0abc
Merge pull request #2919 from pbelevich/patch-1
...
Refactor binary_op functions to remove unused result parameter
2026-02-11 11:48:58 +08:00
Junkai-Wu
6b3e607b85
v4.4 release update v2. ( #2999 )
2026-02-03 20:48:31 -05:00
Hua Huang
1cfbb53a23
[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator ( #2995 )
...
* Fix: SM100 block-scale gemm overlapping accumulator
Signed-off-by: Hua Huang <huah@nvidia.com >
* Also include threads_per_warp fix
Signed-off-by: Hua Huang <huah@nvidia.com >
---------
Signed-off-by: Hua Huang <huah@nvidia.com >
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6
fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release ( #2988 )
...
* fix grouped gemm
* fix mixed input gemm
* fix mixed input grouped gemm
* fix version checking
* use advanced compiler options
* fix comment
* rename advanced compiler configs to adcanced compiler control
* fix comment
* fix name
* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300
fix performance regression in cute-dsl examples for 4.4-ctk13.1 release ( #2990 )
...
* fix regression with cu13.1
* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9
Update nvvm API call from nvvm enum to str ( #2985 )
2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776
update api usage ( #2969 )
2026-01-27 15:33:22 +08:00
Junkai-Wu
9fba3195f9
v4.4 update. ( #2979 )
2026-01-24 11:46:17 -05:00
Johnsonms
0edaa6e47d
Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial ( #2945 )
2026-01-23 12:54:12 +08:00
Aidan Do
431d070fcb
[docs] Add additional tip for generating less kernels in blockwise ( #2940 )
...
- Running without this generates a lot of kernels
- Clarified CMake configuration for selecting GEMM kernels and added details on kernel generation granularity.
2026-01-23 12:53:51 +08:00
Brian K. Ryu
147f5673d0
New RMS Norm example with unit tests ( #2917 )
...
* Add rmsnorm example
* Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly
* Separate out reduce.py
* Change copyright notice years
2026-01-13 09:05:31 +08:00
Johnsonms
8c52459504
Fix incorrect tensor layout strides in Blackwell MMA tutorial comments ( #2921 )
2026-01-09 01:02:41 -05:00
Junkai-Wu
0d2b201e8c
v4.3.5 update. ( #2934 )
...
* v4.3.5 update.
* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Andrew Yooeun Chun
eb61c91147
Fix CUDA version checking in examples ( #2894 )
...
* examples: update CUDA version requirements in Blackwell examples
* examples: fix comments to specify the correct CUDA version requirement
2026-01-07 00:20:37 -05:00
Aidan Do
670480df3a
Fix SFB Layout scale granularity representation ( #2924 )
2026-01-06 23:55:21 -05:00
Pavel Belevich
b6d7703e02
Refactor binary_op functions to remove unused result parameter
2026-01-02 11:23:43 -05:00
Pavel Belevich
f9bedd9096
Fix print statement for floor division result
2026-01-02 11:15:15 -05:00
questa-quan-wang
3f4c086d09
new example with TMA prefetch feature targeting for DRAM latency bound cases ( #2881 )
...
Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com >
2025-12-23 15:29:48 +08:00
Junkai-Wu
7f5fe3edf1
v4.3.4 update. ( #2892 )
2025-12-21 11:49:12 -05:00
dongxiao
331e2f451c
add missing condition for sync ( #2889 )
2025-12-19 11:00:30 +08:00
Linfeng Zheng
f6402fcd5e
add pytest support for tutorial gemm ( #2826 )
...
* add pytest support for tutorial gemm
* add license
2025-12-05 08:45:01 -05:00
bangyu shen
7252a2d17e
remove internal comment ( #2841 )
...
Co-authored-by: bangyus <bangyus@nvidia.com >
2025-12-04 10:36:21 -05:00
Junkai-Wu
bc680c7f67
v4.3.2 update. ( #2839 )
2025-12-04 10:14:32 -05:00
bangyu shen
52ae719eda
[examples][CuTeDSL] init commit for distirbuted examples ( #2806 )
...
* init commit for distirbuted examples
* better OOB protection
* and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions
* add some lamport explanation
* enhance f8 output and warn that f8 output can have nan in it
* tell user why we need complicate data conversions in ref check part
* tell user we don't support nvshmem device function
---------
Co-authored-by: bangyus <bangyus@nvidia.com >
2025-12-01 22:25:40 -05:00
drazi
ec8daf642d
Merge pull request #2809 from whatdhack/patch-1
...
Update notebook title from 'Tour to' to 'Tour of'
2025-11-28 18:07:34 +08:00
Fung Xie
286781a1fb
add requirements.txt
2025-11-27 17:02:27 -08:00
Fung Xie
2664cac685
enhanced the example for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
b9154d65b3
update examples for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
afe2f71522
reorganize examples for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
739fffce27
fix TVM FFI doc and update example
2025-11-27 17:02:26 -08:00
Junkai-Wu
1de3a576cc
v4.3.1 update. ( #2817 )
2025-11-27 09:49:30 -05:00
Shreya Gaur
2052fd3885
Blockscaled Ragged Contiguous Grouped Gemm for MoEs ( #2790 )
...
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
2025-11-26 20:16:49 -05:00
whatdhack
4a55379686
Update notebook title from 'Tour to' to 'Tour of'
...
Grammar check . LLM's can quickly clean up such issues.
2025-11-24 20:11:14 -08:00
Junkai-Wu
8cd5bef43a
v4.3 tag release update. ( #2789 )
2025-11-20 20:49:44 -05:00
Linfeng Zheng
406e078b29
add a notebook for tour to sol gemm ( #2780 )
...
* add tour to sol gemm notebook
* change some typos
* change some typos
2025-11-20 09:41:01 -05:00
Mindy Li
06b6bd7d7b
remove cute dsl pdl example.
2025-11-09 21:47:00 -08:00
Linfeng Zheng
2252254ce2
Add tutorial fp16_gemm_1 ( #2750 )
...
* Add tutorial fp16_gemm_1
* refine
* refine
* refine
* revert changes in fp16_gemm_0.py
2025-11-06 22:40:09 -05:00