Junkai-Wu
c213bfdfc1
Remove redundant dsl examples. ( #3071 )
2026-02-25 22:42:01 -05:00
Linfeng Zheng
3476ddb7bd
remove mixed_input_fmha_prefill ( #3041 )
2026-02-18 07:59:01 -05:00
Yihan Chen
291300ffff
[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) ( #3009 )
...
* kernel impl
* add copyright
2026-02-14 17:54:03 +08:00
aragorn-guan
f9a5f76b7a
Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py ( #3027 )
2026-02-14 17:51:20 +08:00
Junkai-Wu
d4bbf728ca
v4.4 tag release update. ( #3032 )
2026-02-13 23:27:58 -05:00
aragorn-guan
8dbce01473
[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store ( #2970 )
2026-02-11 11:54:00 +08:00
drazi
71aa7a0abc
Merge pull request #2919 from pbelevich/patch-1
...
Refactor binary_op functions to remove unused result parameter
2026-02-11 11:48:58 +08:00
Junkai-Wu
6b3e607b85
v4.4 release update v2. ( #2999 )
2026-02-03 20:48:31 -05:00
Hua Huang
1cfbb53a23
[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator ( #2995 )
...
* Fix: SM100 block-scale gemm overlapping accumulator
Signed-off-by: Hua Huang <huah@nvidia.com >
* Also include threads_per_warp fix
Signed-off-by: Hua Huang <huah@nvidia.com >
---------
Signed-off-by: Hua Huang <huah@nvidia.com >
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6
fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release ( #2988 )
...
* fix grouped gemm
* fix mixed input gemm
* fix mixed input grouped gemm
* fix version checking
* use advanced compiler options
* fix comment
* rename advanced compiler configs to adcanced compiler control
* fix comment
* fix name
* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300
fix performance regression in cute-dsl examples for 4.4-ctk13.1 release ( #2990 )
...
* fix regression with cu13.1
* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9
Update nvvm API call from nvvm enum to str ( #2985 )
2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776
update api usage ( #2969 )
2026-01-27 15:33:22 +08:00
Junkai-Wu
9fba3195f9
v4.4 update. ( #2979 )
2026-01-24 11:46:17 -05:00
Johnsonms
0edaa6e47d
Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial ( #2945 )
2026-01-23 12:54:12 +08:00
Aidan Do
431d070fcb
[docs] Add additional tip for generating less kernels in blockwise ( #2940 )
...
- Running without this generates a lot of kernels
- Clarified CMake configuration for selecting GEMM kernels and added details on kernel generation granularity.
2026-01-23 12:53:51 +08:00
Brian K. Ryu
147f5673d0
New RMS Norm example with unit tests ( #2917 )
...
* Add rmsnorm example
* Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly
* Separate out reduce.py
* Change copyright notice years
2026-01-13 09:05:31 +08:00
Johnsonms
8c52459504
Fix incorrect tensor layout strides in Blackwell MMA tutorial comments ( #2921 )
2026-01-09 01:02:41 -05:00
Junkai-Wu
0d2b201e8c
v4.3.5 update. ( #2934 )
...
* v4.3.5 update.
* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Andrew Yooeun Chun
eb61c91147
Fix CUDA version checking in examples ( #2894 )
...
* examples: update CUDA version requirements in Blackwell examples
* examples: fix comments to specify the correct CUDA version requirement
2026-01-07 00:20:37 -05:00
Aidan Do
670480df3a
Fix SFB Layout scale granularity representation ( #2924 )
2026-01-06 23:55:21 -05:00
Pavel Belevich
b6d7703e02
Refactor binary_op functions to remove unused result parameter
2026-01-02 11:23:43 -05:00
Pavel Belevich
f9bedd9096
Fix print statement for floor division result
2026-01-02 11:15:15 -05:00
questa-quan-wang
3f4c086d09
new example with TMA prefetch feature targeting for DRAM latency bound cases ( #2881 )
...
Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com >
2025-12-23 15:29:48 +08:00
Junkai-Wu
7f5fe3edf1
v4.3.4 update. ( #2892 )
2025-12-21 11:49:12 -05:00
dongxiao
331e2f451c
add missing condition for sync ( #2889 )
2025-12-19 11:00:30 +08:00
Linfeng Zheng
f6402fcd5e
add pytest support for tutorial gemm ( #2826 )
...
* add pytest support for tutorial gemm
* add license
2025-12-05 08:45:01 -05:00
bangyu shen
7252a2d17e
remove internal comment ( #2841 )
...
Co-authored-by: bangyus <bangyus@nvidia.com >
2025-12-04 10:36:21 -05:00
Junkai-Wu
bc680c7f67
v4.3.2 update. ( #2839 )
2025-12-04 10:14:32 -05:00
bangyu shen
52ae719eda
[examples][CuTeDSL] init commit for distirbuted examples ( #2806 )
...
* init commit for distirbuted examples
* better OOB protection
* and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions
* add some lamport explanation
* enhance f8 output and warn that f8 output can have nan in it
* tell user why we need complicate data conversions in ref check part
* tell user we don't support nvshmem device function
---------
Co-authored-by: bangyus <bangyus@nvidia.com >
2025-12-01 22:25:40 -05:00
drazi
ec8daf642d
Merge pull request #2809 from whatdhack/patch-1
...
Update notebook title from 'Tour to' to 'Tour of'
2025-11-28 18:07:34 +08:00
Fung Xie
286781a1fb
add requirements.txt
2025-11-27 17:02:27 -08:00
Fung Xie
2664cac685
enhanced the example for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
b9154d65b3
update examples for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
afe2f71522
reorganize examples for tvm-ffi
2025-11-27 17:02:26 -08:00
Fung Xie
739fffce27
fix TVM FFI doc and update example
2025-11-27 17:02:26 -08:00
Junkai-Wu
1de3a576cc
v4.3.1 update. ( #2817 )
2025-11-27 09:49:30 -05:00
Shreya Gaur
2052fd3885
Blockscaled Ragged Contiguous Grouped Gemm for MoEs ( #2790 )
...
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
2025-11-26 20:16:49 -05:00
whatdhack
4a55379686
Update notebook title from 'Tour to' to 'Tour of'
...
Grammar check . LLM's can quickly clean up such issues.
2025-11-24 20:11:14 -08:00
Junkai-Wu
8cd5bef43a
v4.3 tag release update. ( #2789 )
2025-11-20 20:49:44 -05:00
Linfeng Zheng
406e078b29
add a notebook for tour to sol gemm ( #2780 )
...
* add tour to sol gemm notebook
* change some typos
* change some typos
2025-11-20 09:41:01 -05:00
Mindy Li
06b6bd7d7b
remove cute dsl pdl example.
2025-11-09 21:47:00 -08:00
Linfeng Zheng
2252254ce2
Add tutorial fp16_gemm_1 ( #2750 )
...
* Add tutorial fp16_gemm_1
* refine
* refine
* refine
* revert changes in fp16_gemm_0.py
2025-11-06 22:40:09 -05:00
Ali Hassani
d1ef0e87f2
DistGEMM bug fixes ( #2713 )
...
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00
ANIKET SHIVAM
020c700e97
support for K=0 for sm100 GG ( #2746 )
2025-11-04 11:25:39 -05:00
Junkai-Wu
b1d6e2c9b3
v4.3 update. ( #2709 )
...
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com >
2025-10-21 14:26:30 -04:00
Aya Z. Ibrahim
64579189ec
Feature/add bottom causal mask ( #2480 )
...
* Rebase to latest
* update
* upd
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Update fmha_fusion.hpp
* Update fmha_fusion.hpp
fixed flipped logic for isQBegin
* Update fmha_fusion.hpp
* Avoid use of booleans
The current expression is confusing
* fmt
* Update fmha_fusion.hpp
Reproduce error/fix with:
./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend
* add test, format
---------
Co-authored-by: Richard Cai <ricai@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-09-18 17:11:23 -04:00
Junkai-Wu
74825181f2
Remove old-version dsl examples. ( #2644 )
2025-09-17 22:23:30 -04:00
Junkai-Wu
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
Richard Cai
56f0718a97
ex77 backwards GQA ( #2556 )
...
* bwd GQA init
* Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu
* ref kernel type conversion fix
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-09-09 12:53:28 -04:00