Mindy Li
06b6bd7d7b
remove cute dsl pdl example.
2025-11-09 21:47:00 -08:00
Linfeng Zheng
2252254ce2
Add tutorial fp16_gemm_1 ( #2750 )
...
* Add tutorial fp16_gemm_1
* refine
* refine
* refine
* revert changes in fp16_gemm_0.py
2025-11-06 22:40:09 -05:00
Ali Hassani
d1ef0e87f2
DistGEMM bug fixes ( #2713 )
...
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00
ANIKET SHIVAM
020c700e97
support for K=0 for sm100 GG ( #2746 )
2025-11-04 11:25:39 -05:00
Haicheng Wu
8afb19d904
update CITATION.cff
2025-10-28 23:42:37 -04:00
Qi Yuhang
b2ca083d2b
Fixed compilation error when using StreamK scheduler + PDL. ( #2686 )
2025-10-21 23:11:14 -04:00
Junkai-Wu
b1d6e2c9b3
v4.3 update. ( #2709 )
...
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com >
2025-10-21 14:26:30 -04:00
Lain
e6e2cc29f5
fix ( #2684 )
2025-10-15 14:46:38 -04:00
Haicheng Wu
c6aeb9179c
Update pyproject.toml
...
update version to 4.2.1
2025-09-24 01:18:51 -04:00
Haicheng Wu
95a5ff14c0
Update CHANGELOG.md
...
format change
2025-09-23 17:33:00 -04:00
ANIKET SHIVAM
fb8b43ef05
Merge pull request #2669 from NVIDIA/421_update
...
4.2.1 update
2025-09-23 14:02:29 -07:00
Haicheng Wu
f874df19ac
4.2.1 update
2025-09-23 13:45:13 -07:00
Junkai-Wu
7a6d4ee099
v4.2.1 update. ( #2666 )
2025-09-23 13:25:43 -04:00
GTO
2b8dff1f90
Fix bfloat16 epsilon ( #2607 )
...
* Fix bfloat16 epsilon
* just use constants
---------
Co-authored-by: Konstantin <konstantin@MacBook-Air.local >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-09-21 23:43:59 -04:00
103yiran
fd0312ddf6
Remove duplicate function calls ( #1584 )
2025-09-21 23:16:59 -04:00
Aya Z. Ibrahim
64579189ec
Feature/add bottom causal mask ( #2480 )
...
* Rebase to latest
* update
* upd
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Update fmha_fusion.hpp
* Update fmha_fusion.hpp
fixed flipped logic for isQBegin
* Update fmha_fusion.hpp
* Avoid use of booleans
The current expression is confusing
* fmt
* Update fmha_fusion.hpp
Reproduce error/fix with:
./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend
* add test, format
---------
Co-authored-by: Richard Cai <ricai@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-09-18 17:11:23 -04:00
Jack Kosaian
b234a8c024
Rename python/cutlass to python/cutlass_cppgen ( #2652 )
2025-09-18 14:26:57 -04:00
Junkai-Wu
74825181f2
Remove old-version dsl examples. ( #2644 )
2025-09-17 22:23:30 -04:00
Junkai-Wu
8825e8be4f
Add required changes for github pipeline. ( #2648 )
2025-09-17 22:22:45 -04:00
wbn
7817e47154
Fxied a typo in pipeline descript docs. ( #2623 )
2025-09-15 22:32:27 -04:00
Asuka
25ccb875b8
Fix: a calculation error in the example of dividing out in the 02_layout_algebra doc ( #2635 )
2025-09-15 22:31:33 -04:00
Wanshe
29c1ad704a
Fix doc cute 03_tensor.md link typo ( #2627 )
...
* Update 03_tensor.md fix link typo
change path to relative path
* Update 03_tensor.md
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-09-15 22:26:43 -04:00
Haicheng Wu
57e3cfb47a
doc change for 4.2 ( #2639 )
...
* doc change
* fix broken links
* ragged gemm doc update
* move around texts about moe gemm
2025-09-15 22:02:45 -04:00
Haicheng Wu
e7e0adddac
Update version.h
...
change version number to 4.2
2025-09-15 12:40:58 -04:00
Junkai-Wu
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
Richard Cai
56f0718a97
ex77 backwards GQA ( #2556 )
...
* bwd GQA init
* Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu
* ref kernel type conversion fix
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-09-09 12:53:28 -04:00
Lifu Huang
76c96b0be3
Fix incorrect shapes in copy_atom doc comments. ( #2575 )
2025-09-04 16:57:24 -07:00
ao jia
d98e7bf7ce
Fix comment in mma_atom.hpp ( #2579 )
2025-09-04 16:56:39 -07:00
Lifu Huang
b6ccf34aef
Fix Copy_Atom type mismatch in sgemm_sm80.cu ( #2582 )
2025-09-04 16:56:17 -07:00
Andrei Alexandrescu
2288c0c901
Fix bugs in matrix.h ( #2598 )
2025-09-04 16:55:11 -07:00
Harrison Barclay
b2dd65dc86
more robust imports in heuristics.py and heuristics_provider.py ( #2596 )
2025-08-28 22:32:55 -04:00
Javier
496654bf2c
Fix sm100 gemm wrong static constexpr that breaks compilation on Windows ( #2167 )
...
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* More Windows fixes
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com >
* Revert "More Windows fixes"
This reverts commit 2e8cfc1382 .
---------
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com >
2025-08-28 22:13:00 -04:00
Linfeng Zheng
9ca7e877b2
fix gqa issue for blackwell fmha.py ( #2599 )
2025-08-28 11:15:20 -04:00
Junkai-Wu
a49a78ffef
v4.2 release. ( #2587 )
...
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.
* v4.2 release.
2025-08-22 18:11:24 -04:00
qqwqqw689
11cad1f67b
fix a typo. ( #2561 )
2025-08-19 22:23:09 -04:00
zkyue
931359cec1
Fix typo in functional.h ( #2571 )
2025-08-19 22:22:31 -04:00
Inoday Yadav
42e7c546c4
Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) ( #2562 )
2025-08-19 22:22:02 -04:00
melonedo
ec18e8043b
Make swizzle in pycute work ( #2553 )
2025-08-19 22:21:00 -04:00
Srinath Kailasa
5b76420d6a
[DOC] Add more exposition to composition example ( #2536 )
...
* Add more exposition to composition example
* Apply suggestions from code review
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
2025-08-11 22:20:36 -04:00
Horace He
19772cd63e
Fix typo in smem_allocator.py ( #2517 )
2025-08-10 22:44:22 -04:00
zkyue
052afcd314
fix typo ( #2529 )
2025-08-10 22:44:02 -04:00
Srinath Kailasa
86cf63e2d4
NIT: Grammar ( #2537 )
2025-08-10 22:42:45 -04:00
Tarun Paparaju
a267d47f9b
Update batched_gemm.cu ( #2538 )
2025-08-10 22:42:21 -04:00
starwang1024
9e6ab77d27
Fix a copy error in the SM70 main loop when loading data from smem to rmem ( #2540 )
2025-08-10 22:42:01 -04:00
Robert Maynard
d0eada85a3
Support both CUDA 12 and 13 cccl header locations ( #2543 )
2025-08-10 22:41:25 -04:00
Lifu Huang
23139309e9
Fix incorrect K dim in CuTe MMA Atom doc. ( #2544 )
2025-08-10 22:40:56 -04:00
Wenxin Cheng
6dd13d4278
Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. ( #2324 )
2025-07-31 20:55:19 -04:00
Srinath Kailasa
3b054767b3
Fix typo ( #2514 )
2025-07-30 22:14:54 -04:00
Ali Hassani
6fb5e667c1
[Doc fix] incorrect compute cap. for Blackwell RTX ( #2511 )
...
Blackwell RTX is compute capability 12.0 (SM120) but incorrectly listed
as SM100 in the README.
2025-07-30 22:14:13 -04:00
Wenbo Yang
6c891db9f6
Fix epilogue::thread::Convert cannot be used with cute::collective::DefaultEpilogue. ( #2333 )
2025-07-30 22:12:53 -04:00