Commit Graph

730 Commits

Author SHA1 Message Date
Mindy Li
06b6bd7d7b remove cute dsl pdl example. 2025-11-09 21:47:00 -08:00
Linfeng Zheng
2252254ce2 Add tutorial fp16_gemm_1 (#2750)
* Add tutorial fp16_gemm_1

* refine

* refine

* refine

* revert changes in fp16_gemm_0.py
2025-11-06 22:40:09 -05:00
Ali Hassani
d1ef0e87f2 DistGEMM bug fixes (#2713)
* Blackwell DistGEMM bug fixes

1. If using preferred cluster, there needs to be a branch so that
   the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
   and DistGEMM was previously using the per-device shape to evaluate
   workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
   Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
   otherwise if someone modifies the example to use preferred cluster,
   it will just fail.

* Fix example runtimes

* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00
ANIKET SHIVAM
020c700e97 support for K=0 for sm100 GG (#2746) 2025-11-04 11:25:39 -05:00
Haicheng Wu
8afb19d904 update CITATION.cff 2025-10-28 23:42:37 -04:00
Qi Yuhang
b2ca083d2b Fixed compilation error when using StreamK scheduler + PDL. (#2686) 2025-10-21 23:11:14 -04:00
Junkai-Wu
b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00
Lain
e6e2cc29f5 fix (#2684) 2025-10-15 14:46:38 -04:00
Haicheng Wu
c6aeb9179c Update pyproject.toml
update version to 4.2.1
2025-09-24 01:18:51 -04:00
Haicheng Wu
95a5ff14c0 Update CHANGELOG.md
format change
2025-09-23 17:33:00 -04:00
ANIKET SHIVAM
fb8b43ef05 Merge pull request #2669 from NVIDIA/421_update
4.2.1 update
2025-09-23 14:02:29 -07:00
Haicheng Wu
f874df19ac 4.2.1 update 2025-09-23 13:45:13 -07:00
Junkai-Wu
7a6d4ee099 v4.2.1 update. (#2666) 2025-09-23 13:25:43 -04:00
GTO
2b8dff1f90 Fix bfloat16 epsilon (#2607)
* Fix bfloat16 epsilon

* just use constants

---------

Co-authored-by: Konstantin <konstantin@MacBook-Air.local>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-09-21 23:43:59 -04:00
103yiran
fd0312ddf6 Remove duplicate function calls (#1584) 2025-09-21 23:16:59 -04:00
Aya Z. Ibrahim
64579189ec Feature/add bottom causal mask (#2480)
* Rebase to latest

* update

* upd

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Update fmha_fusion.hpp

* Update fmha_fusion.hpp

fixed flipped logic for isQBegin

* Update fmha_fusion.hpp

* Avoid use of booleans

The current expression is confusing

* fmt

* Update fmha_fusion.hpp

Reproduce error/fix with: 
./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend

* add test, format

---------

Co-authored-by: Richard Cai <ricai@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-18 17:11:23 -04:00
Jack Kosaian
b234a8c024 Rename python/cutlass to python/cutlass_cppgen (#2652) 2025-09-18 14:26:57 -04:00
Junkai-Wu
74825181f2 Remove old-version dsl examples. (#2644) 2025-09-17 22:23:30 -04:00
Junkai-Wu
8825e8be4f Add required changes for github pipeline. (#2648) 2025-09-17 22:22:45 -04:00
wbn
7817e47154 Fxied a typo in pipeline descript docs. (#2623) 2025-09-15 22:32:27 -04:00
Asuka
25ccb875b8 Fix: a calculation error in the example of dividing out in the 02_layout_algebra doc (#2635) 2025-09-15 22:31:33 -04:00
Wanshe
29c1ad704a Fix doc cute 03_tensor.md link typo (#2627)
* Update 03_tensor.md fix link typo

change path to relative path

* Update 03_tensor.md

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-15 22:26:43 -04:00
Haicheng Wu
57e3cfb47a doc change for 4.2 (#2639)
* doc change

* fix broken links

* ragged gemm doc update

* move around texts about moe gemm
2025-09-15 22:02:45 -04:00
Haicheng Wu
e7e0adddac Update version.h
change version number to 4.2
2025-09-15 12:40:58 -04:00
Junkai-Wu
6a35b4d22f v4.2 tag release. (#2638) 2025-09-15 12:21:53 -04:00
Richard Cai
56f0718a97 ex77 backwards GQA (#2556)
* bwd GQA init

* Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu

* ref kernel type conversion fix

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-09 12:53:28 -04:00
Lifu Huang
76c96b0be3 Fix incorrect shapes in copy_atom doc comments. (#2575) 2025-09-04 16:57:24 -07:00
ao jia
d98e7bf7ce Fix comment in mma_atom.hpp (#2579) 2025-09-04 16:56:39 -07:00
Lifu Huang
b6ccf34aef Fix Copy_Atom type mismatch in sgemm_sm80.cu (#2582) 2025-09-04 16:56:17 -07:00
Andrei Alexandrescu
2288c0c901 Fix bugs in matrix.h (#2598) 2025-09-04 16:55:11 -07:00
Harrison Barclay
b2dd65dc86 more robust imports in heuristics.py and heuristics_provider.py (#2596) 2025-08-28 22:32:55 -04:00
Javier
496654bf2c Fix sm100 gemm wrong static constexpr that breaks compilation on Windows (#2167)
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows

* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows

* More Windows fixes

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

* Revert "More Windows fixes"

This reverts commit 2e8cfc1382.

---------

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
2025-08-28 22:13:00 -04:00
Linfeng Zheng
9ca7e877b2 fix gqa issue for blackwell fmha.py (#2599) 2025-08-28 11:15:20 -04:00
Junkai-Wu
a49a78ffef v4.2 release. (#2587)
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.

* v4.2 release.
2025-08-22 18:11:24 -04:00
qqwqqw689
11cad1f67b fix a typo. (#2561) 2025-08-19 22:23:09 -04:00
zkyue
931359cec1 Fix typo in functional.h (#2571) 2025-08-19 22:22:31 -04:00
Inoday Yadav
42e7c546c4 Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) (#2562) 2025-08-19 22:22:02 -04:00
melonedo
ec18e8043b Make swizzle in pycute work (#2553) 2025-08-19 22:21:00 -04:00
Srinath Kailasa
5b76420d6a [DOC] Add more exposition to composition example (#2536)
* Add more exposition to composition example

* Apply suggestions from code review

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
2025-08-11 22:20:36 -04:00
Horace He
19772cd63e Fix typo in smem_allocator.py (#2517) 2025-08-10 22:44:22 -04:00
zkyue
052afcd314 fix typo (#2529) 2025-08-10 22:44:02 -04:00
Srinath Kailasa
86cf63e2d4 NIT: Grammar (#2537) 2025-08-10 22:42:45 -04:00
Tarun Paparaju
a267d47f9b Update batched_gemm.cu (#2538) 2025-08-10 22:42:21 -04:00
starwang1024
9e6ab77d27 Fix a copy error in the SM70 main loop when loading data from smem to rmem (#2540) 2025-08-10 22:42:01 -04:00
Robert Maynard
d0eada85a3 Support both CUDA 12 and 13 cccl header locations (#2543) 2025-08-10 22:41:25 -04:00
Lifu Huang
23139309e9 Fix incorrect K dim in CuTe MMA Atom doc. (#2544) 2025-08-10 22:40:56 -04:00
Wenxin Cheng
6dd13d4278 Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. (#2324) 2025-07-31 20:55:19 -04:00
Srinath Kailasa
3b054767b3 Fix typo (#2514) 2025-07-30 22:14:54 -04:00
Ali Hassani
6fb5e667c1 [Doc fix] incorrect compute cap. for Blackwell RTX (#2511)
Blackwell RTX is compute capability 12.0 (SM120) but incorrectly listed
as SM100 in the README.
2025-07-30 22:14:13 -04:00
Wenbo Yang
6c891db9f6 Fix epilogue::thread::Convert cannot be used with cute::collective::DefaultEpilogue. (#2333) 2025-07-30 22:12:53 -04:00