Commit Graph

344 Commits

Author SHA1 Message Date
HydraQYH
e1b2ec57e3 Hoist waits above the warp specialized region. 2025-12-06 10:20:35 +08:00
HydraQYH
1e5f95cbbe Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative 2025-12-06 10:20:35 +08:00
HydraQYH
acf5990cc2 Refine position for wait_on_dependent_grids. 2025-12-06 10:20:35 +08:00
HydraQYH
91de7891a5 Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp 2025-12-06 10:20:35 +08:00
Junkai-Wu
bc680c7f67 v4.3.2 update. (#2839) 2025-12-04 10:14:32 -05:00
Haicheng Wu
f11375bf91 Bump CUTLASS patch version to 1 2025-12-01 22:08:52 -05:00
Shreya Gaur
af8d5dfa54 bug fix for example 92 (#2830)
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>
Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com>
2025-12-01 22:02:59 -05:00
Junkai-Wu
1de3a576cc v4.3.1 update. (#2817) 2025-11-27 09:49:30 -05:00
Shreya Gaur
2052fd3885 Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790)
* Adding blockscaled ragged contiguous grouped gemm for MoEs

* cleaning up the example

* introduction to example improved

---------

Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>
2025-11-26 20:16:49 -05:00
Junkai-Wu
8cd5bef43a v4.3 tag release update. (#2789) 2025-11-20 20:49:44 -05:00
Ali Hassani
d1ef0e87f2 DistGEMM bug fixes (#2713)
* Blackwell DistGEMM bug fixes

1. If using preferred cluster, there needs to be a branch so that
   the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
   and DistGEMM was previously using the per-device shape to evaluate
   workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
   Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
   otherwise if someone modifies the example to use preferred cluster,
   it will just fail.

* Fix example runtimes

* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00
ANIKET SHIVAM
020c700e97 support for K=0 for sm100 GG (#2746) 2025-11-04 11:25:39 -05:00
Qi Yuhang
b2ca083d2b Fixed compilation error when using StreamK scheduler + PDL. (#2686) 2025-10-21 23:11:14 -04:00
Junkai-Wu
b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00
Lain
e6e2cc29f5 fix (#2684) 2025-10-15 14:46:38 -04:00
Haicheng Wu
f874df19ac 4.2.1 update 2025-09-23 13:45:13 -07:00
Junkai-Wu
7a6d4ee099 v4.2.1 update. (#2666) 2025-09-23 13:25:43 -04:00
GTO
2b8dff1f90 Fix bfloat16 epsilon (#2607)
* Fix bfloat16 epsilon

* just use constants

---------

Co-authored-by: Konstantin <konstantin@MacBook-Air.local>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-09-21 23:43:59 -04:00
103yiran
fd0312ddf6 Remove duplicate function calls (#1584) 2025-09-21 23:16:59 -04:00
Haicheng Wu
57e3cfb47a doc change for 4.2 (#2639)
* doc change

* fix broken links

* ragged gemm doc update

* move around texts about moe gemm
2025-09-15 22:02:45 -04:00
Haicheng Wu
e7e0adddac Update version.h
change version number to 4.2
2025-09-15 12:40:58 -04:00
Junkai-Wu
6a35b4d22f v4.2 tag release. (#2638) 2025-09-15 12:21:53 -04:00
Lifu Huang
76c96b0be3 Fix incorrect shapes in copy_atom doc comments. (#2575) 2025-09-04 16:57:24 -07:00
ao jia
d98e7bf7ce Fix comment in mma_atom.hpp (#2579) 2025-09-04 16:56:39 -07:00
Andrei Alexandrescu
2288c0c901 Fix bugs in matrix.h (#2598) 2025-09-04 16:55:11 -07:00
Javier
496654bf2c Fix sm100 gemm wrong static constexpr that breaks compilation on Windows (#2167)
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows

* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows

* More Windows fixes

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

* Revert "More Windows fixes"

This reverts commit 2e8cfc1382.

---------

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
2025-08-28 22:13:00 -04:00
Junkai-Wu
a49a78ffef v4.2 release. (#2587)
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.

* v4.2 release.
2025-08-22 18:11:24 -04:00
zkyue
931359cec1 Fix typo in functional.h (#2571) 2025-08-19 22:22:31 -04:00
Inoday Yadav
42e7c546c4 Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) (#2562) 2025-08-19 22:22:02 -04:00
zkyue
052afcd314 fix typo (#2529) 2025-08-10 22:44:02 -04:00
starwang1024
9e6ab77d27 Fix a copy error in the SM70 main loop when loading data from smem to rmem (#2540) 2025-08-10 22:42:01 -04:00
Wenxin Cheng
6dd13d4278 Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. (#2324) 2025-07-31 20:55:19 -04:00
Wenbo Yang
6c891db9f6 Fix epilogue::thread::Convert cannot be used with cute::collective::DefaultEpilogue. (#2333) 2025-07-30 22:12:53 -04:00
kf-zhang
26b7450023 support fp16 accmulator for sm89 fp8 mma (#2378)
* add support for sm89 in cute and the unit tests

* support fp16 accmulator for sm89 fp8 mma

* format code
2025-07-30 22:12:08 -04:00
Aditya Kane
f09045d660 Corrected minor nit in mma_traits.hpp (#2447)
* Corrected minor nit in mma_traits.hpp

The entry and descriptions were jumbled up.

* Update mma_traits.hpp

* Update mma_traits.hpp
2025-07-30 22:11:23 -04:00
Haicheng Wu
664c4f7b3e Update CUTLASS version to 4.1
Update CUTLASS version to 4.1.
2025-07-26 20:11:04 -04:00
Junkai-Wu
fd6cfe1ed0 v4.1 release update v2. (#2481) 2025-07-21 22:03:55 -04:00
Junkai-Wu
a1aaf2300a v4.1 release 2025-07-03 08:07:53 -04:00
Junkai-Wu
8bdbfca682 v4.0 update. (#2371) 2025-06-06 02:39:20 -04:00
Kihiro Bando
f115c3f854 Release v4.0.0 (#2294) 2025-05-13 15:55:29 -04:00
Haicheng Wu
ad7b2f5e84 3.9.2 doc/version (#2279)
* 3.9.2 doc/version

* whitespace
2025-05-04 00:00:15 -04:00
Jiazhen Han
89f6bf2739 Fix group scale gemm when K==128 (#2275)
Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com>
2025-05-02 15:41:18 -04:00
Haicheng Wu
f535c33634 3.9.1 doc/version change (#2273) 2025-05-01 00:27:00 -04:00
Qi Yuhang
e5b810bed1 Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. (#2256)
Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com>
2025-04-30 15:28:05 -04:00
Lain
2b78c2fe31 cherry-pick feature/hopper-blockwise-generalization-optimization (#2270) 2025-04-29 16:47:22 -04:00
Haicheng Wu
697126019e fix blackwell grouped groupwise hang (#2267) 2025-04-29 11:54:20 -04:00
Yujia Zhai
331a1f5b3f cutlass 3.9 update (#2255)
* cutlass 3.9 update

* rebase

* fixes out of shared memory for blockwise Blackwell

* doc format

* fix issue 2253

* disable host ref by default

* fix sm120 smem capacity

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-24 15:42:40 -04:00
吴坎
8e345c5c5b fix_missing_stdint (#2199)
* Update config.hpp

* 更新 config.hpp

* 更新 config.hpp
2025-04-23 22:21:22 -04:00
Tri Dao
81a43e6d92 Set EpiTile correctly when TileN is not divisible by 32 (#2220)
If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set
to 128 x 32, which does not compile as TileN is required to divide EpiTileN
2025-04-21 00:02:51 -04:00
Tri Dao
ade6376fa0 [SM90] Change register allocation for TileN=208 to avoid spills (#2219)
With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
2025-04-21 00:02:30 -04:00