HydraQYH
95f8beb44c
Revert "Remove unnecessary #ifdef #endif for general gemm."
...
This reverts commit 17ffd56dfe .
2025-12-09 11:52:23 +08:00
HydraQYH
17ffd56dfe
Remove unnecessary #ifdef #endif for general gemm.
2025-12-06 10:20:35 +08:00
HydraQYH
ff7f2dcdfb
Remove duplicated cutlass::arch::wait_on_dependent_grids();
2025-12-06 10:20:35 +08:00
HydraQYH
929e1e0259
Remove unnecessary #ifdef / #endif for launch_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
b6ad6db219
Delete unnecessary #ifdef / #endif.
2025-12-06 10:20:35 +08:00
HydraQYH
e1b2ec57e3
Hoist waits above the warp specialized region.
2025-12-06 10:20:35 +08:00
HydraQYH
1e5f95cbbe
Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative
2025-12-06 10:20:35 +08:00
HydraQYH
acf5990cc2
Refine position for wait_on_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
91de7891a5
Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp
2025-12-06 10:20:35 +08:00
Junkai-Wu
bc680c7f67
v4.3.2 update. ( #2839 )
2025-12-04 10:14:32 -05:00
Haicheng Wu
f11375bf91
Bump CUTLASS patch version to 1
2025-12-01 22:08:52 -05:00
Shreya Gaur
af8d5dfa54
bug fix for example 92 ( #2830 )
...
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com >
2025-12-01 22:02:59 -05:00
Junkai-Wu
1de3a576cc
v4.3.1 update. ( #2817 )
2025-11-27 09:49:30 -05:00
Shreya Gaur
2052fd3885
Blockscaled Ragged Contiguous Grouped Gemm for MoEs ( #2790 )
...
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
2025-11-26 20:16:49 -05:00
Junkai-Wu
8cd5bef43a
v4.3 tag release update. ( #2789 )
2025-11-20 20:49:44 -05:00
Ali Hassani
d1ef0e87f2
DistGEMM bug fixes ( #2713 )
...
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00
ANIKET SHIVAM
020c700e97
support for K=0 for sm100 GG ( #2746 )
2025-11-04 11:25:39 -05:00
Qi Yuhang
b2ca083d2b
Fixed compilation error when using StreamK scheduler + PDL. ( #2686 )
2025-10-21 23:11:14 -04:00
Junkai-Wu
b1d6e2c9b3
v4.3 update. ( #2709 )
...
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com >
2025-10-21 14:26:30 -04:00
Lain
e6e2cc29f5
fix ( #2684 )
2025-10-15 14:46:38 -04:00
Haicheng Wu
f874df19ac
4.2.1 update
2025-09-23 13:45:13 -07:00
Junkai-Wu
7a6d4ee099
v4.2.1 update. ( #2666 )
2025-09-23 13:25:43 -04:00
GTO
2b8dff1f90
Fix bfloat16 epsilon ( #2607 )
...
* Fix bfloat16 epsilon
* just use constants
---------
Co-authored-by: Konstantin <konstantin@MacBook-Air.local >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-09-21 23:43:59 -04:00
103yiran
fd0312ddf6
Remove duplicate function calls ( #1584 )
2025-09-21 23:16:59 -04:00
Haicheng Wu
57e3cfb47a
doc change for 4.2 ( #2639 )
...
* doc change
* fix broken links
* ragged gemm doc update
* move around texts about moe gemm
2025-09-15 22:02:45 -04:00
Haicheng Wu
e7e0adddac
Update version.h
...
change version number to 4.2
2025-09-15 12:40:58 -04:00
Junkai-Wu
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
Lifu Huang
76c96b0be3
Fix incorrect shapes in copy_atom doc comments. ( #2575 )
2025-09-04 16:57:24 -07:00
ao jia
d98e7bf7ce
Fix comment in mma_atom.hpp ( #2579 )
2025-09-04 16:56:39 -07:00
Andrei Alexandrescu
2288c0c901
Fix bugs in matrix.h ( #2598 )
2025-09-04 16:55:11 -07:00
Javier
496654bf2c
Fix sm100 gemm wrong static constexpr that breaks compilation on Windows ( #2167 )
...
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* More Windows fixes
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com >
* Revert "More Windows fixes"
This reverts commit 2e8cfc1382 .
---------
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com >
2025-08-28 22:13:00 -04:00
Junkai-Wu
a49a78ffef
v4.2 release. ( #2587 )
...
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.
* v4.2 release.
2025-08-22 18:11:24 -04:00
zkyue
931359cec1
Fix typo in functional.h ( #2571 )
2025-08-19 22:22:31 -04:00
Inoday Yadav
42e7c546c4
Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) ( #2562 )
2025-08-19 22:22:02 -04:00
zkyue
052afcd314
fix typo ( #2529 )
2025-08-10 22:44:02 -04:00
starwang1024
9e6ab77d27
Fix a copy error in the SM70 main loop when loading data from smem to rmem ( #2540 )
2025-08-10 22:42:01 -04:00
Wenxin Cheng
6dd13d4278
Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. ( #2324 )
2025-07-31 20:55:19 -04:00
Wenbo Yang
6c891db9f6
Fix epilogue::thread::Convert cannot be used with cute::collective::DefaultEpilogue. ( #2333 )
2025-07-30 22:12:53 -04:00
kf-zhang
26b7450023
support fp16 accmulator for sm89 fp8 mma ( #2378 )
...
* add support for sm89 in cute and the unit tests
* support fp16 accmulator for sm89 fp8 mma
* format code
2025-07-30 22:12:08 -04:00
Aditya Kane
f09045d660
Corrected minor nit in mma_traits.hpp ( #2447 )
...
* Corrected minor nit in mma_traits.hpp
The entry and descriptions were jumbled up.
* Update mma_traits.hpp
* Update mma_traits.hpp
2025-07-30 22:11:23 -04:00
Haicheng Wu
664c4f7b3e
Update CUTLASS version to 4.1
...
Update CUTLASS version to 4.1.
2025-07-26 20:11:04 -04:00
Junkai-Wu
fd6cfe1ed0
v4.1 release update v2. ( #2481 )
2025-07-21 22:03:55 -04:00
Junkai-Wu
a1aaf2300a
v4.1 release
2025-07-03 08:07:53 -04:00
Junkai-Wu
8bdbfca682
v4.0 update. ( #2371 )
2025-06-06 02:39:20 -04:00
Kihiro Bando
f115c3f854
Release v4.0.0 ( #2294 )
2025-05-13 15:55:29 -04:00
Haicheng Wu
ad7b2f5e84
3.9.2 doc/version ( #2279 )
...
* 3.9.2 doc/version
* whitespace
2025-05-04 00:00:15 -04:00
Jiazhen Han
89f6bf2739
Fix group scale gemm when K==128 ( #2275 )
...
Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com >
2025-05-02 15:41:18 -04:00
Haicheng Wu
f535c33634
3.9.1 doc/version change ( #2273 )
2025-05-01 00:27:00 -04:00
Qi Yuhang
e5b810bed1
Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. ( #2256 )
...
Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com >
2025-04-30 15:28:05 -04:00
Lain
2b78c2fe31
cherry-pick feature/hopper-blockwise-generalization-optimization ( #2270 )
2025-04-29 16:47:22 -04:00