cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-11 17:00:05 +00:00

Author	SHA1	Message	Date
HydraQYH	95f8beb44c	Revert "Remove unnecessary #ifdef #endif for general gemm." This reverts commit `17ffd56dfe`.	2025-12-09 11:52:23 +08:00
HydraQYH	17ffd56dfe	Remove unnecessary #ifdef #endif for general gemm.	2025-12-06 10:20:35 +08:00
HydraQYH	ff7f2dcdfb	Remove duplicated cutlass::arch::wait_on_dependent_grids();	2025-12-06 10:20:35 +08:00
HydraQYH	929e1e0259	Remove unnecessary #ifdef / #endif for launch_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	b6ad6db219	Delete unnecessary #ifdef / #endif.	2025-12-06 10:20:35 +08:00
HydraQYH	e1b2ec57e3	Hoist waits above the warp specialized region.	2025-12-06 10:20:35 +08:00
HydraQYH	1e5f95cbbe	Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative	2025-12-06 10:20:35 +08:00
HydraQYH	acf5990cc2	Refine position for wait_on_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	91de7891a5	Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp	2025-12-06 10:20:35 +08:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
Haicheng Wu	f11375bf91	Bump CUTLASS patch version to 1	2025-12-01 22:08:52 -05:00
Shreya Gaur	af8d5dfa54	bug fix for example 92 (#2830 ) Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com> Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com>	2025-12-01 22:02:59 -05:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
Shreya Gaur	2052fd3885	Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790 ) * Adding blockscaled ragged contiguous grouped gemm for MoEs * cleaning up the example * introduction to example improved --------- Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>	2025-11-26 20:16:49 -05:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Ali Hassani	d1ef0e87f2	DistGEMM bug fixes (#2713 ) * Blackwell DistGEMM bug fixes 1. If using preferred cluster, there needs to be a branch so that the universal GEMM wrapper finds the correct base params. 2. Workspace sizes can change depending on problem shape in Blackwell, and DistGEMM was previously using the per-device shape to evaluate workspace size instead of the per-gemm shape. 3. Flattened size used to initialize host tensors can overflow (in Hopper example as well) 4. Preferred and fallback cluster args need to be set explicitly, otherwise if someone modifies the example to use preferred cluster, it will just fail. * Fix example runtimes * Set default fallback cluster shapes to the static ones	2025-11-06 13:31:24 -05:00
ANIKET SHIVAM	020c700e97	support for K=0 for sm100 GG (#2746 )	2025-11-04 11:25:39 -05:00
Qi Yuhang	b2ca083d2b	Fixed compilation error when using StreamK scheduler + PDL. (#2686 )	2025-10-21 23:11:14 -04:00
Junkai-Wu	b1d6e2c9b3	v4.3 update. (#2709 ) * v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>	2025-10-21 14:26:30 -04:00
Lain	e6e2cc29f5	fix (#2684 )	2025-10-15 14:46:38 -04:00
Haicheng Wu	f874df19ac	4.2.1 update	2025-09-23 13:45:13 -07:00
Junkai-Wu	7a6d4ee099	v4.2.1 update. (#2666 )	2025-09-23 13:25:43 -04:00
GTO	2b8dff1f90	Fix bfloat16 epsilon (#2607 ) * Fix bfloat16 epsilon * just use constants --------- Co-authored-by: Konstantin <konstantin@MacBook-Air.local> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-09-21 23:43:59 -04:00
103yiran	fd0312ddf6	Remove duplicate function calls (#1584 )	2025-09-21 23:16:59 -04:00
Haicheng Wu	57e3cfb47a	doc change for 4.2 (#2639 ) * doc change * fix broken links * ragged gemm doc update * move around texts about moe gemm	2025-09-15 22:02:45 -04:00
Haicheng Wu	e7e0adddac	Update version.h change version number to 4.2	2025-09-15 12:40:58 -04:00
Junkai-Wu	6a35b4d22f	v4.2 tag release. (#2638 )	2025-09-15 12:21:53 -04:00
Lifu Huang	76c96b0be3	Fix incorrect shapes in copy_atom doc comments. (#2575 )	2025-09-04 16:57:24 -07:00
ao jia	d98e7bf7ce	Fix comment in mma_atom.hpp (#2579 )	2025-09-04 16:56:39 -07:00
Andrei Alexandrescu	2288c0c901	Fix bugs in matrix.h (#2598 )	2025-09-04 16:55:11 -07:00
Javier	496654bf2c	Fix sm100 gemm wrong static constexpr that breaks compilation on Windows (#2167 ) * Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows * Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows * More Windows fixes Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com> * Revert "More Windows fixes" This reverts commit `2e8cfc1382`. --------- Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>	2025-08-28 22:13:00 -04:00
Junkai-Wu	a49a78ffef	v4.2 release. (#2587 ) * Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line. * v4.2 release.	2025-08-22 18:11:24 -04:00
zkyue	931359cec1	Fix typo in functional.h (#2571 )	2025-08-19 22:22:31 -04:00
Inoday Yadav	42e7c546c4	Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) (#2562 )	2025-08-19 22:22:02 -04:00
zkyue	052afcd314	fix typo (#2529 )	2025-08-10 22:44:02 -04:00
starwang1024	9e6ab77d27	Fix a copy error in the SM70 main loop when loading data from smem to rmem (#2540 )	2025-08-10 22:42:01 -04:00
Wenxin Cheng	6dd13d4278	Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. (#2324 )	2025-07-31 20:55:19 -04:00
Wenbo Yang	6c891db9f6	Fix epilogue::thread::Convert cannot be used with cute::collective::DefaultEpilogue. (#2333 )	2025-07-30 22:12:53 -04:00
kf-zhang	26b7450023	support fp16 accmulator for sm89 fp8 mma (#2378 ) * add support for sm89 in cute and the unit tests * support fp16 accmulator for sm89 fp8 mma * format code	2025-07-30 22:12:08 -04:00
Aditya Kane	f09045d660	Corrected minor nit in mma_traits.hpp (#2447 ) * Corrected minor nit in mma_traits.hpp The entry and descriptions were jumbled up. * Update mma_traits.hpp * Update mma_traits.hpp	2025-07-30 22:11:23 -04:00
Haicheng Wu	664c4f7b3e	Update CUTLASS version to 4.1 Update CUTLASS version to 4.1.	2025-07-26 20:11:04 -04:00
Junkai-Wu	fd6cfe1ed0	v4.1 release update v2. (#2481 )	2025-07-21 22:03:55 -04:00
Junkai-Wu	a1aaf2300a	v4.1 release	2025-07-03 08:07:53 -04:00
Junkai-Wu	8bdbfca682	v4.0 update. (#2371 )	2025-06-06 02:39:20 -04:00
Kihiro Bando	f115c3f854	Release v4.0.0 (#2294 )	2025-05-13 15:55:29 -04:00
Haicheng Wu	ad7b2f5e84	3.9.2 doc/version (#2279 ) * 3.9.2 doc/version * whitespace	2025-05-04 00:00:15 -04:00
Jiazhen Han	89f6bf2739	Fix group scale gemm when K==128 (#2275 ) Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com>	2025-05-02 15:41:18 -04:00
Haicheng Wu	f535c33634	3.9.1 doc/version change (#2273 )	2025-05-01 00:27:00 -04:00
Qi Yuhang	e5b810bed1	Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. (#2256 ) Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com>	2025-04-30 15:28:05 -04:00
Lain	2b78c2fe31	cherry-pick feature/hopper-blockwise-generalization-optimization (#2270 )	2025-04-29 16:47:22 -04:00

1 2 3 4 5 ...

349 Commits