cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-12 09:15:56 +00:00

Author	SHA1	Message	Date
Qi Yuhang	ebf3165efb	[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler (#2865 ) * Delete unused #ifdef/#endif. Bypass sm120 case. * Add todo. * Fix pingpong. * Revert "Add todo." This reverts commit `246cb42091`. * Refine name. Refine name again. * Apply suggestions from code review Skip `is_last_tile` for all sm120 kernels. Co-authored-by: Junkai-Wu <junkaiw@nvidia.com> * Skip early stop for sm120 kernel. * Fix typo. --------- Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>	2025-12-18 08:51:38 +08:00
Haicheng Wu	d4e16f5d4e	Bump version from 4.2.1 to 4.3.3	2025-12-11 23:58:38 -05:00
Junkai-Wu	d3a5492381	v4.3.3 update. (#2868 )	2025-12-11 00:26:58 -05:00
Amin Sedaghat	49bd6bf1ba	fix print_layout printf format in device code (#2688 ) * fix print_layout printf format in device code * Replace %.s format specifier with explicit loop Remove unused delim variable The printf format %.s with dynamic width does not work correctly in CUDA device code, causing literal %.s to appear in output. Fixes #2496 * Update include/cute/util/print_tensor.hpp Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com> * Update include/cute/util/print_tensor.hpp Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com> --------- Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>	2025-12-10 08:57:56 +08:00
Junkai-Wu	c6dfdf8375	Merge pull request #2719 from HydraQYH/dev_support_pdl_for_sm90_gemm_array_tma_ws Support PDL for SM90 Array TMA GEMM	2025-12-09 18:32:15 +08:00
HydraQYH	95f8beb44c	Revert "Remove unnecessary #ifdef #endif for general gemm." This reverts commit `17ffd56dfe`.	2025-12-09 11:52:23 +08:00
HydraQYH	17ffd56dfe	Remove unnecessary #ifdef #endif for general gemm.	2025-12-06 10:20:35 +08:00
HydraQYH	ff7f2dcdfb	Remove duplicated cutlass::arch::wait_on_dependent_grids();	2025-12-06 10:20:35 +08:00
HydraQYH	929e1e0259	Remove unnecessary #ifdef / #endif for launch_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	b6ad6db219	Delete unnecessary #ifdef / #endif.	2025-12-06 10:20:35 +08:00
HydraQYH	e1b2ec57e3	Hoist waits above the warp specialized region.	2025-12-06 10:20:35 +08:00
HydraQYH	1e5f95cbbe	Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative	2025-12-06 10:20:35 +08:00
HydraQYH	acf5990cc2	Refine position for wait_on_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	91de7891a5	Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp	2025-12-06 10:20:35 +08:00
Haicheng Wu	1e7d3e030b	Update CHANGELOG for CuTe DSL enhancements Added new environment variable for CuTe DSL cache directory.	2025-12-05 13:49:57 -05:00
Haicheng Wu	c4744f706e	Bump version from 4.2.1 to 4.3.2	2025-12-05 13:45:16 -05:00
Linfeng Zheng	f6402fcd5e	add pytest support for tutorial gemm (#2826 ) * add pytest support for tutorial gemm * add license	2025-12-05 08:45:01 -05:00
bangyu shen	7252a2d17e	remove internal comment (#2841 ) Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-04 10:36:21 -05:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
bangyu shen	52ae719eda	[examples][CuTeDSL] init commit for distirbuted examples (#2806 ) * init commit for distirbuted examples * better OOB protection * and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions * add some lamport explanation * enhance f8 output and warn that f8 output can have nan in it * tell user why we need complicate data conversions in ref check part * tell user we don't support nvshmem device function --------- Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-01 22:25:40 -05:00
Haicheng Wu	5e847d37c4	Bump version from 4.2.1 to 4.3.1	2025-12-01 22:13:19 -05:00
Haicheng Wu	f16068b4db	Bump version from 4.2.0 to 4.3.1	2025-12-01 22:12:20 -05:00
Haicheng Wu	1acfe141af	Bump version from 4.2.1 to 4.3.1	2025-12-01 22:11:13 -05:00
Haicheng Wu	f11375bf91	Bump CUTLASS patch version to 1	2025-12-01 22:08:52 -05:00
Shreya Gaur	af8d5dfa54	bug fix for example 92 (#2830 ) Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com> Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com>	2025-12-01 22:02:59 -05:00
drazi	ec8daf642d	Merge pull request #2809 from whatdhack/patch-1 Update notebook title from 'Tour to' to 'Tour of'	2025-11-28 18:07:34 +08:00
drazi	5016493cc0	Merge pull request #2813 from fengxie/ftse/fix/example Refactor TVM FFI examples and update doc	2025-11-28 09:07:15 +08:00
Fung Xie	8588d099e4	refactored doc	2025-11-27 17:04:20 -08:00
Fung Xie	8fc9bc5dda	update doc	2025-11-27 17:03:51 -08:00
Fung Xie	f71892b824	update doc	2025-11-27 17:03:03 -08:00
Fung Xie	03aa211310	update doc	2025-11-27 17:02:59 -08:00
Fung Xie	286781a1fb	add requirements.txt	2025-11-27 17:02:27 -08:00
Fung Xie	2664cac685	enhanced the example for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	b9154d65b3	update examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	afe2f71522	reorganize examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	739fffce27	fix TVM FFI doc and update example	2025-11-27 17:02:26 -08:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
Shreya Gaur	2052fd3885	Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790 ) * Adding blockscaled ragged contiguous grouped gemm for MoEs * cleaning up the example * introduction to example improved --------- Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>	2025-11-26 20:16:49 -05:00
whatdhack	4a55379686	Update notebook title from 'Tour to' to 'Tour of' Grammar check . LLM's can quickly clean up such issues.	2025-11-24 20:11:14 -08:00
Haicheng Wu	e67e63c331	Bump version from 4.2.1 to 4.3.0 v4.3.0	2025-11-24 16:36:06 -05:00
Haicheng Wu	ddaf12c1b1	Bump version from 4.2.0 to 4.3.0	2025-11-24 16:35:27 -05:00
Haicheng Wu	7967ce5f83	Bump version to 4.3.0	2025-11-24 16:34:45 -05:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Linfeng Zheng	406e078b29	add a notebook for tour to sol gemm (#2780 ) * add tour to sol gemm notebook * change some typos * change some typos	2025-11-20 09:41:01 -05:00
Zekun Fan	a2439551c7	Fixed editable install to depend on CuTeDSL/requirements.txt (#2768 ) To guarantee wheel version alignment of the source code.	2025-11-14 15:31:49 -08:00
drazi	bd96096d58	Merge pull request #2758 from limin2021/delete-pdl-example [cute-dsl][fix]remove cute dsl pdl example.	2025-11-10 22:56:57 +08:00
Mindy Li	06b6bd7d7b	remove cute dsl pdl example.	2025-11-09 21:47:00 -08:00
Linfeng Zheng	2252254ce2	Add tutorial fp16_gemm_1 (#2750 ) * Add tutorial fp16_gemm_1 * refine * refine * refine * revert changes in fp16_gemm_0.py	2025-11-06 22:40:09 -05:00
Ali Hassani	d1ef0e87f2	DistGEMM bug fixes (#2713 ) * Blackwell DistGEMM bug fixes 1. If using preferred cluster, there needs to be a branch so that the universal GEMM wrapper finds the correct base params. 2. Workspace sizes can change depending on problem shape in Blackwell, and DistGEMM was previously using the per-device shape to evaluate workspace size instead of the per-gemm shape. 3. Flattened size used to initialize host tensors can overflow (in Hopper example as well) 4. Preferred and fallback cluster args need to be set explicitly, otherwise if someone modifies the example to use preferred cluster, it will just fail. * Fix example runtimes * Set default fallback cluster shapes to the static ones	2025-11-06 13:31:24 -05:00
ANIKET SHIVAM	020c700e97	support for K=0 for sm100 GG (#2746 )	2025-11-04 11:25:39 -05:00

1 2 3 4 5 ...

776 Commits