cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-04-19 22:38:56 +00:00

Author	SHA1	Message	Date
yuanxiaolan	de161925a5	pass in stream=-1	2026-02-03 11:59:14 +08:00
yuanxiaolan	de198b2419	fix tvm-ffi path in from_dlpack	2026-02-03 11:59:13 +08:00
Hua Huang	1cfbb53a23	[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995 ) * Fix: SM100 block-scale gemm overlapping accumulator Signed-off-by: Hua Huang <huah@nvidia.com> * Also include threads_per_warp fix Signed-off-by: Hua Huang <huah@nvidia.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com>	2026-02-03 11:01:41 +08:00
dongxiao	a4eb0e05f6	fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988 ) * fix grouped gemm * fix mixed input gemm * fix mixed input grouped gemm * fix version checking * use advanced compiler options * fix comment * rename advanced compiler configs to adcanced compiler control * fix comment * fix name * fix name	2026-01-30 13:31:04 +08:00
myu-guo	d252b01300	fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990 ) * fix regression with cu13.1 * update	2026-01-30 13:30:49 +08:00
Xiao Song	acb45938e9	Update nvvm API call from nvvm enum to str (#2985 )	2026-01-27 17:28:29 +08:00
Xiao Song	7a14467776	update api usage (#2969 )	2026-01-27 15:33:22 +08:00
drazi	51f82812ec	Merge pull request #2891 from ColinPeppler/main docs: note when DSL dumps are populated	2026-01-26 17:38:27 -08:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Qi Yuhang	2fafefb7b9	[Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma (#2965 ) * Fix NumSplitsM when TileShapeM < 128. * Use cute::conditional_t to replace std::conditional_t.	2026-01-23 15:56:52 +08:00
Johnsonms	0edaa6e47d	Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial (#2945 )	2026-01-23 12:54:12 +08:00
Aidan Do	431d070fcb	[docs] Add additional tip for generating less kernels in blockwise (#2940 ) - Running without this generates a lot of kernels - Clarified CMake configuration for selecting GEMM kernels and added details on kernel generation granularity.	2026-01-23 12:53:51 +08:00
Qi Yuhang	667446a9dd	[Doc]Fix Mode Name and Stride in 0t_mma_atom.md (#2910 ) * Fix wrong stride in 0t_mma_atom.md * Fix mode name. * Update media/docs/cpp/cute/0t_mma_atom.md Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com> --------- Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>	2026-01-23 12:53:30 +08:00
Aidan Do	3f5bafb326	[Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles (#2946 ) * . * . * . * . * . * . * .	2026-01-20 15:27:34 +08:00
Tianqi Chen	1e6da09275	[DOCS] Update docs to precisely describe env stream scenario (#2824 ) Since the term synchronize may cause confusion that user thought it means stream sync, it simply means we pass the right current stream as env stream	2026-01-20 09:16:37 +08:00
Benjamin Leff	8debf77437	fix: 2305 omissions (#2957 )	2026-01-14 00:55:05 -05:00
Brian K. Ryu	147f5673d0	New RMS Norm example with unit tests (#2917 ) * Add rmsnorm example * Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly * Separate out reduce.py * Change copyright notice years	2026-01-13 09:05:31 +08:00
Johnsonms	8c52459504	Fix incorrect tensor layout strides in Blackwell MMA tutorial comments (#2921 )	2026-01-09 01:02:41 -05:00
kf-zhang	0deda34b9f	fix typo (#2884 )	2026-01-09 00:57:06 -05:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Wenxuan Tan	f86feb0aa8	Fix idx2crd docstring (#2914 ) * fix idx2crd docstring * fix * fix	2026-01-07 13:11:38 -05:00
Andrew Yooeun Chun	eb61c91147	Fix CUDA version checking in examples (#2894 ) * examples: update CUDA version requirements in Blackwell examples * examples: fix comments to specify the correct CUDA version requirement	2026-01-07 00:20:37 -05:00
Aidan Do	670480df3a	Fix SFB Layout scale granularity representation (#2924 )	2026-01-06 23:55:21 -05:00
veritas-Qiu	61b560983a	remove useless line (#2926 ) the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.	2026-01-06 23:54:08 -05:00
dePaul Miller	7127592069	Replace CUDA driver API with runtime API (#2928 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-01-05 13:50:44 -05:00
questa-quan-wang	2aee73922c	Minor fix for testing of blockscaled dense GEMM with TMA prefetch (#2930 ) * new example with TMA prefetch feature targeting for DRAM latency bound cases * minor fix to resitrct as 100a arch * typo * apply arch for whole pytest --------- Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com> Co-authored-by: Questa Wang <questaw@umbriel-b200-145.ipp4a1.colossus.nvidia.com>	2026-01-05 16:36:03 +08:00
tsu-bin	3d9de19bb7	add constexpr specifier to make_tiled_copy (#2875 )	2026-01-03 15:39:43 -05:00
Haicheng Wu	853ad93d60	Update README.md	2025-12-24 00:21:59 -05:00
Haicheng Wu	34a81f0497	Update driver bug workaround description in CHANGELOG Clarified the description of a driver bug workaround in the CHANGELOG.	2025-12-24 00:20:21 -05:00
questa-quan-wang	3f4c086d09	new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881 ) Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2025-12-23 15:29:48 +08:00
Junkai-Wu	b7ecaa605d	v4.3.4 update v2. (#2898 )	2025-12-22 22:28:26 -05:00
Junkai-Wu	7f5fe3edf1	v4.3.4 update. (#2892 )	2025-12-21 11:49:12 -05:00
Colin Peppler	4b52d37ecd	docs: note when DSL dumps are populated	2025-12-19 17:05:12 -08:00
dongxiao	331e2f451c	add missing condition for sync (#2889 )	2025-12-19 11:00:30 +08:00
Qi Yuhang	ebf3165efb	[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler (#2865 ) * Delete unused #ifdef/#endif. Bypass sm120 case. * Add todo. * Fix pingpong. * Revert "Add todo." This reverts commit `246cb42091`. * Refine name. Refine name again. * Apply suggestions from code review Skip `is_last_tile` for all sm120 kernels. Co-authored-by: Junkai-Wu <junkaiw@nvidia.com> * Skip early stop for sm120 kernel. * Fix typo. --------- Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>	2025-12-18 08:51:38 +08:00
Haicheng Wu	d4e16f5d4e	Bump version from 4.2.1 to 4.3.3	2025-12-11 23:58:38 -05:00
Junkai-Wu	d3a5492381	v4.3.3 update. (#2868 )	2025-12-11 00:26:58 -05:00
Amin Sedaghat	49bd6bf1ba	fix print_layout printf format in device code (#2688 ) * fix print_layout printf format in device code * Replace %.s format specifier with explicit loop Remove unused delim variable The printf format %.s with dynamic width does not work correctly in CUDA device code, causing literal %.s to appear in output. Fixes #2496 * Update include/cute/util/print_tensor.hpp Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com> * Update include/cute/util/print_tensor.hpp Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com> --------- Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>	2025-12-10 08:57:56 +08:00
Junkai-Wu	c6dfdf8375	Merge pull request #2719 from HydraQYH/dev_support_pdl_for_sm90_gemm_array_tma_ws Support PDL for SM90 Array TMA GEMM	2025-12-09 18:32:15 +08:00
HydraQYH	95f8beb44c	Revert "Remove unnecessary #ifdef #endif for general gemm." This reverts commit `17ffd56dfe`.	2025-12-09 11:52:23 +08:00
HydraQYH	17ffd56dfe	Remove unnecessary #ifdef #endif for general gemm.	2025-12-06 10:20:35 +08:00
HydraQYH	ff7f2dcdfb	Remove duplicated cutlass::arch::wait_on_dependent_grids();	2025-12-06 10:20:35 +08:00
HydraQYH	929e1e0259	Remove unnecessary #ifdef / #endif for launch_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	b6ad6db219	Delete unnecessary #ifdef / #endif.	2025-12-06 10:20:35 +08:00
HydraQYH	e1b2ec57e3	Hoist waits above the warp specialized region.	2025-12-06 10:20:35 +08:00
HydraQYH	1e5f95cbbe	Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative	2025-12-06 10:20:35 +08:00
HydraQYH	acf5990cc2	Refine position for wait_on_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	91de7891a5	Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp	2025-12-06 10:20:35 +08:00
Haicheng Wu	1e7d3e030b	Update CHANGELOG for CuTe DSL enhancements Added new environment variable for CuTe DSL cache directory.	2025-12-05 13:49:57 -05:00
Haicheng Wu	c4744f706e	Bump version from 4.2.1 to 4.3.2	2025-12-05 13:45:16 -05:00

1 2 3 4 5 ...

810 Commits