cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-14 02:02:25 +00:00

Author	SHA1	Message	Date
Haicheng Wu	46fcd7eb18	update to 4.5	2026-05-11 20:43:08 -07:00
Emre Albayrak	2e56847d72	Add Snake activation functor for EVT (#3184 ) Introduces cutlass::epilogue::thread::Snake, a two-operand activation functor implementing Snake_a(x) = x + (1/a) * sin^2(a*x) from Ziyin et al. 2020 (arXiv:2006.08195). The per-channel learnable frequency `a` flows through an EVT child (e.g. Sm90RowBroadcast), composing into Sm90EVT<Sm90Compute<Snake, ...>, x_node, alpha_node> for fused GEMM+Snake epilogues used in neural vocoders. Adds unit tests in test/unit/epilogue/thread/activation.cu covering f32 and bf16 paths, validated against float64 reference goldens. Closes #3141	2026-05-11 22:09:53 -04:00
Junkai-Wu	cb37157db5	v4.5 tag update (#3202 ) * Python DSL examples reorganization. * v4.5 tag update.	2026-05-05 20:55:27 -04:00
Blake Ledden	7a9fe055cb	fix: Add missing kElementsPerAccess division in RegularTileIterator store (#3049 ) The store(frag, tile_offset) method was computing the pointer offset without dividing by kElementsPerAccess, while the matching load(frag, tile_offset) method does include this division. Both load_with_pointer_offset and store_with_pointer_offset apply the same byte conversion, so the tile_offset -> pointer_offset calculation must also match. When kElementsPerAccess > 1, this caused load and store to reference different memory locations for the same logical tile offset. Fixes #3017 Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>	2026-04-24 23:27:40 -04:00
Vrushtee	9135a9bb6d	Replace std::min with cute::min in sm120 blockwise scaling device functions (#3055 )	2026-04-24 11:13:38 -04:00
dePaul Miller	b46b16d003	Small Tile N BlockScaled GEMM + Grouped GEMM (#3176 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-04-21 12:32:40 -04:00
zfm	aeba0d3723	correct BLayout stride in SM80 m16n8k32 int4 MMA traits (#3140 )	2026-04-21 17:17:03 +08:00
Junkai-Wu	a221da7ccf	v4.5 dev update. (#3153 )	2026-04-07 12:16:05 -04:00
Junkai-Wu	1b741cabaa	v4.4.2 update. (#3104 )	2026-03-17 00:58:19 -04:00
dePaul Miller	73c59c055c	Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-03-06 20:36:29 -05:00
Junkai-Wu	3bb6e28d3c	v4.4.1 update (#3079 )	2026-02-27 13:59:21 -05:00
Neil Kichler	edf2f82c00	Fix register index bug in mma.sync.aligned.m16n8k16 (#2740 )	2026-02-27 16:24:18 +08:00
Haicheng Wu	2aedca6f5e	Bump CUTLASS version to 4.4.0	2026-02-25 00:01:56 -05:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
Junkai-Wu	6b3e607b85	v4.4 release update v2. (#2999 )	2026-02-03 20:48:31 -05:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Qi Yuhang	2fafefb7b9	[Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma (#2965 ) * Fix NumSplitsM when TileShapeM < 128. * Use cute::conditional_t to replace std::conditional_t.	2026-01-23 15:56:52 +08:00
kf-zhang	0deda34b9f	fix typo (#2884 )	2026-01-09 00:57:06 -05:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
veritas-Qiu	61b560983a	remove useless line (#2926 ) the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.	2026-01-06 23:54:08 -05:00
dePaul Miller	7127592069	Replace CUDA driver API with runtime API (#2928 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-01-05 13:50:44 -05:00
tsu-bin	3d9de19bb7	add constexpr specifier to make_tiled_copy (#2875 )	2026-01-03 15:39:43 -05:00
Junkai-Wu	b7ecaa605d	v4.3.4 update v2. (#2898 )	2025-12-22 22:28:26 -05:00
Junkai-Wu	7f5fe3edf1	v4.3.4 update. (#2892 )	2025-12-21 11:49:12 -05:00
Qi Yuhang	ebf3165efb	[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler (#2865 ) * Delete unused #ifdef/#endif. Bypass sm120 case. * Add todo. * Fix pingpong. * Revert "Add todo." This reverts commit `246cb42091`. * Refine name. Refine name again. * Apply suggestions from code review Skip `is_last_tile` for all sm120 kernels. Co-authored-by: Junkai-Wu <junkaiw@nvidia.com> * Skip early stop for sm120 kernel. * Fix typo. --------- Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>	2025-12-18 08:51:38 +08:00
Junkai-Wu	d3a5492381	v4.3.3 update. (#2868 )	2025-12-11 00:26:58 -05:00
Amin Sedaghat	49bd6bf1ba	fix print_layout printf format in device code (#2688 ) * fix print_layout printf format in device code * Replace %.s format specifier with explicit loop Remove unused delim variable The printf format %.s with dynamic width does not work correctly in CUDA device code, causing literal %.s to appear in output. Fixes #2496 * Update include/cute/util/print_tensor.hpp Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com> * Update include/cute/util/print_tensor.hpp Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com> --------- Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>	2025-12-10 08:57:56 +08:00
HydraQYH	95f8beb44c	Revert "Remove unnecessary #ifdef #endif for general gemm." This reverts commit `17ffd56dfe`.	2025-12-09 11:52:23 +08:00
HydraQYH	17ffd56dfe	Remove unnecessary #ifdef #endif for general gemm.	2025-12-06 10:20:35 +08:00
HydraQYH	ff7f2dcdfb	Remove duplicated cutlass::arch::wait_on_dependent_grids();	2025-12-06 10:20:35 +08:00
HydraQYH	929e1e0259	Remove unnecessary #ifdef / #endif for launch_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	b6ad6db219	Delete unnecessary #ifdef / #endif.	2025-12-06 10:20:35 +08:00
HydraQYH	e1b2ec57e3	Hoist waits above the warp specialized region.	2025-12-06 10:20:35 +08:00
HydraQYH	1e5f95cbbe	Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative	2025-12-06 10:20:35 +08:00
HydraQYH	acf5990cc2	Refine position for wait_on_dependent_grids.	2025-12-06 10:20:35 +08:00
HydraQYH	91de7891a5	Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp	2025-12-06 10:20:35 +08:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
Haicheng Wu	f11375bf91	Bump CUTLASS patch version to 1	2025-12-01 22:08:52 -05:00
Shreya Gaur	af8d5dfa54	bug fix for example 92 (#2830 ) Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com> Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com>	2025-12-01 22:02:59 -05:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
Shreya Gaur	2052fd3885	Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790 ) * Adding blockscaled ragged contiguous grouped gemm for MoEs * cleaning up the example * introduction to example improved --------- Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>	2025-11-26 20:16:49 -05:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Ali Hassani	d1ef0e87f2	DistGEMM bug fixes (#2713 ) * Blackwell DistGEMM bug fixes 1. If using preferred cluster, there needs to be a branch so that the universal GEMM wrapper finds the correct base params. 2. Workspace sizes can change depending on problem shape in Blackwell, and DistGEMM was previously using the per-device shape to evaluate workspace size instead of the per-gemm shape. 3. Flattened size used to initialize host tensors can overflow (in Hopper example as well) 4. Preferred and fallback cluster args need to be set explicitly, otherwise if someone modifies the example to use preferred cluster, it will just fail. * Fix example runtimes * Set default fallback cluster shapes to the static ones	2025-11-06 13:31:24 -05:00
ANIKET SHIVAM	020c700e97	support for K=0 for sm100 GG (#2746 )	2025-11-04 11:25:39 -05:00
Qi Yuhang	b2ca083d2b	Fixed compilation error when using StreamK scheduler + PDL. (#2686 )	2025-10-21 23:11:14 -04:00
Junkai-Wu	b1d6e2c9b3	v4.3 update. (#2709 ) * v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>	2025-10-21 14:26:30 -04:00
Lain	e6e2cc29f5	fix (#2684 )	2025-10-15 14:46:38 -04:00
Haicheng Wu	f874df19ac	4.2.1 update	2025-09-23 13:45:13 -07:00
Junkai-Wu	7a6d4ee099	v4.2.1 update. (#2666 )	2025-09-23 13:25:43 -04:00
GTO	2b8dff1f90	Fix bfloat16 epsilon (#2607 ) * Fix bfloat16 epsilon * just use constants --------- Co-authored-by: Konstantin <konstantin@MacBook-Air.local> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-09-21 23:43:59 -04:00

1 2 3 4 5 ...

376 Commits