Junkai-Wu
a221da7ccf
v4.5 dev update. ( #3153 )
2026-04-07 12:16:05 -04:00
Junkai-Wu
1b741cabaa
v4.4.2 update. ( #3104 )
2026-03-17 00:58:19 -04:00
dePaul Miller
73c59c055c
Support for Group GEMM in CUTLASS Profiler for Geforce and Spark ( #3092 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2026-03-06 20:36:29 -05:00
Junkai-Wu
3bb6e28d3c
v4.4.1 update ( #3079 )
2026-02-27 13:59:21 -05:00
Neil Kichler
edf2f82c00
Fix register index bug in mma.sync.aligned.m16n8k16 ( #2740 )
2026-02-27 16:24:18 +08:00
Haicheng Wu
2aedca6f5e
Bump CUTLASS version to 4.4.0
2026-02-25 00:01:56 -05:00
Junkai-Wu
d4bbf728ca
v4.4 tag release update. ( #3032 )
2026-02-13 23:27:58 -05:00
Junkai-Wu
6b3e607b85
v4.4 release update v2. ( #2999 )
2026-02-03 20:48:31 -05:00
Junkai-Wu
9fba3195f9
v4.4 update. ( #2979 )
2026-01-24 11:46:17 -05:00
Qi Yuhang
2fafefb7b9
[Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma ( #2965 )
...
* Fix NumSplitsM when TileShapeM < 128.
* Use cute::conditional_t to replace std::conditional_t.
2026-01-23 15:56:52 +08:00
kf-zhang
0deda34b9f
fix typo ( #2884 )
2026-01-09 00:57:06 -05:00
Junkai-Wu
0d2b201e8c
v4.3.5 update. ( #2934 )
...
* v4.3.5 update.
* Update copyright to 2026
2026-01-08 15:02:56 -05:00
veritas-Qiu
61b560983a
remove useless line ( #2926 )
...
the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.
2026-01-06 23:54:08 -05:00
dePaul Miller
7127592069
Replace CUDA driver API with runtime API ( #2928 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2026-01-05 13:50:44 -05:00
tsu-bin
3d9de19bb7
add constexpr specifier to make_tiled_copy ( #2875 )
2026-01-03 15:39:43 -05:00
Junkai-Wu
b7ecaa605d
v4.3.4 update v2. ( #2898 )
2025-12-22 22:28:26 -05:00
Junkai-Wu
7f5fe3edf1
v4.3.4 update. ( #2892 )
2025-12-21 11:49:12 -05:00
Qi Yuhang
ebf3165efb
[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler ( #2865 )
...
* Delete unused #ifdef/#endif. Bypass sm120 case.
* Add todo.
* Fix pingpong.
* Revert "Add todo."
This reverts commit 246cb42091 .
* Refine name.
Refine name again.
* Apply suggestions from code review
Skip `is_last_tile` for all sm120 kernels.
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
* Skip early stop for sm120 kernel.
* Fix typo.
---------
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
2025-12-18 08:51:38 +08:00
Junkai-Wu
d3a5492381
v4.3.3 update. ( #2868 )
2025-12-11 00:26:58 -05:00
Amin Sedaghat
49bd6bf1ba
fix print_layout printf format in device code ( #2688 )
...
* fix print_layout printf format in device code
* Replace %.*s format specifier with explicit loop
* Remove unused delim variable
The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.
Fixes #2496
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
2025-12-10 08:57:56 +08:00
HydraQYH
95f8beb44c
Revert "Remove unnecessary #ifdef #endif for general gemm."
...
This reverts commit 17ffd56dfe .
2025-12-09 11:52:23 +08:00
HydraQYH
17ffd56dfe
Remove unnecessary #ifdef #endif for general gemm.
2025-12-06 10:20:35 +08:00
HydraQYH
ff7f2dcdfb
Remove duplicated cutlass::arch::wait_on_dependent_grids();
2025-12-06 10:20:35 +08:00
HydraQYH
929e1e0259
Remove unnecessary #ifdef / #endif for launch_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
b6ad6db219
Delete unnecessary #ifdef / #endif.
2025-12-06 10:20:35 +08:00
HydraQYH
e1b2ec57e3
Hoist waits above the warp specialized region.
2025-12-06 10:20:35 +08:00
HydraQYH
1e5f95cbbe
Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative
2025-12-06 10:20:35 +08:00
HydraQYH
acf5990cc2
Refine position for wait_on_dependent_grids.
2025-12-06 10:20:35 +08:00
HydraQYH
91de7891a5
Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp
2025-12-06 10:20:35 +08:00
Junkai-Wu
bc680c7f67
v4.3.2 update. ( #2839 )
2025-12-04 10:14:32 -05:00
Haicheng Wu
f11375bf91
Bump CUTLASS patch version to 1
2025-12-01 22:08:52 -05:00
Shreya Gaur
af8d5dfa54
bug fix for example 92 ( #2830 )
...
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com >
2025-12-01 22:02:59 -05:00
Junkai-Wu
1de3a576cc
v4.3.1 update. ( #2817 )
2025-11-27 09:49:30 -05:00
Shreya Gaur
2052fd3885
Blockscaled Ragged Contiguous Grouped Gemm for MoEs ( #2790 )
...
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
2025-11-26 20:16:49 -05:00
Junkai-Wu
8cd5bef43a
v4.3 tag release update. ( #2789 )
2025-11-20 20:49:44 -05:00
Ali Hassani
d1ef0e87f2
DistGEMM bug fixes ( #2713 )
...
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00
ANIKET SHIVAM
020c700e97
support for K=0 for sm100 GG ( #2746 )
2025-11-04 11:25:39 -05:00
Qi Yuhang
b2ca083d2b
Fixed compilation error when using StreamK scheduler + PDL. ( #2686 )
2025-10-21 23:11:14 -04:00
Junkai-Wu
b1d6e2c9b3
v4.3 update. ( #2709 )
...
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com >
2025-10-21 14:26:30 -04:00
Lain
e6e2cc29f5
fix ( #2684 )
2025-10-15 14:46:38 -04:00
Haicheng Wu
f874df19ac
4.2.1 update
2025-09-23 13:45:13 -07:00
Junkai-Wu
7a6d4ee099
v4.2.1 update. ( #2666 )
2025-09-23 13:25:43 -04:00
GTO
2b8dff1f90
Fix bfloat16 epsilon ( #2607 )
...
* Fix bfloat16 epsilon
* just use constants
---------
Co-authored-by: Konstantin <konstantin@MacBook-Air.local >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-09-21 23:43:59 -04:00
103yiran
fd0312ddf6
Remove duplicate function calls ( #1584 )
2025-09-21 23:16:59 -04:00
Haicheng Wu
57e3cfb47a
doc change for 4.2 ( #2639 )
...
* doc change
* fix broken links
* ragged gemm doc update
* move around texts about moe gemm
2025-09-15 22:02:45 -04:00
Haicheng Wu
e7e0adddac
Update version.h
...
change version number to 4.2
2025-09-15 12:40:58 -04:00
Junkai-Wu
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
Lifu Huang
76c96b0be3
Fix incorrect shapes in copy_atom doc comments. ( #2575 )
2025-09-04 16:57:24 -07:00
ao jia
d98e7bf7ce
Fix comment in mma_atom.hpp ( #2579 )
2025-09-04 16:56:39 -07:00
Andrei Alexandrescu
2288c0c901
Fix bugs in matrix.h ( #2598 )
2025-09-04 16:55:11 -07:00