96 Commits

Author SHA1 Message Date
Junkai-Wu
a221da7ccf v4.5 dev update. (#3153) 2026-04-07 12:16:05 -04:00
Junkai-Wu
9fba3195f9 v4.4 update. (#2979) 2026-01-24 11:46:17 -05:00
kf-zhang
0deda34b9f fix typo (#2884) 2026-01-09 00:57:06 -05:00
Junkai-Wu
0d2b201e8c v4.3.5 update. (#2934)
* v4.3.5 update.

* Update copyright to 2026
2026-01-08 15:02:56 -05:00
dePaul Miller
7127592069 Replace CUDA driver API with runtime API (#2928)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-01-05 13:50:44 -05:00
tsu-bin
3d9de19bb7 add constexpr specifier to make_tiled_copy (#2875) 2026-01-03 15:39:43 -05:00
Junkai-Wu
b7ecaa605d v4.3.4 update v2. (#2898) 2025-12-22 22:28:26 -05:00
Amin Sedaghat
49bd6bf1ba fix print_layout printf format in device code (#2688)
* fix print_layout printf format in device code

* Replace %.*s format specifier with explicit loop
* Remove unused delim variable

The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.

Fixes #2496

* Update include/cute/util/print_tensor.hpp

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

* Update include/cute/util/print_tensor.hpp

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

---------

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
2025-12-10 08:57:56 +08:00
Junkai-Wu
8cd5bef43a v4.3 tag release update. (#2789) 2025-11-20 20:49:44 -05:00
Junkai-Wu
b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00
Junkai-Wu
6a35b4d22f v4.2 tag release. (#2638) 2025-09-15 12:21:53 -04:00
Lifu Huang
76c96b0be3 Fix incorrect shapes in copy_atom doc comments. (#2575) 2025-09-04 16:57:24 -07:00
ao jia
d98e7bf7ce Fix comment in mma_atom.hpp (#2579) 2025-09-04 16:56:39 -07:00
Junkai-Wu
a49a78ffef v4.2 release. (#2587)
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.

* v4.2 release.
2025-08-22 18:11:24 -04:00
Inoday Yadav
42e7c546c4 Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) (#2562) 2025-08-19 22:22:02 -04:00
Wenxin Cheng
6dd13d4278 Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. (#2324) 2025-07-31 20:55:19 -04:00
kf-zhang
26b7450023 support fp16 accmulator for sm89 fp8 mma (#2378)
* add support for sm89 in cute and the unit tests

* support fp16 accmulator for sm89 fp8 mma

* format code
2025-07-30 22:12:08 -04:00
Aditya Kane
f09045d660 Corrected minor nit in mma_traits.hpp (#2447)
* Corrected minor nit in mma_traits.hpp

The entry and descriptions were jumbled up.

* Update mma_traits.hpp

* Update mma_traits.hpp
2025-07-30 22:11:23 -04:00
Junkai-Wu
a1aaf2300a v4.1 release 2025-07-03 08:07:53 -04:00
Junkai-Wu
8bdbfca682 v4.0 update. (#2371) 2025-06-06 02:39:20 -04:00
Kihiro Bando
f115c3f854 Release v4.0.0 (#2294) 2025-05-13 15:55:29 -04:00
Lain
2b78c2fe31 cherry-pick feature/hopper-blockwise-generalization-optimization (#2270) 2025-04-29 16:47:22 -04:00
Yujia Zhai
331a1f5b3f cutlass 3.9 update (#2255)
* cutlass 3.9 update

* rebase

* fixes out of shared memory for blockwise Blackwell

* doc format

* fix issue 2253

* disable host ref by default

* fix sm120 smem capacity

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-24 15:42:40 -04:00
吴坎
8e345c5c5b fix_missing_stdint (#2199)
* Update config.hpp

* 更新 config.hpp

* 更新 config.hpp
2025-04-23 22:21:22 -04:00
reed
9e1b649827 fix-left-inverse-for-nvcc114 (#2196) 2025-04-10 14:48:46 -04:00
reed
5120b21cc3 suppress compilation warnings (#2195) 2025-04-10 14:48:01 -04:00
kf-zhang
19cc2a5feb add support for sm89 in cute and the unit tests (#2177)
* add support for sm89 in cute and the unit tests

* rebase v3.9 and format code

* minor fix

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 14:16:36 -04:00
liujshi
df8a550d39 Update mma_atom.hpp (#2159)
remove useless code
2025-04-03 11:42:10 -04:00
Yujia Zhai
6f4921858b v3.9 update (#2203)
* v3.9 update

* voidD

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-02 15:11:18 -04:00
Yujia Zhai
62750a2b75 v3.9 (#2185)
* v3.8 update x

* fix blackwell gg

* doc change

* doc change

* doc change

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:52:23 -04:00
Yujia Zhai
b84e9802d8 update 3.8 v2 (#2112)
* update 3.8 v2

* update 3.8

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-19 22:03:14 -05:00
Yujia Zhai
833f6990e0 v3.8.0 update (#2082)
* 3.8 update

* fix Markus' name

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-06 21:33:40 -05:00
Haicheng Wu
47daa33c61 fix cuda 12.6 issues (#2066) 2025-01-28 17:28:29 -05:00
mihir-awatramani
389e493055 CUTLASS 3.8 Release (#2059)
* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36fe8.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-25 02:44:06 -05:00
Yujia Zhai
b78588d163 CUTLASS 3.7 (#2045)
* CUTLASS 3.7

* clean up changelog

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-18 09:53:07 -05:00
Lei Mao
375e284e6a Add Line Break (#2020) 2025-01-08 23:46:59 -05:00
Yujia Zhai
3d261a5974 3.6.0 update (#2005)
* 3.6.0 update

* doc and swap stuff

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-12-25 01:34:40 -05:00
Lei Mao
e1cd8c7866 Fix Typo (#1962) 2024-12-10 22:07:37 -05:00
LiYu Lu
d656afbd2a fix undefined in device code error (#1880) 2024-11-06 14:56:54 -05:00
azhurkevich
e8a8b69365 Refactor some GroupedGEMM logic (#1899) 2024-10-25 20:14:01 -04:00
LiYu Lu
08a49953a0 Add a print for the uint{x}b_t type. (#1871) 2024-10-24 14:39:22 -04:00
Caleb_Du
a424ca6cf9 fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (#1856)
* fix wrong A/BLayout in  MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for  m8n8k128, m16n8k128  mma.and.popc in MMA_Traits instantiation

* add "print" template for  subbyte_reference<T>
2024-10-24 14:38:35 -04:00
Sergey Klevtsov
d65266a868 Add all supported GMMA shapes (#1890) 2024-10-22 18:13:36 -04:00
Tri Dao
5b50a8faaf Add GMMA shape m64n40k16 (#1864) 2024-10-21 20:41:47 -04:00
Sergey Klevtsov
08101d9d0c Improve sm90 mixed dtype kernel (#1883) 2024-10-17 20:06:38 -04:00
Yujia Zhai
cc3c29a81a CUTLASS 3.6.0 (#1850)
* v3.6

* update changelog

* update readme

* fix typo

* fixing typos

* hopper gemm with weight prefetch

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-10-09 15:33:27 -04:00
reed
2991ce18d3 Add print_svg for mma (#1733)
* add print_svg for mma

* correct the code indentation
2024-09-18 10:37:24 -04:00
Junkai-Wu
dbdae514e0 Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795) 2024-09-11 00:07:31 -04:00
Sean Xiaowen Zhang
21d0534167 fix assertion (#1790) 2024-09-09 14:05:27 -04:00
shunfan-shao
4dbf5dbed2 Use CUDA runtime API to retrieve function pointer to driver API (#1700)
* Query pfn to driver api

* use default for older toolkits

---------

Co-authored-by: shunfans <shunfans@nvidia.com>
2024-08-19 13:26:09 -04:00