Haicheng Wu
1afc6d355b
port nvrtc change to version.h
...
update to 4.3.6
alignment-related miscalculation for pipeline stages
Allow larger library on 64bit platform
add more changelog items
remove whitespace
2026-06-20 06:42:52 -07:00
Junkai-Wu
4faf1a1568
v4.3.5 update. ( #2935 )
...
* v4.3.5 update.
* Update copyright to 2026.
2026-01-08 15:02:14 -05:00
Junkai-Wu
7233a05f24
v4.3.4 update. ( #2893 )
2025-12-21 11:49:35 -05:00
Junkai-Wu
5873443bb6
v4.3.3 update ( #2869 )
...
* v4.3.3 update.
* fix print_layout printf format in device code (#2688 )
* fix print_layout printf format in device code
* Replace %.*s format specifier with explicit loop
* Remove unused delim variable
The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.
Fixes #2496
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
* Support PDL for SM90 Array TMA GEMM
* Update changelog
---------
Co-authored-by: Amin Sedaghat <35748194+Aminsed@users.noreply.github.com >
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com >
2025-12-11 00:26:17 -05:00
Junkai-Wu
ff35fa561d
v4.3.2 update. ( #2840 )
2025-12-04 10:14:50 -05:00
Haicheng Wu
10d4651439
Bump version from 4.2.0 to 4.3.1
2025-12-01 19:17:19 -08:00
Junkai-Wu
5fd9685dce
v4.3.1 update ( #2818 )
...
* Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790 )
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
* v4.3.1 update.
---------
Co-authored-by: Shreya Gaur <48754356+Shreya-gaur@users.noreply.github.com >
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com >
2025-11-27 09:48:55 -05:00
Haicheng Wu
ddaf12c1b1
Bump version from 4.2.0 to 4.3.0
2025-11-24 16:35:27 -05:00
Junkai-Wu
8cd5bef43a
v4.3 tag release update. ( #2789 )
2025-11-20 20:49:44 -05:00
Junkai-Wu
b1d6e2c9b3
v4.3 update. ( #2709 )
...
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com >
2025-10-21 14:26:30 -04:00
Junkai-Wu
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
Junkai-Wu
a49a78ffef
v4.2 release. ( #2587 )
...
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.
* v4.2 release.
2025-08-22 18:11:24 -04:00
Yujia Zhai
b78588d163
CUTLASS 3.7 ( #2045 )
...
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-18 09:53:07 -05:00
ANIKET SHIVAM
751eb9a885
Update license year ( #1306 )
2024-01-16 14:37:22 -05:00
ANIKET SHIVAM
2f589ffa76
Updates for 3.4 release. ( #1305 )
2024-01-16 13:42:51 -05:00
Pradeep Ramani
8236f30675
CUTLASS 3.4.0 ( #1286 )
...
* CUTLASS 3.4.0
* Update CHANGELOG.md
---------
Co-authored-by: Pradeep Ramani <prramani@nvidia.com >
2023-12-29 15:21:31 -05:00
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 ( #1167 )
...
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
2023-11-02 11:09:05 -04:00