Commit Graph

75 Commits

Author SHA1 Message Date
dePaul Miller
ca4fdbea70 Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper (#2139)
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.

Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-02-26 12:44:58 -05:00
Yujia Zhai
b84e9802d8 update 3.8 v2 (#2112)
* update 3.8 v2

* update 3.8

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-19 22:03:14 -05:00
Yujia Zhai
833f6990e0 v3.8.0 update (#2082)
* 3.8 update

* fix Markus' name

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-06 21:33:40 -05:00
mihir-awatramani
389e493055 CUTLASS 3.8 Release (#2059)
* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36fe8.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-25 02:44:06 -05:00
Yujia Zhai
b78588d163 CUTLASS 3.7 (#2045)
* CUTLASS 3.7

* clean up changelog

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-18 09:53:07 -05:00
Manish Gupta
ef5620dd1d Blockwise Scaling for FP8 (#1932)
* F8 Blockwise Scaling

* two more NumProducerThreadEvents

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-09 11:22:09 -05:00
Yujia Zhai
3d261a5974 3.6.0 update (#2005)
* 3.6.0 update

* doc and swap stuff

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-12-25 01:34:40 -05:00
azhurkevich
e8a8b69365 Refactor some GroupedGEMM logic (#1899) 2024-10-25 20:14:01 -04:00
Xinyu Yang
b0c09ed077 fix by adding public (#1753) 2024-10-23 12:45:58 -04:00
Haicheng Wu
755194a7bd add is_last_tile 2024-10-17 12:11:02 -07:00
Yujia Zhai
cc3c29a81a CUTLASS 3.6.0 (#1850)
* v3.6

* update changelog

* update readme

* fix typo

* fixing typos

* hopper gemm with weight prefetch

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-10-09 15:33:27 -04:00
Junkai-Wu
dbdae514e0 Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795) 2024-09-11 00:07:31 -04:00
Vijay Thakkar
be60a0b272 CUTLASS 3.5.1 (#1623)
* CUTLASS 3.5.1

* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
Chengquan Jiang
56b46e2d13 Fix grouped gemm invalid memory access to problem shapes (#1543) 2024-07-10 11:55:22 -04:00
Vijay Thakkar
7d49e6c7e2 Updates for CUTLASS 3.5.0 (#1468) 2024-04-11 21:33:40 -04:00
Vijay Thakkar
629f4653c3 CUTLASS 3.5.0 (#1411) 2024-03-19 17:51:04 -04:00
ANIKET SHIVAM
bbe579a9e3 Updates for CUTLASS 3.4.1 (#1346)
* Updates for CUTLASS 3.4.1

* minor epi change
2024-02-15 15:48:34 -05:00
Aleksandar Samardžić
ca37d632c9 Remove sparse GEMM with row broadcasted bias vector (#1302)
This reverts commit d3e72719b4.

Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
2024-01-17 14:06:27 -05:00
ANIKET SHIVAM
751eb9a885 Update license year (#1306) 2024-01-16 14:37:22 -05:00
ANIKET SHIVAM
2f589ffa76 Updates for 3.4 release. (#1305) 2024-01-16 13:42:51 -05:00
Aleksandar Samardžić
5c756eb774 Add support for sparse GEMM with visitor epilogue (#1189)
* Add support for sparse GEMM with visitor epilogue

* Refactor changes at the kernel level
2024-01-04 12:38:11 -05:00
Pradeep Ramani
8236f30675 CUTLASS 3.4.0 (#1286)
* CUTLASS 3.4.0

* Update CHANGELOG.md

---------

Co-authored-by: Pradeep Ramani <prramani@nvidia.com>
2023-12-29 15:21:31 -05:00
Christian Sigg
e1483d5fa0 Collection of changes to fix clang build. (#1200)
* Remove unused variables

* Qualify calls to make_fragment_? from templated base class.

Fixes clang build error.

* Add missing `#include <cstdio>`

* Various changes to fix clang compile errors.

* More changes to fix clang build.

Remaining issues:

- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.

* Fix remaining clang build issues.

* Qualify `cute::rank()` calls.

* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.

* Double-escape special registers in inline asm.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-08 14:42:12 -05:00
Ali Hassani
f4a0216601 Fix bug in single source GEMM with residual + streamk (#1249)
Followup to #1224.

A change in the stream-k threadblock swizzle ctor since 3.3 breaks
single source GEMM with fused epilogue and stream-k. Multi-source was
already corrected.

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
2023-12-07 11:12:02 -05:00
Pradeep Ramani
e9e30c2304 Updates and Bug fixes to CUTLASS 3.3 (#1232) 2023-12-05 09:50:49 -05:00
Christian Sigg
99c4eebe3b Explicitly cast blockIdx to uint3 (#1192)
This works around a clang issue where blockIdx is of a different type.
2023-11-30 00:26:23 -05:00
Sergey Klevtsov
b5d8a5d9cc Allow SM90 pingpong kernel to use custom tile schedulers (#1194)
Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2023-11-15 13:45:17 -05:00
wang-y-z
557be3ab0e Fix several typos (#1169)
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-02 23:54:46 -04:00
Pradeep Ramani
c008b4aea8 CUTLASS 3.3.0 (#1167)
* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
2023-11-02 11:09:05 -04:00
ANIKET SHIVAM
90d3b0fb18 CUTLASS 3.2.1 (#1113)
* Updates for 3.2.1 release.

* Minor fix in gemm op profiler for raster order.

* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
Driss Guessous
88c0d7c726 make only visible on device (#1071) 2023-09-07 13:00:46 -04:00
ANIKET SHIVAM
a88c41cf8d Updates for 3.2 release (#1065) 2023-08-25 23:05:46 -04:00
Haibin Lin
7e5ee8b7bf [doc] fix: fix typos in the comment (#1049) 2023-08-16 11:39:25 -04:00
ANIKET SHIVAM
4575443d44 CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
Aleksandar Samardžić
d3e72719b4 Add support for sparse GEMM with row broadcasted bias vector (#951) 2023-05-24 10:25:05 -04:00
ANIKET SHIVAM
f079619f5e More updates for 3.1 (#958)
* Updates for 3.1

* Minor change

* doc link fix

* Minor updates
2023-05-24 10:17:16 -04:00
Ali Hassani
13f413493a Stream-K with broadcast (#892)
* [WIP] GEMM StreamK w/ Fused Epilogue

* Adds Gemm Streamk with Fused Epilogue kernel level struct.
  * Mostly based on Gemm with Fused Epilogue,
  * Requires a new epilogue
  * Work in progress

* [WIP] StreamK support for GemmUniversalWithBroadcast

* Just based off of how StreamK is allowed in GemmUniversal
  * Untested and a work in progress

* Minor fixes

* [WIP] It compiles!

It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.

* Correction to reference kernel

* Fix typo

* Added MSE measurement

* Switch back to reference kernel + host for loop

Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.

* Fix typos

* Fix broadcast vector + requested changes

* Comment typo

* Small int option and more

* Fix incorrect condition on source needed

* Requested changes

* I think I got it?

* Bias vector should be stride 0

* Two source added!

* Typos

* Merge examples

* Bring back vector row offset

Just to ensure consistency with universal gemm with fused epilogue

* Base arguments and params structs for StreamK

* StreamK epilogue with broadcast now inherits the original

* undo params_streamk_base.h

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-22 19:05:06 -04:00
ANIKET SHIVAM
7c04f95415 Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
ANIKET SHIVAM
d572cc1aab CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00
Alexander Zinoviev
42290f5d1c Fix for dangling pointers (#885) 2023-03-25 01:15:14 -04:00
Jack Kosaian
6116706c96 Set batch_strides on Params::update (#883) 2023-03-20 17:07:47 -04:00
Alexander Pivovarov
7e370c9637 Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-03-09 23:22:56 -05:00
Haicheng Wu
65688c2a87 streamk fix (#836)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-23 16:35:08 -05:00
Haicheng Wu
91b8de8d32 streamk fix (#830)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-20 11:03:16 -05:00
Shuai Shao
ce8597dc14 Fix type bug in conv2d/gemm with broadcast (#796)
add ElementVector

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-09 20:53:25 -05:00
Vijay Thakkar
277bd6e537 CUTLASS 3.0.0 (#786)
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
ANIKET SHIVAM
66d9cddc83 New updates for 2.11 (#775)
* New updates.

* Minor profiler updates

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-01-20 16:32:57 -05:00
Haicheng Wu
764b840d6f streamk example and performance tuning (#760)
* streamk example and performance tuning

* one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 16:10:02 -05:00
Haicheng Wu
ff6e733fe1 restore the old epilogue for everything except streamk (#749)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-04 11:02:55 -05:00
ANIKET SHIVAM
38193d76e3 Updates for stream-k (#728)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-12-08 23:48:10 -05:00