Yuxi Chi
e9a75581fe
DeepGemm Support - Step 2 ( #2142 )
...
* add cpp example for DeepGemm.
* add grouped_gemm_contiguous.
* add groupedgemm_masked.
* add python example and tests.
2025-02-28 10:11:59 -05:00
Yuxi Chi
ac210faef8
DeepGemm Support ( #2137 )
...
* add cpp example for DeepGemm.
* add grouped_gemm_contiguous.
* add groupedgemm_masked.
2025-02-26 07:01:12 -05:00
Junkai-Wu
15f5468872
Migrate FlashMLA codes to example. ( #2135 )
2025-02-26 01:29:07 -05:00
myu-guo
af5519d938
Flash MLA Support - Step 2 ( #2134 )
...
* initial commit
* initial commit
* fix some error
* update
* bugfix
* bugfix
* change name
* Add input&output process
* minor
* update
* initial commit
* initial commit
* fix some error
* update
* bugfix
* bugfix
* change name
* minor
* update
2025-02-25 23:18:03 -05:00
myu-guo
415d587ebf
Flash MLA support ( #2130 )
...
* initial commit
* initial commit
* fix some error
* update
* bugfix
* bugfix
* change name
2025-02-24 08:31:56 -05:00
Yujia Zhai
b84e9802d8
update 3.8 v2 ( #2112 )
...
* update 3.8 v2
* update 3.8
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-19 22:03:14 -05:00
Yujia Zhai
833f6990e0
v3.8.0 update ( #2082 )
...
* 3.8 update
* fix Markus' name
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-06 21:33:40 -05:00
Tadej Ciglarič
6f55278121
bugfix generic-k code in top-k with softmax ( #1993 )
...
* bugfix generic-k code in top-k with softmax
* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
---------
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
2025-01-31 19:05:35 -05:00
Liang
3c28697b9f
Groupwise scaling along M for FP8 gemm ( #2037 )
...
* FP8 groupwise scaling along M
* small updates
---------
Co-authored-by: zl <zl@deepseek.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-31 13:51:28 -05:00
mihir-awatramani
389e493055
CUTLASS 3.8 Release ( #2059 )
...
* CUTLASS 3.8 Release
* update
* Update README.md
* Revert "Update README.md"
This reverts commit b353e36fe8 .
* update
* update
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-25 02:44:06 -05:00
Yujia Zhai
b78588d163
CUTLASS 3.7 ( #2045 )
...
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-18 09:53:07 -05:00
Manish Gupta
ef5620dd1d
Blockwise Scaling for FP8 ( #1932 )
...
* F8 Blockwise Scaling
* two more NumProducerThreadEvents
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-09 11:22:09 -05:00
Lei Mao
52b35e90ce
Fix Typos ( #2021 )
...
* Fix Typo
* Fix Typo
2025-01-08 23:46:28 -05:00
Yujia Zhai
3d261a5974
3.6.0 update ( #2005 )
...
* 3.6.0 update
* doc and swap stuff
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-12-25 01:34:40 -05:00
Lain
4c42f73fda
Improve mixed dtype GEMM ( #1972 )
...
* update
* fix a typo
2024-12-06 13:33:22 -05:00
Lain
8aa95dbb88
Fix the racing condition of mixed-input gemm when writing the registers ( #1931 )
...
* move two warpgroup_wait
* merge main
---------
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
2024-11-08 13:15:54 -05:00
Sergey Klevtsov
08101d9d0c
Improve sm90 mixed dtype kernel ( #1883 )
2024-10-17 20:06:38 -04:00
Yujia Zhai
cc3c29a81a
CUTLASS 3.6.0 ( #1850 )
...
* v3.6
* update changelog
* update readme
* fix typo
* fixing typos
* hopper gemm with weight prefetch
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-10-09 15:33:27 -04:00
Junkai-Wu
dbdae514e0
Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm ( #1795 )
2024-09-11 00:07:31 -04:00
Vijay Thakkar
be60a0b272
CUTLASS 3.5.1 ( #1623 )
...
* CUTLASS 3.5.1
* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
Joe Rowell
843adf0408
Fix SMEM index for C in CuTe examples ( #1477 )
2024-07-10 11:14:15 -04:00
Vijay Thakkar
7d49e6c7e2
Updates for CUTLASS 3.5.0 ( #1468 )
2024-04-11 21:33:40 -04:00
Vijay Thakkar
629f4653c3
CUTLASS 3.5.0 ( #1411 )
2024-03-19 17:51:04 -04:00
ANIKET SHIVAM
bbe579a9e3
Updates for CUTLASS 3.4.1 ( #1346 )
...
* Updates for CUTLASS 3.4.1
* minor epi change
2024-02-15 15:48:34 -05:00
xws117
6e3df975a2
Modify comments in code examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu ( #1325 )
2024-01-31 21:41:30 -05:00
ANIKET SHIVAM
751eb9a885
Update license year ( #1306 )
2024-01-16 14:37:22 -05:00
ANIKET SHIVAM
2f589ffa76
Updates for 3.4 release. ( #1305 )
2024-01-16 13:42:51 -05:00
Tianao Ge
acba5beee5
Fix flops calculation and tensor b stride calculation in the example 36 ( #1278 )
...
* Fix flops calculation and tensor b stride calculation in the example 36
* Fix datatype
* Update gather_scatter_fusion.cu
2024-01-08 17:27:30 -05:00
Jee Li
c9591a694d
fix typo ( #1279 )
2024-01-04 12:41:39 -05:00
Aleksandar Samardžić
5c756eb774
Add support for sparse GEMM with visitor epilogue ( #1189 )
...
* Add support for sparse GEMM with visitor epilogue
* Refactor changes at the kernel level
2024-01-04 12:38:11 -05:00
Pradeep Ramani
8236f30675
CUTLASS 3.4.0 ( #1286 )
...
* CUTLASS 3.4.0
* Update CHANGELOG.md
---------
Co-authored-by: Pradeep Ramani <prramani@nvidia.com >
2023-12-29 15:21:31 -05:00
Christian Sigg
e1483d5fa0
Collection of changes to fix clang build. ( #1200 )
...
* Remove unused variables
* Qualify calls to make_fragment_? from templated base class.
Fixes clang build error.
* Add missing `#include <cstdio>`
* Various changes to fix clang compile errors.
* More changes to fix clang build.
Remaining issues:
- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.
* Fix remaining clang build issues.
* Qualify `cute::rank()` calls.
* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.
* Double-escape special registers in inline asm.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2023-12-08 14:42:12 -05:00
Christian Sigg
56fc3df03b
Adding missing typename ( #1191 )
...
Fixes clang build failures.
2023-11-29 00:20:20 -05:00
Jack Kosaian
8098336d51
Updates to Python interface for PyPI packaging ( #1209 )
...
* Updates
* Updates to notebooks
2023-11-28 13:52:12 -05:00
wang-y-z
557be3ab0e
Fix several typos ( #1169 )
...
Co-authored-by: isaacw <isaacw@nvidia.com >
2023-11-02 23:54:46 -04:00
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 ( #1167 )
...
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
2023-11-02 11:09:05 -04:00
ANIKET SHIVAM
90d3b0fb18
CUTLASS 3.2.1 ( #1113 )
...
* Updates for 3.2.1 release.
* Minor fix in gemm op profiler for raster order.
* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
Vadim Markovtsev
8783c41851
Replace 0x1f with 0xffffffff in __shfl_sync ( #1097 )
...
This fixes compatibility with H100 and resolves #1094
2023-09-18 19:58:19 -04:00
Yujia Zhai
6407bcdf0a
fix matrix B indices ( #1089 )
2023-09-12 14:04:18 -04:00
tpoisonooo
a77b2c9cb8
style(examples): typo ( #1080 )
...
* Update ampere_tensorop_conv2dfprop.cu
learning cutlass, PR a typo.
* Update ampere_gemm_operand_reduction_fusion.cu
2023-09-11 10:13:22 -04:00
ANIKET SHIVAM
a88c41cf8d
Updates for 3.2 release ( #1065 )
2023-08-25 23:05:46 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 ( #1024 )
...
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
dan_the_3rd
146d314057
Update fMHA kernels ( #992 )
...
* Update fMHA kernels
Upstream recent changes to fMHA that we did in xFormers.
Previous version in CUTLASS: facebookresearch/xformers@b6be33a
Updating to: facebookresearch/xformers@55a4798
* minor changes
* make var work
---------
Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2023-07-12 22:30:46 -04:00
Jack Kosaian
87349d3496
Add grouped b2b GEMM ( #970 )
2023-06-05 17:16:57 -04:00
ANIKET SHIVAM
f079619f5e
More updates for 3.1 ( #958 )
...
* Updates for 3.1
* Minor change
* doc link fix
* Minor updates
2023-05-24 10:17:16 -04:00
Ali Hassani
13f413493a
Stream-K with broadcast ( #892 )
...
* [WIP] GEMM StreamK w/ Fused Epilogue
* Adds Gemm Streamk with Fused Epilogue kernel level struct.
* Mostly based on Gemm with Fused Epilogue,
* Requires a new epilogue
* Work in progress
* [WIP] StreamK support for GemmUniversalWithBroadcast
* Just based off of how StreamK is allowed in GemmUniversal
* Untested and a work in progress
* Minor fixes
* [WIP] It compiles!
It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.
* Correction to reference kernel
* Fix typo
* Added MSE measurement
* Switch back to reference kernel + host for loop
Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.
* Fix typos
* Fix broadcast vector + requested changes
* Comment typo
* Small int option and more
* Fix incorrect condition on source needed
* Requested changes
* I think I got it?
* Bias vector should be stride 0
* Two source added!
* Typos
* Merge examples
* Bring back vector row offset
Just to ensure consistency with universal gemm with fused epilogue
* Base arguments and params structs for StreamK
* StreamK epilogue with broadcast now inherits the original
* undo params_streamk_base.h
---------
Co-authored-by: Ali Hassani <ahassanijr@gmail.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2023-05-22 19:05:06 -04:00
ANIKET SHIVAM
7c04f95415
Updates for 3.1 ( #932 )
2023-04-29 09:34:27 -04:00
Alexander Zinoviev
e36912f961
Fix for dangling references in the MHA example ( #918 )
2023-04-19 21:35:46 -04:00
Aleksandr Pivovar
4a68cf748e
added support of b2b bmm ( #849 )
...
* added support of b2b bmm
* fixed arguments and params structures
* added batch_count argument
* removed SplitKSerial and added new test case with b2b bmm
* fixed support of Kbatched and added new test case with batch stride
* added batch support for bias and scale
* make test
* small changes
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2023-04-14 23:20:02 -04:00
ANIKET SHIVAM
d572cc1aab
CUTLASS 3.1 ( #915 )
...
Co-authored-by: Aniket Shivam <ashivam@nvidia.com >
2023-04-14 23:19:34 -04:00