cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-11 17:00:05 +00:00

Author	SHA1	Message	Date
mihir-awatramani	389e493055	CUTLASS 3.8 Release (#2059 ) * CUTLASS 3.8 Release * update * Update README.md * Revert "Update README.md" This reverts commit `b353e36fe8`. * update * update --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-25 02:44:06 -05:00
Yujia Zhai	b78588d163	CUTLASS 3.7 (#2045 ) * CUTLASS 3.7 * clean up changelog --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-18 09:53:07 -05:00
Manish Gupta	ef5620dd1d	Blockwise Scaling for FP8 (#1932 ) * F8 Blockwise Scaling * two more NumProducerThreadEvents --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-09 11:22:09 -05:00
Lei Mao	52b35e90ce	Fix Typos (#2021 ) * Fix Typo * Fix Typo	2025-01-08 23:46:28 -05:00
Yujia Zhai	3d261a5974	3.6.0 update (#2005 ) * 3.6.0 update * doc and swap stuff --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-12-25 01:34:40 -05:00
Lain	4c42f73fda	Improve mixed dtype GEMM (#1972 ) * update * fix a typo	2024-12-06 13:33:22 -05:00
Lain	8aa95dbb88	Fix the racing condition of mixed-input gemm when writing the registers (#1931 ) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>	2024-11-08 13:15:54 -05:00
Sergey Klevtsov	08101d9d0c	Improve sm90 mixed dtype kernel (#1883 )	2024-10-17 20:06:38 -04:00
Yujia Zhai	cc3c29a81a	CUTLASS 3.6.0 (#1850 ) * v3.6 * update changelog * update readme * fix typo * fixing typos * hopper gemm with weight prefetch --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-10-09 15:33:27 -04:00
Junkai-Wu	dbdae514e0	Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795 )	2024-09-11 00:07:31 -04:00
Vijay Thakkar	be60a0b272	CUTLASS 3.5.1 (#1623 ) * CUTLASS 3.5.1 * updates, optimizations, fixes	2024-07-29 08:46:24 -04:00
Joe Rowell	843adf0408	Fix SMEM index for C in CuTe examples (#1477 )	2024-07-10 11:14:15 -04:00
Vijay Thakkar	7d49e6c7e2	Updates for CUTLASS 3.5.0 (#1468 )	2024-04-11 21:33:40 -04:00
Vijay Thakkar	629f4653c3	CUTLASS 3.5.0 (#1411 )	2024-03-19 17:51:04 -04:00
ANIKET SHIVAM	bbe579a9e3	Updates for CUTLASS 3.4.1 (#1346 ) * Updates for CUTLASS 3.4.1 * minor epi change	2024-02-15 15:48:34 -05:00
xws117	6e3df975a2	Modify comments in code examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu (#1325 )	2024-01-31 21:41:30 -05:00
ANIKET SHIVAM	751eb9a885	Update license year (#1306 )	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00
Tianao Ge	acba5beee5	Fix flops calculation and tensor b stride calculation in the example 36 (#1278 ) * Fix flops calculation and tensor b stride calculation in the example 36 * Fix datatype * Update gather_scatter_fusion.cu	2024-01-08 17:27:30 -05:00
Jee Li	c9591a694d	fix typo (#1279 )	2024-01-04 12:41:39 -05:00
Aleksandar Samardžić	5c756eb774	Add support for sparse GEMM with visitor epilogue (#1189 ) * Add support for sparse GEMM with visitor epilogue * Refactor changes at the kernel level	2024-01-04 12:38:11 -05:00
Pradeep Ramani	8236f30675	CUTLASS 3.4.0 (#1286 ) * CUTLASS 3.4.0 * Update CHANGELOG.md --------- Co-authored-by: Pradeep Ramani <prramani@nvidia.com>	2023-12-29 15:21:31 -05:00
Christian Sigg	e1483d5fa0	Collection of changes to fix clang build. (#1200 ) * Remove unused variables * Qualify calls to make_fragment_? from templated base class. Fixes clang build error. * Add missing `#include <cstdio>` * Various changes to fix clang compile errors. * More changes to fix clang build. Remaining issues: - `params` initializer of `CollectiveEpilogue`. - `ops` initializer of `Sm90VisitorImplBase`. - `__usAtomicCAS` needs to be added to clang upstream. * Fix remaining clang build issues. * Qualify `cute::rank()` calls. * Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace. * Double-escape special registers in inline asm. * small change --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-08 14:42:12 -05:00
Christian Sigg	56fc3df03b	Adding missing `typename` (#1191 ) Fixes clang build failures.	2023-11-29 00:20:20 -05:00
Jack Kosaian	8098336d51	Updates to Python interface for PyPI packaging (#1209 ) * Updates * Updates to notebooks	2023-11-28 13:52:12 -05:00
wang-y-z	557be3ab0e	Fix several typos (#1169 ) Co-authored-by: isaacw <isaacw@nvidia.com>	2023-11-02 23:54:46 -04:00
Pradeep Ramani	c008b4aea8	CUTLASS 3.3.0 (#1167 ) * Release 3.3.0 Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. * minor doc update	2023-11-02 11:09:05 -04:00
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
Vadim Markovtsev	8783c41851	Replace 0x1f with 0xffffffff in __shfl_sync (#1097 ) This fixes compatibility with H100 and resolves #1094	2023-09-18 19:58:19 -04:00
Yujia Zhai	6407bcdf0a	fix matrix B indices (#1089 )	2023-09-12 14:04:18 -04:00
tpoisonooo	a77b2c9cb8	style(examples): typo (#1080 ) * Update ampere_tensorop_conv2dfprop.cu learning cutlass, PR a typo. * Update ampere_gemm_operand_reduction_fusion.cu	2023-09-11 10:13:22 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
dan_the_3rd	146d314057	Update fMHA kernels (#992 ) * Update fMHA kernels Upstream recent changes to fMHA that we did in xFormers. Previous version in CUTLASS: facebookresearch/xformers@b6be33a Updating to: facebookresearch/xformers@55a4798 * minor changes * make var work --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-07-12 22:30:46 -04:00
Jack Kosaian	87349d3496	Add grouped b2b GEMM (#970 )	2023-06-05 17:16:57 -04:00
ANIKET SHIVAM	f079619f5e	More updates for 3.1 (#958 ) * Updates for 3.1 * Minor change * doc link fix * Minor updates	2023-05-24 10:17:16 -04:00
Ali Hassani	13f413493a	Stream-K with broadcast (#892 ) * [WIP] GEMM StreamK w/ Fused Epilogue * Adds Gemm Streamk with Fused Epilogue kernel level struct. * Mostly based on Gemm with Fused Epilogue, * Requires a new epilogue * Work in progress * [WIP] StreamK support for GemmUniversalWithBroadcast * Just based off of how StreamK is allowed in GemmUniversal * Untested and a work in progress * Minor fixes * [WIP] It compiles! It is almost certainly incorrect, but we're past getting the templates to match, so checkpointing. * Correction to reference kernel * Fix typo * Added MSE measurement * Switch back to reference kernel + host for loop Still WIP. Now we're getting even a larger MSE, but it's both on basic Split-K and Stream-K. * Fix typos * Fix broadcast vector + requested changes * Comment typo * Small int option and more * Fix incorrect condition on source needed * Requested changes * I think I got it? * Bias vector should be stride 0 * Two source added! * Typos * Merge examples * Bring back vector row offset Just to ensure consistency with universal gemm with fused epilogue * Base arguments and params structs for StreamK * StreamK epilogue with broadcast now inherits the original * undo params_streamk_base.h --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-22 19:05:06 -04:00
ANIKET SHIVAM	7c04f95415	Updates for 3.1 (#932 )	2023-04-29 09:34:27 -04:00
Alexander Zinoviev	e36912f961	Fix for dangling references in the MHA example (#918 )	2023-04-19 21:35:46 -04:00
Aleksandr Pivovar	4a68cf748e	added support of b2b bmm (#849 ) * added support of b2b bmm * fixed arguments and params structures * added batch_count argument * removed SplitKSerial and added new test case with b2b bmm * fixed support of Kbatched and added new test case with batch stride * added batch support for bias and scale * make test * small changes --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-04-14 23:20:02 -04:00
ANIKET SHIVAM	d572cc1aab	CUTLASS 3.1 (#915 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-04-14 23:19:34 -04:00
dan_the_3rd	9b8166e3f0	fMHA: Add backward pass (#844 ) * fMHA: Add backward pass * Better checks for strides/alignments * Remove fb-internal URL * torch.Tensor.untyped_storage requires pytorch 2.0+ * minor changes * make test --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-04-06 20:44:58 -04:00
Adnan Akhundov	2ba1ef10be	Increase max dynamic SMEM size in GemmSoftmax (#903 )	2023-04-03 10:01:12 -04:00
Vijay Thakkar	15d9d31f1f	CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897 )	2023-03-29 10:42:40 -04:00
Alexander Pivovarov	7e370c9637	Fix typos 2 (#842 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-03-09 23:22:56 -05:00
dan_the_3rd	f396cdd15c	ex24[gemm_grouped]: Allow to change layout/dtype (#841 ) * ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>	2023-03-01 07:13:51 -05:00
dan_the_3rd	f303889ed9	fMHA: Sync FW with xFormers (#828 ) * fMHA: Add support for bias+dropout in FW * Remove 'getMaximumSharedMemoryPerBlockKb' * fix comments --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-22 23:25:31 -05:00
Haicheng Wu	34bed24af3	Update helper.h copyright banner	2023-02-16 16:50:04 -05:00
Haicheng Wu	8f5c242426	Update dual_gemm_common.h fix the copyright of a new file.	2023-02-13 15:35:33 -05:00
Adnan Akhundov	3c995c7606	Extend DualGemm: support batched mode + decouple B0/B1 layouts (#790 ) * Fix MHA kernel Summary: ATT Test Plan: Reviewers: Subscribers: Tasks: Tags: * Extend DualGemm to support batched mode (#5) Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set. * Decouple LayoutB0 and LayoutB1 in DualGemm The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously. In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above. * Remove comment as no longer relevant * Revert Fix MHA kernel --------- Co-authored-by: mikeiovine <mikeiovine@fb.com>	2023-02-13 15:27:13 -05:00

1 2 3

113 Commits