Commit Graph

361 Commits

Author SHA1 Message Date
Adnan Akhundov
2ba1ef10be Increase max dynamic SMEM size in GemmSoftmax (#903) 2023-04-03 10:01:12 -04:00
Adnios
0964bdb64c update gemm and conv2d cmdline --help output (#878) 2023-04-01 11:38:13 -04:00
Gregory Meyer (gregjm)
ecbd24566c Enable shared memory intrinsics and ldmatrix PTX on Clang. (#754)
* Enable shared memory intrinsics and ldmatrix PTX on Clang.

This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.

* refine the macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-31 21:42:24 -04:00
Manish Gupta
660a05f581 fix split_k_mode and add reduction kernel for f16 input/accum/output (#896) 2023-03-30 15:31:08 -04:00
Feng Shijie
bc36122c3f [layout] Fix AffineRank2ColumnMajor::packed() (#879)
* [layout] Fix AffineRank2ColumnMajor::packed()

* correct affine2row::packed

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-29 11:59:48 -04:00
Vijay Thakkar
15d9d31f1f CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897) 2023-03-29 10:42:40 -04:00
ptrblck
1eef5c3cf1 add guards for __CUDA_ARCH__ >= 530 (#891)
* add guards for sm>=70

* drop guard to 530
2023-03-28 17:47:10 -04:00
Yujia Zhai
87070b6d51 add a CUTLASS publication (#893)
* add bytetransformer

* update arxiv link

* re-order
2023-03-28 17:06:57 -04:00
Haicheng Wu
77549ae6c8 Update PUBLICATIONS.md
msft moe paper
2023-03-25 21:17:05 -04:00
Alexander Zinoviev
42290f5d1c Fix for dangling pointers (#885) 2023-03-25 01:15:14 -04:00
Vijay Thakkar
209faf7b94 remove spurious comma (#871) 2023-03-20 17:25:27 -04:00
Jack Kosaian
6116706c96 Set batch_strides on Params::update (#883) 2023-03-20 17:07:47 -04:00
Nikita Shulga
2670b973dd Fix sign-compare warning in reorder_array (#869)
`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well


Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532
2023-03-20 17:07:24 -04:00
Vijay Thakkar
af332d4aa9 Add missing comma in cutlass/arch/mma_sm90.h (#862) 2023-03-14 12:04:28 -04:00
Edward Rees
86cae03cea expose StoreT parameter for potential speed (#838)
* expose StoreT parameter for potential speed

* add storeT to more elementwise

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-10 12:58:17 -05:00
Stepan Tezyunichev
29801e348a Hide streams and typinfo from nvrtc (#853)
* Hide streams and typinfo from nvrtc

* Use __CUDACC_RTC__ instead CUDA_ARCH for guard
2023-03-09 23:24:47 -05:00
Alexander Pivovarov
7e370c9637 Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-03-09 23:22:56 -05:00
ANIKET SHIVAM
c4f6b8c6bc Updates for 3.0 (#857)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
v3.0.0
2023-03-09 15:27:40 -05:00
Yinghai Lu
a68e2f95f0 Reduce versbosity in manifest.py (#845) 2023-03-07 11:53:01 -05:00
psaab
a31b43b3f3 Re-enable aarch64 support lost in 277bd6e537 (#846) 2023-03-02 11:17:21 -05:00
dan_the_3rd
f396cdd15c ex24[gemm_grouped]: Allow to change layout/dtype (#841)
* ex24[gemm_grouped]: Allow to change layout/dtype

* Address suggestion from @jackkosaian

---------

Co-authored-by: danthe3rd <danthe3rd>
2023-03-01 07:13:51 -05:00
Alexander Pivovarov
92ebbf1dc4 Fix typos (#839) 2023-02-27 11:17:58 -05:00
Haicheng Wu
65688c2a87 streamk fix (#836)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-23 16:35:08 -05:00
dan_the_3rd
f303889ed9 fMHA: Sync FW with xFormers (#828)
* fMHA: Add support for bias+dropout in FW

* Remove 'getMaximumSharedMemoryPerBlockKb'

* fix comments

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-22 23:25:31 -05:00
Shuai Shao
9cdbe33570 Add fixed_channel and few_channel mode to int8 in generator (#829) 2023-02-21 21:15:39 -05:00
Yuxin Wu
95f673ecf7 Update base_grouped.h (#832) 2023-02-21 14:48:30 -05:00
Haicheng Wu
91b8de8d32 streamk fix (#830)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-20 11:03:16 -05:00
Sujan Kumar Gonugondla
d8359c804b Changes to iterators to support s8 gemm with f16 outputs (#812)
* Changes to iterators to support s8 gemm with f16 outputs

* should work

---------

Co-authored-by: Sujan Gonugondla <gsujan@amaon.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-16 18:37:51 -05:00
Haicheng Wu
34bed24af3 Update helper.h
copyright banner
2023-02-16 16:50:04 -05:00
ZZK
a101ac283f Fix some typos (#791)
* fix typo

* fix a deadlink to code
2023-02-16 15:56:55 -05:00
Haicheng Wu
9fb38ac048 fix alignmentC=8 for imma N=128 (#822)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-15 12:06:00 -05:00
Haicheng Wu
8f5c242426 Update dual_gemm_common.h
fix the copyright of a new file.
2023-02-13 15:35:33 -05:00
Adnan Akhundov
3c995c7606 Extend DualGemm: support batched mode + decouple B0/B1 layouts (#790)
* Fix MHA kernel

Summary:

ATT

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Extend DualGemm to support batched mode (#5)

Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set.

* Decouple LayoutB0 and LayoutB1 in DualGemm

The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously.

In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above.

* Remove comment as no longer relevant

* Revert Fix MHA kernel

---------

Co-authored-by: mikeiovine <mikeiovine@fb.com>
2023-02-13 15:27:13 -05:00
Shuai Shao
ce8597dc14 Fix type bug in conv2d/gemm with broadcast (#796)
add ElementVector

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-09 20:53:25 -05:00
dan_the_3rd
2e10404d26 xFormer updates to fMHA FW (#773)
* xFormer updates to fMHA FW

* Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen'

* Add missing files

* Remove xFormers specific code

* Update fused_multihead_attention_fixed_seqlen.cu

* rebase and solve conflicts

* remove white space

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-08 23:00:10 -05:00
Jack Kosaian
5ff5209ed5 Add acc2smem in epilogue/threadblock/epilogue.h (#806) 2023-02-06 22:04:16 -05:00
Jack Kosaian
5921043981 Re-enable all alignments for int accumulators (#807) 2023-02-06 22:01:15 -05:00
Mark Hoemmen
add4ba622f Fix 8.4 + CUDA 11.4 build (#789)
Work around a likely GCC 8.x issue with fold expressions
and generic lambdas.

Only use the work-around when the host compiler is GCC 8.x.
This avoids any concerns about the work-around possibly
hindering inlining for a critical CuTe function (product).

Users can experiment with the work-around for other compilers
or compiler versions by defining the following macro.

CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND

Fixes https://github.com/NVIDIA/cutlass/issues/788

Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>
2023-01-27 09:18:59 -05:00
Vijay Thakkar
277bd6e537 CUTLASS 3.0.0 (#786)
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
ANIKET SHIVAM
66d9cddc83 New updates for 2.11 (#775)
* New updates.

* Minor profiler updates

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
v2.11.0
2023-01-20 16:32:57 -05:00
psaab
d49bef88f9 Enable aarch64 support (#779) 2023-01-20 15:51:58 -05:00
Haicheng Wu
8b42e751c6 streamk paper link (#765)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 22:10:43 -05:00
Muhammad Osama
eb7f99d3dd @hwu36 Adding the individual arXiv link for Stream-K paper. (#764)
* Stream-K individual paper entry.

* arXiv links updated.
2023-01-10 20:39:06 -05:00
Haicheng Wu
764b840d6f streamk example and performance tuning (#760)
* streamk example and performance tuning

* one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 16:10:02 -05:00
Ali Hassani
a1046d49c1 Adds missing semicolon (#759) 2023-01-09 21:50:46 -05:00
Haicheng Wu
1cd994b4cf Update PUBLICATIONS.md
@neoblizz @dumerrill 

thesis covering streamk
2023-01-08 00:42:19 -05:00
Gregory Meyer (gregjm)
7bdba07310 Add definitions for tag structs. (#752)
This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).
2023-01-06 09:46:52 -05:00
Gregory Meyer (gregjm)
c54ede3a9e Add const overloads for iterator functions. (#753)
This commit adds `const`-correct overloads for `Array::{begin,end,rbegin,rend}`. These overloads are necessary for usage with [the GMock Container Matchers](http://google.github.io/googletest/reference/matchers.html#container-matchers), which cast the `Container` argument to a constant reference.
2023-01-06 09:46:34 -05:00
Haicheng Wu
ff6e733fe1 restore the old epilogue for everything except streamk (#749)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-04 11:02:55 -05:00
Haicheng Wu
5989b7e1d7 Update PUBLICATIONS.md
Add coconet paper to the publication list.  @abhijangda
2023-01-04 09:18:38 -05:00