Commit Graph

376 Commits

Author SHA1 Message Date
Janusz Lisiecki
24c8b7d8a2 Fix cuTE compilation with clang (#939)
- clang 1.14 complains about missing function from a host call:
  cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
  return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2023-05-09 09:51:45 -04:00
ANIKET SHIVAM
7c04f95415 Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
Gregory Meyer (gregjm)
6f8596ce3f Add missing #include directive to get access to cutlass::epilogue::thread::ScaleType. (#925)
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue::thread::ScaleType`
already available when it is `#include`d.
2023-04-28 20:02:41 -04:00
Adnan Akhundov
fe2f491dd7 Get SM count with cudaDeviceGetAttribute in KernelHardwareInfo (#927) 2023-04-28 13:23:23 -04:00
Adnan Akhundov
df02482f1d Add missing schedules argument in SM90 fp16 op generation (#920) 2023-04-26 16:44:49 -04:00
Jakub Szuppe
180c5629bf Add missing checks for NVRTC in CuTe (#921) 2023-04-25 12:52:43 -04:00
Alexander Zinoviev
e36912f961 Fix for dangling references in the MHA example (#918) 2023-04-19 21:35:46 -04:00
Jack Kosaian
9a83bd3381 CUTLASS 3.1 Python interface documentation (#917)
* Add 12.1 Dockerfile

* Add 3.1 docs
2023-04-18 15:11:35 -04:00
Adnan Akhundov
54bebe417d Fix some typos in CuTe tutorials (#912) 2023-04-17 16:00:51 -04:00
Guray Ozen
43cfbe0086 Allow L2 prefect for clang compiler (#914) 2023-04-15 01:23:22 -04:00
Aleksandr Pivovar
4a68cf748e added support of b2b bmm (#849)
* added support of b2b bmm

* fixed arguments and params structures

* added batch_count argument

* removed SplitKSerial and added new test case with b2b bmm

* fixed support of Kbatched and added new test case with batch stride

* added batch support for bias and scale

* make test

* small changes

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-04-14 23:20:02 -04:00
ANIKET SHIVAM
d572cc1aab CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00
dan_the_3rd
9b8166e3f0 fMHA: Add backward pass (#844)
* fMHA: Add backward pass

* Better checks for strides/alignments

* Remove fb-internal URL

* torch.Tensor.untyped_storage requires pytorch 2.0+

* minor changes

* make test

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-04-06 20:44:58 -04:00
Shuai Shao
e2d439ee7e Add tile_n=32 and tile_k=32 kernels in generator.py (#858) 2023-04-06 10:00:52 -04:00
Adnan Akhundov
0435979f59 Remove const from 3.x GemmUniversalAdapter::operator() (#905) 2023-04-03 20:30:51 -04:00
Adnan Akhundov
2ba1ef10be Increase max dynamic SMEM size in GemmSoftmax (#903) 2023-04-03 10:01:12 -04:00
Adnios
0964bdb64c update gemm and conv2d cmdline --help output (#878) 2023-04-01 11:38:13 -04:00
Gregory Meyer (gregjm)
ecbd24566c Enable shared memory intrinsics and ldmatrix PTX on Clang. (#754)
* Enable shared memory intrinsics and ldmatrix PTX on Clang.

This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.

* refine the macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-31 21:42:24 -04:00
Manish Gupta
660a05f581 fix split_k_mode and add reduction kernel for f16 input/accum/output (#896) 2023-03-30 15:31:08 -04:00
Feng Shijie
bc36122c3f [layout] Fix AffineRank2ColumnMajor::packed() (#879)
* [layout] Fix AffineRank2ColumnMajor::packed()

* correct affine2row::packed

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-29 11:59:48 -04:00
Vijay Thakkar
15d9d31f1f CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897) 2023-03-29 10:42:40 -04:00
ptrblck
1eef5c3cf1 add guards for __CUDA_ARCH__ >= 530 (#891)
* add guards for sm>=70

* drop guard to 530
2023-03-28 17:47:10 -04:00
Yujia Zhai
87070b6d51 add a CUTLASS publication (#893)
* add bytetransformer

* update arxiv link

* re-order
2023-03-28 17:06:57 -04:00
Haicheng Wu
77549ae6c8 Update PUBLICATIONS.md
msft moe paper
2023-03-25 21:17:05 -04:00
Alexander Zinoviev
42290f5d1c Fix for dangling pointers (#885) 2023-03-25 01:15:14 -04:00
Vijay Thakkar
209faf7b94 remove spurious comma (#871) 2023-03-20 17:25:27 -04:00
Jack Kosaian
6116706c96 Set batch_strides on Params::update (#883) 2023-03-20 17:07:47 -04:00
Nikita Shulga
2670b973dd Fix sign-compare warning in reorder_array (#869)
`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well


Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532
2023-03-20 17:07:24 -04:00
Vijay Thakkar
af332d4aa9 Add missing comma in cutlass/arch/mma_sm90.h (#862) 2023-03-14 12:04:28 -04:00
Edward Rees
86cae03cea expose StoreT parameter for potential speed (#838)
* expose StoreT parameter for potential speed

* add storeT to more elementwise

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-10 12:58:17 -05:00
Stepan Tezyunichev
29801e348a Hide streams and typinfo from nvrtc (#853)
* Hide streams and typinfo from nvrtc

* Use __CUDACC_RTC__ instead CUDA_ARCH for guard
2023-03-09 23:24:47 -05:00
Alexander Pivovarov
7e370c9637 Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-03-09 23:22:56 -05:00
ANIKET SHIVAM
c4f6b8c6bc Updates for 3.0 (#857)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
v3.0.0
2023-03-09 15:27:40 -05:00
Yinghai Lu
a68e2f95f0 Reduce versbosity in manifest.py (#845) 2023-03-07 11:53:01 -05:00
psaab
a31b43b3f3 Re-enable aarch64 support lost in 277bd6e537 (#846) 2023-03-02 11:17:21 -05:00
dan_the_3rd
f396cdd15c ex24[gemm_grouped]: Allow to change layout/dtype (#841)
* ex24[gemm_grouped]: Allow to change layout/dtype

* Address suggestion from @jackkosaian

---------

Co-authored-by: danthe3rd <danthe3rd>
2023-03-01 07:13:51 -05:00
Alexander Pivovarov
92ebbf1dc4 Fix typos (#839) 2023-02-27 11:17:58 -05:00
Haicheng Wu
65688c2a87 streamk fix (#836)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-23 16:35:08 -05:00
dan_the_3rd
f303889ed9 fMHA: Sync FW with xFormers (#828)
* fMHA: Add support for bias+dropout in FW

* Remove 'getMaximumSharedMemoryPerBlockKb'

* fix comments

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-22 23:25:31 -05:00
Shuai Shao
9cdbe33570 Add fixed_channel and few_channel mode to int8 in generator (#829) 2023-02-21 21:15:39 -05:00
Yuxin Wu
95f673ecf7 Update base_grouped.h (#832) 2023-02-21 14:48:30 -05:00
Haicheng Wu
91b8de8d32 streamk fix (#830)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-20 11:03:16 -05:00
Sujan Kumar Gonugondla
d8359c804b Changes to iterators to support s8 gemm with f16 outputs (#812)
* Changes to iterators to support s8 gemm with f16 outputs

* should work

---------

Co-authored-by: Sujan Gonugondla <gsujan@amaon.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-16 18:37:51 -05:00
Haicheng Wu
34bed24af3 Update helper.h
copyright banner
2023-02-16 16:50:04 -05:00
ZZK
a101ac283f Fix some typos (#791)
* fix typo

* fix a deadlink to code
2023-02-16 15:56:55 -05:00
Haicheng Wu
9fb38ac048 fix alignmentC=8 for imma N=128 (#822)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-15 12:06:00 -05:00
Haicheng Wu
8f5c242426 Update dual_gemm_common.h
fix the copyright of a new file.
2023-02-13 15:35:33 -05:00
Adnan Akhundov
3c995c7606 Extend DualGemm: support batched mode + decouple B0/B1 layouts (#790)
* Fix MHA kernel

Summary:

ATT

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Extend DualGemm to support batched mode (#5)

Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set.

* Decouple LayoutB0 and LayoutB1 in DualGemm

The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously.

In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above.

* Remove comment as no longer relevant

* Revert Fix MHA kernel

---------

Co-authored-by: mikeiovine <mikeiovine@fb.com>
2023-02-13 15:27:13 -05:00
Shuai Shao
ce8597dc14 Fix type bug in conv2d/gemm with broadcast (#796)
add ElementVector

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-09 20:53:25 -05:00
dan_the_3rd
2e10404d26 xFormer updates to fMHA FW (#773)
* xFormer updates to fMHA FW

* Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen'

* Add missing files

* Remove xFormers specific code

* Update fused_multihead_attention_fixed_seqlen.cu

* rebase and solve conflicts

* remove white space

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-08 23:00:10 -05:00