cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-12 01:10:08 +00:00

Author	SHA1	Message	Date
Brian K. Ryu	147f5673d0	New RMS Norm example with unit tests (#2917 ) * Add rmsnorm example * Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly * Separate out reduce.py * Change copyright notice years	2026-01-13 09:05:31 +08:00
Johnsonms	8c52459504	Fix incorrect tensor layout strides in Blackwell MMA tutorial comments (#2921 )	2026-01-09 01:02:41 -05:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Andrew Yooeun Chun	eb61c91147	Fix CUDA version checking in examples (#2894 ) * examples: update CUDA version requirements in Blackwell examples * examples: fix comments to specify the correct CUDA version requirement	2026-01-07 00:20:37 -05:00
Aidan Do	670480df3a	Fix SFB Layout scale granularity representation (#2924 )	2026-01-06 23:55:21 -05:00
questa-quan-wang	3f4c086d09	new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881 ) Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2025-12-23 15:29:48 +08:00
Junkai-Wu	7f5fe3edf1	v4.3.4 update. (#2892 )	2025-12-21 11:49:12 -05:00
dongxiao	331e2f451c	add missing condition for sync (#2889 )	2025-12-19 11:00:30 +08:00
Linfeng Zheng	f6402fcd5e	add pytest support for tutorial gemm (#2826 ) * add pytest support for tutorial gemm * add license	2025-12-05 08:45:01 -05:00
bangyu shen	7252a2d17e	remove internal comment (#2841 ) Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-04 10:36:21 -05:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
bangyu shen	52ae719eda	[examples][CuTeDSL] init commit for distirbuted examples (#2806 ) * init commit for distirbuted examples * better OOB protection * and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions * add some lamport explanation * enhance f8 output and warn that f8 output can have nan in it * tell user why we need complicate data conversions in ref check part * tell user we don't support nvshmem device function --------- Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-01 22:25:40 -05:00
drazi	ec8daf642d	Merge pull request #2809 from whatdhack/patch-1 Update notebook title from 'Tour to' to 'Tour of'	2025-11-28 18:07:34 +08:00
Fung Xie	286781a1fb	add requirements.txt	2025-11-27 17:02:27 -08:00
Fung Xie	2664cac685	enhanced the example for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	b9154d65b3	update examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	afe2f71522	reorganize examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	739fffce27	fix TVM FFI doc and update example	2025-11-27 17:02:26 -08:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
Shreya Gaur	2052fd3885	Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790 ) * Adding blockscaled ragged contiguous grouped gemm for MoEs * cleaning up the example * introduction to example improved --------- Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>	2025-11-26 20:16:49 -05:00
whatdhack	4a55379686	Update notebook title from 'Tour to' to 'Tour of' Grammar check . LLM's can quickly clean up such issues.	2025-11-24 20:11:14 -08:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Linfeng Zheng	406e078b29	add a notebook for tour to sol gemm (#2780 ) * add tour to sol gemm notebook * change some typos * change some typos	2025-11-20 09:41:01 -05:00
Mindy Li	06b6bd7d7b	remove cute dsl pdl example.	2025-11-09 21:47:00 -08:00
Linfeng Zheng	2252254ce2	Add tutorial fp16_gemm_1 (#2750 ) * Add tutorial fp16_gemm_1 * refine * refine * refine * revert changes in fp16_gemm_0.py	2025-11-06 22:40:09 -05:00
Ali Hassani	d1ef0e87f2	DistGEMM bug fixes (#2713 ) * Blackwell DistGEMM bug fixes 1. If using preferred cluster, there needs to be a branch so that the universal GEMM wrapper finds the correct base params. 2. Workspace sizes can change depending on problem shape in Blackwell, and DistGEMM was previously using the per-device shape to evaluate workspace size instead of the per-gemm shape. 3. Flattened size used to initialize host tensors can overflow (in Hopper example as well) 4. Preferred and fallback cluster args need to be set explicitly, otherwise if someone modifies the example to use preferred cluster, it will just fail. * Fix example runtimes * Set default fallback cluster shapes to the static ones	2025-11-06 13:31:24 -05:00
ANIKET SHIVAM	020c700e97	support for K=0 for sm100 GG (#2746 )	2025-11-04 11:25:39 -05:00
Junkai-Wu	b1d6e2c9b3	v4.3 update. (#2709 ) * v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>	2025-10-21 14:26:30 -04:00
Aya Z. Ibrahim	64579189ec	Feature/add bottom causal mask (#2480 ) * Rebase to latest * update * upd Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Update fmha_fusion.hpp * Update fmha_fusion.hpp fixed flipped logic for isQBegin * Update fmha_fusion.hpp * Avoid use of booleans The current expression is confusing * fmt * Update fmha_fusion.hpp Reproduce error/fix with: ./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend * add test, format --------- Co-authored-by: Richard Cai <ricai@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2025-09-18 17:11:23 -04:00
Junkai-Wu	74825181f2	Remove old-version dsl examples. (#2644 )	2025-09-17 22:23:30 -04:00
Junkai-Wu	6a35b4d22f	v4.2 tag release. (#2638 )	2025-09-15 12:21:53 -04:00
Richard Cai	56f0718a97	ex77 backwards GQA (#2556 ) * bwd GQA init * Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu * ref kernel type conversion fix --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2025-09-09 12:53:28 -04:00
Lifu Huang	b6ccf34aef	Fix Copy_Atom type mismatch in sgemm_sm80.cu (#2582 )	2025-09-04 16:56:17 -07:00
Linfeng Zheng	9ca7e877b2	fix gqa issue for blackwell fmha.py (#2599 )	2025-08-28 11:15:20 -04:00
Junkai-Wu	a49a78ffef	v4.2 release. (#2587 ) * Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line. * v4.2 release.	2025-08-22 18:11:24 -04:00
qqwqqw689	11cad1f67b	fix a typo. (#2561 )	2025-08-19 22:23:09 -04:00
Horace He	19772cd63e	Fix typo in smem_allocator.py (#2517 )	2025-08-10 22:44:22 -04:00
Tarun Paparaju	a267d47f9b	Update batched_gemm.cu (#2538 )	2025-08-10 22:42:21 -04:00
botbw	da47886e34	Fix example bug (#2351 )	2025-07-30 22:12:33 -04:00
xiangjiaojun	84a27b3926	fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu GridDim miscalculated (#2492 ) * fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu Launch dimGrid error * feat: add cta tiler * Update examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu use cluster_layout_vmnk instead of cta_tiler Co-authored-by: Junkai-Wu <junkaiw@nvidia.com> * feat: remove cta_tiler --------- Co-authored-by: qinghongzeng <qinghongzeng@deeproute.ai> Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>	2025-07-30 22:11:04 -04:00
kernyan	e093b4f691	Fix tutorial comment in sgemm_1.cu: use tCrC instead of tCsA in axpby explanation (#2448 )	2025-07-30 22:09:55 -04:00
Zeyu WANG	0e026982ce	Example 77 add blackwell fmha bwd for MLA shape (#2466 ) * Update examples/77_blackwell_fmha/device/fmha_device_bwd.hpp Co-authored-by: Vijay Thakkar <vijaythakkar@me.com> * bug fix & use existing value rather than pass one more argument to support different dim in bwd_convert * Fix casual mask cnt when IsQBegin==false * bug fix in casual mask backward * code sync --------- Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>	2025-07-24 18:41:11 -04:00
Junkai-Wu	fd6cfe1ed0	v4.1 release update v2. (#2481 )	2025-07-21 22:03:55 -04:00
zhang	9baa06dd57	Add Blackwell MLA forward (shape: d=192, dv=128) implementation in example_77 (#2472 )	2025-07-18 01:27:48 -04:00
Junkai-Wu	a1aaf2300a	v4.1 release	2025-07-03 08:07:53 -04:00
Junkai-Wu	889ff20648	v4.0 update v2. (#2420 ) * Ex77 forward kernel fix.	2025-06-25 12:56:25 -04:00
Junkai-Wu	dc4817921e	v4.0 update. (#2398 ) * Ex77 fix.	2025-06-12 09:10:29 -04:00
Junkai-Wu	8bdbfca682	v4.0 update. (#2371 )	2025-06-06 02:39:20 -04:00
Manish Gupta	2e2af190bd	Revert "[ex77] fix mla split; add fwd lse; add bwd varlen (#2366 )" (#2370 ) This reverts commit `f12b1d75c9`.	2025-06-05 23:14:57 -04:00
Markus Hoehnerbach	f12b1d75c9	[ex77] fix mla split; add fwd lse; add bwd varlen (#2366 )	2025-06-05 18:39:46 -04:00

1 2 3 4

179 Commits