Commit Graph

  • 982748aa73 [Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091) main Johnsonms 2026-03-17 21:40:15 -07:00
  • da5e086dab v4.4.2 update. (#3105) v4.4.2 release/4.4 Junkai-Wu 2026-03-17 12:58:41 +08:00
  • 1b741cabaa v4.4.2 update. (#3104) Junkai-Wu 2026-03-17 12:58:19 +08:00
  • 772fbb264e [CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106) Linfeng Zheng 2026-03-17 10:11:55 +08:00
  • 087c84df83 docs: Fix float16 documentation in elementwise_add notebook (#2949) (#3047) Blake Ledden 2026-03-11 19:29:46 -07:00
  • 6a188a33cb Placeholder change. oss_ci Zekun Fan 2025-11-14 16:09:03 -08:00
  • 73c59c055c Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092) dePaul Miller 2026-03-06 17:36:29 -08:00
  • e5fcd125a5 [fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands (#3087) Johnsonms 2026-03-05 01:20:26 -08:00
  • a93d86ec83 Fix finding cuDNN (#2890) TLescoatTFX 2026-03-05 02:51:37 +01:00
  • 49e54f2b23 fix: add_help=False in temporary parser (#2721) David W.H. Swenson 2026-03-02 01:33:42 -06:00
  • b9847690c5 Merge pull request #3028 from SzymonOzog/patch-3 drazi 2026-02-28 10:11:05 +08:00
  • 4370102f9d v4.4.1 update (#3080) v4.4.1 Junkai-Wu 2026-02-28 03:01:08 +08:00
  • 3bb6e28d3c v4.4.1 update (#3079) Junkai-Wu 2026-02-28 02:59:21 +08:00
  • c651d660d2 fix typo (#3012) Tianqi Zhang (张天启) 2026-02-27 16:25:35 +08:00
  • 518327d631 Fix error in Blackwell document of referring to Mxf4 format as NVF4 (#2977) Ziang Li 2026-02-27 00:25:16 -08:00
  • de67bb7a42 Fix example in CuTe tutorials (#2752) StevenYangCC 2026-02-27 16:24:34 +08:00
  • edf2f82c00 Fix register index bug in mma.sync.aligned.m16n8k16 (#2740) Neil Kichler 2026-02-27 09:24:18 +01:00
  • 79345359a7 Fix debug typo in sgemm_2.cu and sgemm_sm70.cu (#2678) mnehete32 2026-02-27 13:53:59 +05:30
  • 8b9b3d78df fix typo in documentation (#2671) zkyue 2026-02-27 16:23:37 +08:00
  • fc5bbc2dab Fix typo in cute.nvgpu.warpgroup.mma doc (#2548) Gabriel Wu 2026-02-27 16:22:55 +08:00
  • 057635de5c Remove redundant dsl example. (#3074) Junkai-Wu 2026-02-26 21:10:59 +08:00
  • c213bfdfc1 Remove redundant dsl examples. (#3071) v4.4.0 Junkai-Wu 2026-02-26 11:42:01 +08:00
  • 954503d44c Bump version to 4.4.0 Haicheng Wu 2026-02-25 00:04:04 -05:00
  • 6c4200f1bc Bump version from 4.3.5 to 4.4.0 Haicheng Wu 2026-02-25 00:03:23 -05:00
  • de93e8a4ac Bump version from 4.3.5 to 4.4.0 Haicheng Wu 2026-02-25 00:03:04 -05:00
  • b92b9f0d37 Bump version from 4.3.5 to 4.4.0 Haicheng Wu 2026-02-25 00:02:41 -05:00
  • 2aedca6f5e Bump CUTLASS version to 4.4.0 Haicheng Wu 2026-02-25 00:01:56 -05:00
  • ae5ed7361b Bump CUTLASS version to 4.4.0 hwu36-patch-2 Haicheng Wu 2026-02-25 00:01:26 -05:00
  • 6450964b57 Update README Haicheng Wu 2026-02-24 23:55:55 -05:00
  • 284449fa5b Revise chagnelog Haicheng Wu 2026-02-24 23:54:56 -05:00
  • 0853d81d70 Revise README Haicheng Wu 2026-02-24 15:32:17 -05:00
  • 3476ddb7bd remove mixed_input_fmha_prefill (#3041) Linfeng Zheng 2026-02-18 20:59:01 +08:00
  • 291300ffff [CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009) Yihan Chen 2026-02-14 17:54:03 +08:00
  • f9a5f76b7a Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027) aragorn-guan 2026-02-14 17:51:20 +08:00
  • ec7e6cb17b Merge pull request #2971 from rsmallblue/tvm-ffi drazi 2026-02-14 14:14:10 +08:00
  • 395ab575f6 Merge branch 'main' into tvm-ffi Yuan Xiaolan 2026-02-14 13:35:28 +08:00
  • d4bbf728ca v4.4 tag release update. (#3032) Junkai-Wu 2026-02-14 12:27:58 +08:00
  • beb80e04e1 Add option to not suffix prints with new line Szymon Ożóg 2026-02-13 15:56:50 +01:00
  • 01687cfba1 Merge pull request #3004 from tridao/add-sub-packed-f32x2 drazi 2026-02-13 20:46:26 +08:00
  • 5c42d0f28c Merge pull request #3021 from tridao/clc_no_multicast drazi 2026-02-13 20:45:52 +08:00
  • 1d36152f34 Merge pull request #3022 from tridao/nvvm_fmin drazi 2026-02-13 20:45:08 +08:00
  • 244e8d00d5 [Cute-DSL] Add cute.arch.fmin by calling nvvm Tri Dao 2026-02-11 14:23:09 -05:00
  • 5b83b34afd [Cute-DSL] Add option for issue_clc_query without multicast Tri Dao 2026-02-11 14:19:29 -05:00
  • 8dbce01473 [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970) aragorn-guan 2026-02-11 11:54:00 +08:00
  • 71aa7a0abc Merge pull request #2919 from pbelevich/patch-1 drazi 2026-02-11 11:48:58 +08:00
  • 51935551fb [CuTeDSL] Add sub_packed_f32x2 operation Tri Dao 2026-02-04 21:18:46 +07:00
  • 6b3e607b85 v4.4 release update v2. (#2999) Junkai-Wu 2026-02-04 09:48:31 +08:00
  • de161925a5 pass in stream=-1 yuanxiaolan 2026-02-03 11:55:54 +08:00
  • de198b2419 fix tvm-ffi path in from_dlpack yuanxiaolan 2026-01-22 13:47:45 +08:00
  • 1cfbb53a23 [CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995) Hua Huang 2026-02-03 11:01:41 +08:00
  • a4eb0e05f6 fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988) dongxiao 2026-01-30 13:31:04 +08:00
  • d252b01300 fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990) myu-guo 2026-01-30 13:30:49 +08:00
  • acb45938e9 Update nvvm API call from nvvm enum to str (#2985) Xiao Song 2026-01-27 17:28:29 +08:00
  • 7a14467776 update api usage (#2969) Xiao Song 2026-01-27 15:33:22 +08:00
  • 51f82812ec Merge pull request #2891 from ColinPeppler/main drazi 2026-01-26 17:38:27 -08:00
  • 9fba3195f9 v4.4 update. (#2979) Junkai-Wu 2026-01-25 00:46:17 +08:00
  • 2fafefb7b9 [Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma (#2965) Qi Yuhang 2026-01-23 15:56:52 +08:00
  • 0edaa6e47d Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial (#2945) Johnsonms 2026-01-22 20:54:12 -08:00
  • 431d070fcb [docs] Add additional tip for generating less kernels in blockwise (#2940) Aidan Do 2026-01-22 20:53:51 -08:00
  • 667446a9dd [Doc]Fix Mode Name and Stride in 0t_mma_atom.md (#2910) Qi Yuhang 2026-01-23 12:53:30 +08:00
  • 3f5bafb326 [Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles (#2946) Aidan Do 2026-01-19 23:27:34 -08:00
  • 1e6da09275 [DOCS] Update docs to precisely describe env stream scenario (#2824) Tianqi Chen 2026-01-19 20:16:37 -05:00
  • e594def95e Don't access data_ptr of fake tensor. Fix EFC w/o epilogue cutlass_api jkosaian 2026-01-14 18:00:08 -08:00
  • 8debf77437 fix: 2305 omissions (#2957) Benjamin Leff 2026-01-13 23:55:05 -06:00
  • e222b2a9b9 Update TVM FFI version jkosaian 2026-01-13 07:58:48 -08:00
  • 87cab7bae2 2026-01-12 updates jkosaian 2026-01-12 18:51:25 -08:00
  • 147f5673d0 New RMS Norm example with unit tests (#2917) Brian K. Ryu 2026-01-12 17:05:31 -08:00
  • 8c52459504 Fix incorrect tensor layout strides in Blackwell MMA tutorial comments (#2921) Johnsonms 2026-01-08 22:02:41 -08:00
  • 0deda34b9f fix typo (#2884) kf-zhang 2026-01-09 13:57:06 +08:00
  • 0d2b201e8c v4.3.5 update. (#2934) Junkai-Wu 2026-01-09 04:02:56 +08:00
  • 4faf1a1568 v4.3.5 update. (#2935) v4.3.5 release/4.3 Junkai-Wu 2026-01-09 04:02:14 +08:00
  • f86feb0aa8 Fix idx2crd docstring (#2914) Wenxuan Tan 2026-01-07 10:11:38 -08:00
  • eb61c91147 Fix CUDA version checking in examples (#2894) Andrew Yooeun Chun 2026-01-07 14:20:37 +09:00
  • 670480df3a Fix SFB Layout scale granularity representation (#2924) Aidan Do 2026-01-06 20:55:21 -08:00
  • 61b560983a remove useless line (#2926) veritas-Qiu 2026-01-07 12:54:08 +08:00
  • 7c09485e25 2026-01-06 updates jkosaian 2026-01-06 04:25:33 -08:00
  • 7127592069 Replace CUDA driver API with runtime API (#2928) dePaul Miller 2026-01-05 10:50:44 -08:00
  • 2aee73922c Minor fix for testing of blockscaled dense GEMM with TMA prefetch (#2930) questa-quan-wang 2026-01-05 16:36:03 +08:00
  • 3d9de19bb7 add constexpr specifier to make_tiled_copy (#2875) tsu-bin 2026-01-04 04:39:43 +08:00
  • b6d7703e02 Refactor binary_op functions to remove unused result parameter Pavel Belevich 2026-01-02 11:23:43 -05:00
  • f9bedd9096 Fix print statement for floor division result Pavel Belevich 2026-01-02 11:15:15 -05:00
  • 1810164f27 Update driver bug workaround description in CHANGELOG v4.3.4 Haicheng Wu 2025-12-24 00:34:25 -05:00
  • 709ccc7b92 Update README.md Haicheng Wu 2025-12-24 00:33:54 -05:00
  • 853ad93d60 Update README.md Haicheng Wu 2025-12-24 00:21:59 -05:00
  • 34a81f0497 Update driver bug workaround description in CHANGELOG Haicheng Wu 2025-12-24 00:20:21 -05:00
  • 3f4c086d09 new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881) questa-quan-wang 2025-12-23 15:29:48 +08:00
  • 9a9dbab522 v4.3.4 update v2. (#2899) Junkai-Wu 2025-12-23 11:29:06 +08:00
  • b7ecaa605d v4.3.4 update v2. (#2898) Junkai-Wu 2025-12-23 11:28:26 +08:00
  • 7233a05f24 v4.3.4 update. (#2893) Junkai-Wu 2025-12-22 00:49:35 +08:00
  • 7f5fe3edf1 v4.3.4 update. (#2892) Junkai-Wu 2025-12-22 00:49:12 +08:00
  • 4b52d37ecd docs: note when DSL dumps are populated Colin Peppler 2025-12-19 17:05:12 -08:00
  • 331e2f451c add missing condition for sync (#2889) dongxiao 2025-12-19 11:00:30 +08:00
  • ebf3165efb [Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler (#2865) Qi Yuhang 2025-12-18 08:51:38 +08:00
  • dfcb55de16 Fix batch adding for EFC jkosaian 2025-12-16 14:08:23 -08:00
  • ead2fbfe13 Initial commit jkosaian 2025-12-16 10:00:46 -08:00
  • d55f6beeeb Bump version from 4.3.2 to 4.3.3 v4.3.3 Haicheng Wu 2025-12-11 23:59:52 -05:00
  • d4e16f5d4e Bump version from 4.2.1 to 4.3.3 Haicheng Wu 2025-12-11 23:58:38 -05:00
  • 6d4cf6d915 Update version of cutlass_library to 4.3.3 hwu36-patch-1 Haicheng Wu 2025-12-11 23:58:12 -05:00
  • d3a5492381 v4.3.3 update. (#2868) Junkai-Wu 2025-12-11 13:26:58 +08:00
  • 5873443bb6 v4.3.3 update (#2869) Junkai-Wu 2025-12-11 13:26:17 +08:00