Commit Graph

  • e8ecfad75b add tileN = 8,16 for SM120 blockscale GEMM. (#3292) main Brayden Zhong 2026-06-26 15:16:02 -07:00
  • 12ff513cea [CuTeDSL] Add a render function hook to allow render layout natively (#3135) Kaining Zhong 2026-06-26 14:14:55 -05:00
  • d4b4b494c3 [CLI] Recover ssd and blockwise group gemm perf (#3344) myu-guo 2026-06-24 08:42:36 +08:00
  • c88b280fbf add fp4_x2 example (#3043) drazi 2026-06-23 17:56:23 +08:00
  • 8f50b052e1 Fix license. (#3328) Junkai-Wu 2026-06-23 10:07:29 +08:00
  • 2b426e9065 add one more item in changelog v4.2.2 release/4.2 Haicheng Wu 2026-06-21 09:41:21 -07:00
  • bed71224ed Fix for blockwise gg nosmem epi && no sfd with nosmem GG epilogues Haicheng Wu 2026-06-20 14:27:27 -07:00
  • f93e1cea92 remove white space Haicheng Wu 2026-06-20 06:53:20 -07:00
  • 02d166e39c port nvrtc change to version.h Haicheng Wu 2026-06-14 21:59:26 -07:00
  • 1afc6d355b port nvrtc change to version.h v4.3.6 release/4.3 Haicheng Wu 2026-06-14 21:59:26 -07:00
  • cf064d2e6b Update tensorop_gemm.py (#3322) minas-nv 2026-06-16 08:33:00 -07:00
  • becfce08cd Enable tcgen05 blockscaled ops on Thor SM110 (#3283) xiangg-nv 2026-06-16 15:01:25 +08:00
  • b18df7d206 update to 4.4.3 v4.4.3 release/4.4 Haicheng Wu 2026-06-15 20:40:53 -07:00
  • 645a98d263 port nvrtc change to version.h Haicheng Wu 2026-06-14 21:59:26 -07:00
  • 39b352fa93 v4.6 dev update. (#3315) Junkai-Wu 2026-06-16 11:23:20 +08:00
  • db1c288993 Update sm100 MMA desc offsetting (#3299) v4.5.2 release/4.5 ANIKET SHIVAM 2026-06-08 19:12:36 -07:00
  • 67cb24cfe3 port nvrtc change to version.h Haicheng Wu 2026-06-14 21:59:26 -07:00
  • 0ce648f53f [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280) Tyler 2026-06-13 16:10:47 -05:00
  • 93774d3da5 Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286) Alex Georgiev 2026-06-11 12:33:47 -04:00
  • 1fc71b3ed1 Update sm100 MMA desc offsetting (#3299) ANIKET SHIVAM 2026-06-08 19:12:36 -07:00
  • d80a4e53b5 fix validation codes (#3303) brandonsun 2026-06-05 20:16:02 +08:00
  • 2599f2975b [CLI] quick fix for fmha compile options (#3295) Linfeng Zheng 2026-06-03 17:56:17 +08:00
  • 0e9ac0734c Fix example with upcoming release (#3293) Anakin(Yancheng) Zheng 2026-06-03 13:40:27 +08:00
  • 0bdd5cf8fb fix example issue (#3294) xiufanl 2026-06-03 11:02:00 +08:00
  • 423904d717 [CLI] Recover fmha perf (#3291) Linfeng Zheng 2026-06-03 08:55:57 +08:00
  • 25e252bdce replace deprecated apis (#3285) brandonsun 2026-06-01 08:58:58 +08:00
  • 1732ed7da3 [CuTeDSL] Make @cute.struct instances flattenable across scf.if / scf.while (#3270) George Karpenkov 2026-05-28 17:34:48 -07:00
  • 9c1d0965f8 Add Blackwell GeForce blockscaled GEMM examples (#3272) bangyu shen 2026-05-28 04:06:52 +08:00
  • 47ba7a738b v4.5.2 update. (#3265) Junkai-Wu 2026-05-27 10:33:15 +08:00
  • 5c54bee12b v4.5.2 update. (#3264) Junkai-Wu 2026-05-27 10:32:26 +08:00
  • 60b9659133 [CLI] add support for sm100 blockscaled gemm (#3274) Caleb_Du 2026-05-27 09:33:26 +08:00
  • 5f06f5fc1a fix elect_sync api (#3262) Longsheng Du 2026-05-26 08:50:00 +08:00
  • e45ccb1226 [CLI] Update FMHA & improve perf (#3251) Linfeng Zheng 2026-05-25 15:56:08 +08:00
  • 546c3efa89 Fix examples and pytest, run ruff (#3230) dePaul Miller 2026-05-20 20:05:38 -07:00
  • 982cb9e718 v4.5.1 update. (#3237) Junkai-Wu 2026-05-19 10:35:08 +08:00
  • 2e602843e7 v4.5.1 update. (#3238) v4.5.1 Junkai-Wu 2026-05-19 10:33:27 +08:00
  • e406c186f5 Fix typo in README.md for mixed precision support v4.5.0 Haicheng Wu 2026-05-12 23:34:47 -04:00
  • bafd9e53b5 Fix typo in CHANGELOG.md for mixed precision Haicheng Wu 2026-05-12 23:34:14 -04:00
  • 971d1ed8b7 fix for thor (#3224) Observer007 2026-05-13 09:06:44 +08:00
  • ef120d0d09 update to 4.5 (#3228) Haicheng Wu 2026-05-12 02:44:22 -04:00
  • 46fcd7eb18 update to 4.5 fix/4.5_update Haicheng Wu 2026-05-11 20:43:08 -07:00
  • c775e566bd fix: exclude SM70/72 from CUTLASS_NVCC_ARCHS_SUPPORTED on CUDA >= 13.0 (#3166) Vensen 2026-05-12 09:10:16 +07:00
  • 2e56847d72 Add Snake activation functor for EVT (#3184) Emre Albayrak 2026-05-12 05:09:53 +03:00
  • 1d9e1f6d7a [CuTeDSL] Fix loop carried target scope (#3200) TungtungQia 2026-05-11 16:02:26 +08:00
  • ae6bccf341 [CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206) questa-quan-wang 2026-05-07 15:03:37 +08:00
  • cb37157db5 v4.5 tag update (#3202) Junkai-Wu 2026-05-06 08:55:27 +08:00
  • f74fea9ce3 [Hopper CuTeDSL] Add FP8 GEMM with 2xAcc (#3149) Johnsonms 2026-04-25 13:10:33 -07:00
  • 7a9fe055cb fix: Add missing kElementsPerAccess division in RegularTileIterator store (#3049) Blake Ledden 2026-04-24 20:27:40 -07:00
  • 9135a9bb6d Replace std::min with cute::min in sm120 blockwise scaling device functions (#3055) Vrushtee 2026-04-24 20:43:38 +05:30
  • b46b16d003 Small Tile N BlockScaled GEMM + Grouped GEMM (#3176) dePaul Miller 2026-04-21 09:32:40 -07:00
  • aeba0d3723 correct BLayout stride in SM80 m16n8k32 int4 MMA traits (#3140) zfm 2026-04-21 17:17:03 +08:00
  • ea46e277d2 Add absf and floor to cute.math (#3156) Nandor Licker 2026-04-17 03:54:24 +03:00
  • 3f3db08a0a Add support for empty dataclass arguments (#3152) Nandor Licker 2026-04-17 03:47:47 +03:00
  • 97e682e651 Placeholder change. oss_ci Zekun Fan 2025-11-14 16:09:03 -08:00
  • 08185b9c3e Update blackwell tutorial to be compatible with 4.5-dev version (#3130) Longsheng Du 2026-04-09 14:40:33 +08:00
  • bd01dd3651 Update the release note for 4.5 dev (#3154) brandonsun 2026-04-08 10:02:46 +08:00
  • a221da7ccf v4.5 dev update. (#3153) Junkai-Wu 2026-04-08 00:16:05 +08:00
  • 418d38a5de PR update (#3103) Katja Sirazitdinova 2026-04-02 14:00:41 +04:00
  • 4ca61d0662 [CuTeDSL] Add dataclass example: passing pointers via frozen dataclass (#3070) drazi 2026-03-30 15:08:36 +08:00
  • baea077e42 Merge pull request #3126 from keithzzzzz/main drazi 2026-03-24 23:56:23 +08:00
  • ecb32fe231 [CLI] Fix tutorial issues Zheng Linfeng 2026-03-24 00:12:01 -07:00
  • 982748aa73 [Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091) Johnsonms 2026-03-17 21:40:15 -07:00
  • da5e086dab v4.4.2 update. (#3105) v4.4.2 Junkai-Wu 2026-03-17 12:58:41 +08:00
  • 1b741cabaa v4.4.2 update. (#3104) Junkai-Wu 2026-03-17 12:58:19 +08:00
  • 772fbb264e [CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106) Linfeng Zheng 2026-03-17 10:11:55 +08:00
  • 087c84df83 docs: Fix float16 documentation in elementwise_add notebook (#2949) (#3047) Blake Ledden 2026-03-11 19:29:46 -07:00
  • 73c59c055c Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092) dePaul Miller 2026-03-06 17:36:29 -08:00
  • e5fcd125a5 [fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands (#3087) Johnsonms 2026-03-05 01:20:26 -08:00
  • a93d86ec83 Fix finding cuDNN (#2890) TLescoatTFX 2026-03-05 02:51:37 +01:00
  • 49e54f2b23 fix: add_help=False in temporary parser (#2721) David W.H. Swenson 2026-03-02 01:33:42 -06:00
  • b9847690c5 Merge pull request #3028 from SzymonOzog/patch-3 drazi 2026-02-28 10:11:05 +08:00
  • 4370102f9d v4.4.1 update (#3080) v4.4.1 Junkai-Wu 2026-02-28 03:01:08 +08:00
  • 3bb6e28d3c v4.4.1 update (#3079) Junkai-Wu 2026-02-28 02:59:21 +08:00
  • c651d660d2 fix typo (#3012) Tianqi Zhang (张天启) 2026-02-27 16:25:35 +08:00
  • 518327d631 Fix error in Blackwell document of referring to Mxf4 format as NVF4 (#2977) Ziang Li 2026-02-27 00:25:16 -08:00
  • de67bb7a42 Fix example in CuTe tutorials (#2752) StevenYangCC 2026-02-27 16:24:34 +08:00
  • edf2f82c00 Fix register index bug in mma.sync.aligned.m16n8k16 (#2740) Neil Kichler 2026-02-27 09:24:18 +01:00
  • 79345359a7 Fix debug typo in sgemm_2.cu and sgemm_sm70.cu (#2678) mnehete32 2026-02-27 13:53:59 +05:30
  • 8b9b3d78df fix typo in documentation (#2671) zkyue 2026-02-27 16:23:37 +08:00
  • fc5bbc2dab Fix typo in cute.nvgpu.warpgroup.mma doc (#2548) Gabriel Wu 2026-02-27 16:22:55 +08:00
  • 057635de5c Remove redundant dsl example. (#3074) Junkai-Wu 2026-02-26 21:10:59 +08:00
  • c213bfdfc1 Remove redundant dsl examples. (#3071) v4.4.0 Junkai-Wu 2026-02-26 11:42:01 +08:00
  • 954503d44c Bump version to 4.4.0 Haicheng Wu 2026-02-25 00:04:04 -05:00
  • 6c4200f1bc Bump version from 4.3.5 to 4.4.0 Haicheng Wu 2026-02-25 00:03:23 -05:00
  • de93e8a4ac Bump version from 4.3.5 to 4.4.0 Haicheng Wu 2026-02-25 00:03:04 -05:00
  • b92b9f0d37 Bump version from 4.3.5 to 4.4.0 Haicheng Wu 2026-02-25 00:02:41 -05:00
  • 2aedca6f5e Bump CUTLASS version to 4.4.0 Haicheng Wu 2026-02-25 00:01:56 -05:00
  • ae5ed7361b Bump CUTLASS version to 4.4.0 hwu36-patch-2 Haicheng Wu 2026-02-25 00:01:26 -05:00
  • 6450964b57 Update README Haicheng Wu 2026-02-24 23:55:55 -05:00
  • 284449fa5b Revise chagnelog Haicheng Wu 2026-02-24 23:54:56 -05:00
  • 0853d81d70 Revise README Haicheng Wu 2026-02-24 15:32:17 -05:00
  • 3476ddb7bd remove mixed_input_fmha_prefill (#3041) Linfeng Zheng 2026-02-18 20:59:01 +08:00
  • 291300ffff [CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009) Yihan Chen 2026-02-14 17:54:03 +08:00
  • f9a5f76b7a Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027) aragorn-guan 2026-02-14 17:51:20 +08:00
  • ec7e6cb17b Merge pull request #2971 from rsmallblue/tvm-ffi drazi 2026-02-14 14:14:10 +08:00
  • 395ab575f6 Merge branch 'main' into tvm-ffi Yuan Xiaolan 2026-02-14 13:35:28 +08:00
  • d4bbf728ca v4.4 tag release update. (#3032) Junkai-Wu 2026-02-14 12:27:58 +08:00
  • beb80e04e1 Add option to not suffix prints with new line Szymon Ożóg 2026-02-13 15:56:50 +01:00
  • 01687cfba1 Merge pull request #3004 from tridao/add-sub-packed-f32x2 drazi 2026-02-13 20:46:26 +08:00
  • 5c42d0f28c Merge pull request #3021 from tridao/clc_no_multicast drazi 2026-02-13 20:45:52 +08:00