custom_flashinfer

kvcache-ai/custom_flashinfer

Fork 0

mirror of https://github.com/kvcache-ai/custom_flashinfer.git synced 2026-06-29 10:47:12 +00:00

Commit Graph

Select branches

Hide Pull Requests

AOT

GQA_var_batch

MLA_cache_var_batch

batchprefill_varlen

change_dir

cuda_graph_wjh

feat-cuda_graph_var_batch-zbx

fix-precision-mla-merge-main

main

optimize-mask

rope-wjh

18fd91a74a bugfix: Fix the bug of the kernel-selection heuristic in trtllm-gen (#1307) main Perkz Zheng 2025-07-23 16:33:23 +08:00
9c609f050c Update cutlass fp4 moe kernels (#1294) Shu Wang 2025-07-23 02:37:44 -05:00
f0b235c016 Heuristics + testing unification + CUDA Graphs (#1306) azhurkevich 2025-07-23 00:23:21 -07:00
238f1d264d feat: SM level profiler (#1305) Wenxuan Tan 2025-07-22 21:39:14 -05:00
60cf6e45e2 minor: some fix and cleanup for trtllm-gen mha (#1302) eigen 2025-07-22 17:43:55 -07:00
63a3074c70 bugfix: ensure graph is captured and executed on the same stream to avoid rep… (#1303) Elfie Guo 2025-07-22 17:43:49 -07:00
04b9a2a459 feat: Remove FAST_BUILD FLAG for MOE (#1291) Shu Wang 2025-07-22 12:26:25 -05:00
74f5dcc227 refactor: refactor trtllm-gen attention kernel integration code (#1289) Zihao Ye 2025-07-22 10:25:24 -07:00
3f99f18670 perfix: use lightweight API to query device property (#1298) azhurkevich 2025-07-21 23:20:55 -07:00
6bb969b5c0 fix: minor errors in cubin loader (#1295) eigen 2025-07-21 16:11:31 -07:00
927a41e8b1 feat: add mm_fp4 use cudnn backend (#1288) Xiaodong (Vincent) Huang 2025-07-21 15:35:17 -07:00
8587c21e8f refactor: Improved metainfo for trtllm-gen fmha (#1292) Yaxing Cai 2025-07-21 12:19:38 -07:00
fe29ed63cb bugfix: guard fp8 e8m0 and e2m1 compile (#1287) Wenxuan Tan 2025-07-21 01:37:59 -05:00
de55a8f56e hotfix: patch error handling (#1293) Alex Yang 2025-07-20 00:03:40 -07:00
1d72ed4076 Convert scale_factor from scalar to Tensor in trt_allreduce_fusion (#1284) Ilya Markov 2025-07-18 20:49:31 +02:00
39d81f779b feat: Add shuffle matrix flag (#1272) Alex Yang 2025-07-18 10:16:49 -07:00
07de0ac164 Fix install folder regression, and JIT-vs-AOT differences (#1279) Jo Shields 2025-07-18 13:09:31 -04:00
e885dd0c98 bugfix: fix multiCtasKvScratchPtr misalignment issue (new one) (#1286) Po-Han Huang (NVIDIA) 2025-07-19 01:01:14 +08:00
1e9a41ad7f fix: update trtllm-gen fmha benchmark (#1280) eigen 2025-07-18 02:00:13 -07:00
513c6139f5 refactor: Unify groupwise fp8 GEMM test (#1281) Yaxing Cai 2025-07-17 22:32:17 -07:00
172628ce8b hotfix: fix deepgemm artifactory hash (#1278) Yaxing Cai 2025-07-17 16:28:07 -07:00
a21ffbe62f fix: Add missing import in comm/__init__,py (#1275) Mehdi Amini 2025-07-18 00:51:10 +02:00
4e9da5d9ee feat: add masked deepgemm support and benchmarking (#1266) Yaxing Cai 2025-07-17 04:01:24 -07:00
6f3b59ff6d feat: add trtllm-gen context attention (#1239) Lain 2025-07-17 03:31:55 -07:00
3c40456eff feat: Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output (#1242) weiliang 2025-07-17 01:50:19 +08:00
6ebbf7fc17 feat: enable trtllm-gen mla MTP (#1258) eigen 2025-07-16 03:58:21 -07:00
0ff65d8aee CI: install nvidia-nvshmem-cu12 (#1262) Emilien Macchi 2025-07-16 06:11:42 -04:00
96801b2cf3 feat: sm100 low latency nvfp4 kernels (#1214) azhurkevich 2025-07-16 03:11:07 -07:00
9c9d7fd3cb feat: add gemm fp8 using cudnn backend (#1264) Xiaodong (Vincent) Huang 2025-07-16 02:41:25 -07:00
f4cca1f64b refactor: Made AR output optional + esthetic changes (#1265) Maximilien Breughe 2025-07-16 01:59:30 -05:00
f153369384 fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) (#1254) vlev02 2025-07-15 21:24:42 +08:00
28741b776e refactor: Reduce the JIT compilation time of gen_gemm_sm100_module (#1251) Jinyang Yuan 2025-07-15 15:18:55 +08:00
3f8317c67a feat: TRT-LLM's Multi-Node NVLink AR + fused RMSNorm kernel (#1255) Maximilien Breughe 2025-07-15 02:10:52 -05:00
2d50d96653 release: bump version to v0.2.8 (#1257) Zihao Ye 2025-07-14 23:50:41 -07:00
ce68e1d0cc feat: support environment variable overrides for NVSHMEM paths and linker flags (#1253) Emilien Macchi 2025-07-14 22:37:49 -04:00
740bf33742 Defer mpi import for comm module (#1250) Zihao Ye 2025-07-14 12:22:54 -07:00
9c7774d155 fix: Remove sm100+ requirment for trtllm allreduce kernels (#1249) Zihao Ye 2025-07-14 11:05:13 -07:00
6596f954a4 feat: add prebuilt DeepGEMM kernels (#1209) Yaxing Cai 2025-07-14 02:26:18 -07:00
1d9785be40 feat: add trtllm-gen mla cubin (#1222) eigen 2025-07-13 23:54:40 -07:00
ff0270cefa feat: Support MXFP8 x MXFP4 CUTLASS grouped GEMM (#1241) Jinyang Yuan 2025-07-13 15:06:24 +08:00
a03c2909dc [comm] TRT-LLM's Multi-Node NVLink All-Reduce Kernel (#1213) Maximilien Breughe 2025-07-10 21:58:47 -05:00
d0c104d680 hotfix: Patch fp8 cubin availability (#1240) Alex Yang 2025-07-10 19:58:20 -07:00
bd74e15cce feat: trtllm-gen fp8 moe kernels (#1212) Alex Yang 2025-07-10 13:15:54 -07:00
728e8bb3ef bugfix: support uint8_t for vec_t class template (#1234) Yang Chen 2025-07-08 10:13:24 -07:00
04f97584d7 bugfix: Fix building without get_requires*() invocation (#1226) Michał Górny 2025-07-08 10:05:04 +02:00
b08224dcd9 docker: add cuda-python to CI docker image (#1233) Zihao Ye 2025-07-08 00:56:06 -07:00
9157d0514f Enable cudnn decode and add tests for the cudnn decode kernel (#1221) Anerudhan Gopal 2025-07-08 00:27:42 -07:00
fe62e5614b minor: update oneshot handling, add params notes (#1232) eigen 2025-07-07 23:54:23 -07:00
c77d3e11ad feat: Add non-causal cudnn prefill kernels (#1230) Anerudhan Gopal 2025-07-07 23:53:45 -07:00
4b0e0fbad9 fix: add trtllm-allreduce-fusion api notes and fix memory error (#1229) eigen 2025-07-07 22:45:41 -07:00
7ac6658c83 bugfix: add logits processor to pyproject.toml (#1224) Zihao Ye 2025-07-07 20:13:05 -07:00
644eda67cf Fix missing hash in the cudnn cubin path (#1227) Anerudhan Gopal 2025-07-07 16:56:35 -07:00
21afaa78b9 Handle allocation cutlass fused MoE output to caller (#1225) Shu Wang 2025-07-07 12:38:42 -05:00
aff79d1cd7 update trtllm-gen decode attention kernel launcher (#1189) Shu Wang 2025-07-07 11:54:24 -05:00
3b01face12 bugfix: fix blackwell fmha hanging issue for empty kv_len (#1198) Zihao Ye 2025-07-06 15:45:23 -07:00
8c2d5ef159 misc: minor adds in readme (#1218) eigen 2025-07-05 20:09:31 -04:00
f71ce8125d [TVM] Remove enable_pdl from TVM binding interface (#1217) Ruihang Lai 2025-07-05 15:59:03 -04:00
8dd4ed2c25 bugfix: Fix test_groupwise_scaled_gemm_fp8.py (#1211) Jinyang Yuan 2025-07-04 01:01:30 +08:00
ed95a4aec6 bugfix: Fix the issue with auxillary kernel launch and grid dim calculation (#1208) Anerudhan Gopal 2025-07-02 22:59:56 -07:00
ef197c0fee feat: enable and update all-reduce fused quantization (#1164) eigen 2025-07-02 11:35:49 -04:00
16cb9e493d [fix] fix BatchAttention CTA_TILE_KV mask issue (#1206) Yilong Zhao 2025-07-01 16:38:45 -07:00
3fb73b3cd5 chore: bump flashinfer v0.2.7.post1 (#1205) Yineng Zhang 2025-07-01 07:51:40 -07:00
421d0615fd Fix flashinfer.comm module missing (#1203) Xiaoyu Zhang 2025-07-01 16:17:21 +08:00
ece99cccef Feature/cudnn dynamic cubin (#1187) Anerudhan Gopal 2025-06-30 19:38:51 -07:00
40db3fadfd [feat] optimize persistent batch attention perf. (#1200) Yilong Zhao 2025-06-30 15:10:25 -07:00
4d3fb6d561 chore: bump v0.2.7 (#1199) Yineng Zhang 2025-06-30 12:38:15 -07:00
2abd1acca0 bugfix: fix broken docs build by adding missing dependencies (#1197) Yi Pan 2025-07-01 00:38:11 +08:00
3f1d096c42 fix: trtllm_comm module aot arch issues (#1196) eigen 2025-06-30 06:23:32 -04:00
2c894d25be feat: support green ctx creation by a list of SM counts (#1190) Yi Pan 2025-06-30 15:29:42 +08:00
e20978e053 bugfix: fix invalid blackwell fmha unittests (#1181) Zihao Ye 2025-06-27 11:18:16 -07:00
4ec2116e58 [feat] support block sparse attention w/ variable block sizes and head-wise sparse patterns (#1177) Yilong Zhao 2025-06-26 17:37:57 -07:00
aa528fe652 [CI] Update is_last_build (#1183) Yong Wu 2025-06-26 15:29:45 -07:00
1b9ba25415 bugfix: softmax NaN results caused by large -inf masks (#1178) Shanli Xing 2025-06-26 04:00:17 -04:00
f70b66dc82 add nvshmem sum_reduce for mnnvl allreduce (#1152) Amir Samani 2025-06-25 12:44:00 -07:00
71509faeb0 Expose fp4 blockscale swizzling kernel (#1176) Shu Wang 2025-06-25 00:44:19 -05:00
09a23d9d58 feat: logits processor fustion rule for temperature softmax (#1170) Shanli Xing 2025-06-24 22:15:45 -04:00
a61ef7ba44 bugfix: Fix missing symbols in trtllm_utils.so (#1168) Christian Heimes 2025-06-25 01:07:38 +02:00
28cf1aa7f1 feat: nvshmem python bindings (#1160) Zihao Ye 2025-06-24 14:19:00 -07:00
3dd4f03df8 feat: MNNVL AllToAllV communication operator support (#1134) Yaxing Cai 2025-06-24 09:17:11 -07:00
9e81467d54 feat: Fused temperature online softmax kernel (#1153) Shanli Xing 2025-06-24 00:43:27 -04:00
27060628c3 feat: experimental support of green ctx (#1163) Zihao Ye 2025-06-22 23:43:09 -07:00
ba2470c51f feat: add finalize_moe_allreduce from trtllm (#1159) eigen 2025-06-21 23:23:50 -04:00
15b3e65122 refactor: communication module (#1162) eigen 2025-06-20 00:04:24 -04:00
f230eb6636 Add fp4 quantization swizzling tests (#1157) Shu Wang 2025-06-19 21:55:27 -05:00
ac78bc3f0a feat: update non-fused moe (#1161) eigen 2025-06-19 19:20:35 -04:00
0ea82ec068 Add more logging to TRTLLM-GEN debug trace (NFC) (#1158) Mehdi Amini 2025-06-19 22:20:49 +02:00
28d0843e2d feat: add trtllm all-reduce fusion (#1131) eigen 2025-06-18 22:08:38 -04:00
0a754ce4fc feat: add trtllm moe_allreduce_fusion (#1108) eigen 2025-06-17 05:23:14 -04:00
8e204cf4a3 ci: Install mpi4py (#1149) Yong Wu 2025-06-16 20:53:21 -07:00
ff6808cc1d bugfix: fix precision errors when applying causal mask on Qwen-2.5 series models (#1148) Yilong Zhao 2025-06-16 08:29:40 -07:00
fa519c0ca3 misc: remove sync between persistent runners and use packed_causal_kv_end for SM90Plan (#1146) Wenxuan Tan 2025-06-16 01:43:37 -05:00
0fbd03bf99 feat: Add support for FLASHINFER_EXTRA_LDFLAGS environment variable (#1144) Jennifer Zhou 2025-06-16 00:02:10 +00:00
5c3e4e8574 bugfix: Fix FA2 and FA3 multi-item scoring and cuda illegal memory access error (#1140) Arup De 2025-06-12 20:51:00 -07:00
568ab6ceab [feat] add unified batch attention w/ correctness tests. (#1137) Yilong Zhao 2025-06-11 22:12:12 -07:00
35aaabb98d refactor: use functools.cache instead of global dict for caching modules (#1135) Zihao Ye 2025-06-11 19:22:50 -07:00
a2d803a4a8 fix: sync after create_workspace (#1138) eigen 2025-06-11 21:06:00 -04:00
f484fd3c7f fix: negative zero by type trait --> binary value (#1136) eigen 2025-06-11 16:12:32 -04:00
ac74c5104a 0611,merge upstream, support aot AOT qiyuxinlin 2025-06-11 12:56:22 +00:00
2f01a9a35d [Feature] Support PDL for batch Prefill and Decode (#1117) Wenxuan Tan 2025-06-10 17:02:10 -05:00
47beb43464 misc: correct runllm widget (again) (#1133) Ruihang Lai 2025-06-10 01:54:21 -04:00

1 2 3 4 5 ...

Commit Graph Select branches Hide Pull Requests AOT GQA_var_batch MLA_cache_var_batch batchprefill_varlen change_dir cuda_graph_wjh feat-cuda_graph_var_batch-zbx fix-precision-mla-merge-main main optimize-mask rope-wjh Mono Color

Commit Graph

Select branches

Hide Pull Requests

AOT

GQA_var_batch

MLA_cache_var_batch

batchprefill_varlen

change_dir

cuda_graph_wjh

feat-cuda_graph_var_batch-zbx

fix-precision-mla-merge-main

main

optimize-mask

rope-wjh