mirror of
https://github.com/kvcache-ai/custom_flashinfer.git
synced 2026-06-29 10:47:12 +00:00
This PR fixes issue #960 , we identifies several performance bottlenecks for our python APIs when kernels are not captured by CUDAGraph: 1. The device guard in Python is slow (`with input.device as device:`) 2. Get current cuda stream in Python is time-consuming. These issues were introduced in JIT refactor after v0.1.6 (mainly for accelerating JIT compilation speed). In this PR, we changed back to get stream and device guard in C++). @MichoChan @xiaoqi35