Files
Zihao Ye 61e049a02e perf: Fix python API overhead when CUDAGraph is not enabled (#969)
This PR fixes issue #960 , we identifies several performance bottlenecks
for our python APIs when kernels are not captured by CUDAGraph:
1. The device guard in Python is slow (`with input.device as device:`)
2. Get current cuda stream in Python is time-consuming.

These issues were introduced in JIT refactor after v0.1.6 (mainly for
accelerating JIT compilation speed). In this PR, we changed back to get
stream and device guard in C++).

@MichoChan @xiaoqi35
2025-03-23 21:19:35 -07:00
..
2025-02-14 06:34:04 +08:00
2024-11-04 23:56:16 -08:00
2025-03-13 01:57:38 -07:00