Po Yen Chen
|
7fbc9d6c97
|
[CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833)
* Re-mapping thread block indices for causal=True kernels
* Use more intuitive remap_opt value
* Fallback to origin remapping if seqlen_q >= 64K
* Use GenericAttentionMask to reduce mask computation
* Avoid unnecessary boundary check for IsMasking=false case
* Fix wrong kernel entry specifier
* Add s_nop to prevent delay wave0-3
* Refine scheduling
* Remove unnecessary sched_group_barrier()
* Move sched_group_barrier() call to scheduler
* Replace inline asm s_setprio with intrinsics
* Rephrase comments
* Expend some o_acc rescaling insts to avoid SIMD idle
* Fix block idx special mapping logic
* Tune block index mapping for causal=False cases
* Tune block index mapping for causal=True cases
* Fix wrong vmcnt()
* Remove parameter name
* Use boolean option for turn on/off causal mask
* Update benchmark_fwd_v3.sh option usages
* Add option if compiler support it
|
2025-09-16 11:32:38 +08:00 |
|