mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 10:09:41 +00:00
* Re-mapping thread block indices for causal=True kernels
* Use more intuitive remap_opt value
* Fallback to origin remapping if seqlen_q >= 64K
* Use GenericAttentionMask to reduce mask computation
* Avoid unnecessary boundary check for IsMasking=false case
* Fix wrong kernel entry specifier
* Add s_nop to prevent delay wave0-3
* Refine scheduling
* Remove unnecessary sched_group_barrier()
* Move sched_group_barrier() call to scheduler
* Replace inline asm s_setprio with intrinsics
* Rephrase comments
* Expend some o_acc rescaling insts to avoid SIMD idle
* Fix block idx special mapping logic
* Tune block index mapping for causal=False cases
* Tune block index mapping for causal=True cases
* Fix wrong vmcnt()
* Remove parameter name
* Use boolean option for turn on/off causal mask
* Update benchmark_fwd_v3.sh option usages
* Add option if compiler support it
[ROCm/composable_kernel commit: 7fbc9d6c97]