* Avoid passing indices (std::vector) by value to host tensor's operator()
Each access requires 2 allocations and copies of the vector.
* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification
* Compute ds_hp_host_ref in parallel
This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.
* Do not use ForEach for simple copy and conversion of large tensors
These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>