* tuning para, * testing on v100 * add fp16 * remove deprecated tensor descriptor * sync with miopen * update build script Co-authored-by: Jing Zhang <jizhan@amd.com>
* Added bwd data v3r1: breaking down compute into a series of load balanced GEMM, and launch in a single kernel * Added bwd data v4r1: like v3r1, but launch GEMMs in multiple kernels * Tweaked v1r1 and v1r2 (atomic) on AMD GPU