* Grouped gemm + Gelu instances.
* Device Instance Factory for GroupedGemm+Gelu
* Client example
* Rangify fill helper functions.
* Fix name clash.
* Profiler for grouped_gemm+gelu
* No need to use full namespace name.
* Add check for MRaw divisible by vector load.
* Ugly fix for big errors.
* Add grouped_gemm+gelu to profiler CMakelists.
* Store in argument additional info.
* Information about Mraw, Nraw, Kraw values.
* Use FastGelu instead of Gelu.
* Change client ex to use FastGelu
* Remove relaxed error precision.
* Remove duplicate output elementwise-op
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>