* Add a gpu gemm reference kernel
* Switch to gpu reference in gemm examples
* Remove redundant arguments
* Update all related examples
* Update more examples
* Try less threads per block
* Try even less threads per block
* Add support for all matrix layouts
* Increase block size
* Clean up
* Remove hardcoded strides
* Clean up
* Try a column-major case
* Revert back to row-major
* Run both CPU and GPU veriffication
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* workaround nan problem by changing output to fp16
* enable f8/bf8 gemm tests on MI200
* workaround f16 to f8 conversion
---------
Co-authored-by: Jing Zhang <jizha@amd.com>