* add an example of customized bfp16_rtn * fixed threadwise_copy --------- Co-authored-by: Jing Zhang <jizha@amd.com>