* Refine the device batchnorm-backward base API templates and data type assignments
* Remove duplicated kernel file
* Add batchnorm backward instances and external API
* Add batchnorm-backward profiler and tests
* Add client example which uses batchnorm backward external API
* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory
* Loose the threshold for batchnorm-backward check_err()