* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances
* Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces
* Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock
* Split the cpp and hpp files for reduction instances to enable more parallel compiling
* Remove the using of macros for declaring reduction instances and instance references
* Update to add_device_reduce_instance_xxxx templated functions
* Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates
* Change return format