Adding allreduce kernel code for message sizes smaller than 32 bytes,
when the number of elements are smaller than the number of ranks.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>