* start split k
* add base device class
* add example after merge develop
* add gridwise gemm
* add b matrix split k
* split=1
* change name for kb
* not bias result right
* bias only add once
* fix register spill
* regular code
* add fp32 example
* fix for 64bit index
* fix CheckValidity of gridwise