Details:
- Modified aocl_get_reorder_buf_size_ and aocl_reorder_ APIs
to allow reordering from column major input matrix.
- Added new pack kernels that packs/reorders B matrix from
column-major input format.
- Updated Early-return check conditions to account for trans
parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [CPUPL-2268]
Change-Id: Ida66d7e3033c52cca0229c6b78d16976fbbecc4c