Details:
- Added new params(order, trans) to aocl_get_reorder_buf_size_ and
aocl_reorder_ APIs.
- Added new pack kernels that packs A matrix from either row-major or
column major input matrix to pack buffer with row-major format.
- Updated cntx with pack kernel function pointers for packing A matrix.
- Transpose of A matrix is handled by packing A matrix to row-major
format during run-time.
- Updated Early-return check conditions to account for trans parameters.
- Updated bench file to test/benchmark transpose support.
AMD-Internal: [SWLCSG-2268, SWLCSG-2442]
Change-Id: I43a113dc4bc11e6bb7cc4d768c239a16cb6bbea4