Description:
Implemented a c reference for
aocl_gemm_unreorder_bf16bf16f32of32 function
The implementation working for row major and
column major yet to be enabled.
AMD-Internal: [ SWLCSG-3279 ]
Change-Id: Ibcce4180bb897a40252140012d8d6886c38cb77a
Description:
1. Added new output types for f32 element wise API's to support
s8, u8, s32 , bf16 outputs.
2. Updated the base f32 API to support all the post-ops supported in
gemm API's
AMD Internal: [SWLCSG-3384]
Change-Id: I1a7caac76876ddc5a121840b4e585ded37ca81e8
- Add missing xmm, ymm and k registers to clobber lists
in bli_dgemmsup_rv_zen4_asm_24x8m.c
- Add missing ymm1 in bli_dgemmsup_rv_zen4_asm_24x8m.c
bli_gemmsup_rv_haswell_asm_d6x8m.c and bli_gemmsup_rd_zen_s6x64.c
- Also change formatting in bli_copyv_zen4_asm_avx512.c
bli_dgemm_avx512_asm_8x24.c and bli_zero_zmm.c to make
automatic processing of clobber lists easier.
AMD-Internal: [CPUPL-5895]
Change-Id: If05a3f00e6c0f9033eeced5de165ba4c3128b3e5
More changes to standardize copyright formatting and correct years
for some files modified in recent commits.
AMD-Internal: [CPUPL-5895]
Change-Id: Ie95d599710c1e0605f14bbf71467ca5f5352af12
- Added 32x3n n-biased kernels to directly handle the cases where n=3
which were earlier being handled by the primary n-biased, 32x8n,
kernel.
- Modified the n-biased fringe kernels to further handle the smaller
m-fringe cases. Thus, now the kernels handle the following range of m
for any value of n:
- 16x8n : m = [16, 31)
- 8x8n : m = [8, 15)
- m_leftx8n : m = [1, 7]
- Updated the function pointer map for n-biased kernels with added
granularity to invoke the smaller fringe cases directly on the basis
of m-dimension.
- Added micro-kernel unit tests for all the dgemv_n kernels.
AMD-Internal: [CPUPL-6231]
Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119
Description:
1. Changed all post-ops in s8s8s32o<s32|s8|u8|f32|bf16> to operate
on float data. All the post-ops are updated to operate on f32
by converting s32 accumulator registers to float at the end of k
loop. Changed all post-ops to operate on float data.
2. Added s8s8s32ou8 API which uses s8s8s32os32 kernels but store
the output in u8
AMD-Internal - SWLCSG-3366
Change-Id: Iadfd9bfb98fc3bf21e675acb95553fe967b806a6
- Modified bench to support testing of different types of buffers
for bias, mat_add and mat_mul postops.
- Added support for testing integer APIs with float accumulation
type.
Change-Id: I72364e9ad25e6148042b93ec6d152ff82ea03e96
libFLAME calls DAMAX kernel directly. Now that AVX512 version
has been enabled in BLIS cntx, export this symbol.
AMD-Internal: [CPUPL-5895]
Change-Id: I4c74150578f49eb643b0f68c6cc32ee2bb23bec2
- Blocksizes for sizes where M >> K, N >> K and K < 500 were tuned by running
blis bench on only one MPI rank. Blocksizes tuned this way are not performing
well for all configurations.
- Retuned the blocksizes so that performance is good for such skinny sizes.
AMD-Internal: [CPUPL-6362]
Change-Id: I89c61889df2443ef6bf0e87bf89263768b5c00c1
Description:
1. Support has been added to scale buffer values using both scalar and
vector scale factors before matrix add or matrix mul post-ops.
AMD-Internal: CPUPL-6340
Change-Id: Ie023d5963689897509ef3d5784c3592791e57125
Description:
1. Changed all post-ops in u8s8s32o<s32|s8|u8|f32|bf16> to operate
on float data. All the post-ops are updated to operate on f32
by converting s32 accumulator registers to float at the end of k
loop. Changed all post-ops to operate on float data.
2. Added u8s8s32ou8 API which uses u8s8s32os32 kernels but store
the output in u8
AMD-Internal - SWLCSG-3366
Change-Id: Iab1db696d3c457fb06045cbd15ea496fd4b732a5
Description:
1. Added u8s8s32of32,u8s8s32obf16, s8s8s32of32 and s8s8s32obf16 APIs.
Where the inputs are uint8/int8 and the processing is done using
VNNI but the output is stored in f32 and bf16 formats. All the int8
kernels are reused and updated with the new output data types.
2. Added F32 data type support in bias.
3. Updated the bench and bench input file to support validation.
AMD-Internal: SWLCSG-3335
Change-Id: Ibe2474b4b8188763a3bdb005a0084787c42a93dd
- As part of AOCL-BLAS, there exists a set of vectorized
SUP kernels for GEMM, that are performant when invoked
in a bare-metal fashion.
- Designed a macro-based interface for handling tiny
sizes in GEMM, that would utilize there kernels. This
is currently instantiated for 'Z' datatype(double-precision
complex).
- Design breakdown :
- Tiny path requires the usage of AVX2 and/or AVX512
SUP kernels, based on the micro-architecture. The
decision logic for invoking tiny-path is specific
to the micro-architecture. These thresholds are defined
in their respective configuration directories(header files).
- List of AVX2/AVX512 SUP kernels(lookup table), and their
lookup functions are defined in the base-architecture from
which the support starts. Since we need to support backward
compatibility when defining the lookup table/functions, they
are present in the kernels folder(base-architecture).
- Defined a new type to be used to create the lookup table and its
entries. This type holds the kernel pointer, blocking dimensions
and the storage preference.
- This design would only require the appropriate thresholds and
the associated lookup table to be defined for the other datatypes
and micro-architecture support. Thus, is it extensible.
- NOTE : The SUP kernels that are listed for Tiny GEMM are m-var
kernels. Thus, the blocking in framework is done accordingly.
In case of adding the support for n-var, the variant
information could be encoded in the object definition.
- Added test-cases to validate the interface for functionality(API
level tests). Also added exception value tests, which have been
disabled due to the SUP kernel optimizations.
AMD-Internal: [CPUPL-6040][CPUPL-6018][CPUPL-5319][CPUPL-3799]
Change-Id: I84f734f8e683c90efa63f2fa79d2c03484e07956
- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
AVX2 kernels for Zen1/2/3 architectures. These kernels are written
from the ground up and are independent of fused kernels.
- The DGEMV primary kernel processes the calculation in chunks of
8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
kernels, which are invoked by the primary kernel as needed.
- Implemented the kernels by computing the dot product of matrix A
columns with vector x in chunks of 32 elements, storing the results
in accumulator registers. Fringe elements are handled in chunks
of 16, 8, etc. The data in the accumulator registers is then reduced
and added to vector y.
AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61
Details:
- Disabled intrinsics code of f32obf16 pack function
for gcc < 11.2 as the instructions used in kernels
are not supported by the compiler versions.
- Addded early-return check for WOQ APIs when compiling with
gcc < 11.2
- Fixed code to check whether JIT kernels are generated inside
batch_gemm API for bf16 datatype.
AMD Internal: [CPUPL-6327]
Change-Id: I0a017c67eb9d9d22a14e095e435dc397e265fb0a
Description:
Added _mm512_cvtps_epi32 for bf16 to s32 conversion in gemv APIs.
AMD-Internal: SWLCSG-3302
Change-Id: I7e3e6da8f50d1f7177629cb68ac21e3bbce40bee
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
type(s8/s32/bf16) irrespective of the storage data type in
u8s8s32/s8s8s32 API's.
AMD-Internal: SWLCSG-3302
Change-Id: I3c465fe428672d2d58c1c60115c46d2d5b11f0f4
Refined thresholds to decide between native and sup DGEMM code-paths for both zen4 and zen5 processors.
AMD-Internal: [CPUPL-6300]
Change-Id: Ib32a256dba99a0a92b7ecaa7684443a66c459566
Some kernel file names were the same for different sub-configurations,
which could result in duplicate copies of the same object being archived
depending upon the order of (re-)compiling the source files. Rename the
files to be specific to each sub-configuration to avoid this problem.
AMD-Internal: [CPUPL-5895]
Change-Id: I182ac706e04a364f1df20fd0fb5b633eb10eeafb
Details:
- The batch matmul performs a series of matmuls, processing
more than one GEMM problem at once.
- Introduced a new parameter called batch_size for the user
to indicate number of GEMM problems in a batch/group.
- This operation supports processing GEMM problems with
different parameters including dims,post-ops,stor-schemes etc.,
- This operation is optimized for problems where all the
GEMMs in a batch are of same size and shape.
- For now, the threads are distributed among different GEMM
problems equally irrespective of their dimensions which
leads to better performance for batches with identical GEMMs
but performs sub-optimally for batches with non-identical GEMMs.
- Optimizations for batches with non-identical GEMMs is in progress.
- Added bench and input files for batch_matmul.
- Added logger functionality for batch_matmul APIs.
AMD-Internal: [SWLCSG-2944]
Change-Id: I83e26c1f30a5dd5a31139f6706ac74be0aa6bd9a
-As it stands the buffer type in matrix add|mul post-ops is expected to
be the same as that of the output C matrix type. This limitation is now
removed and user can specify the buffer type by setting the stor_type
attribute in add|mul post-op struct. As of now int8, int32, bfloat16 and
float types are supported for the buffer in s32 micro-kernels. The same
support is also added for bf16 micro-kernels, with bfloat16 and float
supported for now.
-Additionally the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add|mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.
-The bias_stor_type attribute is renamed to stor_type in bias post-ops.
AMD-Internal: [SWLCSG-3319]
Change-Id: I4046ab84481b02c55a71ebb7038e38aec840c0fa
Details:
- Fixed few bugs in downscale post-op for f32 datatype.
- Fixed a bug in setting strides of packB buffer in
int8 APIs.
Change-Id: Idb3019cc4593eace3bd5475dd1463dea32dbe75c
- Added Downscale, tanh and sigmoid post-op support to the JIT kernels
- Mask bf16s4 kernel call while JIT kernels are enabled to avoid compile-time error.
- Added the optional support for B-prefetch in the JIT kernels
- Resolved the visibility issues in global variable jit_krnels_generated
- Modified the array generation for scale and zp values in the bench
Change-Id: I09b8afc843f51ac23645e02f210a2c13d3af804d
-Currently the values (from buffer) are added/multiplied as is to the
output registers while performing the matrix add/mul post-ops. Support
is added for scaling these values before using them in the post-ops.
Both scalar and vector scale_factors are supported.
AMD-Internal: [SWLCSG-3181]
Change-Id: Ifdb7160a1ea4f5ecccfa3ef31ecfed432898c14d
- Bug : The current {S/D}AMAXV AVX512 kernels produced an
incorrect functionality with multiple absolute maximums.
They returned the last index when having multiple occurences,
instead of the first one.
- Implemented a bug-fix to handle this issue on these AVX512
kernels. Also ensured that the kernels are compliant with
the standard when handling exception values.
- Further optimized the code by decoupling the logic to find
the maximum element and its search space for index. This way,
we use lesser latency instructions to compute the maximum
first.
- Updated the unit-tests, exception value tests and early return
tests for the API to ensure code-coverage.
AMD-Internal: [CPUPL-4745]
Change-Id: I2f44d33dbaf89fe19e255af1f934877816940c6f
Since the threshold for tiny path was large but the buffer size was
not enough to store the complete packed matrix. That is leading to
segmentation faults.
This commit fix the buffer size as per the threshold of tiny gemm path.
With the corrected buffer size, the matrix is packed correctly.
AMD-Internal: [CPUPL-6201]
Change-Id: I0292a07f6146e7f1ccd8c1010b4c41c218fd9b47
- Warnings in DTRSM kernel caused by uninitialized registers
and extra loop unroll is fixed.
- Warning in DGEMM kernel caused by extra space is fixed.
Change-Id: I1d9cfaa0b2847f5fdbe8b343a462d67a3aca0819
- This patch introduces changes to support DGEMM computation when the input matrix A is transposed.
- The changes accommodate CRC (Column-Row-Column) and RRC (Row-Row-Column) storage schemes for matrices
C, A, and B. The primary goal is to pack the A matrix in a column-stored scheme, enabling the re-use
of the DGEMM SUP kernel for efficient computation.
- Performance is better when BLIS_PACK_BUFFER macro is set to 0.
By default, it is set to 1[enabled].
AMD-Internal: [CPUPL-6054]
Change-Id: I543a84b05c9e6380bc03017ab6da685e7006a64e
This patch introduces comprehensive optimizations to the DGEMM kernel, focusing on loop
efficiency and edge kernel performance. The following technical improvements have been implemented:
1. **IR Loop Optimization:**
- The IR loop has been re-implemented in hand-written assembly to eliminate the overhead associated
with `begin_asm` and `end_asm` calls, resulting in more efficient execution.
2. **JR Loop Integration:**
- The JR loop is now incorporated into the micro kernel. This integration avoids the repetitive overhead
of stack frame management for each JR iteration, thereby enhancing loop performance.
3. **Kernel Decomposition Strategy:**
- The m dimension is decomposed into specific sizes: 20, 18, 17, 16, 12, 11, 10, 9, 8, 4, 2, and 1.
- For remaining cases, masked variants of edge kernels are utilized to handle the decomposition efficiently.
1. **Interleaved Scaling by Alpha:**
- Scaling by the alpha factor is interleaved with load instructions to optimize the instruction pipeline
and reduce latency.
2. **Efficient Mask Preparation:**
- Masks are prepared within inline assembly code only at points where masked load-store operations are necessary,
minimizing unnecessary overhead.
3. **Broadcast Instruction Optimization:**
- In edge kernels where each FMA (Fused Multiply-Add) operation requires a broadcast without subsequent reuse,
the broadcast instruction is replaced with `mem_1to8`.
- This allows the compiler to optimize by assigning separate vector registers for broadcasting, thus avoiding
dependency chains and improving execution efficiency.
4. **C Matrix Update Optimization:**
- During the update of the C matrix in edge kernels, columns are pre-loaded into multiple vector registers.
This approach breaks dependency chains during FMA operations following the scaling by alpha, thereby mitigating
performance bottlenecks and enhancing throughput.
These optimizations collectively improve the performance of the DGEMM kernel, particularly in handling edge cases and
reducing overhead in critical loops. The changes are expected to yield significant performance gains in matrix multiplication
operations.
This patch also involves changes for tiny gemm interface. A light
interface for calling kernels and removing calls to avx2 dgemm kernels
as we use avx512 dgemm kernels for all the sizes for zen4 and zen5.
For zen4 and zen5 when A matrix transposed(CRC, RRC), tiny kernel does not have
the support to handle such inputs and thus such inputs are routed to
gemm_small path.
AMD-Internal: [CPUPL-6054]
Change-Id: I57b430f9969ca39aa111b54fa169e4225b900c4a
- AVX512 specific DGEMV native kernels are added for Zen4/5
architectures to handle the NO_TRANSPOSE cases and are independent of
the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
beta scaling of y vector within the kernel itself and handle cases
where n is less than 5:
- bli_dgemv_n_zen_int_32x8n_avx512( ... )
- bli_dgemv_n_zen_int_32x4n_avx512( ... )
- bli_dgemv_n_zen_int_32x2n_avx512( ... )
- bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
m-dimension and for this kernel beta scaling is handled beforehand
within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
using fused kernel, namely AXPYF, to perform the GEMV operation.
AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79
Description:
1. AutoAWQ use a int32 buffer to store 8 elements each of 4 bits in this
format [0, 2, 4, 6, 1, 3, 5, 7].
2. Support is added to convert above format back to the original
sequential order [0, 1, 2, 3, 4, 5, 6, 7] before reordering
in the AWQ API.
AMD-Internal: SWLCSG-3169
Change-Id: I5395766060c200ab81d0b8be94356678a169ac13
Description:
1. Added group quantization and zero-point (zp) in
aocl_gemm_bf16s4f32o<bf16|f32> API.
2. Group quantization is technique to improve accuracy
where scale factors to dequantize weights varies at group
level instead of per channel and per tensor level.
3. Added zp and scaling in woq packb kernels so that for
large M values zp and scaling are performed at pack-b
stage and bf16 kernels are called
4. Adding zp support and scaling to default path in WoQ kernels
created some performance overhead when M value is very small.
5. Added string group_size to lpgemm bench to read
group size from bench_input.txt and tested for
various combinations of matrix dimensions.
6. The scalefactors could be of type float or bf16
and the zeropoint values are expected to be
in int8 format.
AMD-Internal: [SWLCSG-3168, SWLCSG-3172]
Change-Id: Iff07b54d76edc7408eb2ea0b29ce8b4a04a38f57
- Enabled dynamic blocksizes for DGEMM in ZEN4 and ZEN5 systems.
- MC, KC and NC are dynamically selected at runtime for DGEMM native.
- A local copy of cntx is created and blocksizes are updated in the local cntx.
- Updated threshold for picking DGEMM SUP kernel for ZEN4.
AMD-Internal: [CPUPL-5912]
Change-Id: Ic12a1a48bfa59af26cc17ccfa47a2a33fadde1f6
- Merged ZEN4 and ZEN5 DGEMM 8x24 kernel.
- Replaced 32x6 kernel with 8x24. Now same
kernel is used for ZEN4 and ZEN5.
- Blocksizes have been tuned for genoa only.
- DGEMM kernel for DTRSM native code path
is replaced with 8x24 kernel.
- Enabled alpha scaling during packing for ZEN4.
- ZEN4 8x24 kernel has been removed.
AMD-Internal: [CPUPL-5912]
Change-Id: I89a16a7e3355af037d21d453aabf53c5ecccb754
Description:
1. The bias type was supported only based on output data type.
2. The option is added in the pre-ops structure to select the bias data
type irrespective of the storage data type in bf16 and WoQ API's
AMD-Internal: SWLCSG-3171
Change-Id: Iac10b946c2d4a5c405b2dc857362be0058615abf
Description:
Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_<s8|u8>s8<s32|s16>o<s8|u8|s32> API's
Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))
Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.
AMD-Internal: [SWLCSG-3178]
Change-Id: I9df3aab02222f728ff9d1f292c7bc549f30176f0
Description:
Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_f32f32f32of32 API's
Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))
Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.
AMD-Internal: [SWLCSG-3178]
Change-Id: Iac0a907f6dea1d9cb82d9fd8716bfdbf1c33921d
Description:
Implemented sigmoid, tanh as fused post-ops in
aocl_gemm_bf16bf16f32o<f32|bf16) API's
Sigmoid(x) = 1/1+e^(-x)
Tanh(x) = (1-e^(-2x))/(1+e^(2x))
Updated bench_lpgemm to recognize sigmod, tanh
as options for post-ops from bench_input and verified.
AMD-Internal: [SWLCSG-3178]
Change-Id: I78a3ba4a67ab63f9d671fbe315f977b016a0d969
- Implemented a set of column preferential dot-product based
ZGEMM kernels(main and fringe) in AVX512(for SUP code-path).
These kernels perform matrix multiplication as a sequence
of inner products(i.e, dot-products).
- These standalone kernels are expected to strictly handle
the CRC storage scheme for C, A and B matrices. RRC is also
supported through operation transpose, at the framework
level.
- Added unit-tests to test all the kernels(main and fringe),
as well as the redirection between these kernels.
AMD-Internal: [CPUPL-5949]
Change-Id: I858257ac2658ed9ce4980635874baa1474b79c38
Description:
_mm512_cvtne2ps_pbh(a, b) instruction takes
b when j<16 but the code was developed in
with assuming reverse order.
Fixed some indentation issues
Changed the file name and made it uniform
Change-Id: I7b45b4c35931d8febde7b7b5d9604ea953046f97
Description:
aocl_reorder_f32obf16 function is implemented to
reorder input weight matrix of data type float to
bfloat16.
The reordering is done to match the input requirements
of API aocl_gemm_bf16bf16f32o<f32|bf16>.
The objective of the API is to convert a model/matrix
of type f32 to bf16 and process when machine supports
bf16 FMA instruction _mm512_dpbf16_ps but the model
is still in float
Change-Id: Ib7c743d52d01a1ac09e84ac120577ec9e02f90f5
-Currently lpgemm sets the context (block sizes and micro-kernels) based
on the ISA of the machine it is being executed on. However this approach
does not give the flexibility to select a different context at runtime.
In order to enable runtime selection of context, the context
initialization is modified to read the AOCL_ENABLE_INSTRUCTIONS env
variable and set the context based on the same. As part of this commit,
only f32 context selection is enabled.
-Bug fixes in scale ops in f32 micro-kernels and GEMV path selection.
-Added vectorized f32 packing kernels for NR=16(AVX2) and NR=64(AVX512).
This is only for B matrix and helps remove dependency of f32 lpgemm api
on the BLIS packing framework.
AMD Internal: [CPUPL-5959]
Change-Id: I4b459aaf33c54423952f89905ba43cf119ce20f6
Details:
- Added a new API called unreorder that converts a matrix from
reordered format to it's original format( row-major or col-major ).
- Currently this API only supports bf16 datatype.
- Added corresponding bench and input file to test accuracy of the
API.
- The new API is only supported for 'B' matrix.
- Modified input validation checks in reorder API to account for
row Vs col storage of matrix and transposes for bf16 datatype.
Change-Id: Ifb9c53b7e6da6f607939c164eb016e82514581b7
-Added new pack kernels that packs/reorders B matrix (odd strides) from
column-major input format. This also supports the transB scenario if
input B matrix is row major.
Change-Id: Ia0fe7e5f19ae9eba5c418f4089c7e6df11091853
- Implemented the Scale post-op for the F32 API for all kernels
- f32_scale = (f32 * scale_factor) + offset
- Added the bench inputs
Change-Id: Ib0f25f870eafe695d8b2a2c434c8cb3ec4f7db4c
- Data-type of n, and conj is dim_t which will be int32_t for LP64 case.
- When loading 64-bit registers using "mov" instructions, mov(rax, var(n)),
the "n" should be 64-bit otherwise incorrect values gets loaded.
Fix: We typecast these variables to int64_t before loading into registers.
Thanks to mangala.v@amd.com for finding this bug.
Change-Id: I8542dc1ea434ca9030f3c56d9a681135055f8ba5
-Added new pack kernels that packs/reorders B matrix from column-major
input format. This also supports the transB scenario if input B matrix
is row major.
Change-Id: I4c75b6e81016331fd7e7f95ad4212e6d38dc586f
- Implemented the AVX512 packA kernel for col major inputs in F32 API
- Removed the work arounds for n = 1, mtag_a = PACK case, where the execution was
being directed to GEMM instead of GEMV.
Change-Id: I6fb700d96069213a762e8a83a209c5388a91050f
SCALV is used directly by BLAS, CBLAS and BLIS scal{v} APIs but
also within many other APIs to handle special cases. In general
it is preferred to use SETV when alpha=0, but BLAS and CBLAS
continue to multiple all vector element by alpha. This has
different behaviour for propagating NaNs or Infs.
Changes in this commit:
- Standardize early returns from SCALV reference and optimized
kernels.
- User supplied N<0 is handled at the top level API layer. Use
negative values of N in kernel calls to signify that SETV
should _not_ be used when alpha=0. This should only be
required in SCALV.
- Include serial threshold in zdscal (as in dscal) to reduce
overhead for small problem sizes.
- Code tidying to make different variants more consistent.
- More standardization of tests in SCALV gtestsuite programs.
- Remove scalv_extreme_cases.cpp as it is now redundant.
AMD-Internal: [CPUPL-4415]
Change-Id: I42e98875ceaea224cc98d0cdfe0133c9abc3edae
- Added explicit typecast to the pointers that are passed
to the _mm_prefetch( ... ) intrinsic, to avoid compiler
warnings.
AMD-Internal: [CPUPL-4415]
Change-Id: I1c1398b7b5abe81848d33cb6df107f7f077588ea