amd/blis - blis - Public git mirror

amd/blis

Fork 0

mirror of https://github.com/amd/blis.git synced 2026-07-01 11:47:25 +00:00

Commit Graph

Author	SHA1	Message	Date
Arnav Sharma	5a4739d288	DGEMV NO_TRANSPOSE Optimizations and Unit Tests - Added 32x3n n-biased kernels to directly handle the cases where n=3 which were earlier being handled by the primary n-biased, 32x8n, kernel. - Modified the n-biased fringe kernels to further handle the smaller m-fringe cases. Thus, now the kernels handle the following range of m for any value of n: - 16x8n : m = [16, 31) - 8x8n : m = [8, 15) - m_leftx8n : m = [1, 7] - Updated the function pointer map for n-biased kernels with added granularity to invoke the smaller fringe cases directly on the basis of m-dimension. - Added micro-kernel unit tests for all the dgemv_n kernels. AMD-Internal: [CPUPL-6231] Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119	2025-02-06 18:52:32 +05:30
Hari Govind S	349fc47ec5	DGEMV Optimizations for TRANSPOSE Cases - Developed new AVX512 DGEMV kernels for Zen4/5 architectures and AVX2 kernels for Zen1/2/3 architectures. These kernels are written from the ground up and are independent of fused kernels. - The DGEMV primary kernel processes the calculation in chunks of 8 columns. Fringe columns (sizes 1 to 7) are handled by fringe kernels, which are invoked by the primary kernel as needed. - Implemented the kernels by computing the dot product of matrix A columns with vector x in chunks of 32 elements, storing the results in accumulator registers. Fringe elements are handled in chunks of 16, 8, etc. The data in the accumulator registers is then reduced and added to vector y. AMD-Internal: [CPUPL-5835] Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61	2025-01-24 00:38:34 -05:00
Arnav Sharma	25e59fcbb9	DGEMV Optimizations for NO_TRANSPOSE Cases - AVX512 specific DGEMV native kernels are added for Zen4/5 architectures to handle the NO_TRANSPOSE cases and are independent of the AXPYF fused kernels. - The following set of kernels biased towards the n-dimension perform beta scaling of y vector within the kernel itself and handle cases where n is less than 5: - bli_dgemv_n_zen_int_32x8n_avx512( ... ) - bli_dgemv_n_zen_int_32x4n_avx512( ... ) - bli_dgemv_n_zen_int_32x2n_avx512( ... ) - bli_dgemv_n_zen_int_32x1n_avx512( ... ) - The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the m-dimension and for this kernel beta scaling is handled beforehand within the framework. - Added unit-tests for the new kernels. - AVX2 path for Zen/2/3 architectures still follows the old approach of using fused kernel, namely AXPYF, to perform the GEMV operation. AMD-Internal: [CPUPL-5560] Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79	2024-12-12 10:26:50 -05:00

Author

SHA1

Message

Date

Arnav Sharma

5a4739d288

DGEMV NO_TRANSPOSE Optimizations and Unit Tests

- Added 32x3n n-biased kernels to directly handle the cases where n=3
  which were earlier being handled by the primary n-biased, 32x8n,
  kernel.
- Modified the n-biased fringe kernels to further handle the smaller
  m-fringe cases. Thus, now the kernels handle the following range of m
  for any value of n:
  - 16x8n     : m = [16, 31)
  - 8x8n      : m = [8, 15)
  - m_leftx8n : m = [1, 7]
- Updated the function pointer map for n-biased kernels with added
  granularity to invoke the smaller fringe cases directly on the basis
  of m-dimension.
- Added micro-kernel unit tests for all the dgemv_n kernels.

AMD-Internal: [CPUPL-6231]
Change-Id: Ibe88848c2c1bbb65b3e79fbc90a2800dc15f5119

2025-02-06 18:52:32 +05:30

Hari Govind S

349fc47ec5

DGEMV Optimizations for TRANSPOSE Cases

- Developed new AVX512 DGEMV kernels for Zen4/5 architectures and
  AVX2 kernels for Zen1/2/3 architectures. These kernels are written
  from the ground up and are independent of fused kernels.

- The DGEMV primary kernel processes the calculation in chunks of
  8 columns. Fringe columns (sizes 1 to 7) are handled by fringe
  kernels, which are invoked by the primary kernel as needed.

- Implemented the kernels by computing the dot product of matrix A
  columns with vector x in chunks of 32 elements, storing the results
  in accumulator registers. Fringe elements are handled in chunks
  of 16, 8, etc. The data in the accumulator registers is then reduced
  and added to vector y.

AMD-Internal: [CPUPL-5835]
Change-Id: I5cb9eb1330db095931586a7028fd7676fbbecc61

2025-01-24 00:38:34 -05:00

Arnav Sharma

25e59fcbb9

DGEMV Optimizations for NO_TRANSPOSE Cases

- AVX512 specific DGEMV native kernels are added for Zen4/5
  architectures to handle the NO_TRANSPOSE cases and are independent of
  the AXPYF fused kernels.
- The following set of kernels biased towards the n-dimension perform
  beta scaling of y vector within the kernel itself and handle cases
  where n is less than 5:
    - bli_dgemv_n_zen_int_32x8n_avx512( ... )
    - bli_dgemv_n_zen_int_32x4n_avx512( ... )
    - bli_dgemv_n_zen_int_32x2n_avx512( ... )
    - bli_dgemv_n_zen_int_32x1n_avx512( ... )
- The bli_dgemv_n_zen_int_16mx8_avx512( ... ) is biased towards the
  m-dimension and for this kernel beta scaling is handled beforehand
  within the framework.
- Added unit-tests for the new kernels.
- AVX2 path for Zen/2/3 architectures still follows the old approach of
  using fused kernel, namely AXPYF, to perform the GEMV operation.

AMD-Internal: [CPUPL-5560]
Change-Id: I22bc2a865cd28b9cdcb383e17d1ff38bdd28de79

2024-12-12 10:26:50 -05:00

3 Commits