mirror of
https://github.com/amd/blis.git
synced 2026-04-20 07:38:53 +00:00
Minor updates to docs, Makefiles.
Details:
- Changed all occurrances of
micro-kernel -> microkernel
macro-kernel -> macrokernel
micro-panel -> micropanel
in all markdown documents in 'docs' directory. This change is being
made since we've reached the point in adoption and acceptance of
BLIS's insights where words such as "microkernel" are no longer new,
and therefore now merit being unhyphenated.
- Updated "Implementation Notes" sections of KernelsHowTo.md, which
still contained references to nonexistent cpp macros such as
BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?.
- Added 'run-fast' and 'check-fast' targets to testsuite/Makefile.
- Minor updates to Testsuite.md, including suggesting use of
'make check' and 'make check-fast' when running from the local
testsuite directory.
- Added a comment to top-level Makefile explaining the purpose behind
the TESTSUITE_WRAPPER variable, which at first glance appears to serve
no purpose.
This commit is contained in:
@@ -72,7 +72,7 @@ void bli_cntx_init_fooarch( cntx_t* cntx )
|
||||
|
||||
// -------------------------------------------------------------------------
|
||||
|
||||
// Update the context with optimized native gemm micro-kernels and
|
||||
// Update the context with optimized native gemm microkernels and
|
||||
// their storage preferences.
|
||||
bli_cntx_set_l3_nat_ukrs
|
||||
(
|
||||
@@ -143,9 +143,9 @@ _**Blocksize object array.**_ The `blkszs` array declaration is needed later in
|
||||
|
||||
_**Reference initialization.**_ The first function call, `bli_cntx_init_fooarch_ref()`, initializes the context `cntx` with function pointers to reference implementations of all of the kernels supported by BLIS (as well as cache and register blocksizes, and other fields). This function is automatically generated by BLIS for every sub-configuration enabled at configure-time. The function prototype is generated by a preprocessor macro in `frame/include/bli_arch_config.h`.
|
||||
|
||||
_**Level-3 micro-kernels.**_ The second function call is to a variable argument function, `bli_cntx_set_l3_nat_ukrs()`, which updates `cntx` with five optimized double-precision complex level-3 micro-kernels. The first argument encodes the number of individual kernels being registered into the context. Every subsequent line, except for the last line, is associated with the registration of a single kernel, and each of these lines is independent of one another and can occur in any order, provided that the kernel parameters of each line occur in the same order--kernel ID, followed by datatype, followed by function name, followed by storage preference boolean (i.e., whether the micro-kernel prefers row storage). The last argument of the function call is the address of the context being updated, `cntx`. Notice that we are registering micro-kernels written for another type of hardware, `bararch`, because in our hypothetical universe `bararch` is very similar to `fooarch` and so we recycle the code between the two configurations. After the function returns, the context contains pointers to optimized double-precision level-3 real micro-kernels. Note that the context will still contain reference micro-kernels for single-precision real and complex, and double-precision complex computation, as those kernels were not updated.
|
||||
_**Level-3 microkernels.**_ The second function call is to a variable argument function, `bli_cntx_set_l3_nat_ukrs()`, which updates `cntx` with five optimized double-precision complex level-3 microkernels. The first argument encodes the number of individual kernels being registered into the context. Every subsequent line, except for the last line, is associated with the registration of a single kernel, and each of these lines is independent of one another and can occur in any order, provided that the kernel parameters of each line occur in the same order--kernel ID, followed by datatype, followed by function name, followed by storage preference boolean (i.e., whether the microkernel prefers row storage). The last argument of the function call is the address of the context being updated, `cntx`. Notice that we are registering microkernels written for another type of hardware, `bararch`, because in our hypothetical universe `bararch` is very similar to `fooarch` and so we recycle the code between the two configurations. After the function returns, the context contains pointers to optimized double-precision level-3 real microkernels. Note that the context will still contain reference microkernels for single-precision real and complex, and double-precision complex computation, as those kernels were not updated.
|
||||
|
||||
_Note:_ Currently, BLIS only allows the kernel developer to signal a preference (row or column) for `gemm` micro-kernels. The preference of the `gemmtrsm` and `trsm` micro-kernels can (and must) be set, but are ignored by the framework during execution.
|
||||
_Note:_ Currently, BLIS only allows the kernel developer to signal a preference (row or column) for `gemm` microkernels. The preference of the `gemmtrsm` and `trsm` microkernels can (and must) be set, but are ignored by the framework during execution.
|
||||
|
||||
_**Level-1m (packm) kernels.**_ The third function call is to another variable argument function, `bli_cntx_set_packm_kers()`. This function works very similar to `bli_cntx_set_l3_nat_ukrs()`, except that it expects a different set of kernel IDs (because now we are registering level-1m kernels) and it does not take a storage preference boolean. After this function returns, `cntx` contains function pointers to optimized double-precision real `packm` kernels. These kernels, like the level-3 kernels previously, are also borrowed from the `bararch` kernel set. Unregistered `packm` kernels will continue to point to reference code.
|
||||
|
||||
@@ -155,7 +155,7 @@ _**Level-1v kernels.**_ The fourth function call is to `bli_cntx_set_l1v_kers()`
|
||||
|
||||
For a complete list of kernel IDs, please see the definitions of `l3ukr_t`, `l1mkr_t`, `l1fkr_t`, `l1vkr_t` in [frame/include/bli_type_defs.h](https://github.com/flame/blis/blob/master/frame/include/bli_type_defs.h).
|
||||
|
||||
_**Setting blocksizes.**_ The next block of code initializes the `blkszs` array with register and cache blocksize values for each datatype. The values here are used by the level-3 operations that employ the level-3 micro-kernels we registered previously. We use `bli_blksz_init_easy()` when initializing only the primary value. If the auxiliary value needs to be set to a different value that the primary, `bli_blksz_init()` should be used instead, as in:
|
||||
_**Setting blocksizes.**_ The next block of code initializes the `blkszs` array with register and cache blocksize values for each datatype. The values here are used by the level-3 operations that employ the level-3 microkernels we registered previously. We use `bli_blksz_init_easy()` when initializing only the primary value. If the auxiliary value needs to be set to a different value that the primary, `bli_blksz_init()` should be used instead, as in:
|
||||
```c
|
||||
// s d c z
|
||||
bli_blksz_init_easy( &blkszs[ BLIS_MR ], 0, 8, 0, 0 );
|
||||
@@ -170,7 +170,7 @@ Here, we use `bli_blksz_init()` to set different auxiliary (maximum) cache block
|
||||
|
||||
Note that we set level-3 blocksizes even for datatypes that retain reference code kernels; however, by passing in `0` for those blocksizes, we indicate to `bli_blksz_init()` and `bli_blksz_init_easy()` that the current value should be left untouched. In the example above, this leaves the blocksizes associated with the reference kernels (set by `bli_cntx_init_fooarch_ref()`) intact for the single real, single complex, and double complex datatypes.
|
||||
|
||||
_Digression:_ Auxiliary blocksize values for register blocksizes are interpreted as the "packing" register blocksizes. _PACKMR_ and _PACKNR_ serve as "leading dimensions" of the packed micro-panels that are passed into the micro-kernel. Oftentimes, _PACKMR = MR_ and _PACKNR = NR_, and thus the developer does not typically need to set these values manually. (See the [implementation notes for gemm](KernelsHowTo.md#Implementation_Notes_for_gemm) in the BLIS Kernel guide for more details on these topics.)
|
||||
_Digression:_ Auxiliary blocksize values for register blocksizes are interpreted as the "packing" register blocksizes. _PACKMR_ and _PACKNR_ serve as "leading dimensions" of the packed micropanels that are passed into the microkernel. Oftentimes, _PACKMR = MR_ and _PACKNR = NR_, and thus the developer does not typically need to set these values manually. (See the [implementation notes for gemm](KernelsHowTo.md#Implementation_Notes_for_gemm) in the BLIS Kernel guide for more details on these topics.)
|
||||
|
||||
_Digression:_ Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the _maximum_ size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is _not_ merged and instead it is computed upon in separate (final) iteration.
|
||||
|
||||
@@ -219,7 +219,7 @@ These macros are used in computing the maximum amount of temporary storage (typi
|
||||
```c
|
||||
#define BLIS_STACK_BUF_MAX_SIZE ( BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE * 2 )
|
||||
```
|
||||
These temporary buffers are used when handling edge cases (m % _MR_ != 0 || n % _NR_ != 0) within the level-3 macro-kernels, and also in the virtual micro-kernels of various implementations of induced methods for complex matrix multiplication. It is **very important** that these values be set correctly; otherwise, you may experience undefined behavior as stack data is overwritten at run-time. A kernel developer may set `BLIS_SIMD_NUM_REGISTERS` and `BLIS_SIMD_SIZE`, which will indirectly affect `BLIS_STACK_BUF_MAX_SIZE`, or he may set `BLIS_STACK_BUF_MAX_SIZE` directly. Notice that the default values are already set to work with modern x86_64 systems.
|
||||
These temporary buffers are used when handling edge cases (m % _MR_ != 0 || n % _NR_ != 0) within the level-3 macrokernels, and also in the virtual microkernels of various implementations of induced methods for complex matrix multiplication. It is **very important** that these values be set correctly; otherwise, you may experience undefined behavior as stack data is overwritten at run-time. A kernel developer may set `BLIS_SIMD_NUM_REGISTERS` and `BLIS_SIMD_SIZE`, which will indirectly affect `BLIS_STACK_BUF_MAX_SIZE`, or he may set `BLIS_STACK_BUF_MAX_SIZE` directly. Notice that the default values are already set to work with modern x86_64 systems.
|
||||
|
||||
_**Memory alignment.**_ BLIS implements memory alignment internally, rather than relying on a function such as `posix_memalign()`, and thus it can provide aligned memory even with functions that adhere to the `malloc()` and `free()` API in the standard C library.
|
||||
```c
|
||||
@@ -231,7 +231,7 @@ _**Memory alignment.**_ BLIS implements memory alignment internally, rather than
|
||||
#define BLIS_HEAP_STRIDE_ALIGN_SIZE BLIS_SIMD_ALIGN_SIZE
|
||||
#define BLIS_POOL_ADDR_ALIGN_SIZE BLIS_PAGE_SIZE
|
||||
```
|
||||
The value `BLIS_STACK_BUF_ALIGN_SIZE` defines the alignment of stack memory used as temporary internal buffers, such as for output matrices to the micro-kernel when computing edge cases. (See [implementation notes](KernelsHowTo#implementation-notes-for-gemm) for the `gemm` micro-kernel for details.) This value defaults to `BLIS_SIMD_ALIGN_SIZE`, which defaults to `BLIS_SIMD_SIZE`.
|
||||
The value `BLIS_STACK_BUF_ALIGN_SIZE` defines the alignment of stack memory used as temporary internal buffers, such as for output matrices to the microkernel when computing edge cases. (See [implementation notes](KernelsHowTo#implementation-notes-for-gemm) for the `gemm` microkernel for details.) This value defaults to `BLIS_SIMD_ALIGN_SIZE`, which defaults to `BLIS_SIMD_SIZE`.
|
||||
|
||||
The value `BLIS_HEAP_ADDR_ALIGN_SIZE` defines the alignment used when allocating memory via the `malloc()` function defined by `BLIS_MALLOC_USER`. Setting this value to `BLIS_SIMD_ALIGN_SIZE` may speed up certain level-1v and -1f kernels.
|
||||
|
||||
@@ -538,9 +538,9 @@ Adding support for a new set of kernels in BLIS is easy and can be done via the
|
||||
AXPYV_KER_PROT( float, s, axpyv_knl_asm )
|
||||
DOTXV_KER_PROT( float, s, dotxv_knl_asm )
|
||||
```
|
||||
The first line generates a function prototype for a double-precision real `gemm` micro-kernel named `bli_dgemm_knl_asm_24x8()`. Notice how the macro takes three arguments: the C language datatype, the single character corresponding to the datatype, and the base name of the function, which includes the operation (`gemm`), the kernel set name (`knl`), and a substring specifying its implementation (`asm_24x8`).
|
||||
The first line generates a function prototype for a double-precision real `gemm` microkernel named `bli_dgemm_knl_asm_24x8()`. Notice how the macro takes three arguments: the C language datatype, the single character corresponding to the datatype, and the base name of the function, which includes the operation (`gemm`), the kernel set name (`knl`), and a substring specifying its implementation (`asm_24x8`).
|
||||
|
||||
The second and third lines generate prototypes for double-precision real `packm` kernels to go along with the `gemm` micro-kernel above. The fourth and fifth lines generate prototypes for double-precision complex instances of the level-1f kernels `axpyf` and `dotxf`. The last two lines generate prototypes for single-precision real instances of the level-1v kernels `axpyv` and `dotxv`.
|
||||
The second and third lines generate prototypes for double-precision real `packm` kernels to go along with the `gemm` microkernel above. The fourth and fifth lines generate prototypes for double-precision complex instances of the level-1f kernels `axpyf` and `dotxf`. The last two lines generate prototypes for single-precision real instances of the level-1v kernels `axpyv` and `dotxv`.
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user