Refer to color mm algorithm in Multithreading.md.

This commit is contained in:
Field G. Van Zee
2018-12-04 13:30:25 -06:00
parent 22384fd2b7
commit 9b688a2d69

View File

@@ -104,7 +104,7 @@ Next, which combinations of loops to parallelize depends on which caches are sha
* For compute resources that have private L2 caches but that share an L3 cache (example: cores on a socket), try parallelizing the `IC` loop. In this situation, threads will share the same packed row panel from matrix B, but pack and compute with different blocks of matrix A.
* If compute resources share an L2 cache but have private L1 caches (example: pairs of cores), try parallelizing the `JR` loop. Here, threads share the same packed block of matrix A but read different packed micro-panels of B into their private L1 caches. In some situations, parallelizing the `IR` loop may also be effective.
![The primary algorithm for level-3 operations in BLIS](http://www.cs.utexas.edu/users/field/mm_algorithm.png)
![The primary algorithm for level-3 operations in BLIS](http://www.cs.utexas.edu/users/field/mm_algorithm_color.png)
## Globally at runtime