* reuse local prefetch logic from compute v4 pipeline add single-tile test explicit lambda capture reuse lds block descriptors from base policy for the transposed case match the test case kernel configuration with compute v4 * add comments