Commit Graph

6 Commits

Author SHA1 Message Date
Brock Hargreaves
26d29374e5 [rocm-libraries] ROCm/rocm-libraries#5213 (commit 9f7e62c)
[CK] Fix warp tile combination selection in absence of a GPU
 (#5213)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The `get_gpu_name_by_id()` function in
`gemm_streamk_validation_utils.py` relies on `rocminfo` to detect the
GPU architecture at runtime. However, __`rocminfo` fails in CI/build
environments__ where:

- No physical GPU is present
- ROCm tools are not installed
- The build is running in a container without GPU access

In any of these environments, the problem manifests itself in incorrect
kernel validation and will generate template instantiations that do not
exist:

```
[composable_kernel] FAILED: test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o
[composable_kernel] /__w/TheRock/TheRock/build/core/clr/dist/lib/llvm/bin/clang++ -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_ENABLE_TF32 -DCK_TILE_USE_WMMA=0 -DCK_TIME_KERNEL=1 -DCK_USE_FNUZ_FP8 -DCK_USE_GFX94 -DCK_USE_XDL -DDPP_KERNELS -DGEMM_SINGLE_INSTANCE_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp\" -DGEMM_TEST_PARAMS_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/test_params.hpp\" -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/profiler/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/library/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include -I/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/include -I/__w/TheRock/TheRock/build/profiler/rocprofiler-sdk/stage/include -I/__w/TheRock/TheRock/build/profiler/roctracer/stage/include -I/__w/TheRock/TheRock/build/base/half/stage/include -I/__w/TheRock/TheRock/build/third-party/sysdeps/linux/libdrm/build/stage/lib/rocm_sysdeps/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx942   -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -Wno-undefined-func-template -Wno-float-equal --offload-compress -include /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp -MD -MT test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -MF test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o.d -o test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -x hip -c /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/test_gemm_streamk_simple.cpp
[composable_kernel] In file included from <built-in>:2:
[composable_kernel] In file included from /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp:9:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm.hpp:23:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1.hpp:7:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1_default_policy.hpp:8:
[composable_kernel] /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp:185:1: error: implicit instantiation of undefined template 'ck_tile::impl::warp_gemm_dispatcher::Dispatcher<_Float16, _Float16, float, 16, 16, 8, false, false, false, ck_tile::WGAttrNumAccessEnum::Single, ck_tile::WGAttrNumAccessEnum::Single>'
```

## Technical Details

### Changes Made:

#### 1. __gemm_streamk_validation_utils.py__

- Added module-level storage: `_configured_gpu_targets`

- Added `set_gpu_targets(targets: List[str])` to configure fallback GPU
targets

- Added `get_configured_gpu_targets() -> List[str]` to retrieve
configured targets

- Enhanced `get_gpu_name_by_id()` to:

  - First try `rocminfo` (existing behavior)
  - If `rocminfo` fails, fall back to first configured GPU target
  - Extract base gfx name (e.g., "gfx90a" from "gfx90a:xnack+")
  - Log debug messages when using fallback

#### 2. __gemm_streamk_instance_builder.py__

- Added `--gpu_targets` command-line argument
- Automatically calls `set_gpu_targets()` when `--gpu_targets` is
provided
- Parses semicolon-separated GPU target list from CMake

#### 3. __test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt__

- Modified both `--list_kernels` and `--gen_single` invocations to pass
`--gpu_targets "${SUPPORTED_GPU_TARGETS}"`
- GPU targets are now automatically wired from CMake to Python scripts

### How It Works:

1. __CMake Configuration__: `SUPPORTED_GPU_TARGETS` is determined from
`GPU_TARGETS` or defaults
2. __CMake → Python__: CMake passes targets via `--gpu_targets` argument
to Python scripts
3. __Python Configuration__: Scripts call `set_gpu_targets()` to
configure the fallback
4. __Fallback Mechanism__: When `rocminfo` fails, `get_gpu_name_by_id()`
uses the first configured target
5. __Target Parsing__: Extracts clean gfx name (e.g., "gfx90a" from
"gfx90a:xnack+")

## Test Plan

Confirm that only the appropriate kernels are selected and that CI
passes.

## Test Result

1. Waiting on CI
2. Compilation succeeded locally and the kernel list does not contain
the 16x16x8 kernel for gfx942 anymore:

```
(.venv) bhargrea@ctr-cx66-mi300x-02:~/github/TheRock$ cat build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_kernel_list.txt
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x16|256x256x32_2x2x1_16x16x16|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x16|256x256x32_2x2x1_16x16x16|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x32|256x256x32_2x2x1_16x16x32|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x32|256x256x32_2x2x1_16x16x32|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x8|256x256x32_2x2x1_32x32x8|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x8|256x256x32_2x2x1_32x32x8|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x16|256x256x32_2x2x1_32x32x16|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x16|256x256x32_2x2x1_32x32x16|compv3_cshuffle_intrawave_atomic_False_False_False_False
```

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 23:12:59 +00:00
Brock Hargreaves
51537eb189 [rocm-libraries] ROCm/rocm-libraries#5165 (commit 8df295c)
[CK] Streamk tile engine test not setting a reasonable
 CU_COUNT default when the query fails (#5165)

## Motivation

The following error was coming up when compiling on Windows when the
generate_configs.py file tries to query the GPU for the number of CU's:

```
[composable_kernel configure] -- Generating Stream-K test config files for fp16
[composable_kernel configure] Traceback (most recent call last):
[composable_kernel configure]   File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 277, in <module>
[composable_kernel configure]     main()
[composable_kernel configure]     ~~~~^^
[composable_kernel configure]   File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 271, in main
[composable_kernel configure]     cu_count, configs_dir_path, tile_sizes, datatype = get_args()
[composable_kernel configure]                                                        ~~~~~~~~^^
[composable_kernel configure]   File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 267, in get_args
[composable_kernel configure]     return (int(args.cu_count), args.configs_dir_path, args.tiles, args.datatype)
[composable_kernel configure]             ~~~^^^^^^^^^^^^^^^
[composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n'
[composable_kernel configure] CMake Error at test/ck_tile/gemm_streamk_tile_engine/generate_configs.cmake:98 (message):
[composable_kernel configure]   Eror occured during execution of
[composable_kernel configure]   E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/generate_configs.py
[composable_kernel configure] Call Stack (most recent call first):
[composable_kernel configure]   test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt:301 (generate_test_configs)
[composable_kernel configure]
[composable_kernel configure]
[composable_kernel configure] -- Configuring incomplete, errors occurred!
[composable_kernel configure FAILED WITH CODE 1 in 41 seconds]
ninja: build stopped: subcommand failed.
```

## Technical Details

There was one major problem in the following code and two changes were
made:

```
        execute_process(
            COMMAND ${CPP_EXE_PATH}
            OUTPUT_STRIP_TRAILING_WHITESPACE
            ERROR_VARIABLE standard_error
            RESULT_VARIABLE queried_cu_count
        )

        if (standard_error)
            message(STATUS "Error information from attempting to query HIP device and properties:\n"
                            "${standard_error}")
        endif()
```

1. RESULT_VARIABLE does not capture the IO output of the executable, but
rather the exit code. You can see from the error output here that it was
trying to cast "Exit code 0xc0000135\n" to an integer. I fixed this by
changing RESULT_VARIABLE to OUTPUT_VARIABLE.

```
[composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n'
```

Note that this also gives us the reason that the query failed: Exit code
0xc0000135, which needs to be addressed in a separate issue: "Exit code
0xc0000135, also seen as -1073741515, is a Windows error indicating that
an application failed to start because a required Dynamic Link Library
(DLL) file or a system component like the .NET Framework is missing or
corrupted"

It's likely the executable that is created from this code can't find the
hip dll, or something similar:

```
        set(CPP_FILE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cu_count.cpp)
        set(CPP_EXE_PATH ${CMAKE_CURRENT_BINARY_DIR}/cu_count)

        execute_process(
            COMMAND ${CMAKE_HIP_COMPILER} -x hip ${CPP_FILE_PATH} -o ${CPP_EXE_PATH}
            RESULT_VARIABLE compile_result
        )
```

2. For clarity and consistency purposes, I changed the check afterwards
to explicitly look for a non-zero exit code. This matches previous
checks in the cmake file. I also added improved error checking when the
query for the cu count fails.

## Test Plan

Ensure it compiles locally and existing CI isn't impacted.

## Test Result

Waiting on CI.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 17:53:08 +00:00
Emily Martins
f1746955fd [rocm-libraries] ROCm/rocm-libraries#4984 (commit 962b047)
[CK_TILE] Reduce Register Spills in Stream-K Reductions
 (#4984)

## Motivation

In CK Tile Stream-K, kernels using one of two non-atomic reduction
strategies (i.e., linear, tree) have high register spill count, with the
tree reduction generally being worse. These changes act a first step to
help decrease the register spill count.

## Technical Details
### Problem 1: Unvectorized access to partials
In both the linear and tree reductions, workgroups write partials
results to a global buffer; another workgroup will later read this data.
When the initial logic to support reading and writing to the partials
buffer was added (see
https://github.com/ROCm/composable_kernel/pull/3107), the tile
distribution encoding used to read from and write to partials matches
the register layout for the accumulator of the mfma instruction used for
the kernel. Since we do not currently use the transposed register layout
for the accumulator, we end with an encoding that is not optimized for
writing to HBM.

For example: Consider the register layout of the
`v_mfma_f32_16x16x32_fp8_fp8` instruction.
```bash
./matrix_calculator.py --architecture gfx942 --instruction  v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix
```
<img width="1113" height="537" alt="image"
src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02"
/>

The above shows that threads are responsible for consecutive elements
down a column of the C tile. If we use this distribution to read and
write to partials with C in row major, then threads are unable to
perform vectorized reads and writes. Note: thread 0 is shown in red and
thread 1 is shown in green.

Since the C-shuffle Epilogue only supports C in row major, reading and
writing to partials is highly unoptimized.
### Problem 2: Missed opportunity for SPGR use in tree reduction loop
Since the reduction occurs between workgroups, all threads in the
workgroup follow the same execution paths in the tree reduction logic,
hence various variables should be using SGPRs, but they are not.

### Implemented Solutions
1. Add a new tile distribution encoding that is optimized for accessing
partials in HBM. This encoding does not change the data assignment to
threads, it merely changes the addresses to which they write/read in the
partials buffer. For example, continuing with the
`v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result
in threads writing in the following layout:
<img width="517" height="342" alt="image"
src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202"
/>

This layout ensures that each thread writes along a row, enabling
`buffer_{store|load}_dwordx4` instructions (i.e., vectorized accesses).
This helps reduce register usage due to requiring fewer offset
calculations.

2. To force SGPR usage in the tree reduction loop, I make use of CK
Tile's `amd_wave_read_first_lane` which is a wrapper around
`__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the
tree reduction.

_These changes do not fully eliminate register spills. Future work will
aim to further reduce spills. But these changes make good progress._

## Test Plan

Added tests for different warp tile sizes to validate that the new
encoding works with different `WarpGemm` variants.

## Test Result

All tests pass locally on all gfx9 architectures.

Some results for decreases in register spills on gfx942: (BL = baseline)
| Kernel | SGPR Spill (BL) | SGPR Spill (new) | SGPR Delta | SGPR % |
VGPR Spill (BL) | VGPR Spill (new) | VGPR Delta | VGPR % |

|--------|------------------:|------------------:|-----------:|-------:|-------------------:|------------------:|-----------:|-------:|
| fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 | 223 | 0 | -223 |
-100.0% | 21 | 20 | -1 | -4.8% |
| fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 | 233 | 11 | -222 | -95.3%
| 443 | 23 | -420 | -94.8% |
| fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 | 221 | 3 | -218 | -98.6%
| 12 | 6 | -6 | -50.0% |
| fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 | 230 | 14 | -216 | -93.9%
| 396 | 12 | -384 | -97.0% |

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 17:40:34 +00:00
Emily Martins
fc3180120e [rocm-libraries] ROCm/rocm-libraries#4756 (commit 79bc2ca)
[CK_TILE] Update Stream-K Reduction Strategy Enum

## Motivation

Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.

## Technical Details

Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic        = 0u,
    Reduction     = 1u,
    TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic = 0u,
    Linear = 1u,
    Tree   = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan

No new functionality was added, so no new tests were added; I just
validated existing tests and examples.

## Test Result

All tests passed locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-24 06:41:15 +00:00
Emily Martins
8cbd09c84a [CK_TILE] Stream-K Tile Engine Test Config File Generation (#3662)
* Stream-K smoke test config file generation

This change converts the stream-k smoke tests to use tile engine. Since
the m, n, and k values dependent on the CU count of a device, the
configs are generated during the Configuration Phase.

* Compute GEMM reference on GPU

* Remove redundant Stream-K tests

Removing redundant tests that are now run via tile engine.

* Fix relative and absolute tolerance calculation

This change updates the Stream-K tile engine interface to ensure that
num_wgs_per_tile is propaged and passed into the compare_results
function to calculate the rel and abs tolerance. Before, split-k was
used, which is incorrect for Stream-K since the split-k value is
always 1.

* Cleanup imports, types, and other misc items

This commit makes the following changes:
- Uses Typing module for nested type hints
- Uses quotes around cu_count_arg argument in generate_configs.cmake in
  if statements
- Adds explicit include for tuple in test_gemm_streamk_simple.cpp
- Adds a type for the tiles argument in argparser to check argument
  validity

* Use CU count as return value for better parsing

* Add reduction tests for bf16, fp8, and bf8
2026-02-03 09:12:15 -07:00
arai713
b9bb1db5d9 Addition of Stream-K tests using Tile Engine (#3514)
* Addition of Stream-K tests using Tile Engine

This change adds an implementation for generating Stream-K tests using Tile Engine.
This will generate various test executables for different combinations based on the
config files. This addition has simple tests running for bf16 and fp16, with both
atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation
of Stream-K in Tile Engine.

* integrating addition of tree reduction and editing the README

* temporarily removing parallel and tree reduction from configs while bugs regarding them are being resolved
2026-01-22 12:53:52 -08:00