* ck-builder: tensor copy function
This function copies one tensor to another, so that the memory
layout can be changed between them.
* ck-builder: fix ck::bhalf literals
These types don't work properly.
* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it
This reduces the amount of duplicated code a bit.
* ck-builder: add flat tensor iterator
This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.
* ck-builder: integrate validation with ck gpu verification
By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.
* ck-builder: add check_by_accumulations
This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.
* ck: fix test_gpu_verification GenerateRandomData for bhalf
is_integer_it<bhalf_t> yields true, but it is not actually
an integer.
* ck: make gpu_verification kernels be proper persistent kernels
Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.
* ck: clean up gpu_verification.hpp using block_reduce
This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.
* ck-builder: doc typos
* ck-builder: update testing readme with validation interface.
* ck-builder: rebase fixes + review comments
* ck-builder: fix device integer generation with float types
Passing bfloat here causes a nans due to type_convert performing
a bitcast.
* ck: another bhalf_t bug
CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.
[ROCm/composable_kernel commit: 42048bdb7d]
* ck-builder: restructure testing conv
In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.
* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp
This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.
* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3
Turns out that the supplied layout isn't actually supported...
* ck-builder: ck and reference conv integration for bwd weight
* ck-builder: ck bwd weight execution test
* ck-builder: ckt::run support for ck-tile bwd weight
* ck-builder: ck tile bwd weight execution test
* ck-builder: extra debug printing in MatchesReference
* ck-builder: make ckt::run return RunResult
This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.
* ck-builder: RunResult matcher
Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.
* ck-builder: doc fixes
* ck-builder: add missing headers
[ROCm/composable_kernel commit: cc75948d1c]
* ck-builder: remove SPATIAL_DIM parameter from ConvTensorLayouts
This information is already in the SIGNATURE, so its pointless to pass it
separately. This streamlines the interface of those functions a bit. Also
touches up the style of those files in general.
* ck-builder: implement reference conv using old ck
The old ck implementation is more featureful and better tested.
* ck-builder: replace test_reference_execution reference with old ck
This strips out the ck-tile gpu reference implementation completely.
* ck-builder: clean up test_reference_execution
- Remove unneccesary messages
- Replace EXPECT_TRUE(true) with EXPECT_NO_THROW()
[ROCm/composable_kernel commit: 1040d9b1f5]
This is mostly adjustments to enum values so that the docs align correctly with the current code.
Also updated the calendar scope of the project to extend through March 2026.
[ROCm/composable_kernel commit: df7ee270a6]
* Add placeholder test.
* Initial conv bwd weight factory.
* Conv builder test refactoring.
* Add missing pieces to bwd weight factory.
* Improve compile time erros message when no matching factory is found.
* Use amcro to ensure automatic macthing between concepts are their string representations.
* Improve compile time diagnostics.
* Small improvements.
* Improve missing member/wrong type compile-time errors.
* Improve compile time diagnostics.
* Concept bug fixes.
* Remove debug assert.
* Update algorithm signature diagnostics.
* Factory bug fixes.
* First functional version of bwd weight conv factory.
* Refactor handing of GEMM-K batch template parameter in conv bwd weight factory.
* Concept improvements.
* Improve concept diagnostics.
* Introduve a common size type for concepts.
* Update compiletime diagnostics to use the size type.
* Update conv specialization enum.
* Fix fwd conv builder tests.
* Fix smoke tests.
* Separate bwd weigth and bwd data tests into separate targets.
* Clean-up CK Tile builder tests.
* Add bwd weight XDL CShuffle V3 factory.
* Build conv bwd weigth v3 instances successfully.
* Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.
* Test fix.
* Add instance traits for bwd weight algorithms.
* Add unit tests for instance strings.
* Build new instance traits unit tests but exclude WMMA for now.
* Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.
* Conv bwd weight DL factory.
* Final implementation for bwd weight DL factory.
* Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
* Treat ref algorithm the same way as real algorithms in the dispatcher.
* Refactor large tensor support and WMMA configuration.
* Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3.
* Update Readme.
* Fix WMMA bwd weight tests.
* Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3.
* Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle.
* Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
* Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs.
* Fix fwd factories after refactoring.
* clang-format
* Move compile-time diagnostics to a separate branch.
* Fix ref algorithm dispatching.
* Fix smoke tests.
* clang-format
* Fix factory for regular WMMA conv bwd weight.
* Clarify builder Readme.
* Remove obsolete test file.
* Fix test after merge.
* clang-format
* Remove the C++26 extensions.
* Unify conv elementwise ops and layout definitions for fwd and bwd directions.
* Remove old layout and elementwise ops.
* Unify handling of conv tensor types between fwd and bwd directions.
* Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank.
* Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms.
* clang-format
---------
Co-authored-by: Ville Pietilä <>
[ROCm/composable_kernel commit: 9908a87c31]
This adds some utilities to automatically generate UniqueInputs,
UniqueOutputs, alloc_inputs, alloc_outputs, and validate, based
on a Inputs::reflect() and Outputs::reflect().
[ROCm/composable_kernel commit: b352a68606]
* ck-builder: make toString to_string
We are using snake case for CK-Builder
* ck-builder: add debug.hpp with tensor descriptor printing function
This adds some initial functionality to debug.hpp, a header which will
be used to house some debug utilities.
* ck-builder: abstract nd-iteration
Abstracting this makes it easier to test, clearer, and allows us to
use it elsewhere (such as in debug.hpp soon)
* ck-builder: tensor printing
* ck-builder: rename INT32 to I32
This makes it more in line with the other data type definitions.
[ROCm/composable_kernel commit: e3884bbf05]
* ck-builder: explicitly delete forward declarations
Before, these functions were seen as a forward declaration for an existing function.
If no actual implementation overload could be found, these would be selected and
a linker error or warning would be generated. By marking these functions as explicitly
deleted, they incorrect invocations are generated as compile error instead.
* ck-builder: ckt::run plumbing for reference conv
This implements the ckt::run plumbing for the reference convolution
implementation and sets up the first complete end-to-end test.
* ck-builder: make validation system check for all-zeros
When both the actual and reference output are both all zero bits,
there is probably something wrong in the test framework.
* ck-builder: proper implementation+tests for TensorDescriptor::is_packed
* ck-builder: fix typos
[ROCm/composable_kernel commit: 1c433c64ec]
This pull request builds on #3267 by proving the "validation" infrastructure, the means to compare a set of `Outputs`.
The design of the validation infrastructure is relatively straight forward:
- Each SIGNATURE should come with a `validate()` implementation, which should be implemented in a similar way that the other functions/types from `testing.hpp` are implemented.
- `validate()` returns a `ValidationReport`, which is a structure that keeps all relevant information about comparing the tensors from two `Outputs`. Note that crucially, `validate()` should not do any reporting by itself. Rather, glue logic should be implemented by the user to turn `ValidationReport` into a relevant error message.
- You can see this clue code for CK-Builder itself in `testing_utils.hpp`, its `MatchesReference()`. This functionality is relatively barebones right now, it will be expanded upon in a different PR to keep the scope of this one down.
The comparison is done on the GPU (using an atomic for now), to keep tests relatively quick. Some notable items from this PR:
- To help compare the tensors and with writing tests, I've written a generic function `tensor_foreach` which invokes a callback on every element of a tensor.
- For that it was useful that the `TensorDescriptor` has a rank which is known at compile-time, so I've changed the implementation of `TensorDescriptor` for that. I felt like it was a better approach than keeping it dynamic, for multiple reasons:
- This is C++ and we should use static typing where possible and useful. This way, we don't have to implement runtime assertions about the tensor rank.
- We know already know the rank of tensors statically, as it can be derived from the SIGNATURE.
- It simpifies the implementation of `tensor_foreach` and other comparison code.
- There are a lot of new tests for validating the validation implementation, validating validation validation tests (Only 3 recursive levels though...). For a few of those functions, I felt like it would be useful to expose them to the user.
- Doc comments everywhere.
[ROCm/composable_kernel commit: e6e7dc2910]
Implemented device random number generators for ck tensors.
Includes tests and integration to ck builder testing interface.
[ROCm/composable_kernel commit: f86bbb1aef]
* Add README.md for testing
* Add tensor_memory_manager.
* ck-builder: tensor memory manager rebase fixes
This fixes some issues caused by the API being changed recently.
Also, this streamlines the ckt namespace to always be ck_tile::builder::test,
as this is already being used by other tests
Really, this commit should be squashed into the previous,
but I'm keeping it separate for brevity.
* ck-builder: test arguments initial prototype
* ck-builder: test system initial prototype
* ck-builder: fix non-standardized copyright comments
* ck-builder: new prototype
* ck-builder: group testing inputs/outputs into a separate structure
This is basically the return of the tensor memory manager after all,
except that the design is more closely tied to the actual operation.
Using a struct allows us to add additional input/output tensors
without breaking code (by defaulting those new parameters). Note
that the tensors are split into a separate inputs/outputs because we
usually want to allocate the output _twice_: once for the real
computation and once for the reference computation.
* ck-builder: simplify prototype naming; start docs
* ck-builder: update testing readme
* ck-builder: testing documentation
* ck-builder: HipStatusMatcher
This matcher can be used to check HIP status codes and provide
nice and readable error messages.
* ck-builder: tensor_buffer.hpp tests
* ck-builder: conv_fwd.hpp tests
* ck-builder: add example end-to-end test in conv fwd 2d fp16
* ck-builder: simplify extent usage
* ck-builder: update testing doc
* ck-builder: skip end to end test on non-gfx9
* fix check_copyright_year interpreter
/bin/bash is not guaranteed to exist on Linux. Signed,
a NixOS user
* ck-builder: fix copyrights
* ck-builder: reduce conv fwd testing size
This test allocated 24GB of memory, too much for 16GB cards.
---------
Co-authored-by: John Shumway <jshumway@amd.com>
[ROCm/composable_kernel commit: 6219b12730]