Now that the remaining reflection has been implemented, we
can add the remaining factory tests too. This is the complete set
of instances for forward grouped conv currently in CK.
Grouping transfer operations per tensor makes it easier to
constrain on and operate with the transfer operations. As an
example, we can now deduplicate the logic for translating
the transfer operations from the ck-builder interface to the old
ck interface for the A and B tensors.
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
1. Enable grouped_gemm_quant and gemm_streamk on gfx12
- test_ck_tile_streamk_smoke is kept on gfx9, since it looks someone is still working on it.
2. Update warp tile size in grouped_gemm_quant and gemm_streamk unit test
3. Reduce gemm tile size to pass the build on gfx12 in test_gemm_streamk_reboot_types.hpp
Fixed issues encountered during merge of #3192
* fixed accidental drop of get_elementwise_operation during merge and added call to get_elementwise_op to 4 other builders
* run clang-format
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
* ck-builder: some miscellaneous fixes
* ck-builder: fix InstanceSet.FromFactory test
The exact syntax that the instance string functionality
returns has changed. This commit updates the test to expect
the right string.
* allow using alternative compiler in all CI stages
* get rid of some redundancies in jenkinsfile
* clean up jenkinsfile a bit more
* further clean up jenkinsfile
* do not force user jenkins in ci dockers
* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
* Use vectorized stores for dropout randvals
With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.
* Calculate dropout randvals for storing and applying only once
Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.
* Reapply "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)
This reverts commit e4298e55c7.
* WIP
* take Y2 as the AK1/BK1 value, that is the 'vector size' after shuffle
* use get_n_lds_banks()
* clang-format
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Removed direction and elementwise operation from default values required for convolution signature concept. Added constexpr helpers to set default values. Add compile-time tests.
* Add missing copyright statements
* Use ck_tile::host_tensor_descriptor instead of a custom lambda
* Refactor use of check_data_type in test classes
* Use TEST_SUITE_NAME with TYPED_TEST_SUITE
* Remove an unused namespace
* Make dim3 const
* Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile
* Add missing precision type combinations to CompV4 from CompV3
* Move the INT8 tests around for consistency with KernelTypesCompV3Wmma
* Add missing precision type combinations to CompV3Wmma from CompV3
* Remove the basic and universal tests and their dependencies
* On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B
* Use ADataType and BDataType instead of ComputeDataType for WarpGemm
* Explicitly set some return types to void
* Use more general typenames in InterleavedPKTypeLoader
* Add load_interleaved_pk_type.hpp to common.hpp
* Use std::is_same_v in load_int4_tile
* Add handling of LoadTranspose to load_int4_tile
* Factor out common code in several places using load_int4_tile
* Add support for pk_int4_t using load_int4_tile
* Fix formatting
Proposed changes
Improve the forward convolution builder implementation and addressed leftover feedback left from PR #3138. Main changes
Refactored tests such that they reflect better the builder pattern. The templates and types for the convolution algorithm concepts are created via factory that facilitates programmatic creation of the device op instances.
Moved tests into anonymous namespace.
The convolution factory had lot of if-else constructs when CK Builder types were converted into CK library types. I had initially trouble in using static_assert in the default branch of switch as the static_assert was evaluated at compile time even for valid types. However, if we change the static_assert to throw "<error message>", it will result in a compile-time error only if the default branch is actually hit. This assumes that the function is consteval. Hence, changed all conversions in the convolution factory to use switch, which is more intuitive.
Removed the explicit device op definition from convolution signature and the corresponding predicate file. The device ops are defined by the corresponding concepts. This allowed to remove lot of boilerplate code from the convolution factory.
Adde inheritance and convolution algorithm specialization to handle device ops that are specialization of a more generic ones. The large tensor support is more naturally expressed by this pattern.
Added support for the FP8 data type.
* WIP: Builder for expected test results.
* Improve ckb fwd conv instance tests.
* clang-format
* Change if-else statements into switch in conv factory.
* Fix clang-formatting.
* Removed unnecessary includes.
* Added missing copyright.
* Remove explicit device op flag from from convolution signature.
* Add missing concept.
* Fix build.
* clang-format
* Add test for building conv fwd FP8 instances.
* Add missing header to instance traits.
* Clean-up recently added instances.
* Introduce inheritance and specialization.
* Use builder to build conv algorithm templates and types.
* clang-format
* Fix conv description tests.
---------
Co-authored-by: John Shumway <john.shumwayjr@gmail.com>