* WIP POC of dispatcher
* Dispatcher python workflow setup.
* Dispatcher cleanup and updates.
Further dispatcher cleanup and updates.
Build fixes
Improvements and python to CK example
Improvements to readme
* Fixes to python paths
* Cleaning up code
* Improving dispatcher support for different arch
Fixing typos
* Fix formatting errors
* Cleaning up examples
* Improving codegeneration
* Improving and fixing C++ examples
* Adding conv functionality (fwd,bwd,bwdw) and examples.
* Fixes based on feedback.
* Further fixes based on feedback.
* Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug.
* Another round of improvements based on feedback.
* Trimming out unnecessary code.
* Fixing the multi-D implementation.
* Using gpu verification for gemms and fixing convolutions tflops calculation.
* Fix counter usage issue and arch filtering per ops.
* Adding changelog and other fixes.
* Improve examples and resolve critical bugs.
* Reduce build time for python examples.
* Fixing minor bug.
* Fix compilation error.
* Improve installation instructions for dispatcher.
* Add docker based installation instructions for dispatcher.
* Fixing arch-based filtering to match tile engine.
* Remove dead code and fix arch filtering.
* Minor bugfix.
* Updates after rebase.
* Trimming code.
* Fix copyright headers.
* Consolidate examples, cut down code.
* Minor fixes.
* Improving python examples.
* Update readmes.
* Remove conv functionality.
* Cleanup following conv removable.
* Parallelization in dataset generation
* Parallelizable tests for fwd, bwd data, bwd weight with datasets
* .gitignore generated datasets
* Test parallelization script with round-robin GPU scheduling
* Parallelization updates to test generation and running
* Dataset paths relative to executable
* Update output from test generation
* Default to one GPU in test generation
* Add small dataset tests to Jenkins
* Update copyright lines
* Update test_data/generate_test_dataset.sh
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Move trap disable
* Common get path function
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
We don't want to add cmake formatting until we are in the super repo, but its handy if developers want to experiment with formatting. For now we should ignore .cmake-format.yaml.
Developers who use cline on the code base need to ignore .cline* directories like .cline_storage and .clinerules. Using a wildcard to ignore any other cline-related directories.
Add experimental builder infrastructure for composable_kernel
- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.
This establishes the directory structure for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.
* Shard several of the most costly targets.
Introduces a filter_tuple_by_modulo to break up tuples.
Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.
time ninja -j 64 device_grouped_conv3d_fwd_instance
* fix clang format
* Fix build errors in instantiation code.
I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:
ninja -j 128 test_grouped_convnd_fwd
This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.
I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.
* Fix link errors from mismatched declarations.
Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Shard the longest 2D convolution builds
Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.
* Use PROJECT_SOURCE_DIR for submodule compatibility
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Remove accidental copy of a file
* Remove accidental copies of template files.
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Shard several of the most costly targets.
Introduces a filter_tuple_by_modulo to break up tuples.
Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.
time ninja -j 64 device_grouped_conv3d_fwd_instance
* fix clang format
* Fix build errors in instantiation code.
I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:
ninja -j 128 test_grouped_convnd_fwd
This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.
I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.
* Fix link errors from mismatched declarations.
Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Shard the longest 2D convolution builds
Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.
* Use PROJECT_SOURCE_DIR for submodule compatibility
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Docs: Add precision support reference page
* edit of the precision type content
* added more description on scalars
---------
Co-authored-by: spolifroni-amd <sandra.polifroni@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
* enable gfx940
* switch between intrinsic mfma routines on mi100/200 and mi300
* fix mfma_int8 on MI300
* disable 2 int8 examples on MI300
* Update cmake-ck-dev.sh
* restore gitignore file
* modify Jenkinsfile to the internal repo
* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* initial enablement of gfx950
* fix clang format
* disable examples 31 and 41 int8 on gfx950
* add code
* fix build wip
* fix xx
* now can build
* naming
* minor fix
* wip fix
* fix macro for exp2; fix warpgemm a/b in transposedC
* unify as tuple_array
* Update the required Python version to 3.9
* Update executable name in test scripts
* re-structure tuple/array to avoid spill
* Merge function templates
* Fix format
* Add constraint to array<> ctor
* Re-use function
* Some minor changes
* remove wrong code in store_raw()
* fix compile issue in transpose
* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'
* let more integral_constant->constant, and formating
* make sure thread_buffer can be tuple/array
* temp fix buffer_store spill
* not using custom data type by default, now we can have ISA-level same code as opt_padding
* fix compile error, fp8 not ready now
* fix fp8 duplicated move/shift/and/or problem
* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
* fix scratch in fp8 kernel
* update some readme
* fix merge from upstream
* sync with upstream
* sync upstream again
* sync 22
* remove unused
* fix clang-format
* update README of ck_tile example
* fix several issue
* let python version to be 3.8 as minimal
* remove ck_tile example from default cmake target like all/install/check
* remove mistake
* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
* fix some bug in group-mode masking and codegen. update README
* F8 quantization for FMHA forward (#1224)
* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
* Add element function to fmha api
* Adjust P elementwise function
* Fix bug of elementwise op, our elementwise op is not inout
* Add some elementwise op, prepare to quantization
* Let generate.py can generate different elementwise function
* To prevent compiler issue, remove the elementwise function we have not used.
* Remove f8 pipeline, we should share the same pipeline even in f8
* Remove remove_cvref_t
* Avoid warning
* Fix wrong fp8 QK/KV block gemm setting
* Check fp8 rounding error in check_err()
* Set fp8 rounding error for check_err()
* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
* 1. codgen the f8 api and kernel
2. f8 host code
* prevent warning in filter mode
* Remove not-in-use elementwise function kargs
* Remove more not-in-use elementwise function kargs
* Small refinements in C++ source files
* Use conditional_t<> to simplify code
* Support heterogeneous argument for binary function types
* Re-use already-existing scales<> functor template
* Fix wrong value produced by saturating
* Generalize the composes<> template
* Unify saturates<> implementation
* Fix type errors in composes<>
* Extend less_equal<>
* Reuse the existing template less_equal<> in check_err()
* Add equal<float> & equal<double>
* Rename check_err() parameter
* Rename check_err() parameter
* Add FIXME comment for adding new macro in future
* Remove unnecessary cast to void
* Eliminate duplicated code
* Avoid dividing api pool into more than 2 groups
* Use more clear variable names
* Use affirmative condition in if stmt
* Remove blank lines
* Donot perfect forwarding in composes<>
* To fix compile error, revert generate.py back to 4439cc107d
* Fix bug of p element function
* Add compute element op to host softmax
* Remove element function in api interface
* Extract user parameter
* Rename pscale and oscale variable
* rename f8 to fp8
* rename more f8 to fp8
* Add pipeline::operator() without element_functor
* 1. Remove deprecated pipeline enum
2. Refine host code parameter
* Use quantization range as input
* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description
* Refine description
* prevent early return
* unify _squant kernel name in cpp, update README
* Adjust the default range.
* Refine error message and bias range
* Add fp8 benchmark and smoke test
* fix fp8 swizzle_factor=4 case
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* enable compilation of INSTANCES_ONLY for Windows
* suppress ROCMChecks warnings on GoogleTests
* suppress -Wfloat-equal warning on GoogleTests
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* update documentation dependencies
add version number to docs
rename doc config directories
enable more doc formats on rtd
add license section in docs
* New docs directory with minimal config
* Based on docs directory of rocBLAS
* Config for running Doxygen then Sphinx to generate HTML
* Add minimal content - intro to doc
* Add some boilerplate sections to doc
* content still needs to be done,
* e.g., need to generate API documentation using Doxygen
* need to write contributor guide
* Start Softmax section of Support Primitives doc
* Written as a test bed for typesetting math content
* Need to decide how much detail to go into
* add doc directories to git ignore file.
* Minor edits - new line at EOF, change year in copyright notices
* Port Markdown files to ReStructuredText
* Copy Markdown files from pre-existing doc directory to docs directory
* Convert to reStructured Text (rst) - section headings, links, tables
have a different syntax in rst
* New rst files added to index - can generate HTML with same style as
HTML generated from rst files in previous commits
* Intention is to make all the content in doc redundant and use rst
throughout rather than mix of md and rst
* Extend Softmax section of Primitives Guide
* rename l to z
* add material on applying softmax row-wise to matrix
* define macro for diag operator (represents diagonal matrix)
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Switch to standard ROCm packaging
* Revert .gitignore changes
* install new rocm-cmake version
* update readme
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* add gitignore
* host tensor: allow generating sequentially increasing value in a given dimension
* gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor
- remove dangling header include
- modify example gemm_xdl accordingly
- infer KPack value from M/NPerXdl
- device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
(API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)
* add LDS data dump utility
* profiler: reflect API change for distinct K0/K1 for A/B matrices
* profiler: add conflict-free LDS write FP16 kernel instances
* fix accidental perf regression
* address feedback; cosmetic changes
* clang-format for new files
* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>