* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
* revrt back to atomic add using casting
1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it.
2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance.
3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.
* Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty.
* Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good.
* Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet.
* Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3. In principle the whole thing works now, just need to add other data types and perhaps do some cleanup.
* Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16).
* Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts.
* Fixup comments and TODOs
* Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small.
* Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests.
* Take batching into account when using rotating memory or clearing the C tensor.
* Implement small refactors / comments etc. from review.
* Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists.
* Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems.
* Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups.
* Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function.
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* Shard several of the most costly targets.
Introduces a filter_tuple_by_modulo to break up tuples.
Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.
time ninja -j 64 device_grouped_conv3d_fwd_instance
* fix clang format
* Fix build errors in instantiation code.
I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:
ninja -j 128 test_grouped_convnd_fwd
This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.
I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.
* Fix link errors from mismatched declarations.
Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Shard the longest 2D convolution builds
Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.
* Use PROJECT_SOURCE_DIR for submodule compatibility
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Remove accidental copy of a file
* Remove accidental copies of template files.
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Shard several of the most costly targets.
Introduces a filter_tuple_by_modulo to break up tuples.
Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.
time ninja -j 64 device_grouped_conv3d_fwd_instance
* fix clang format
* Fix build errors in instantiation code.
I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:
ninja -j 128 test_grouped_convnd_fwd
This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.
I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.
* Fix link errors from mismatched declarations.
Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Shard the longest 2D convolution builds
Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.
* Use PROJECT_SOURCE_DIR for submodule compatibility
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
* compile profiler only for gemm-mulmul-weight-preshuffle
* m/n xdl; m/n xdl per wave; cshuffle block transfer cluster length m per block
* process all p1 instances
* process all p2 instances
* process all p3 instances
* convert p4 instance
* modify compute p1 instances
* modify compute p2 instances
* relax p4 instance c block transfer cluster len
* fix c block transfer cluster lengths comment
* add mfma (without 16x16) instances to the profiler
* roll back profiling cmakelists change
* clang-format
* re-add (now unused) 32x32 xdl-tile instances
* clang-format
* add more instances
* fit c block transfer lengths into block
* copy and write over the instance definitions from bf16 to fp16
* add instances to profiler
* unify instance tuple alias
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Updates to support mixed precision
* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip
* Added support for F8xF16xF16 to gemm_wmma_universal
* Added support for F16xF8xF16 to gemm_wmma_universal
* Added support for BF16xI4xBF16 to gemm_wmma_universal
* Added support for F16xI4xF16 to gemm_wmma_universal
* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType
* Added missing test class for FP16_KM_NK
* Pre-commit hooks fixes
* Added padding instances for f16xf16xf16
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Ammending changes for adding support for padding instances for f16xf16xf16
* Fixes for padding instances for f16xf16xf16
* Added padding instances for bf16xbf16, f8xf8
* Added packed instances for bf16xi4xbf16
* Added padding instances for f8xf16xf16
* Added padding instances for f16xf8xf16, f16xi4xf16
* Fixed typos for bf16xbf16xbf16 padding instances
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Fix int8 DTYPES check for gemm_bilinear
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
* tweaks to the miopen specific build. add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has.
* move the tidy if statment
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Unify test interface for different layouts.
* WIP: Introducing FP4/FP6/FP8 abstractions
* WIP: Introducing packed storage abstraction
* WIP: Introducing packed storage abstraction
* WIP: Improved support for FP6 data type
* Refactor packed storage for f6_t
* WIP: FP6 MFMA test
* Test if we correctly represent all FP6/FP4 numbers
* Additional output for failed FP4 test.
* More failing conversion tests
* Even more failing conversion tests
* Working FP6 MFMA tests
* Expand MX MFMA testing to BF8/6
* Update and verify MX MFMA test for packed types
* Fix fp4 and fp6 conversions on host
* Working MX MFMA tests for FP8/6/4
* Cleanup
* Add missing type
* Cleanup
* Final cleanup
* Restrict FP6/4 values output to CK_LOGGING=1
* Use CHAR_BIT instead of number 8
* Fix typo
* Remove FP6 and FP4 from the list of native types
---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
* generate and store library package for all targets
* use ninja to build packages for all targets
* make sure to use ftime-trace when using ninja
* make sure build trace only runs on gfx9
* archive lib package and stash only library package
* make the work compiled
* Solved the example code, but still have the profiler error
* Finished the feature
* Clang format and update the CHANGELOG
* solve the preshuffle v1 & v2 problem
* Comment Addressed
* Comment Addressed
* Add conversion tests
* Fix ctor
* Fix nan logic
* Fix conversion logic
* Permute packed f4_t values
* Fix conversion to float, repack vector elements
* Fix device tests
* Permute elements in a vector
* Add a repro test
* Add a conversion for a repro test
* Update test vectors
* Update conversion
* Fix the test
* Update test vector generator
* Fix vector sr conversion
* Permute conversion args
* Update conversion
* Test
* Fix packing
* Simplify conversion function
* Pack conversion in a loop
* Pack conversion in a loop
* Pack another conversion in a loop
* Pack one more conversion in a loop
* Pack the last conversion in a loop
* Clean up
* Add ops
* Add tests
* Add missing utils
* Update reference mx gemm
* Add f4x2 init mode
* Update host tensor utils
* Update chunk size for f4x2
* Add non scaled ops
* Add a type utility
* Update non scaled reference kernel
* Add non scaled tests
* Debug mfma arguments
* Add more debug info
* Update chunk size
* Update data layout
* Add more debugging
* Fix B stride
* Fix reference gemm
* Fix build
* One more reference fix
* Add more debug info
* Disable some tests
* Enable tests
* Add fp4 dimensions
* Update reference kernels
* Temp edits
* Remove leftovers
* Fix conflicts
* Clean up
* More clean up
* Revert "More clean up"
This reverts commit d8d35a0846.
* Add layouts to tests
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* Add gemm_mx_fp8_bf8 example with row-major B
* Add more overloads of MX MFMA instructions
* Add MK_KN (RRR) tests
* Add KM_NK (CCR) tests
* Add more problem sizes to Large tests
* Add test_gemm_mx to the list of regression tests
* Prepare files for DeviceGemm_Wmma_CShuffleV3
* Implement main part of CShuffleV3 with block pipeline v3 for WMMA
* Remove unused functions and template params for A/B descriptors
* Support both gfx11 and gfx12
* Enable SplitK for gfx12 and disable for gfx11
* Added RowColRow layout for DeviceGemmV2 fp16
* Added more instances for Row, Col, Row data layout
* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout
* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout
* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout
* Fix formatting
* Add documentation
Based on e5ad48a784
* Enable gemm_universal profiling for gfx11/12
* Add WMMA intrinsics for F8/BF8
* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances
* Add BF16 instances and tests
* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>