* Shard several of the most costly targets.
Introduces a filter_tuple_by_modulo to break up tuples.
Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.
time ninja -j 64 device_grouped_conv3d_fwd_instance
* fix clang format
* Fix build errors in instantiation code.
I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:
ninja -j 128 test_grouped_convnd_fwd
This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.
I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.
* Fix link errors from mismatched declarations.
Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Shard the longest 2D convolution builds
Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.
* Use PROJECT_SOURCE_DIR for submodule compatibility
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
* compile profiler only for gemm-mulmul-weight-preshuffle
* m/n xdl; m/n xdl per wave; cshuffle block transfer cluster length m per block
* process all p1 instances
* process all p2 instances
* process all p3 instances
* convert p4 instance
* modify compute p1 instances
* modify compute p2 instances
* relax p4 instance c block transfer cluster len
* fix c block transfer cluster lengths comment
* add mfma (without 16x16) instances to the profiler
* roll back profiling cmakelists change
* clang-format
* re-add (now unused) 32x32 xdl-tile instances
* clang-format
* add more instances
* fit c block transfer lengths into block
* copy and write over the instance definitions from bf16 to fp16
* add instances to profiler
* unify instance tuple alias
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Updates to support mixed precision
* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip
* Added support for F8xF16xF16 to gemm_wmma_universal
* Added support for F16xF8xF16 to gemm_wmma_universal
* Added support for BF16xI4xBF16 to gemm_wmma_universal
* Added support for F16xI4xF16 to gemm_wmma_universal
* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType
* Added missing test class for FP16_KM_NK
* Pre-commit hooks fixes
* Added padding instances for f16xf16xf16
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Ammending changes for adding support for padding instances for f16xf16xf16
* Fixes for padding instances for f16xf16xf16
* Added padding instances for bf16xbf16, f8xf8
* Added packed instances for bf16xi4xbf16
* Added padding instances for f8xf16xf16
* Added padding instances for f16xf8xf16, f16xi4xf16
* Fixed typos for bf16xbf16xbf16 padding instances
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Fix int8 DTYPES check for gemm_bilinear
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
* tweaks to the miopen specific build. add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has.
* move the tidy if statment
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* generate and store library package for all targets
* use ninja to build packages for all targets
* make sure to use ftime-trace when using ninja
* make sure build trace only runs on gfx9
* archive lib package and stash only library package
* make the work compiled
* Solved the example code, but still have the profiler error
* Finished the feature
* Clang format and update the CHANGELOG
* solve the preshuffle v1 & v2 problem
* Comment Addressed
* Comment Addressed
* Add gemm_mx_fp8_bf8 example with row-major B
* Add more overloads of MX MFMA instructions
* Add MK_KN (RRR) tests
* Add KM_NK (CCR) tests
* Add more problem sizes to Large tests
* Add test_gemm_mx to the list of regression tests
* Prepare files for DeviceGemm_Wmma_CShuffleV3
* Implement main part of CShuffleV3 with block pipeline v3 for WMMA
* Remove unused functions and template params for A/B descriptors
* Support both gfx11 and gfx12
* Enable SplitK for gfx12 and disable for gfx11
* Added RowColRow layout for DeviceGemmV2 fp16
* Added more instances for Row, Col, Row data layout
* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout
* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout
* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout
* Fix formatting
* Add documentation
Based on e5ad48a784
* Enable gemm_universal profiling for gfx11/12
* Add WMMA intrinsics for F8/BF8
* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances
* Add BF16 instances and tests
* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Add new instances for weight_preshuffle for f8->bf16
* Add new instances for weight_preshuffle for f8->f16
* clang formatted
---------
Co-authored-by: Khushbu Agarwal <khuagar@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Added two kernel for M=32 problem
* Comment the first one
* Enable multiply_multiply for Scale_Block_M = 1 for deepseek
* Modify the a_thread offset since the A data load is different from B.
* edit fp8 ab scale for Scale_Block_M=1
* edit GemmSpec to MNKPadding
* enable blockwise pipelie v1 and v2. v1 is work for small K.
* add instance for gemm_ab_scale
* fix cmakelist of ckProfiler
* optimize blockscale gemm. todo: reduce vgpr usage
* fix a correctness bug
* sanity checked
* revert ckprofiler cmake changes
* clang format
* revert unnecessary changes.
* remove commented codes.
* split weight preshuffle library targets
* bring back enable-post-misched=0
* fix build issues for gemm_multiply_multiply_fp8 instances
* fix clang format
* add verbose build flag when building for all targets
* reduce path names for new instances
* fix paths in cmake
* refactor gemm_multiply_multiply library target
* fix a bug in example
* fix example 65 cmake
* reduce the number of threads when building libs for all targets to 50
* use ninja to build for all targets
* reduce teh number of threads when building for all targets
* reduce the number of threads to 32 when building libs for all targets to 50
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Add runtime check in example_gemm_xdl_streamk for gfx950
* Add runtime check in grouped conv fwd examples for gfx950
* Disable CK_USE_AMD_MFMA_GFX950
* Add new instances for gfx950
* Fix test_gemm_universal on gfx950
* Added two kernel for M=32 problem
* Comment the first one
* Enable multiply_multiply for Scale_Block_M = 1 for deepseek
* Modify the a_thread offset since the A data load is different from B.
* edit fp8 ab scale for Scale_Block_M=1
* edit GemmSpec to MNKPadding
* enable blockwise pipelie v1 and v2. v1 is work for small K.
* add instance for gemm_ab_scale
* fix cmakelist of ckProfiler
* optimize blockscale gemm. todo: reduce vgpr usage
* fix a correctness bug
* sanity checked
* revert ckprofiler cmake changes
* clang format
* revert unnecessary changes.
* remove commented codes.
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>