[CK] Address a bunch of errors associated with targeting
gfx1200 on Windows (#5045)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Still addressing errors that are blocking the merge of TheRock PR:
https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382
## Technical Details
1. There are multiple fmha python scripts that are writing native paths
which are confusing cmake. I addressed one of these in an earlier PR
https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing
more that are exposed with gfx1200 target:
```
[composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library):
[composable_kernel configure] Syntax error in cmake code when parsing string
[composable_kernel configure]
[composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp
[composable_kernel configure]
[composable_kernel configure] Invalid character escape '\b'.
```
2. In the following compiler error we see gemm_prec_str<ADataType,
BDataType> being passed as a function to concat(...), instead of being
evaluated with the parenthesis operator(), i.e.,
gemm_prec_str<ADataType, BDataType>(). There are multiples instances of
this, I wonder what non-msvc compilers do here:
```
[composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast]
[composable_kernel] 119 | ((oss << sep << rest), ...);
[composable_kernel] | ^~~~
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here
[composable_kernel] 248 | return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName());
[composable_kernel] | ^
```
There are plenty of other places where we use gemm_prec_str with the
operator(), so I'm pretty sure these were just typos...but I'd like some
eyes on it.
3. There are 2 tests that fail to build on Windows, which I've excluded
from the build but will open bug tickets for:
1. gemm_weight_preshuffle
2. grouped_gemm_preshuffle
Here's a sample of the compiler error for these tests:
```
[composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations]
[composable_kernel] 110 | const char* vp = std::getenv(name);
[composable_kernel] | ^
[composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here
[composable_kernel] 1183 | _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s)
[composable_kernel] | ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE'
[composable_kernel] 368 | #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \
[composable_kernel] | ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT'
[composable_kernel] 358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text))
[composable_kernel] | ^
[composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation)
[composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918)
[composable_kernel] Target: x86_64-pc-windows-msvc
[composable_kernel] Thread model: posix
[composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin
[composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s).
[composable_kernel] ninja: build stopped: subcommand failed.
[composable_kernel FAILED WITH CODE 1 in 238 seconds]
ninja: build stopped: subcommand failed.
```
## Test Plan
Wait for internal CI and make sure build compiles locally.
## Test Result
Waiting on CI
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch
* fix cpp17 compile error in the ck-tile examples
---------
Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
Refine README files by removing outdated references to the Tile Engine
* Updates based on PR feedback 1
* Updates based on PR feedback 2
* Updates based on PR feedback 3
* Updates based on PR feedback 4
* Updates based on PR feedback 5
* Updates based on PR feedback 6
* Updates based on PR feedback 7
* Updates based on PR feedback 8
* Content Modification of CK Tile Example
* Modify the ck_tile gemm config
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm
* Update rmsnorm host reference
* Update tree reduction of rmsnorm for reference host
* Fix cross warp for m > 1 cases
* Add RMSNorm model selectable option for host reference
* Fix save_unquant cases
* Update reference rmsnorm forward function to use enum for model sensitivity
* Update reference rmsnorm calculation for model sensitivity
* Fix m warp for layernorm
* Adjust parameter of reference for twoPass
* Fix clang format
* Run clang-format-overwrite.sh to fix formating issue
* fix clang format
---------
Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side.
Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
* add ck tile examples to package
* Update jenkinsfile
* fix for jenkinsfile
* fix for building ck tile code on non gfx9
* compile ck tile examples only for gfx94
* include ck tile examples in all target
* fix for basic gemm UseStructuredSparsity
* Update CMakeLists.txt
* Update gemm_pipeline_problem.hpp
* add targets to rocm install
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* 50ms -> 28ms
* Fix bug in non fuse_add_store cases
* Fine tuned setting for 2 pass pipeline
* adjust workload
* remove unnecessary change
* add layernorm
* Adding output quant and unquant results at the same time.
* fix test
* fix format
* tune for cases 128x640 and 128x1024
* bug ifx
* Add shortcut to RMSNorm
* Modify test for adding shortcut for RMSNorm
* Add fused parameter into tests
* 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp
* 1. Supports various stride and percisions.
* Add support of Epilogue
* Add fuse and epilogue support to rmsnorm ref
* Modify rmsnorm example
* Refactor tests/examples
* Bug fix for newly added tests/examples
* Bug fix for new tests 2
* Modify smoke test scripts
remove dbg code
* Supports non-smooth dyanmic quant
* Update Rmsnorm2dFwd::GetName()
* rename xscale and prec_sx to smoothscale and prec_sm
Bug fix after rename
Remove files
* change example_rmsnorm2d_fwd.cpp
* update performance calculator
* Fix issue in two-pass when fuse add is enabled
* Remove comment of beta
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16
* delete comment
* 1.remove fmha change 2.change buffer name from bias to xbias
* Now bias can be used independently from fadd
* change kbias to kxbias
---------
Co-authored-by: feli <felix.li@amd.com>
* add no welford
* enable output raw
* raw of int8
* fix build
* fix smoke test err
* [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate
* [cktile]layernorm, fix err commit files and remove uselss
* fix quant 8192 err & change norm_reduce class and file name
---------
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
* [CK_TILE] add more stride for layernorm to support un-continuous Tensor
* align CK coding style
* extend strides to layernrom expample
* clang-format...
* optimze small N case using vec io and using rcp div
* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass
* [Ck_tile] fix blockSize compute in Generic2dBlockShape
* [Ck_tile]fix kfastfdiv template style
* [Ck_tile] layernorm, fix stype in review
---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.
Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.
* add prenorm/postnorm support, refactor using generate.py
* update README
* update README
* fix format
* update some description and fix format
* update format
* format
* use non-raw for loading
* format and update n4096
* dynamic-quant ready
* update readme
* support fused dynamic-quant
* update fused-quant, with smooth
* update README
* update args
* update some based on comment
* port layernorm
* change warp_welford.hpp
* Update warpshuffle
* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()
* refine welford max count calculation
* unify layernorm api
* Rename file
* Remove save mean and inv std
* Revert "refine welford max count calculation"
This reverts commit 022365802b.
* Fix order of parameter
* refine welford max count calculation again
* Remove fp32 instances
* Fix bug of padding
* refactor api
* Support bf16
* Extract common function
* Refine arg of operator()
* Add kMThreadPerBlock to template parameter
* clang format
* Refine variable name
* Refine file name
* remove redundant line
* refactor layernorm2d pipeline and add block-per-block utility
* fix name
* rename more
* add more block-per-tile instance
* remove duplicated define
* update instance for 2048, 1024 case
* support up to 2048 now
* opt loading
* add n1536
* Add two pass pipeline
* format
* Fix incorrect type
* parallel compilation
* Use smaller N
* fix 2p pass
* Support Repeat_M in distribution
* Refine nameing
* Add reduce example
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
* Fix text alignment of ArgParser::print()
* Update example README files
* Clarify make-ck-dev.sh <arch> usage
* Only keep some of the argument from '-?' output
* Undo command line output changes in README
* Only keep existing argument on doc and update description
* Fix text alignment
* Make cmake-ck-*.sh compatible with 'sh' command
* Fix compile error
* Add one pass pipeline
* Extract creating tile_window to operator()
* clang format
* reduce duplicated code
* do not hardcode
* Support padding in layernorm
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Add layernorm2d forward
* Refind file path
* clang format
* Exclude ck_tile op from all
* use add_executable instead
* refactor layernorm2d_fwd example
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>