Test high-dimensional tensors to verify no integer overflow in
element space size calculations. Includes:
- 8D packed test with prime dimensions (180180 elements)
- 8D permuted layout test with non-monotonous strides
(memory order differs from logical order)
- Vary sequence values across all tests to avoid repetition
- Use prime numbers for distinct, coprime test data
- Add TwoSequences test for unpack_and_merge_sequences
- Remove tests of implementation details (detail::compute_element_space_size)
- Use public API (make_naive_tensor_descriptor) for all tests
- Avoid square/cube shapes that could hide row/column major bugs
- Use prime numbers for padding tests to catch index calculation errors
- Add two padding test cases: arbitrary offsets and stride slice
Add documentation for:
- sequence_map_inverse: O(N) to O(1) via pack expansion (95% time reduction)
- calculate_element_space_size: fold expression (73% time reduction)
Update case studies section with these optimizations.
The InitializeElementSize function used generate_tuple with a lambda to
compute visible dimension lengths. Each TensorDescriptor type created
a unique lambda type, causing 78 instantiations (385ms).
Replace with direct pack expansion using helper functions, eliminating
the lambda instantiation overhead entirely.
Results on example_grouped_conv_fwd_xdl_fp16:
- generate_tuple lambdas: 178 -> 100 (44% reduction)
- Template instantiation time: 19.5s -> 19.0s
The GetTransformAndItsUpperDimension function used nested static_for
loops with lambdas to search for a hidden dimension in UpperDimensionIdss.
This caused 918 applier::operator() instantiations (81% of all applier
instantiations).
Replace with find_in_tuple_of_sequences helper that uses constexpr
array lookup and if-constexpr recursion, eliminating the lambda
instantiation overhead.
Results on example_grouped_conv_fwd_xdl_fp16:
- applier instantiations: 1132 -> 127 (89% reduction)
- TensorDescriptor instantiations: 2503 -> 664 (73% reduction)
- Template instantiation time: 23.4s -> 19.4s (17% reduction)
Use operator| with fold expression (Seqs{} | ...) to merge sequences
in O(1) template depth instead of O(log N) binary tree recursion.
- Reduces sequence_merge instantiations from 449 to 167 (63% reduction)
- Total template instantiations: 47,186 → 46,974 (-212)
- ADL finds operator| since Sequence is in ck namespace
Use pack expansion with fold expression to compute element space size
instead of recursive template or recursive lambda.
Results:
- calculate_element_space_size: 24 instances, 35ms → 10 instances, 9ms
- Max template depth: 24 → 23
Add make_uniform_tuple<N>(value) helper to replace common pattern:
generate_tuple([&](auto) { return value; }, Number<N>{})
This avoids unique lambda instantiations when creating tuples with
repeated values. Applied to device_grouped_conv_fwd_multiple_abd.
Lambda expressions in transform_tensor_descriptor created unique template
instantiations for each capture combination. This change replaces lambdas
with named functor structs to reduce instantiation count:
- Add merge_sequences_functor and unpack_and_merge_sequences helper
- Add convert_visible_to_hidden_id and convert_visible_ids_to_hidden_ids
- Add generate_arithmetic_sequence_from_scan
Build analysis shows instantiation count dropped from 388 to 32 (92% reduction).
This adds an optimized helper for the common generate_tuple pattern:
generate_tuple([](auto i) { return Sequence<i.value>{}; }, N)
The new generate_identity_sequences<N>() function creates
Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>> without
requiring lambda instantiation at each call site.
Updated 21 call sites across threadwise_tensor_slice_transfer,
wrapper utilities, and layout files to use the new helper.
Build time improvement: ~1.1% wall-clock (18.3s -> 18.1s)
Replace linear recursive instantiation with direct pack expansion
for 1-4 sequences, and binary tree reduction for larger cases.
Before: O(N) depth for merging N sequences
After: O(log N) depth with O(1) for up to 4 sequences
This further reduces maximum nesting depth from 26 to 22 levels
when combined with the previous sequence_gen optimization.
Co-Authored-By: Claude <noreply@anthropic.com>
Replace recursive template instantiation with compiler intrinsic
__make_integer_seq and pack expansion for O(1) instantiation depth.
Before: Maximum nesting depth of 90 levels with recursive divide-and-conquer
After: Maximum nesting depth of 26 levels using flat pack expansion
Performance improvements measured on example_grouped_conv_fwd_xdl_fp16:
- Template instantiation wall-clock time: 36.8s -> 18.7s (49% faster)
- Template instantiation cumulative time: 56.6s -> 25.8s (54% faster)
- Maximum nesting depth: 90 -> 26 (71% reduction)
The key changes:
- sequence_gen: Uses __make_integer_seq to generate indices 0..N-1,
then applies functor F via pack expansion in a single step
- uniform_sequence_gen: Uses __make_integer_seq with pack expansion
to generate N copies of a constant value
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix large case init bounds
* Revert "Fix large case init bounds"
This reverts commit 1abca05c6f.
* Restore CPU initialization for do_verification != 2
1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder
2. Added a detailed comment explaining why this is necessary:
- SLES15 is a legacy platform with limited C++20 ecosystem support
- While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder
* Adding CK Tile documentation
* Updates based on feedback
* Fix tile window API description
* Fix remaining images
* add documentation about flush_cache and rotating_buffer functionality in ck_tile
* Supplement the documentation
* light edit of the ck tile conceptual doc
---------
Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Factor helpers out of conv_traits.hpp
* Create a non-templated conv_traits struct
* Migrate to new instance-specific instance_to_conv_traits functions
* Clean up reflection concepts
* Clean up ConvTraits helpers
* Update testing for convolution traits
This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.
* Address reviewer comments and resolve merge conflict
This is mostly adjustments to enum values so that the docs align correctly with the current code.
Also updated the calendar scope of the project to extend through March 2026.
The test_ck_tile_streamk_reduction test suite seems to have transient
failures; hence, we are disabling these tests for now. We will re-enable
them once the bug is resolved.
Adding owners permissions for two leading developers on the CK Builder subproject to help with reviews on that project, especially in the EU time zones.
Remove aska-0096, who has left AMD
* formatted
* formatted
* formatting
* formatting
* formatting
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* enable prefill shapes
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Add support for rowcol and tensor GEMM operations
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update README
* adding preshuffle quant as new parameter and its associated new files
* remove debugging statements
* adding test
* enable preshuffle quant with permuteN
* updating readme and correcponding gemmconfigs
* updating cmake file
* fixing CI failures for grouped quant gemm
* debugging permuteN
* debugging
* debugging PermuteN
* initial commit
* resolving merge conflicts
* adding test cases
* initial commit with prints
* debugging
* fine-grained working
* debugging medium grained
* fixing the tile window
* formatting
* enabling prefill shapes
* working prefill shapes
* formatted
* clean up
* code cleanup
* bug fix after merging with develop
* G128 working for both prefill and decode shapes for preshufflequant
* clean up after merging with develop
* fixing group 64 for decode shapes
* non preshufflequant working for group size 128
* enable preshuffleb and preshufflequant with variour group sizes
* reduce build time by splitting example into diff datatype files
* Adding tests for preshuffleQuant
* address review comment
* fix for gfx1201
* compile time fix for gfx1201
* clang formatted
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
* CMakeLists.txt hack for Windows.
* Add Windows build instructions.
* Fix type issue with variadic min function.
* Use std::common_type to fix the variadic min/max functions.
* Enable CPU guard compilation on Windows.
* Suppress warnings related to std::getenv on Windows platform.
* Git ignore the output directory on Windows platform.
* Powershell script for running tests and generating reports.
* Improve test logging.
* Disable non-conv tests.
* Fix Debug build on Windows.
* More debug build changes.
* Update Windows build instructions.
* Enable all tests.
* Test fixes.
* Suppress not found linker options warning.
* Update unsigned long literals and format specifiers to work correctly in Windows
* Fix conv 3D bwd weight bilinear tests on Windows.
* Revert changes on .gitignore.
* Clean-up CMake project file for Windows builds.
* clang-format
* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.
* Fix building examples on Windows.
* Update Readme.
* Remove the suppression of the deprecated warnings.
* Remove Windows specific min/max implementations from CK Tile math core.
* Remove unnecessary no-op on Windows.
---------
Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>
* Refactor GPU verification kernel to gather erorr stats on GPU
* Check if result is all zero
* non-negative error count doesn't need custom Atomics
* Remove unnecessary AtomicMaxFloat function
* Simpler warp reduction, remove passed flag
* Move verification header to include
* Fix header path in test
* Fix block reduction loop
- Add support for direct store in epilogue instead of cshuffle
- Add padding support for wave transfer without transpose
- Add wave transfer with interleaved layout to support direct store
- Enable new functionalities on GEMMs
- Add optional new functionality support for grouped convolution fwd
- Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)
* memory op changes
* memory op changes
* Fixing TILE_ENGINE_BASIC in Tile Engine
* Removing gfx90a from Tile Engine Run
* [CK TILE ENGINE] increasing ci configs for BASIC case
* Setting RUN_TILE_ENGINE_BASIC_TESTS to ON by default
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>