composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 11:47:48 +00:00

Author	SHA1	Message	Date
Max Podkorytov	97873bc0d5	Expand tensor descriptor test coverage - Add tests for make_naive_tensor_descriptor_packed (1D, 2D, 3D) - Add tests for make_naive_tensor_descriptor_aligned (2D, 3D) - Add 1D tensor tests with explicit strides - Ensure all shapes use distinct, coprime dimensions	2026-01-19 23:41:39 -06:00
Max Podkorytov	b60d14ba89	Address review feedback on tensor descriptor helper tests - Remove tests of implementation details (detail::compute_element_space_size) - Use public API (make_naive_tensor_descriptor) for all tests - Avoid square/cube shapes that could hide row/column major bugs - Use prime numbers for padding tests to catch index calculation errors - Add two padding test cases: arbitrary offsets and stride slice	2026-01-19 23:31:13 -06:00
Max Podkorytov	0c8188374a	Add unit tests for template optimization helpers Add Google Tests for optimized template utilities: - sequence_gen: Tests with custom functors (4 tests) - generate_identity_sequences: Tuple of identity sequences (4 tests) - find_in_tuple_of_sequences: O(1) sequence search (6 tests) - sequence_find_value: Value lookup in sequences (5 tests) - container_concat: Tuple/array concatenation (5 tests) - make_uniform_tuple: Repeated value tuples (4 tests) - compute_element_space_size: Fold expression (8 tests) - unpack_and_merge_sequences: Sequence merging (2 tests) Total: 43 new tests across 4 test files.	2026-01-19 18:35:57 -06:00
Max Podkorytov	05d9befe90	Document sequence_map_inverse and element_space_size optimizations Add documentation for: - sequence_map_inverse: O(N) to O(1) via pack expansion (95% time reduction) - calculate_element_space_size: fold expression (73% time reduction) Update case studies section with these optimizations.	2026-01-19 15:45:52 -06:00
Max Podkorytov	52fa8f6c2c	Add build time optimization documentation	2026-01-19 15:41:10 -06:00
Max Podkorytov	a565d87e08	Apply same optimization pattern to TensorAdaptor TensorAdaptor has identical InitializeElementSize and GetTransformAndItsUpperDimension patterns as TensorDescriptor. Apply the same optimization: - Replace nested static_for lambdas with find_in_tuple_of_sequences - Replace generate_tuple lambda with pack expansion Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)	2026-01-16 23:38:12 -06:00
Max Podkorytov	bbf5c5e926	Replace generate_tuple lambda with pack expansion in InitializeElementSize The InitializeElementSize function used generate_tuple with a lambda to compute visible dimension lengths. Each TensorDescriptor type created a unique lambda type, causing 78 instantiations (385ms). Replace with direct pack expansion using helper functions, eliminating the lambda instantiation overhead entirely. Results on example_grouped_conv_fwd_xdl_fp16: - generate_tuple lambdas: 178 -> 100 (44% reduction) - Template instantiation time: 19.5s -> 19.0s	2026-01-16 21:46:50 -06:00
Max Podkorytov	1d7c221c95	Replace nested static_for lambdas with compile-time search helper The GetTransformAndItsUpperDimension function used nested static_for loops with lambdas to search for a hidden dimension in UpperDimensionIdss. This caused 918 applier::operator() instantiations (81% of all applier instantiations). Replace with find_in_tuple_of_sequences helper that uses constexpr array lookup and if-constexpr recursion, eliminating the lambda instantiation overhead. Results on example_grouped_conv_fwd_xdl_fp16: - applier instantiations: 1132 -> 127 (89% reduction) - TensorDescriptor instantiations: 2503 -> 664 (73% reduction) - Template instantiation time: 23.4s -> 19.4s (17% reduction)	2026-01-16 21:46:50 -06:00
Max Podkorytov	9942fd6ab9	Replace sequence_merge O(log N) recursion with O(1) fold expression Use operator\| with fold expression (Seqs{} \| ...) to merge sequences in O(1) template depth instead of O(log N) binary tree recursion. - Reduces sequence_merge instantiations from 449 to 167 (63% reduction) - Total template instantiations: 47,186 → 46,974 (-212) - ADL finds operator\| since Sequence is in ck namespace	2026-01-16 21:46:27 -06:00
Max Podkorytov	e74b611c14	Replace O(N) recursive element space size with O(1) fold expression Use pack expansion with fold expression to compute element space size instead of recursive template or recursive lambda. Results: - calculate_element_space_size: 24 instances, 35ms → 10 instances, 9ms - Max template depth: 24 → 23	2026-01-16 21:46:27 -06:00
Max Podkorytov	a8c9be9378	Rewrite sequence_map_inverse using O(1) depth pack expansion Replace O(N) recursive template sequence_map_inverse_impl with constexpr function and pack expansion for O(1) template depth. Results: - sequence_map_inverse: 45 instances, 187ms → 7 instances, 10ms (95% reduction)	2026-01-16 21:46:27 -06:00
Max Podkorytov	02e42dcaa1	Replace lambdas with named functors in container_concat Lambdas create unique types per call site, causing duplicate template instantiations. Named functors are shared across call sites. Results: - container_concat: 186 → 93 instantiations (50% reduction) - Wall-clock: 518ms → 309ms (40% reduction)	2026-01-16 21:46:07 -06:00
Max Podkorytov	0a1e1cc66f	Add container_product helper for O(1) depth fold expression Replace O(N) recursive container_reduce with O(1) fold expression for computing products of container elements. This reduces template instantiation depth from 26 to 23 levels. - Add container_product() using unpack + fold expression - Migrate 10 call sites from container_reduce(x, multiplies{}, 1)	2026-01-16 21:46:07 -06:00
Max Podkorytov	22a409be00	Add make_uniform_tuple helper for repeated value patterns Add make_uniform_tuple<N>(value) helper to replace common pattern: generate_tuple([&](auto) { return value; }, Number<N>{}) This avoids unique lambda instantiations when creating tuples with repeated values. Applied to device_grouped_conv_fwd_multiple_abd.	2026-01-16 21:46:07 -06:00
Max Podkorytov	00849ac2e2	Replace lambdas with named functors in transform_tensor_descriptor Lambda expressions in transform_tensor_descriptor created unique template instantiations for each capture combination. This change replaces lambdas with named functor structs to reduce instantiation count: - Add merge_sequences_functor and unpack_and_merge_sequences helper - Add convert_visible_to_hidden_id and convert_visible_ids_to_hidden_ids - Add generate_arithmetic_sequence_from_scan Build analysis shows instantiation count dropped from 388 to 32 (92% reduction).	2026-01-16 21:45:36 -06:00
Max Podkorytov	d7e7fbdcff	Add generate_identity_sequences helper for common pattern This adds an optimized helper for the common generate_tuple pattern: generate_tuple([](auto i) { return Sequence<i.value>{}; }, N) The new generate_identity_sequences<N>() function creates Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>> without requiring lambda instantiation at each call site. Updated 21 call sites across threadwise_tensor_slice_transfer, wrapper utilities, and layout files to use the new helper. Build time improvement: ~1.1% wall-clock (18.3s -> 18.1s)	2026-01-16 21:45:31 -06:00
Max Podkorytov	3d46680be0	Optimize sequence_merge using direct concatenation for small cases Replace linear recursive instantiation with direct pack expansion for 1-4 sequences, and binary tree reduction for larger cases. Before: O(N) depth for merging N sequences After: O(log N) depth with O(1) for up to 4 sequences This further reduces maximum nesting depth from 26 to 22 levels when combined with the previous sequence_gen optimization. Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-16 21:45:26 -06:00
Max Podkorytov	94b9e4b635	Optimize sequence_gen and uniform_sequence_gen using __make_integer_seq Replace recursive template instantiation with compiler intrinsic __make_integer_seq and pack expansion for O(1) instantiation depth. Before: Maximum nesting depth of 90 levels with recursive divide-and-conquer After: Maximum nesting depth of 26 levels using flat pack expansion Performance improvements measured on example_grouped_conv_fwd_xdl_fp16: - Template instantiation wall-clock time: 36.8s -> 18.7s (49% faster) - Template instantiation cumulative time: 56.6s -> 25.8s (54% faster) - Maximum nesting depth: 90 -> 26 (71% reduction) The key changes: - sequence_gen: Uses __make_integer_seq to generate indices 0..N-1, then applies functor F via pack expansion in a single step - uniform_sequence_gen: Uses __make_integer_seq with pack expansion to generate N copies of a constant value Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-16 21:45:26 -06:00
Cong Ma	f9104ef9b3	[CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584 )	2026-01-16 15:27:39 -08:00
Johannes Graner	3f735c127b	[CK Profiler] Restore CPU tensor initialization when verification is not done on GPU (#3594 ) * Fix large case init bounds * Revert "Fix large case init bounds" This reverts commit `1abca05c6f`. * Restore CPU initialization for do_verification != 2	2026-01-16 10:56:53 -08:00
logicat	fec81109f1	Remove unnecessary hip_fp16 include from stream_config (#3549 )	2026-01-16 10:40:05 -08:00
John Shumway	2d233c838a	Disable CK Builder for SLES15 in Jenkins CI (#3581 ) 1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder 2. Added a detailed comment explaining why this is necessary: - SLES15 is a legacy platform with limited C++20 ecosystem support - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder	2026-01-16 10:36:23 -08:00
spolifroni-amd	427d4fb9e9	CK Tile: fix some issues (#3557 ) * Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2026-01-16 10:34:44 -08:00
Thrupti Raj Lakshmana Gowda	de8ee379ad	Fixing GEMM Multi D on Tile Engine (#3583 )	2026-01-16 10:17:21 -08:00
John Shumway	644cdbe3c9	Merge pull request #3573 from ROCm/jshumway/builder-readme	2026-01-15 17:55:04 -08:00
Max Podkorytov	086a1f8861	Add LLM-agnostic Docker and build analysis tools (#3576 ) This commit introduces utility tools for building, testing, and analyzing Composable Kernel. The tools are designed to be LLM-agnostic and can be used with any AI assistant or directly from the command line. Tools Added: ============ 1. ck-docker - Docker container management - Start/stop ROCm-enabled containers - Build targets with CMake + Ninja - Run tests with gtest filters - Auto-detect GPU targets (gfx950, gfx942, etc.) - Per-user, per-branch container naming to avoid conflicts 2. ck-build-analysis - Build time profiling - Uses Clang's -ftime-trace for compilation analysis - Aggregates statistics across multiple trace files - Identifies template instantiation bottlenecks - Generates detailed Markdown reports with: * Compilation phase breakdown * Top expensive instantiations * Template family analysis * Data-driven optimization recommendations - Configurable granularity (1µs to 500µs) - PEP 723 compliant Python script with auto-dependency management via uv Key Features: ============= - LLM-agnostic design (works with any AI assistant) - Zero-configuration setup with automatic dependency installation - Comprehensive documentation in script/tools/README.md - Security hardening (input validation, no command injection) - Multi-file trace aggregation for accurate build analysis - Jinja2-based report generation for customizable output Implementation: =============== - script/tools/ck-docker - Main Docker orchestration script - script/tools/ck-build-analysis - Build analysis orchestration - script/tools/common.sh - Shared utilities (container mgmt, GPU detection) - script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer - script/tools/templates/ - Jinja2 templates for report generation - script/tools/README.md - Comprehensive documentation Directory Structure: ==================== script/tools/ ├── README.md # Main overview ├── README_ck-docker.md # ck-docker documentation ├── README_ck-build-analysis.md # ck-build-analysis documentation ├── ck-docker # Docker orchestration script ├── ck-build-analysis # Build analysis orchestration ├── common.sh # Shared utilities ├── analyze_build_trace.py # Python analyzer (PEP 723) └── templates/ └── build_analysis_report.md.jinja # Report template The tools follow Unix philosophy: do one thing well, compose easily, and work from both CLI and programmatic contexts.	2026-01-15 08:30:23 -08:00
dependabot[bot]	f57395689b	Bump rocm-docs-core[api_reference] from 1.31.1 to 1.31.2 in /docs/sphinx (#3577 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.1 to 1.31.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.1...v1.31.2) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.31.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-15 07:49:06 -08:00
Michal Kulikowski	e1f2a44096	[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation -Fixing args num in ck examples 68/69 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2026-01-15 16:43:02 +01:00
Yung-sheng Tu	6df2d70143	Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 ) * add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples	2026-01-15 07:19:31 -08:00
Estevan Vedovelli	e30207985a	Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-01-15 09:35:24 -05:00
Jeff Huang	993d3e2f0e	[FMHA] Enable page size 16 for batch prefill kernel (#3568 ) * [FMHA] Enable page size 16 for batch prefill kernel * Refactor batch prefill KV offset logic to simplify template arguments - Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args. - Add static assert to forbid `page_size=1` with vectorized layout.	2026-01-15 22:11:44 +08:00
John Shumway	5122637215	[CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547 ) * Factor helpers out of conv_traits.hpp * Create a non-templated conv_traits struct * Migrate to new instance-specific instance_to_conv_traits functions * Clean up reflection concepts * Clean up ConvTraits helpers * Update testing for convolution traits This is a lot of cleanup on tests to have verbose coverage of feature extraction, explicit tests for each supported device kernel, and simple, readable test code. * Address reviewer comments and resolve merge conflict	2026-01-15 10:03:21 +01:00
John Shumway	df7ee270a6	Update README.md files to match recent code changes This is mostly adjustments to enum values so that the docs align correctly with the current code. Also updated the calendar scope of the project to extend through March 2026.	2026-01-15 02:15:29 -05:00
Illia Silin	8705fdcb0c	add aiter test_batch_prefill and simplify jenkins file a bit (#3570 )	2026-01-14 14:07:47 -08:00
Emily Martins	7f912909ca	Disable CK Tile Stream-K reduction tests (#3559 ) The test_ck_tile_streamk_reduction test suite seems to have transient failures; hence, we are disabling these tests for now. We will re-enable them once the bug is resolved.	2026-01-14 14:02:21 -07:00
John Shumway	f08fb3f748	[CK_BUILDER] Update owners file for more reviews for CK Builder (#3572 ) Adding owners permissions for two leading developers on the CK Builder subproject to help with reviews on that project, especially in the EU time zones. Remove aska-0096, who has left AMD	2026-01-14 12:43:55 -08:00
Bartłomiej Kocot	a346cfa960	Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566 )	2026-01-14 12:37:12 -08:00
Bartłomiej Kocot	a07c8e38bd	Fix grouped conv bwd data wmma check (#3562 )	2026-01-14 11:04:37 -08:00
Khushbu Agarwal	118afa455c	[CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * initial commit with prints * debugging * fine-grained working * debugging medium grained * fixing the tile window * formatting * enabling prefill shapes * working prefill shapes * formatted * clean up * code cleanup * bug fix after merging with develop * G128 working for both prefill and decode shapes for preshufflequant * clean up after merging with develop * fixing group 64 for decode shapes * non preshufflequant working for group size 128 * enable preshuffleb and preshufflequant with variour group sizes * reduce build time by splitting example into diff datatype files * Adding tests for preshuffleQuant * address review comment * fix for gfx1201 * compile time fix for gfx1201 * clang formatted --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>	2026-01-14 10:00:19 -08:00
Ville Pietilä	1fc5a3f3ac	Build CK on Windows (#3458 ) * CMakeLists.txt hack for Windows. * Add Windows build instructions. * Fix type issue with variadic min function. * Use std::common_type to fix the variadic min/max functions. * Enable CPU guard compilation on Windows. * Suppress warnings related to std::getenv on Windows platform. * Git ignore the output directory on Windows platform. * Powershell script for running tests and generating reports. * Improve test logging. * Disable non-conv tests. * Fix Debug build on Windows. * More debug build changes. * Update Windows build instructions. * Enable all tests. * Test fixes. * Suppress not found linker options warning. * Update unsigned long literals and format specifiers to work correctly in Windows * Fix conv 3D bwd weight bilinear tests on Windows. * Revert changes on .gitignore. * Clean-up CMake project file for Windows builds. * clang-format * Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms. * Fix building examples on Windows. * Update Readme. * Remove the suppression of the deprecated warnings. * Remove Windows specific min/max implementations from CK Tile math core. * Remove unnecessary no-op on Windows. --------- Co-authored-by: User <user@example.com> Co-authored-by: Ville Pietilä <none> Co-authored-by: John Afaganis <john.afaganis@amd.com> Co-authored-by: Ville Pietilä <>	2026-01-14 07:31:45 -08:00
Johannes Graner	f173642087	[CK] Refactor GPU verification kernel to gather error stats on GPU (#3551 ) * Refactor GPU verification kernel to gather erorr stats on GPU * Check if result is all zero * non-negative error count doesn't need custom Atomics * Remove unnecessary AtomicMaxFloat function * Simpler warp reduction, remove passed flag * Move verification header to include * Fix header path in test * Fix block reduction loop	2026-01-14 16:04:50 +01:00
Johannes Graner	3ccb15ea02	[CK Profiler] Initialize tensors on GPU in CK profiler (#3550 ) * Initialize tensors on GPU in CK profiler * Kick CI	2026-01-14 16:04:14 +01:00
Linjun-AMD	717ed0b59f	[CK_TILE][FMHA] Enable gpt-oss sink (#3490 ) * Enable gptoss sink Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add gptoss sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update CHANGELOG.md Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix test args error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update test_fmha_fwd.cpp * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Revert "update sink test" This reverts commit `970b4f1686`. * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update valid sink_v in splitkv pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix clangformat error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix aiter scale error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_pipeline_qr_ks_vs.hpp * div scale_s for sink_value Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * update sink_value with bias Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Fix typo in dropout parameter in fmha_batch_prefill_kernel * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * optimized some code Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix splitkv error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update sink reference Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * Update smoke_test_fwd_sink.sh --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-01-14 21:32:06 +08:00
Enrico Degregori	693ff3bbb3	Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465 ) - Add support for direct store in epilogue instead of cshuffle - Add padding support for wave transfer without transpose - Add wave transfer with interleaved layout to support direct store - Enable new functionalities on GEMMs - Add optional new functionality support for grouped convolution fwd - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)	2026-01-14 11:02:19 +01:00
Thrupti Raj Lakshmana Gowda	51027474af	[CK TILE ENGINE] CI fix for Basic Tile Engine (#3554 ) * memory op changes * memory op changes * Fixing TILE_ENGINE_BASIC in Tile Engine * Removing gfx90a from Tile Engine Run * [CK TILE ENGINE] increasing ci configs for BASIC case * Setting RUN_TILE_ENGINE_BASIC_TESTS to ON by default --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-01-13 16:20:30 -08:00
Thomas Ning	00c46785a8	Shuffle fix for gfx950 (#3491 ) * solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com>	2026-01-13 09:21:29 -08:00
Ville Pietilä	9908a87c31	[CK_BUILDER] Add bwd weight factories (#3509 ) * Add placeholder test. * Initial conv bwd weight factory. * Conv builder test refactoring. * Add missing pieces to bwd weight factory. * Improve compile time erros message when no matching factory is found. * Use amcro to ensure automatic macthing between concepts are their string representations. * Improve compile time diagnostics. * Small improvements. * Improve missing member/wrong type compile-time errors. * Improve compile time diagnostics. * Concept bug fixes. * Remove debug assert. * Update algorithm signature diagnostics. * Factory bug fixes. * First functional version of bwd weight conv factory. * Refactor handing of GEMM-K batch template parameter in conv bwd weight factory. * Concept improvements. * Improve concept diagnostics. * Introduve a common size type for concepts. * Update compiletime diagnostics to use the size type. * Update conv specialization enum. * Fix fwd conv builder tests. * Fix smoke tests. * Separate bwd weigth and bwd data tests into separate targets. * Clean-up CK Tile builder tests. * Add bwd weight XDL CShuffle V3 factory. * Build conv bwd weigth v3 instances successfully. * Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Test fix. * Add instance traits for bwd weight algorithms. * Add unit tests for instance strings. * Build new instance traits unit tests but exclude WMMA for now. * Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Conv bwd weight DL factory. * Final implementation for bwd weight DL factory. * Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle * Treat ref algorithm the same way as real algorithms in the dispatcher. * Refactor large tensor support and WMMA configuration. * Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3. * Update Readme. * Fix WMMA bwd weight tests. * Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3. * Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle. * Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 * Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs. * Fix fwd factories after refactoring. * clang-format * Move compile-time diagnostics to a separate branch. * Fix ref algorithm dispatching. * Fix smoke tests. * clang-format * Fix factory for regular WMMA conv bwd weight. * Clarify builder Readme. * Remove obsolete test file. * Fix test after merge. * clang-format * Remove the C++26 extensions. * Unify conv elementwise ops and layout definitions for fwd and bwd directions. * Remove old layout and elementwise ops. * Unify handling of conv tensor types between fwd and bwd directions. * Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank. * Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms. * clang-format --------- Co-authored-by: Ville Pietilä <>	2026-01-13 18:12:38 +02:00
Po Yen Chen	710fa1fd3d	fix incorrect List import in reduce_parameter.py (#3555 )	2026-01-13 20:03:05 +05:30
Erwin Terpstra	eb041079a3	Implement grouped gemm tile loop for RDNA4 (#3304 ) * feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3	2026-01-13 07:14:23 +01:00
Jeff Huang	141f77aa12	[CK Tile] Fix FMHA LSE calculation and potential division by zero (#3326 ) This commit addresses numerical stability issues in the BlockFmhaPipelineQRKSVS pipeline when bias has -inf masking values: 1. Explicitly handle the case where the accumulated exponential sum (l) is zero. In this case, the LSE is now correctly set to negative infinity, preventing log(0) errors. 2. Extend the zero-check protection in the normalization step to cover the ELEMENTWISE_BIAS case, preventing potential division by zero.	2026-01-13 13:52:26 +08:00

1 2 3 4 5 ...

2932 Commits