composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-03-29 11:37:38 +00:00

Author	SHA1	Message	Date
Ville Pietilä	fc22320d78	[CK_TILE] Split-K autodeduction (#3351 ) * First version of split-K autodeduction. * Fix circular dependency and kernel construction. * Fix tolerance calculation for bwd weight example. * Simplify kernel construction. * Fix kernel launching bug for split-K autodeduce. * Add split-K autodeduction support for the two stage example. * Fix a corner case. * Fix clang-format. * Fix clang-format for inc files. * Add missing header. * Prevent too large split-K values. * Fix formatting. * Add unit tests for IsSupportedArgument in grouped bwd conv. * clang-format. * Fix merge conflicts. * Address feedback from code review. * clang-format * Fix new tests after merge. --------- Co-authored-by: Ville Pietilä <>	2025-12-10 09:30:30 +02:00
Max Podkorytov	d184eed823	[CK-Tile] Refactor base pipeline usage (#3251 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder	2025-12-04 11:45:49 -08:00
JH-Leon-KIM-AMD	4baa4c9fae	[CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216 ) * LWPCK-4043: Add GPU reference implementations for CK Tile convolution This commit implements GPU-based reference kernels for CK Tile convolution operations to enable faster verification of optimized kernels, especially for large tensors (>2GB). Changes: - Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution - Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data - Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight - Integrate GPU references with test infrastructure (replace -v=2 error) - Support for 1D, 2D, and 3D convolutions - Generic data type support (FP16, BF16, FP32) - Grid-stride loop pattern for scalability The GPU references use a simple, readable implementation that prioritizes correctness over performance. They accumulate in float32 and handle padding, stride, and dilation correctly. * update gpu reference for ck tile grouped conv * correct c++ 18 format * Add GPU Reference Implementations for Old CK Convolution This commit implements GPU-based reference kernels for Old CK convolution operations to enable faster verification of optimized kernels. Changes: - Fixed old CK forward GPU reference (naive_conv_fwd.hpp) * Fixed BF16 NaN issue (use type_convert instead of static_cast) * Fixed FP8/BF8 arithmetic (accumulate in float) * Fixed uninitialized variables * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed) - Created backward data GPU reference (naive_conv_bwd_data.hpp) * Implements input gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Created backward weight GPU reference (naive_conv_bwd_weight.hpp) * Implements weight gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Integrated with old CK examples * Forward: 10 XDL examples now support do_verification=2 * Backward data: Integrated with example/17_convnd_bwd_data/ * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only) * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU) Testing: - 50 comprehensive tests created - 42/42 tests passing (100% success rate) - CPU and GPU verification produce identical results - Verified across multiple dimensions, sizes, and data types Limitations: - GPU references support standard convolution only (G=1) - Fused operations (DL variants) not supported - Some tests blocked by optimized kernel size constraints Result: Old CK GPU references can replace CPU references for verification with 50-100x performance improvement for large tensors. * Apply clang-format to old CK GPU reference files * Fix C++17 compatibility: use brace initialization for aggregate types * add get_rtol, get_atl and consistency cout message * Use triple bracket syntax for kernel launch per review feedback Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski. This is more idiomatic HIP/CUDA style and equally correct. All tests still passing after this change. * Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode - Replace manual error checking with HIP_CHECK_ERROR macro - Add v=3 verification mode (GPU ref vs CPU ref direct comparison) - Consistent output format across all examples - All tests passing (7/7 v=3 tests pass for FP16) * Use ConvDims structure to simplify GPU reference kernels Replace 24 individual parameters with ConvDims structure per review feedback. - Add conv_common.hpp with ConvDims and helper function - Update kernel signatures: 24 params → 1 structure - Remove duplicate extraction code from host files * Use get_block_id() and get_thread_id() helpers in CK Tile Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions. Updated 3 CK Tile GPU reference kernels per review feedback. * Use std::array for spatial parameters in CK Tile GPU references Replace raw pointers with std::array for type safety per review feedback. - Add conv_common.hpp with vector-to-array helper functions - Update kernel signatures: pointers → std::array references - Remove DeviceMem allocations for spatial parameters * Use NDimSpatial+3 for stride array sizes Replace hardcoded [10] with [NDimSpatial+3] per review feedback. Array sizes now correctly reflect actual dimensions needed. * Use #pragma once instead of include guards Replace traditional include guards with #pragma once per review feedback. Updated 3 Old CK GPU reference headers. * Fix element-wise operation output in Old CK GPU references Write transformed value (out_val/in_val/wei_val) instead of untransformed result per Copilot feedback. This ensures element-wise operations are correctly applied to output. * Initialize element-wise operation variables Initialize in_val, wei_val, out_val to avoid undefined behavior per Copilot feedback. Updated backward data and backward weight kernels. * Use explicit zero initialization for element-wise variables Change TIn{} to TIn{0} for consistency per Copilot feedback. All 3 kernels now use consistent zero initialization. * Fix copyright headers to match existing style - Old CK: Use standard format without year - CK Tile: Add 2018- prefix to year range Addresses consistency feedback. * Rename GPU reference files: add _gpu suffix * Refactor index calculations: use std::array and extract to helper functions * Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-03 21:14:21 +02:00
Ville Pietilä	66832861ad	[CK_TILE] Merge multiple fwd convolution groups into a single GEMM batch. (#3136 ) * Merge fwd conv groups in CK Tile. * Fix building CK fwd convs. * Add number of merged groups to conv fwd kernel name. * Get number of merged groups from conv config. * Rename GemmConfig to ConvConfig. * Clean-up TODOs. * Check that number of conv groups must be divisible by the number of merged groups. * Improve error handling in the conv fwd example. * Fix clang-format. * Fix group offsets. * Fix merge problem. * Address feedback from code review. * Fix clang-formatting.	2025-12-02 15:23:32 +02:00
Aviral Goel	004784ef98	chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 ) * chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>	2025-11-28 13:49:54 -08:00
arai713	24d88d2472	[CK_TILE] Move DataTypeTraits into a Common File (#3146 ) This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile. Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2025-11-27 09:09:54 -08:00
Max Podkorytov	79aae7c7f7	[CK Tile] enable building examples by default (#3259 ) * remove EXCLUDE_FROM_ALL from ck-tile examples -> +15 min build time w/ 64 threads for a single arch * fix cpp17 compile error in the ck-tile examples --------- Co-authored-by: khuagarw <khuagarw@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-11-26 16:24:44 -08:00
Bartłomiej Kocot	00dfa2f2ce	[CK TILE] Grouped Conv Explicit Gemm (#3289 ) * [CK TILE] Grouped Conv Explicit Gemm * fixes * apply builder fixes	2025-11-25 23:28:35 +01:00
Bartłomiej Kocot	9ac2666d5b	[CK_BUILDER] Add grouped conv bwd ck tile traits (#3281 ) * [CK_BUILDER] Add grouped conv bwd ck tile traits * copilot fixes	2025-11-25 14:57:43 +01:00
Aviral Goel	d85f065b15	chore(copyright): update copyright header for example directory (#3273 ) * chore(copyright): update copyright header for codegen directory * chore(copyright): update copyright header for example directory	2025-11-24 18:02:41 -08:00
Johannes Graner	096f0a3b23	[CK Tile] Fix example for conv fwd + bias + clamp (#3235 ) * Fix clamp not being applied correctly * Apply group offsets to D tensors --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-11-24 07:36:26 +01:00
Bartłomiej Kocot	2234ff830b	[CK TILE] Convolution remove magic values (#3160 ) * [CK TILE] Refactor Conv configs and Conv Elementwise * fix * [CK TILE] Convolution remove magix values * fix partitioner	2025-11-06 11:26:30 +01:00
Bartłomiej Kocot	8681ced962	[CK TILE] Refactor Conv configs and Conv Elementwise (#3151 ) * [CK TILE] Refactor Conv configs and Conv Elementwise * fix	2025-11-04 15:04:53 +01:00
Bartłomiej Kocot	99f38e4d9b	[CK TILE] Refactor grouped conv fwd large tensor (#3144 )	2025-11-04 00:34:48 +01:00
Emily Martins	2ec57a8e70	Replace CK_TILE_PIPELINE macros with a common enum This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3, CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum called GemmPipeline to reduce code duplication.	2025-11-03 09:35:05 -07:00
JH-Leon-KIM-AMD	1fbb47ad30	[CK TILE] Grouped conv fwd split image (#2970 ) * Refactor split-image implementation: simplify code and remove redundant variables * Add padding debug output to split-image implementation - Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp - Verified padding works correctly with all tests passing * Fix sign comparison warning after rebase with origin/develop - Cast blockIdX from unsigned to signed index_t for comparisons - Integrated with new GetOutputTileIndex logic from upstream - Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane * Fix Split-N with groups bug and clean up unused parameters - Fixed batch stride calculation to include G dimension for grouped convolutions - When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups - Removed unused multi-split parameters (we only support 2-way split) - All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N * Implement recursive queue-based split-image detection and calculation - Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp - Implement recursive binary splitting algorithm (10GB→5GB+5GB→...) - Correctly handle odd dimensions (61→30+31) - Calculate proper offsets for each split piece - Update invoker to use split-image helper Note: Split detection and calculation work correctly but kernel launching for individual pieces requires kernel modification to handle different spatial dimensions (unlike Split-N which uses blockIdx.z). * WIP: Split-Image investigation - found architecture mismatch - Split-N modifies N_ directly in transformer constructor - Split-Image needs different approach due to varying dimensions - Added split calculation logic for 1D and 2D convolutions - Still facing memory issues when creating piece transformers Key finding: Split-N uses blockIdx.z for parallel execution, while Split-Image needs sequential execution of non-uniform pieces. * Add 1D split-image implementation for grouped convolution (N=1 working) Implements split-image for 1D convolution to handle large tensors that exceed memory thresholds. This is a critical milestone with N=1 fully working and tested. Key Changes: - Invoker: Add split-image logic that splits W dimension in half - Transformer: Add SplitConvProblem helper for recursive splitting - Calculate offsets for LEFT and RIGHT pieces - Launch two kernels sequentially (LEFT then RIGHT) Implementation Details: - Binary split: divides W dimension by 2 - LEFT piece: W=0 to W/2, keeps left padding, removes right padding - RIGHT piece: W/2 to W, removes left padding, keeps right padding - Offset calculation accounts for stride, dilation, and padding - Physical memory offset (no padding in memory) Test Results (N=1): ✅ 94/94 tests passing - Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups) - Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries) - Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops) Known Limitations: - Only works with N=1 (single batch) - N>1 fails when split-image triggers (offset calculation issue with Split-N) - Root cause: Split-N modifies N in transformer, but offset calculated in invoker - Solution planned: Move offset calculation to transformer (next phase) Files Modified: - grouped_convolution_forward_invoker.hpp: Add split-image logic - transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper This commit represents a stable, tested 1D split-image implementation for N=1 cases. It's an important milestone before extending to N>1 and multi-dimensional splits. * Add basic split-image implementation for 1D/2D/3D grouped convolution This is a working baseline implementation that splits large spatial dimensions to handle memory constraints. Implementation: - 1D: W-split for NWGC layout (36/36 tests passing) - 2D: H-split for NHWGC layout (20/20 tests passing) - 3D: D-split for NDHWGC layout (verified working) Features: - Binary split of outermost spatial dimension - Sequential LEFT/RIGHT kernel launches - Proper padding adjustment at split boundaries - Offset calculation for pointer arithmetic - Debug output for verification Threshold: 100KB (configurable in transformer) Known limitations: - No safety checks for edge cases (to be added) - Offset calculated before Split-N (incompatible with N>1, to be fixed) - No recursive splitting for very large tensors Next steps: - Add safety checks (is_possible_to_split_) - Move offset calculation to transformer (after Split-N) - Test with N>1 + split-image combination Refactor split-image to unified structure for 1D/2D/3D Unified the three separate dimension-specific blocks into a single common implementation with dimension-specific stride calculations. Benefits: - Reduced code from 636 → 348 lines (45% reduction) - Eliminated code duplication - Easier to maintain and extend - Single source of truth for split logic Implementation: - Common: Binary split, offset calc, padding adjustment, kernel launch - Dimension-specific: Stride calculation only - 1D: stride = G * C - 2D: stride = W_in * G * C - 3D: stride = H_in * W_in * G * C Test results (all passing): - 1D: 36/36 tests ✅ - 2D: 20/20 tests ✅ - 3D: 28/28 tests ✅ - Total: 84/84 (100%) All test scenarios verified: - Varying channels, padding, stride, dilation - Filter sizes (1x1 pointwise to 7x7) - Multiple groups (G=1,2,4) - Odd dimensions - Complex combinations * Add safety checks for split-image in all dimensions Added is_possible_to_split safety checks to prevent crashes when splitting is not feasible. Safety checks verify: 1. Output dimension > 1 (can't split single element) 2. RIGHT piece starts after left padding 3. LEFT piece ends within input bounds If checks fail, falls back to normal kernel launch. Verified for all dimensions: - 1D (W-split): Wo=1 case triggers fallback - 2D (H-split): Ho=1 case triggers fallback - 3D (D-split): Do=1 case triggers fallback Original 84 tests still pass - they use normal configurations that naturally satisfy safety conditions. Safety checks protect against pathological edge cases with: - Very small spatial dimensions - Extreme stride/dilation combinations - Invalid padding configurations * Fix Split-N + Split-Image compatibility issue Fixed critical bug where Split-N and Split-Image working together caused ~50% incorrect results due to wrong batch stride calculation. Problem: - Batch stride was calculated using MODIFIED spatial dimensions (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000) - Spatial offset was applied globally in invoker, not per-batch in kernel - Each batch (blockIdx.z) got wrong memory offset Solution: 1. Store spatial offset in kargs (don't apply to pointer in invoker) 2. Copy correct batch_stride from temp_kargs to left/right kargs 3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset Changes: - grouped_convolution_forward_kernel.hpp: * Added spatial_offset_in/out fields to KernelArgs * Apply batch + spatial offset in operator() - grouped_convolution_forward_invoker.hpp: * Keep base pointer, store spatial offset in kargs * Copy batch_stride from temp_kargs (has original dimensions) - transform_conv_fwd_to_gemm.hpp: * Add debug output for split-image calculation Results: - N=1 tests: 84/84 passing (100%) - N>1 tests: Now all passing (previously ~50% errors) - Tested: 1D, 2D, 3D with N=1,2,4,8,16,20 * Implement unified threshold for Split-N and Split-Image This commit consolidates threshold management for both Split-N and Split-Image operations into a single source of truth, eliminating code duplication and fixing offset calculation issues. Key Changes: ============ 1. Transformer (transform_conv_fwd_to_gemm.hpp): - Moved TwoGB constant to public section for unified access - CalculateSplitImage() now takes no parameters - Uses internal threshold: TwoGB / sizeof(CDataType) - Calculates offsets using N_ (after Split-N) for correctness 2. Kernel (grouped_convolution_forward_kernel.hpp): - GetSplitImageInfo() simplified to take no parameters - Forwards to transformer's CalculateSplitImage() - Clean interface with unified threshold internally 3. Invoker (grouped_convolution_forward_invoker.hpp): - Removed redundant threshold calculation - Simplified to call kargs.GetSplitImageInfo() with no params - Clean early-return pattern (no unnecessary else blocks) - Removed duplicate/dead code paths Benefits: ========= - Single source of truth: TwoGB defined once in transformer - No parameter passing for threshold between components - Correct offset calculation using N_ (post-Split-N) - Cleaner code with no duplication - All tests passing: 1D/2D/3D with various N values Testing: ======== - Split-Image only (N=1, large spatial): PASS - Split-N only (N>1, small spatial): PASS - Both splits active (N>1, large spatial): PASS - No splits (N=1, small spatial): PASS - CPU verification correct for all scenarios * Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded) The old recursive queue-based implementation has been replaced by the new CalculateSplitImage() method which is simpler and correctly handles Split-N + Split-Image interaction. Changes: - Wrapped lines 381-1078 in #if 0...#endif - Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded() - Preserved for reference but disabled from compilation - No functional changes - all tests still pass The new implementation (CalculateSplitImage at line ~2163) provides: - Correct offset calculation using N_ (after Split-N) - Simpler binary split logic - Better integration with unified threshold approach * Implement recursive split-image with depth limit (MAX_DEPTH=10) Changes: - Add depth tracking to SplitPiece struct - Implement two stopping conditions: 1. Piece size below threshold (optimal case) 2. Depth >= MAX_DEPTH (prevents infinite recursion) - Remove MAX_PIECES limit in favor of depth-based control - Support up to 2^10 = 1024 pieces with depth 10 This allows handling extreme tensor sizes while ensuring termination. Pieces larger than threshold will still launch correctly if depth limit reached. Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification. * Summary of recursive split-image implementation: - Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces) - Two stopping conditions: size below threshold OR max depth reached - Cumulative offset tracking through all recursion levels - LEFT piece inherits parent offset, RIGHT accumulates (parent + local) - Per-batch spatial offset application in kernel operator() - Batch stride uses original dimensions (before split) - Works with Split-N: split-N first, then recursive split-image - Handles odd dimensions, padding, stride, dilation correctly - All 1D/2D/3D tests pass with CPU verification * Add comment explaining MAX_DEPTH capacity for 2GB threshold * Refactor: move recursive split-image logic to transformer - Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp - Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic) - Encapsulate SplitPiece struct and BFS splitting algorithm in transformer - Remove unused includes (queue, vector) from invoker - Add documentation comment for AreDescriptorsSmallerThan2GB() - Improve code organization and reusability - No performance overhead (static template function, compiler inlines) - All tests passing with 2GB production threshold * Apply clang-format-18 formatting - Format invoker and transformer files with clang-format-18 - Fix brace placement and alignment - No functional changes * Fix clang-format-18 issues in forward kernel - Remove extra blank lines - Fix line wrapping for template calls - Consolidate GetSplitImageInfo() to single line * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Split-Image implementation with temporary fixed divider - Implemented spatial dimension splitting (Split-Image) for large tensors - Added piece-based coordinate transformation for 1D/2D/3D convolutions - Integrated Split-N (batch splitting) with automatic threshold detection - Fixed M dimension calculation to include batch: M = N × spatial_size - Added spatial offset support in kernel arguments - Verified 20/20 test cases passing for Split-Image alone - Known issue: Split-N + Split-Image combination needs coordinate fix Implementation Details: - Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values - Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts - Piece descriptors track spatial boundaries and block ranges - No performance overhead for N=1 cases * Fix 1D split-image padding issue with per-piece dimensions - Store actual size per piece to handle non-uniform splits - Remove dead code from transform utils * Fix 2D/3D split-image with independent split factors per dimension Problem: Single split factor caused non-uniform pieces when dimensions didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed. Solution: Independent split factor selection for W, H, D dimensions. Each dimension gets optimal factor based on its own size. Test Results: - 1D: 42/42 pass (100%) - 2D: 25/25 pass (100%) - Total: 67/67 combinations verified * Remove unused split-image struct fields Cleanup of split-image implementation: - Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct - These fields were declared but never used in the kernel - Per-piece dimensions are already stored in pieces[] array - Reduces struct size and improves code clarity Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass * Refactor split-image invoker code for improved readability - Extract piece calculation logic into calculate_piece lambda helper - Extract kernel args population into populate_split_image_kargs lambda - Use aggregate initialization for cleaner struct population - Reduce nesting depth and improve maintainability - Fix outdated comment about split-image implementation status * Refactor split-image code and remove debug prints - Extract GPU kernel helper lambdas for better readability - Remove all split-image debug print statements - Set memory threshold to 2GB for production - All tests pass with CPU verification * Add split-image safety constraints and refactor to utils - Add MAX_TOTAL_PIECES=64 limit to prevent segfault - Move calculate_spatial_piece to library utils - Add layout validation (NWGC, NHWGC, NDHWGC only) - Fix hierarchical splitting to respect piece limits - Add proper documentation and formatting * Change split-image from runtime to compile-time branching Response to @bartekxk review comment: Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)' Changes: - Add EnableSplitImage template parameter to kernel - Change runtime if to compile-time if constexpr - Update invoker to instantiate kernel variants with true/false Benefits: - Eliminates runtime branching in GPU kernel - Dead code elimination (each variant is smaller) - Better compiler optimization Files modified: 2 Lines changed: 20 total (6 in kernel, 14 in invoker) Tests: 27/27 passed (100%) Performance: No regression * Add split-image example as separate binary - Create grouped_convolution_forward_split_image example - Add grouped_convolution_forward_split_image_invoker.hpp - Update CMakeLists.txt to build split_image binary * Replace linear search with binary search in find_piece_id - Change O(n) to O(log n) for finding piece ownership - Matches reference implementation in large_tensor_cshuffle * Simplify split-image code and fix integer overflow - Extract lambda functions to static helper methods - Pre-calculate constants in invoker - Fix integer overflow in tensor size calculation for large tensors * Trigger CI rerun - fix merge conflicts * Fix merge conflict markers * Fix clang-format: remove space before {} * Fix clang-format: comment wrapping and Swish constructor * Rename split_image to large_tensor for clarity - Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp - Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp - Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor - Updated comments to refer to 'large tensor' instead of 'split-image' * Update comments and include in large_tensor example - Updated header comments to use 'large tensor' terminology - Fixed include path to use large_tensor_invoker.hpp * Remove test code, restore 2GB threshold * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix build errors after develop merge and complete rename to large_tensor This commit addresses compilation errors from the develop merge and completes the rename from split_image to large_tensor. Changes: 1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp 2. Fix template parameter order in large_tensor_invoker.hpp - TransformConvFwdToGemm signature changed in develop - NumGroupsToMerge and SplitN parameters swapped positions 3. Fix missing template parameter in GroupedConvFwdHostArgs 4. Fix EpiloguePipeline scope in kernel (merge conflict) 5. Update binary name references in test scripts * Restore 2GB threshold for split-image Changed threshold from 100MB (testing) back to 2GB for production use. * Fix const-correctness in ds_ptr cast * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply clang-format-18 * update c++ 18 format * Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-01 14:18:16 +02:00
Bartłomiej Kocot	c2d7931446	[CK TILE] Clear output buffers for grouped conv bwd (#3127 )	2025-10-31 14:11:54 +01:00
Ville Pietilä	22d9f99942	Fixed building CK Tile grouped conv fwd bias clamp example. (#3124 )	2025-10-30 18:17:48 +02:00
Ville Pietilä	121bf0e1f3	[CK_Tile] Merge multiple convolution groups into a single GEMM batch (#2986 ) * Fix compilation of the grouped conv examples. * Fix grouped conv bwd weight example output in CK Tile. * Add number of groups to merge to ck tile grouped gemm example. * Initial set of tests for TransformConvBwdWeightToGemm. * Added unit tests for TransformConvBwdWeightToGemm conv groups are merged. * WIP: Tensor transformations. * Add unit tests for coordinate transforms. * Fully working conv group merging for TransformConvBwdWeightToGemm. * WIP: Merged conv groups offset calculation. * Adde unit tests for tensor view. * WIP: Merged conv groups epilogue. * Enable running multiple conv groups per batch. * Add tests for tile_distribution_encoding. * Change example to match optimally depthwise convolution with merged groups. * Add more tests for tensor view. * Integration test for reading diagonal blocks from grouped distributed tensor. * Improved integration test. * Improve test for accessing diagonal blocks. * Added integration test for cshuffle epilogue LDS tile distribution. * Add more logging. * Increase the max number of reported errors. * WIP: merged conv groups GEMM epilogue changes. * LDS to global memory copy. * Fix tile window size for c block. * Integration test for CShuffle epilogue. * Improved CShuffle test. * WIP: Separate epilogue for merged conv groups. * Tile example parameters changes to match depthwise conv. * Offset fixes. * Epilogue fixes. * Working baseline for depthwise covolution with merged conv groups. * Fix build. * Initial unit tests for tensor descriptor. * Add one more unit test for tensor view. * WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding. * Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding. * Add more comments, disable debug code. * Remove debug and other dead code. * Code clean-up for bwd tensor transformations. * Enable running multiple GEMM batches of merged conv groups. * Add compile check for assumed row-mjor layout. * Fix strides in 1D conv to gemm transformation. * WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases. * Fix case k > 1 and c=1. * Remove debug code. * Make MPerGroup and NPerGroup template parameters. * Add additional check for non-supported c > 1 case. * WIP: Put back the generic tensor descriptors for convolutions. * Fix tensor descriptors. * Remove the obsolete template parameters. * Add more instances. * Fix bugs in merged conv groups tensor descriptors. * Fix tensor descriptors for merged conv groups when K > 1. * Remove debug output. * Remove dead code. * Fix merge conflicts. * Code clean-up. * Remove unused code. * Run clang-formatting. * Remove debug prints and obsolete tests. * Check that number of convolution groups is multiple of merged groups. * Fix build after removing obsolete functionality. * Remove obsolete enumeration. * Fix new unit projects. * Remove unnecessary includes. * Fix passing the number of merged groups. * Remove unrelated tests. * Fix IsSupportedArgument for bwd weight conv kernel. * Fix clang formatting. * Fix the bwd weight conv to gemm mapping for num merged groups > 1. * GEMM config for conv group merging. * Fix clang-formatting. * Remove obsolete comment. * Fix typos in comment strings. * Increase the max number of reported errors when testing against reference implementation. * Rename gemm_config to conv_config. * Rename GemmConfig to ConvConfig and move NumGroupsToMerge into ConvConfig. * Change num_groups_to_merge to a boolean flag in the ck tile grouped conv example. * Run clang-format. * Add number of merged groups into kernel name string. * Remove group merging flag from CK Tile grouped conv example.	2025-10-29 16:49:28 +02:00
Johannes Graner	5c1974065e	[CK_TILE] Add conv fwd + bias + clamp example (#3012 ) * Implement argument passing to element-wise functions for fwd convolution * Add files for fwd + bias + clamp example * Implement Bias * Implement Clamp * Elementwise function composition * Composition unit test * Implement fwd + bias + clamp example * Simplify argument passing and composition * elfunc -> bias_and_clamp * Rename function to specify example * Move element-wise function instantiation to kernel * Make bias a runtime tensor * No ugly namespace aliasing * Initialize element-wise function on host * Remove function initialization helper, simplify Compose initialization * Remove unintended LSP compatibility patch * Clean up includes and unused code * Switch names in cshuffle epilogue * Move CDElementwise to conv traits * Re-add required include * Initialize bias in same way as other tensors * Better type specification for ds pointer * Disable 1D convolution * Add warning for non-group-constant bias	2025-10-27 18:43:09 +01:00
Illia Silin	3348f01e6f	re-enable clang-format by default (#3030 ) * re-enable clang-format by default * fix clang format	2025-10-15 07:43:11 -07:00
jakpiase	6deaaa92cc	[CK_TILE] Switch into universal gemms for conv bwds (#2981 ) * switch into universal gemms for conv bwds * some fixes and support universal gemm in conv fwd * add reviewer comments	2025-10-14 16:09:16 +02:00
Johannes Graner	15fff74503	[CK Tile] Implement Invoker pattern for remaining grouped convolution examples (#2894 ) * Invoker for grouped_conv_fwd * Invoker for grouped_conv_bwd_data * Fix incorrect out layout identifier	2025-09-24 10:22:38 +02:00
jakpiase	624c46866e	[CK_TILE] Add conv bwd weight two stage support (#2855 ) * resolved conflicts * add conv bwd weight twostage * fix one file * fixes after review * fixes * fixes * Fix --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-09-22 15:31:25 +02:00
linqunAMD	60d3e8f504	[CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 (#2808 ) * [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 * fix gemm_splitk_two_stage * revert .pre-commit-config.yaml	2025-09-11 07:27:33 -07:00
Ville Pietilä	83f607e2a6	[CK Tile] Fix building grouped conv examples in CK Tile (#2777 ) * Fix compilation of the grouped conv examples. * Fix grouped conv bwd weight example output in CK Tile.	2025-09-05 09:14:21 +03:00
rahjain-amd	4d041837ad	Add json dump support to output details from CK/CKTile Examples. (#2551 ) * Adding RapidJson Library * Adding Json Dumps in all CK_Tile Examples Not verified yet * Adding json to cktile Batched Transpose * adding json dumps to layernorm2d_fwd * Adding json dump to flatmm_basic * Adding RapidJson Library * Adding Json Dumps in all CK_Tile Examples Not verified yet * Adding json to cktile Batched Transpose * adding json dumps to layernorm2d_fwd * Adding json dump to flatmm_basic * Adding json in 03_gemm * Add json dump to 16_batched_gemm * Add json dump to gemm_multi_d_fp16 * Add json dump to grouped_gemm * fix fmha_bwd/fwd * Fix clang-format errors exclude include/rapidjson in jenkins as its a third-party library * Saparating function and defination. * Update Documentation of 03_gemm * Refactoring as per code review * Disable fp8 instances on unsupported targets (#2592) * Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt * Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt * Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt * Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt --------- Co-authored-by: AviralGoelAMD <aviralgoel@amd.com> * fix clang format * remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt * Fixing Readme and unifying jsondumps * adding moe_smoothquant * adding fused_moe * Fixing Readme for batched_gemm * Fixing Readme for grouped_gemm * adding flatmm * adding gemm_multi_d_fp16 * adding elementwise * adding File name when json is dumped * Fixing Reduce after merge * adding batched_transpose * Adding Warptile in Gemm * Fixing Clang Format --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: AviralGoelAMD <aviralgoel@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-09-02 23:31:29 -07:00
linqunAMD	4a49dac7c6	[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728 ) * fix copy basic build error * fix other ck tile test build error	2025-08-28 20:30:30 +08:00
Cong Ma	cd53e2e57e	[CK TILE GEMM] Fix a merge conflict (#2753 ) * Fixed a merge conflict in `245467f3` * Foramt the code	2025-08-27 11:08:09 -07:00
Bartłomiej Kocot	4212bbc170	[CK Tile] Grouped convolution backward data (#2652 ) * base working version for single groupped conv bwd data * Fix 2d descriptor * fix groups * Add 3d support * fixes * fixes * fixes --------- Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>	2025-08-20 05:29:57 -07:00
linqunAMD	9fcc1ee9fd	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-18 10:08:31 -07:00
Khushbu Agarwal	88d72178d6	[CK_Tile] Updating gpu timer when doing flush cache (#2593 ) * Missed updating function names in example * updating timer * code cleanup * addressing review comments * updating tile_engine code * addressing review comments	2025-07-31 16:43:33 -07:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
jakpiase	6681593864	[CK_TILE] Grouped Convolution Backward Weight Kernel (#2357 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * resolved conflicts * rebase fixes * fixes * tmp * add working support for splitk * minor fix * fixes * fixes * minor fix * small fix * Split K and preprocessing fixes --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-07-24 10:41:35 +02:00
Bartłomiej Kocot	cebdee4d9e	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes	2025-06-20 15:44:36 -07:00

35 Commits