composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	35fc7c9e4f	Add new section to changelog (#3295 ) * Add new section to changelog * Update CHANGELOG.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-12-05 07:14:52 -08:00
jakpiase	f7650ee82b	fix enforcing fixedvectorsizes for ck tile conv (#3344 )	2025-12-05 09:30:22 +01:00
John Shumway	13f6d63565	Clean up conv_traits.hpp (#3354 ) When I asked for a description of operators that didn't have ConvTraits, I was getting very long confusing errors about ConvTraits not being defined. Now we get specific errors explaining which concepts are violated, making it easier to know which code to generalize or update. * Add concepts to conv_traits.hpp to get better error message. * Put the correct requires clauses in the right places to get descriptive error messages. * General cleanup of functions in conv_traits.hpp to make functions easier to read.	2025-12-04 19:12:36 -08:00
Po Yen Chen	05292b3604	[CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153 ) * Let fmha_fwd_v3() compatible with fmha_fwd() * Decouple get_fwd_blobs() and FmhaFwdKernel * Decouple compatibility checks from get_fwd_blobs() * Extract product feature checks out from get_fwd_blobs() * Remove duplicated code in factories and redundant checks * Remove FmhaFwdKernel<>::GetName() * Let FmhaFwdApiPool support pipelines with different mask_impl * Add tile setting for fmha fwd v3 pipeline * Add fwd v3 instances to tile_example_fmha_fwd manually * Remove unused function import * Undo irrelevant changes * Remove fwd v3 instances from tile_example_fmha_fwd * Finish fmha fwd v3 kernel instance codegen * Fix formatting * Remove unused F_idx attribute * Add is_generic_attention_mask<> traits * Add constraints to the fmha fwd v3 pipeline * Unify traits & problem used for fmha fwd v3 * Unify kernel launch code for fmha fwd v2 & v3 * Unify kernel template selection logic * Use same kernel codegen template for both v2 & v3 * Rename api() property as render() method * Allow specifying filter for fmha fwd api pool * Allow specifying function name when rendering api pool items * Separate fmha fwd v3 kernel dispatching logic from v2 * Remove lambda assignment * Add simple v2/v3 dispatch logic * Stop generating empty if-clauses Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them. * Use "".join() to concatenate fmha fwd api string content * Add more feature checks for fmha fwd v3 pipeline * Check features before dispatch to fmha_fwd_v3() * Add more feature checks for fmha_fwd_v3() * Add missing filter call * Use Tuple to reserve the dtype orders * Fix wrong pipeline matching logic * Add fmha fwd v3 group mode instances * Add functor_transform<> * Add type constraints to make_tile_window() * Remove fmha fwd v3 example * Fix wrong product(aiter mha_fwd()) config * Fix wrong fmha fwd v2/v3 selection logic * Fix formatting * Add comment to warning v3 kernel users * Fix wrong codegen logics * Remove unnecessary param * Fix format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-05 10:31:12 +08:00
Illia Silin	d1193e8637	fix hipblaslt build for different archs (#3358 )	2025-12-04 18:29:14 -08:00
Max Podkorytov	d184eed823	[CK-Tile] Refactor base pipeline usage (#3251 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder	2025-12-04 11:45:49 -08:00
spolifroni-amd	d9d4c9c3df	[composable_kernel] initial draft of the ck tile conceptual doc (#3242 ) * Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc * Fixes for ruff check. * Fixes for ruff check 2. * Fixes for ruff check 3. --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>	2025-12-04 11:09:21 -08:00
Illia Silin	cd21e20ae7	build latest hipblaslt in ck_pytorch docker (#3347 )	2025-12-04 06:58:42 -08:00
Ville Pietilä	9cb1f421bc	[CK_BUILDER] Refactor convolution signature to provide data type/layout/elementwise op per tensor (#3331 ) * Separate layouts into separate entities for input, weight, and output tensors. * Add test for handling bias tensor layouts. * Use instance string in builder tests. * Add handling of output bias data types and layouts. * Generalize handling of the elementwise ops. * Test fix. * Create builder for layouts. * Layout builder improvements. * Improve layout builder. * Simplify bias layout handling. * Code clean-up. * Move layout utils into separate file. * Remove hard-coded layout combinations. * Small code clean-up. * Move data type utils into a separate file. * Add data types, layouts, and elementwise ops per conv tensor. * Builder bug fixes after refactoring. * Working baseline. * Make signature definition look nice in the test code. * Move TensorConfig into test implementations. * Fix all fwd conv builder tests. * Fix conv traits and descriptors tests. * More factory assets under a separate directory. * Fix building conv traits. * Fix clang-format. * Add Readme doc to describe the design. * Add link to main Readme. Fix links in the builder design doc. * Clean-up data type/layout/elementwise op conversions. * Switch from dimension and tensor type specific layouts to a flat list of tensor layouts. * Fix clang-formatting. * Fix clang-format for test code. * Simplify fwd conv signature definitions in the test code. * Remove accidental edits. * Fix comment string. * Fix instance factory after rebase. * Fix tests after rebase. * Unify layout handling. * Add more conv layout unit tests. * Clang-format. * Fix merge conflicts. * Improve elementwise op handling. --------- Co-authored-by: Ville Pietilä <>	2025-12-04 12:58:31 +02:00
arai713	583fafc803	[CK_TILE] Fix for Moving DataTypeTraits into a Common File (#3335 ) This PR fixes a mismatch caused when PR #3146 was merged out of sync with develop, which made its intended changes ineffective. This PR reapplies those changes to move DataTypeTraits into a common file to mitigate code duplication. Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-03 22:46:22 -08:00
kensclin	ffc3120f63	Ck tile/gemm blockscale opt (#3227 ) * GEMM block scale optimization kernel * GEMM block scale optimization kernel * Fix: Apply clang-format for style consistency * Fix: Apply clang-format for style consistency --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-03 22:07:23 -08:00
rocking	eb7f617713	fp8 fmha async pipeline (#3339 ) * replace qr with async pipeline * Add fp8fp32 to DTYPE_BITS * Add kAlignmentRandVal to avoid compile fail * format --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-04 12:18:25 +08:00
JH-Leon-KIM-AMD	4baa4c9fae	[CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216 ) * LWPCK-4043: Add GPU reference implementations for CK Tile convolution This commit implements GPU-based reference kernels for CK Tile convolution operations to enable faster verification of optimized kernels, especially for large tensors (>2GB). Changes: - Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution - Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data - Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight - Integrate GPU references with test infrastructure (replace -v=2 error) - Support for 1D, 2D, and 3D convolutions - Generic data type support (FP16, BF16, FP32) - Grid-stride loop pattern for scalability The GPU references use a simple, readable implementation that prioritizes correctness over performance. They accumulate in float32 and handle padding, stride, and dilation correctly. * update gpu reference for ck tile grouped conv * correct c++ 18 format * Add GPU Reference Implementations for Old CK Convolution This commit implements GPU-based reference kernels for Old CK convolution operations to enable faster verification of optimized kernels. Changes: - Fixed old CK forward GPU reference (naive_conv_fwd.hpp) * Fixed BF16 NaN issue (use type_convert instead of static_cast) * Fixed FP8/BF8 arithmetic (accumulate in float) * Fixed uninitialized variables * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed) - Created backward data GPU reference (naive_conv_bwd_data.hpp) * Implements input gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Created backward weight GPU reference (naive_conv_bwd_weight.hpp) * Implements weight gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Integrated with old CK examples * Forward: 10 XDL examples now support do_verification=2 * Backward data: Integrated with example/17_convnd_bwd_data/ * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only) * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU) Testing: - 50 comprehensive tests created - 42/42 tests passing (100% success rate) - CPU and GPU verification produce identical results - Verified across multiple dimensions, sizes, and data types Limitations: - GPU references support standard convolution only (G=1) - Fused operations (DL variants) not supported - Some tests blocked by optimized kernel size constraints Result: Old CK GPU references can replace CPU references for verification with 50-100x performance improvement for large tensors. * Apply clang-format to old CK GPU reference files * Fix C++17 compatibility: use brace initialization for aggregate types * add get_rtol, get_atl and consistency cout message * Use triple bracket syntax for kernel launch per review feedback Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski. This is more idiomatic HIP/CUDA style and equally correct. All tests still passing after this change. * Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode - Replace manual error checking with HIP_CHECK_ERROR macro - Add v=3 verification mode (GPU ref vs CPU ref direct comparison) - Consistent output format across all examples - All tests passing (7/7 v=3 tests pass for FP16) * Use ConvDims structure to simplify GPU reference kernels Replace 24 individual parameters with ConvDims structure per review feedback. - Add conv_common.hpp with ConvDims and helper function - Update kernel signatures: 24 params → 1 structure - Remove duplicate extraction code from host files * Use get_block_id() and get_thread_id() helpers in CK Tile Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions. Updated 3 CK Tile GPU reference kernels per review feedback. * Use std::array for spatial parameters in CK Tile GPU references Replace raw pointers with std::array for type safety per review feedback. - Add conv_common.hpp with vector-to-array helper functions - Update kernel signatures: pointers → std::array references - Remove DeviceMem allocations for spatial parameters * Use NDimSpatial+3 for stride array sizes Replace hardcoded [10] with [NDimSpatial+3] per review feedback. Array sizes now correctly reflect actual dimensions needed. * Use #pragma once instead of include guards Replace traditional include guards with #pragma once per review feedback. Updated 3 Old CK GPU reference headers. * Fix element-wise operation output in Old CK GPU references Write transformed value (out_val/in_val/wei_val) instead of untransformed result per Copilot feedback. This ensures element-wise operations are correctly applied to output. * Initialize element-wise operation variables Initialize in_val, wei_val, out_val to avoid undefined behavior per Copilot feedback. Updated backward data and backward weight kernels. * Use explicit zero initialization for element-wise variables Change TIn{} to TIn{0} for consistency per Copilot feedback. All 3 kernels now use consistent zero initialization. * Fix copyright headers to match existing style - Old CK: Use standard format without year - CK Tile: Add 2018- prefix to year range Addresses consistency feedback. * Rename GPU reference files: add _gpu suffix * Refactor index calculations: use std::array and extract to helper functions * Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-03 21:14:21 +02:00
Enrico Degregori	161835533b	Wmma support for gemm_multiply_multiply_wp (#3278 ) * Initial implementation with splitK support * Add gfx11 support * Fix compilation error * Add instances * Add irregular instances * Fix GetBuffer arguments * Minor changes * Address review comments * Fix compilation errors * Fix copyright header	2025-12-03 07:38:23 -08:00
John Shumway	f29b67cf9b	[CK_BUILDER] Add Description::instance_string() method and update tests (#3340 ) * Create Description::instance_string() function To expose more reflection capabilities in MIOpen, we add the instance_string functionality to the ckr::Description class. This PR introduces a base class, adds the instance_string method, and implements the method by injecting the Traits::instance_string method through the ConvDescription constructor. This will enable us to replace the specialized get_instance_string() method on device operations with a describe() method in a subsequent PR. * Test describe().instance_string() Update the instance string tests to also call `ckr::describe<Instance>().instance_string()`. This documents that the xld kernels are supported with describe(), but WMMA and DL kernels are not yet supported. Also update namespace and add a HasConvTraits concept.	2025-12-03 06:36:09 -08:00
jakpiase	e6a583416b	[CK TILE] Add index optimizations for conv bwd weight (#3321 )	2025-12-03 10:53:46 +01:00
Aviral Goel	6cb0bc2d11	feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193 ) * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * feat(gemm_quant): add RRR and CRR layout support for aquant gemm * test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm * fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests * fix: test cases compilation failure due to PR# 2095 * fix: make condition to filter out tests for gfx950 more explicit * need to support the gfx950 * fix: add layout suppot for gfx950 * Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277) * WIP: add support for pk_int4_t for aquant mode layouts RR and CR * test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant * fix: compile time error for gfx950 * fix: minor bug where is_a_load_tr_v() was mising * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318) * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant * test: add unit tests for new layout support CCRC for aquant block scale gemm * docs: update changelog with new layout support info * Update CHANGELOG.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * refactor: break test instances into multiple cpp files to reduce build time (#3319) * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant * test: add unit tests for new layout support CCRC for aquant block scale gemm * refactor: break test instances into multiple cpp files to reduce build time * chore: rename file for better code readability * fix: merge conflict resolution * fix: remove memory pipeline because new layout is not compatible * build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr() * refactor: address review comments * solve the conflict --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-12-02 14:59:07 -08:00
Illia Silin	2c284a1780	Disable gemm_blockscale_f8 on gfx90a by default. (#3338 ) * disable gemm_blockscale_f8 instances on gfx90a by default * fix cmake logic, diasble some cmake output * fix cmake logic	2025-12-02 11:33:33 -08:00
John Shumway	280bc42191	[CK_BUILDER] Refactor builder factory code. (#3276 ) Refactor the builder factory code into multiple files and subdirectories and a ck_tile::builder::factory namespace. The factory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations. Major changes in this PR: Dispatch logic is explicit in the function make_conv_instance instead of implicit in template specialization selection. Helper code is moved to a subdirectory builder/factory/helpers. Helpers now have unit tests. Factories are moved to their own files. Code moved to namespaces ck_tile::builder::factory and ck_tile::builder::factory::internal. This does not yet fix the problem of bad error messages, but the make_conv_instance function makes the poor error messages clear. The choice of algorithm must be much more robust (perhaps with explicit enumeration in the algorithm descriptor), so that the dispatch doesn't fail. Quality changes: Making dispatch explicit rather than implicit will improve robustness, readability, maintainability, testability, and extensibility. Separating code into separate files and subdirectories helps readability and extensibility. Adding unit tests for helpers documents behavior and will enable more complex logic and functionality. Separating files (especially unit tests) helps clarify includes and dependencies and makes code easier to refactor.	2025-12-02 07:40:14 -08:00
Thomas Ning	8459d389ad	disable the gfx90a (#3336 )	2025-12-02 07:27:37 -08:00
Ville Pietilä	66832861ad	[CK_TILE] Merge multiple fwd convolution groups into a single GEMM batch. (#3136 ) * Merge fwd conv groups in CK Tile. * Fix building CK fwd convs. * Add number of merged groups to conv fwd kernel name. * Get number of merged groups from conv config. * Rename GemmConfig to ConvConfig. * Clean-up TODOs. * Check that number of conv groups must be divisible by the number of merged groups. * Improve error handling in the conv fwd example. * Fix clang-format. * Fix group offsets. * Fix merge problem. * Address feedback from code review. * Fix clang-formatting.	2025-12-02 15:23:32 +02:00
msaffari-amd	2d3020e5b0	[CK Tile] batched contraction kernel generalizing (#3126 ) * Add help for example * Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings * Add stride-aware reference for batched contraction with independent D tensor layouts * Add -num_d argument for runtime D tensor count selection in batched contraction * Add stride vector arguments in example code for testing non-contiguous batched contraction inputs * Add descriptor-based architecture for batched contraction multi-dimensional stride support * Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0 * Add complete multi-dimensional stride support via descriptors * Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm * Clean up batched contraction: remove old UniversalGemmKernel path * Clean up batched contraction: remove legacy paths and finalize docs * Optimize batched contraction example: pass dimension sizes not vectors * correct the reference calculation, unsigned int to int * Fix batched_contraction C++17 build errors for gfx90a CI	2025-12-02 13:30:27 +01:00
DarylHawkinsAMD	d3f37ebf6c	[CK_BUILDER] Install CK builder headers, added missing include (#3334 )	2025-12-02 04:28:46 -08:00
jakpiase	5d67d82a0b	[CK_TILE] Fix for comp pipeline v4 (#3307 ) * Fix for gemm_pipeline_ag_bg_cr_comp_v4 * Update hotloop condition Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * fix formating --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-12-02 11:38:06 +01:00
jakpiase	59265d5eb2	[CK_TILE] Add indexing optimizations for conv bwd data (#3309 ) * add indexing optimizations for conv bwd data * fix formating	2025-12-02 11:37:26 +01:00
Yi DING	f211156ce6	[CK_Tile] Flatmm MX Cleanup & Explicite Offset Calculation (#3286 )	2025-12-02 14:21:12 +08:00
Erwin Terpstra	46f1d740f0	Add grouped gemm instances for RDNA4 (#3237 ) * wip: grouped_gemm implementation based on wmma kernel + example for fp16 * chore: clean up grouped_gem_wmma_splitk_fp16 example * chore: add cmake options to fully disable XDL or WMMA kernels * feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts) * chore: add grouped gemm wmma bf16 example * refactor: reuse more code between instance factory functions * chore: turn test failure if not all batch sizes are supported into a warning * chore: made failing of test on unsupported instances conditional to not break old tests * chore: add log message to failure case where AK1/BK1/KBatch is too high for K value * fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run() * fix: stray comma after parameter list * fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran * chore: update copyright in header comments * nit: minor feebdack * refactor: unified XDL / wma tests * fix: properly disable FP8 instances when ONLY targeting gfx11 * refactor: add v3 suffix to grouped_gemm device struct name * fix: small typos in example code * fix: fully exclude xdl/wmma instances when using the corresponding cmake flags * chore: remove unused destructor and added pipeline support checks to remove unnecessary paths * fix: make sure to not add instance library to group if library was skipped * fix: make sure xdl grouped gemm doesnt fail the new test * fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case * fix: examples not working since dependent types and functions were moved to ck namespace in develop * fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances * chore: replace/add copyright headers with new format	2025-12-01 15:32:10 -08:00
Cong Ma	23fb253c4e	Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325 ) * [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples * [CK TILE GEMM Quant] update tile distribution of aquant * [CK TILE GEMM Quant] update aquant register offset calculation * [CK TILE GEMM Quant] Reimplement aquant register offset calculation * [CK TILE GEMM Quant] Add more unit tests of Aquant - Test M128xN128xK128 * [CK TILE GEMM Quant] Add more comments to Gemm Aquant	2025-12-01 15:04:37 -08:00
John Shumway	7873f8fa13	[CK_BUILDER] Update the testing documentation (#3312 ) * [CKBuilder] Update the testing documentation Now that we have clear sets for smoke tests and regression test, we rearange the CMakeLists.txt file to be organized and have description and instructional comments. Move all the test targets that compile quickly into the smoke test suite. Update the builder README.md to reflect this new test organization and functionality. * Update experimental/builder/README.md Clarify integration tests description from review comment. Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Correct README.md The regression tests here still run very fast like smoke tests, but can take minutes or even tens of minutes to compile. We test most of the builder functionality without compiling heavily-templated kernel code, but these regression tests do an expensive full build of the CK kernels. --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-12-01 13:05:32 -08:00
John Shumway	d17994f3df	[CK_BUILDER] Fix cosmetic problem with conv_description (#3333 ) The ConvDescription::detailed command wasn't using TreeFormatter::writeLast correctly, which led to extra lines being drawn in the tree view. It's a simple fix, just a cosmetic improvment out reflection output (ASCII art).	2025-12-01 12:45:04 -08:00
John Shumway	abd6a4b3fc	Cleanup convolution description (#3329 ) Remove obsolete feature for extracting a description from a builder, since this should apply directly to the instance type. Also add some documentation, including a README.md for reflection.	2025-12-01 10:03:58 +01:00
Yi DING	9ed9539ddf	[CK_TILE] Disable cast_tile_pk_fp16bf16_fp32 as It Causes Extra spills on Recent Compilers (#3327 )	2025-12-01 14:48:22 +08:00
Gino Lu	ba6af9fe7c	[CK_TILE] Add unit test for fp4 warp gemm (#2817 ) This update includes a unit test for warp GEMM	2025-12-01 13:56:48 +08:00
Aviral Goel	004784ef98	chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 ) * chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>	2025-11-28 13:49:54 -08:00
Sami Remes	f981554c39	[CK_TILE] Fix Quant GEMM build (#3320 ) * Fix build * Fix ck_tile example 38 & 40 --------- Co-authored-by: Yi DING <yi.ding@amd.com>	2025-11-28 20:33:53 +08:00
msaffari-amd	f875ab0bbc	Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic-add (#3236 ) * Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic * correct clang-format * removed unused rtol_atol variable from example code * clang format correction * remove unused varable max_accumulated_value from example	2025-11-28 09:43:01 +01:00
Cong Ma	30727c48fc	Tile engine for streamk (#3157 ) * [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM. - This commit lays the groundwork for integrating the tile engine into streamk GEMM. It focuses on creating benchmark executables for streamk GEMM. - Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once the streamk implementation reaches stability. * [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM * [CK TILE STREAMK] Refactor: Extract common utility functions. * [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation * Add pre-commit * [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK * [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK * [CK TILE STREAMK] Update Jenkinsfile * [CK TILE Engine] Update StreamK tile engine help message Remove default value messages as they are automatically printed * [CK TILE Engine] Update StreamK tile engine - Remove namespace reboot * [CK TILE Engine] Update StreamK tile engine - Fix merge error	2025-11-27 15:49:57 -07:00
arai713	24d88d2472	[CK_TILE] Move DataTypeTraits into a Common File (#3146 ) This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile. Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2025-11-27 09:09:54 -08:00
Matthias Gehre	678298d4c7	Add support for gfx1153 (#3306 )	2025-11-27 08:48:00 +01:00
Thomas Ning	a38aeceb21	Fix and improve the gemm quant pipeline infrastructure (#3245 )	2025-11-26 18:04:27 -08:00
Max Podkorytov	79aae7c7f7	[CK Tile] enable building examples by default (#3259 ) * remove EXCLUDE_FROM_ALL from ck-tile examples -> +15 min build time w/ 64 threads for a single arch * fix cpp17 compile error in the ck-tile examples --------- Co-authored-by: khuagarw <khuagarw@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-11-26 16:24:44 -08:00
andrew clark	40d7217ac7	Automated Perfetto UI Notifications (#3255 ) * Testing visualization generation * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Adding dummy test data * Update Jenkinsfile * Update Jenkinsfile * Adding notifications * Testing * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Image compression * Update Jenkinsfile * Moving capture logic to main Jenkins file * Testing generation * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Fixing curl request * Update Jenkinsfile * Clean up * Fix * Fixing notification * Testing message creation * Adjusting message payload * Testing notification generation * Updating main jenkinsfile * Fixing cleanup call * Removing test pipeline code * Comment clean up * Testing pipeline * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Moving archive Moving trace archive to safe location before source checkout * Removing test pipeline * Testing pipeline with unique file names * Update Jenkinsfile * Removing test files Updated main pipeline	2025-11-26 16:27:27 -07:00
Aviral Goel	de6466481f	chore(copyright): update copyright header for include directory (#3293 )	2025-11-26 11:00:05 -07:00
John Shumway	10a782d846	Fix template parameter macros (#3305 ) Some of the device implementation templates have macros like GridwiseGemmMultiABDTemplateParameters that can cause build errors if multiple files are included together. This error comes up with our builder code. To clean up the macros and make them safer, we follow these follow rules: * Use more specific names to avoid duplication. * Undefine the macro after it is used to avoid leaking out of the file scope. * Use a prefix CK_ on the macro to avoid conflicting with other libraries. * Use all caps with underscores for preprocessor macro names.	2025-11-26 09:48:17 -08:00
Aviral Goel	35a4b26af0	fix: add dynamic selection of pipelines for aquant mode (#3282 ) - Add conditional selection to use v3 pipeline when PreshuffleQuant is true - Add static assertion in memory pipeline to prevent PreshuffleQuant usage - Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases - Update BaseGemmPipeline selection to handle all quant modes properly	2025-11-26 10:58:09 +04:00
Yi DING	8fa90025d0	[CK_TILE] Refine warp_gemm_attribute_mfma (#3272 )	2025-11-26 10:57:15 +08:00
Yi DING	c7dce2ac29	[CK_TILE] Fix Compilation of Flatmm Examples (#3285 )	2025-11-26 10:11:43 +08:00
Illia Silin	a54f7b1138	Enable ck_builder in CI. (#3296 ) * build and run ck_builder tests * add test_ckb_all to targets * fix syntax * fix test path * Update CMake targets for builder testing in CI (#3290) Our existing CMake only had build targets. Update CMakeLists.txt to have CTEST targets: * smoke-builder * regression-builder * check-builder Co-authored-by: John Shumway <jshumway@amd.com> * use check-builder target * get rid of test_ckb_all target * call ninja check-builder separately --------- Co-authored-by: John Shumway <jshumway@amd.com>	2025-11-25 17:45:59 -08:00
Aviral Goel	cd47293869	chore(copyright): update copyright header for experimental & example directory (#3292 )	2025-11-26 03:09:39 +04:00
Bartłomiej Kocot	00dfa2f2ce	[CK TILE] Grouped Conv Explicit Gemm (#3289 ) * [CK TILE] Grouped Conv Explicit Gemm * fixes * apply builder fixes	2025-11-25 23:28:35 +01:00

1 2 3 4 5 ...

2734 Commits