Commit Graph

2734 Commits

Author SHA1 Message Date
Bartłomiej Kocot
35fc7c9e4f Add new section to changelog (#3295)
* Add new section to changelog

* Update CHANGELOG.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

---------

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
2025-12-05 07:14:52 -08:00
jakpiase
f7650ee82b fix enforcing fixedvectorsizes for ck tile conv (#3344) 2025-12-05 09:30:22 +01:00
John Shumway
13f6d63565 Clean up conv_traits.hpp (#3354)
When I asked for a description of operators that didn't have ConvTraits, I was getting very long confusing errors about ConvTraits not being defined. Now we get specific errors explaining which concepts are violated, making it easier to know which code to generalize or update.

* Add concepts to conv_traits.hpp to get better error message.
* Put the correct requires clauses in the right places to get descriptive error messages.
* General cleanup of functions in conv_traits.hpp to make functions easier to read.
2025-12-04 19:12:36 -08:00
Po Yen Chen
05292b3604 [CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153)
* Let fmha_fwd_v3() compatible with fmha_fwd()

* Decouple get_fwd_blobs() and FmhaFwdKernel

* Decouple compatibility checks from get_fwd_blobs()

* Extract product feature checks out from get_fwd_blobs()

* Remove duplicated code in factories and redundant checks

* Remove FmhaFwdKernel<>::GetName()

* Let FmhaFwdApiPool support pipelines with different mask_impl

* Add tile setting for fmha fwd v3 pipeline

* Add fwd v3 instances to tile_example_fmha_fwd manually

* Remove unused function import

* Undo irrelevant changes

* Remove fwd v3 instances from tile_example_fmha_fwd

* Finish fmha fwd v3 kernel instance codegen

* Fix formatting

* Remove unused F_idx attribute

* Add is_generic_attention_mask<> traits

* Add constraints to the fmha fwd v3 pipeline

* Unify traits & problem used for fmha fwd v3

* Unify kernel launch code for fmha fwd v2 & v3

* Unify kernel template selection logic

* Use same kernel codegen template for both v2 & v3

* Rename api() property as render() method

* Allow specifying filter for fmha fwd api pool

* Allow specifying function name when rendering api pool items

* Separate fmha fwd v3 kernel dispatching logic from v2

* Remove lambda assignment

* Add simple v2/v3 dispatch logic

* Stop generating empty if-clauses

Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.

* Use "".join() to concatenate fmha fwd api string content

* Add more feature checks for fmha fwd v3 pipeline

* Check features before dispatch to fmha_fwd_v3()

* Add more feature checks for fmha_fwd_v3()

* Add missing filter call

* Use Tuple to reserve the dtype orders

* Fix wrong pipeline matching logic

* Add fmha fwd v3 group mode instances

* Add functor_transform<>

* Add type constraints to make_tile_window()

* Remove fmha fwd v3 example

* Fix wrong product(aiter mha_fwd()) config

* Fix wrong fmha fwd v2/v3 selection logic

* Fix formatting

* Add comment to warning v3 kernel users

* Fix wrong codegen logics

* Remove unnecessary param

* Fix format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-05 10:31:12 +08:00
Illia Silin
d1193e8637 fix hipblaslt build for different archs (#3358) 2025-12-04 18:29:14 -08:00
Max Podkorytov
d184eed823 [CK-Tile] Refactor base pipeline usage (#3251)
* initial poc

* factor out common parts in operator()

* cv4

* rest of the universal gemm pipelines

* fix test

* remove boilerplate from tile engine

* fix example

* fix example

* format

* fix tests build for gemm

* remove base pipeline codegen from gemm instance builder

* unify v3 logic with the rest of universal gemm pipelines

* fix build for multi abd test

* fix test gemm multi d

* fix build for weight preshuffle

* fix grouped gemm test

* fix grouped gemm multi d test

* fix grouped gemm preshuffle

* fix grouped gemm example except for quant

* fix gemm preshuffle

* fix splitk 2 stage example

* fix batched gemm example

* fix multid example

* fix multiabd example

* fix batched gemm test

* fixup

* fix examples build

* fix grouped gemm test build

* fix smoke builder
2025-12-04 11:45:49 -08:00
spolifroni-amd
d9d4c9c3df [composable_kernel] initial draft of the ck tile conceptual doc (#3242)
* Adding CK Tile documentation

* Updates based on feedback

* Fix tile window API description

* Fix remaining images

* add documentation about flush_cache and rotating_buffer functionality in ck_tile

* Supplement the documentation

* light edit of the ck tile conceptual doc

* Fixes for ruff check.

* Fixes for ruff check 2.

* Fixes for ruff check 3.

---------

Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
2025-12-04 11:09:21 -08:00
Illia Silin
cd21e20ae7 build latest hipblaslt in ck_pytorch docker (#3347) 2025-12-04 06:58:42 -08:00
Ville Pietilä
9cb1f421bc [CK_BUILDER] Refactor convolution signature to provide data type/layout/elementwise op per tensor (#3331)
* Separate layouts into separate entities for input, weight, and output tensors.

* Add test for handling bias tensor layouts.

* Use instance string in builder tests.

* Add handling of output bias data types and layouts.

* Generalize handling of the elementwise ops.

* Test fix.

* Create builder for layouts.

* Layout builder improvements.

* Improve layout builder.

* Simplify bias layout handling.

* Code clean-up.

* Move layout utils into separate file.

* Remove hard-coded layout combinations.

* Small code clean-up.

* Move data type utils into a separate file.

* Add data types, layouts, and elementwise ops per conv tensor.

* Builder bug fixes after refactoring.

* Working baseline.

* Make signature definition look nice in the test code.

* Move TensorConfig into test implementations.

* Fix all fwd conv builder tests.

* Fix conv traits and descriptors tests.

* More factory assets under a separate directory.

* Fix building conv traits.

* Fix clang-format.

* Add Readme doc to describe the design.

* Add link to main Readme. Fix links in the builder design doc.

* Clean-up data type/layout/elementwise op conversions.

* Switch from dimension and tensor type specific layouts to a flat list of tensor layouts.

* Fix clang-formatting.

* Fix clang-format for test code.

* Simplify fwd conv signature definitions in the test code.

* Remove accidental edits.

* Fix comment string.

* Fix instance factory after rebase.

* Fix tests after rebase.

* Unify layout handling.

* Add more conv layout unit tests.

* Clang-format.

* Fix merge conflicts.

* Improve elementwise op handling.

---------

Co-authored-by: Ville Pietilä <>
2025-12-04 12:58:31 +02:00
arai713
583fafc803 [CK_TILE] Fix for Moving DataTypeTraits into a Common File (#3335)
This PR fixes a mismatch caused when PR #3146 was merged out of sync with develop, which made its intended changes ineffective. This PR reapplies those changes to move DataTypeTraits into a common file to mitigate code duplication.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-03 22:46:22 -08:00
kensclin
ffc3120f63 Ck tile/gemm blockscale opt (#3227)
* GEMM block scale optimization kernel

* GEMM block scale optimization kernel

* Fix: Apply clang-format for style consistency

* Fix: Apply clang-format for style consistency

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-03 22:07:23 -08:00
rocking
eb7f617713 fp8 fmha async pipeline (#3339)
* replace qr with async pipeline

* Add fp8fp32 to DTYPE_BITS

* Add kAlignmentRandVal to avoid compile fail

* format

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-04 12:18:25 +08:00
JH-Leon-KIM-AMD
4baa4c9fae [CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216)
* LWPCK-4043: Add GPU reference implementations for CK Tile convolution

This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).

Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability

The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.

* update gpu reference for ck tile grouped conv

* correct c++ 18 format

* Add GPU Reference Implementations for Old CK Convolution

This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.

Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
  * Fixed BF16 NaN issue (use type_convert instead of static_cast)
  * Fixed FP8/BF8 arithmetic (accumulate in float)
  * Fixed uninitialized variables
  * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)

- Created backward data GPU reference (naive_conv_bwd_data.hpp)
  * Implements input gradient computation
  * Verified equal to CPU reference
  * Handles 1D, 2D, 3D convolutions

- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
  * Implements weight gradient computation
  * Verified equal to CPU reference
  * Handles 1D, 2D, 3D convolutions

- Integrated with old CK examples
  * Forward: 10 XDL examples now support do_verification=2
  * Backward data: Integrated with example/17_convnd_bwd_data/
  * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
  * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)

Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types

Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints

Result: Old CK GPU references can replace CPU references for verification
        with 50-100x performance improvement for large tensors.

* Apply clang-format to old CK GPU reference files

* Fix C++17 compatibility: use brace initialization for aggregate types

* add get_rtol, get_atl and consistency cout message

* Use triple bracket syntax for kernel launch per review feedback

Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.

All tests still passing after this change.

* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode

- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)

* Use ConvDims structure to simplify GPU reference kernels

Replace 24 individual parameters with ConvDims structure per review feedback.

- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files

* Use get_block_id() and get_thread_id() helpers in CK Tile

Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.

Updated 3 CK Tile GPU reference kernels per review feedback.

* Use std::array for spatial parameters in CK Tile GPU references

Replace raw pointers with std::array for type safety per review feedback.

- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters

* Use NDimSpatial+3 for stride array sizes

Replace hardcoded [10] with [NDimSpatial+3] per review feedback.

Array sizes now correctly reflect actual dimensions needed.

* Use #pragma once instead of include guards

Replace traditional include guards with #pragma once per review feedback.

Updated 3 Old CK GPU reference headers.

* Fix element-wise operation output in Old CK GPU references

Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.

This ensures element-wise operations are correctly applied to output.

* Initialize element-wise operation variables

Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.

Updated backward data and backward weight kernels.

* Use explicit zero initialization for element-wise variables

Change TIn{} to TIn{0} for consistency per Copilot feedback.

All 3 kernels now use consistent zero initialization.

* Fix copyright headers to match existing style

- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range

Addresses consistency feedback.

* Rename GPU reference files: add _gpu suffix

* Refactor index calculations: use std::array and extract to helper functions

* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-03 21:14:21 +02:00
Enrico Degregori
161835533b Wmma support for gemm_multiply_multiply_wp (#3278)
* Initial implementation with splitK support

* Add gfx11 support

* Fix compilation error

* Add instances

* Add irregular instances

* Fix GetBuffer arguments

* Minor changes

* Address review comments

* Fix compilation errors

* Fix copyright header
2025-12-03 07:38:23 -08:00
John Shumway
f29b67cf9b [CK_BUILDER] Add Description::instance_string() method and update tests (#3340)
* Create Description::instance_string() function

To expose more reflection capabilities in MIOpen, we add the instance_string functionality to the ckr::Description class. This PR introduces a base class, adds the instance_string method, and implements the method by injecting the Traits::instance_string method through the ConvDescription constructor.

This will enable us to replace the specialized get_instance_string() method on device operations with a describe() method in a subsequent PR.

* Test describe().instance_string()

Update the instance string tests to also call `ckr::describe<Instance>().instance_string()`. This documents that the xld kernels are supported with describe(), but WMMA and DL kernels are not yet supported. Also update namespace and add a HasConvTraits concept.
2025-12-03 06:36:09 -08:00
jakpiase
e6a583416b [CK TILE] Add index optimizations for conv bwd weight (#3321) 2025-12-03 10:53:46 +01:00
Aviral Goel
6cb0bc2d11 feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193)
* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* feat(gemm_quant): add RRR and CRR layout support for aquant gemm

* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm

* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests

* fix: test cases compilation failure due to PR# 2095

* fix: make condition to filter out tests for gfx950 more explicit

* need to support the gfx950

* fix: add layout suppot for gfx950

* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)

* WIP: add support for pk_int4_t for aquant mode layouts RR and CR

* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant

* fix: compile time error for gfx950

* fix: minor bug where is_a_load_tr_v() was mising

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* docs: update changelog with new layout support info

* Update CHANGELOG.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* refactor: break test instances into multiple cpp files to reduce build time (#3319)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* refactor: break test instances into multiple cpp files to reduce build time

* chore: rename file for better code readability

* fix: merge conflict resolution

* fix: remove memory pipeline because new layout is not compatible

* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()

* refactor: address review comments

* solve the conflict

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-02 14:59:07 -08:00
Illia Silin
2c284a1780 Disable gemm_blockscale_f8 on gfx90a by default. (#3338)
* disable gemm_blockscale_f8 instances on gfx90a by default

* fix cmake logic, diasble some cmake output

* fix cmake logic
2025-12-02 11:33:33 -08:00
John Shumway
280bc42191 [CK_BUILDER] Refactor builder factory code. (#3276)
Refactor the builder factory code into multiple files and subdirectories and a ck_tile::builder::factory namespace.

The factory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations.

Major changes in this PR:

Dispatch logic is explicit in the function make_conv_instance instead of implicit in template specialization selection.
Helper code is moved to a subdirectory builder/factory/helpers.
Helpers now have unit tests.
Factories are moved to their own files.
Code moved to namespaces ck_tile::builder::factory and ck_tile::builder::factory::internal.
This does not yet fix the problem of bad error messages, but the make_conv_instance function makes the poor error messages clear. The choice of algorithm must be much more robust (perhaps with explicit enumeration in the algorithm descriptor), so that the dispatch doesn't fail.

Quality changes:

Making dispatch explicit rather than implicit will improve robustness, readability, maintainability, testability, and extensibility.
Separating code into separate files and subdirectories helps readability and extensibility.
Adding unit tests for helpers documents behavior and will enable more complex logic and functionality.
Separating files (especially unit tests) helps clarify includes and dependencies and makes code easier to refactor.
2025-12-02 07:40:14 -08:00
Thomas Ning
8459d389ad disable the gfx90a (#3336) 2025-12-02 07:27:37 -08:00
Ville Pietilä
66832861ad [CK_TILE] Merge multiple fwd convolution groups into a single GEMM batch. (#3136)
* Merge fwd conv groups in CK Tile.

* Fix building CK fwd convs.

* Add number of merged groups to conv fwd kernel name.

* Get number of merged groups from conv config.

* Rename GemmConfig to ConvConfig.

* Clean-up TODOs.

* Check that number of conv groups must be divisible by the number of merged groups.

* Improve error handling in the conv fwd example.

* Fix clang-format.

* Fix group offsets.

* Fix merge problem.

* Address feedback from code review.

* Fix clang-formatting.
2025-12-02 15:23:32 +02:00
msaffari-amd
2d3020e5b0 [CK Tile] batched contraction kernel generalizing (#3126)
* Add help for example

* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings

* Add stride-aware reference for batched contraction with independent D tensor layouts

* Add -num_d argument for runtime D tensor count selection in batched contraction

* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs

* Add descriptor-based architecture for batched contraction multi-dimensional stride support

* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0

* Add complete multi-dimensional stride support via descriptors

* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm

* Clean up batched contraction: remove old UniversalGemmKernel path

* Clean up batched contraction: remove legacy paths and finalize docs

* Optimize batched contraction example: pass dimension sizes not vectors

* correct the reference calculation, unsigned int to int

* Fix batched_contraction C++17 build errors for gfx90a CI
2025-12-02 13:30:27 +01:00
DarylHawkinsAMD
d3f37ebf6c [CK_BUILDER] Install CK builder headers, added missing include (#3334) 2025-12-02 04:28:46 -08:00
jakpiase
5d67d82a0b [CK_TILE] Fix for comp pipeline v4 (#3307)
* Fix for gemm_pipeline_ag_bg_cr_comp_v4

* Update hotloop condition

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* fix formating

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-12-02 11:38:06 +01:00
jakpiase
59265d5eb2 [CK_TILE] Add indexing optimizations for conv bwd data (#3309)
* add indexing optimizations for conv bwd data

* fix formating
2025-12-02 11:37:26 +01:00
Yi DING
f211156ce6 [CK_Tile] Flatmm MX Cleanup & Explicite Offset Calculation (#3286) 2025-12-02 14:21:12 +08:00
Erwin Terpstra
46f1d740f0 Add grouped gemm instances for RDNA4 (#3237)
* wip: grouped_gemm implementation based on wmma kernel + example for fp16

* chore: clean up grouped_gem_wmma_splitk_fp16 example

* chore: add cmake options to fully disable XDL or WMMA kernels

* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)

* chore: add grouped gemm wmma bf16 example

* refactor: reuse more code between instance factory functions

* chore: turn test failure if not all batch sizes are supported into a warning

* chore: made failing of test on unsupported instances conditional to not break old tests

* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value

* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()

* fix: stray comma after parameter list

* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran

* chore: update copyright in header comments

* nit: minor feebdack

* refactor: unified XDL / wma tests

* fix: properly disable FP8 instances when ONLY targeting gfx11

* refactor: add v3 suffix to grouped_gemm device struct name

* fix: small typos in example code

* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags

* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths

* fix: make sure to not add instance library to group if library was skipped

* fix: make sure xdl grouped gemm doesnt fail the new test

* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case

* fix: examples not working since dependent types and functions were moved to ck namespace in develop

* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances

* chore: replace/add copyright headers with new format
2025-12-01 15:32:10 -08:00
Cong Ma
23fb253c4e Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325)
* [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples

* [CK TILE GEMM Quant] update tile distribution of aquant

* [CK TILE GEMM Quant] update aquant register offset calculation

* [CK TILE GEMM Quant] Reimplement aquant register offset calculation

* [CK TILE GEMM Quant] Add more unit tests of Aquant

- Test M128xN128xK128

* [CK TILE GEMM Quant] Add more comments to Gemm Aquant
2025-12-01 15:04:37 -08:00
John Shumway
7873f8fa13 [CK_BUILDER] Update the testing documentation (#3312)
* [CKBuilder] Update the testing documentation

Now that we have clear sets for smoke tests and regression test, we rearange the CMakeLists.txt file to be organized and have description and instructional comments.

Move all the test targets that compile quickly into the smoke test suite.

Update the builder README.md to reflect this new test organization and functionality.

* Update experimental/builder/README.md

Clarify integration tests description from review comment.

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Correct README.md

The regression tests here still run very fast like smoke tests, but can take minutes or even tens of minutes to compile.  We test most of the builder functionality without compiling heavily-templated kernel code, but these regression tests do an expensive full build of the CK kernels.

---------

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
2025-12-01 13:05:32 -08:00
John Shumway
d17994f3df [CK_BUILDER] Fix cosmetic problem with conv_description (#3333)
The ConvDescription::detailed command wasn't using TreeFormatter::writeLast correctly, which led to extra lines being drawn in the tree view. It's a simple fix, just a cosmetic improvment out reflection output (ASCII art).
2025-12-01 12:45:04 -08:00
John Shumway
abd6a4b3fc Cleanup convolution description (#3329)
Remove obsolete feature for extracting a description from a builder, since this should apply directly to the instance type. Also add some documentation, including a README.md for reflection.
2025-12-01 10:03:58 +01:00
Yi DING
9ed9539ddf [CK_TILE] Disable cast_tile_pk_fp16bf16_fp32 as It Causes Extra spills on Recent Compilers (#3327) 2025-12-01 14:48:22 +08:00
Gino Lu
ba6af9fe7c [CK_TILE] Add unit test for fp4 warp gemm (#2817)
This update includes a unit test for warp GEMM
2025-12-01 13:56:48 +08:00
Aviral Goel
004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
Sami Remes
f981554c39 [CK_TILE] Fix Quant GEMM build (#3320)
* Fix build

* Fix ck_tile example 38 & 40

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-11-28 20:33:53 +08:00
msaffari-amd
f875ab0bbc Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic-add (#3236)
* Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic

* correct clang-format

* removed unused rtol_atol variable from example code

* clang format correction

* remove unused varable max_accumulated_value from example
2025-11-28 09:43:01 +01:00
Cong Ma
30727c48fc Tile engine for streamk (#3157)
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.

- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
  It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
  the streamk implementation reaches stability.

* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM

* [CK TILE STREAMK] Refactor: Extract common utility functions.

* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation

* Add pre-commit

* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK

* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK

* [CK TILE STREAMK] Update Jenkinsfile

* [CK TILE Engine] Update StreamK tile engine help message

Remove default value messages as they are automatically printed

* [CK TILE Engine] Update StreamK tile engine

- Remove namespace reboot

* [CK TILE Engine] Update StreamK tile engine

- Fix merge error
2025-11-27 15:49:57 -07:00
arai713
24d88d2472 [CK_TILE] Move DataTypeTraits into a Common File (#3146)
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-11-27 09:09:54 -08:00
Matthias Gehre
678298d4c7 Add support for gfx1153 (#3306) 2025-11-27 08:48:00 +01:00
Thomas Ning
a38aeceb21 Fix and improve the gemm quant pipeline infrastructure (#3245) 2025-11-26 18:04:27 -08:00
Max Podkorytov
79aae7c7f7 [CK Tile] enable building examples by default (#3259)
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch

* fix cpp17 compile error in the ck-tile examples

---------

Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2025-11-26 16:24:44 -08:00
andrew clark
40d7217ac7 Automated Perfetto UI Notifications (#3255)
* Testing visualization generation

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Adding dummy test data

* Update Jenkinsfile

* Update Jenkinsfile

* Adding notifications

* Testing

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Image compression

* Update Jenkinsfile

* Moving capture logic to main Jenkins file

* Testing generation

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Fixing curl request

* Update Jenkinsfile

* Clean up

* Fix

* Fixing notification

* Testing message creation

* Adjusting message payload

* Testing notification generation

* Updating main jenkinsfile

* Fixing cleanup call

* Removing test pipeline code

* Comment clean up

* Testing pipeline

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Moving archive

Moving trace archive to safe location before source checkout

* Removing test pipeline

* Testing pipeline with unique file names

* Update Jenkinsfile

* Removing test files

Updated main pipeline
2025-11-26 16:27:27 -07:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
John Shumway
10a782d846 Fix template parameter macros (#3305)
Some of the device implementation templates have macros like GridwiseGemmMultiABDTemplateParameters that can cause build errors if multiple files are included together. This error comes up with our builder code.

To clean up the macros and make them safer, we follow these follow rules:
* Use more specific names to avoid duplication.
* Undefine the macro after it is used to avoid leaking out of the file scope.
* Use a prefix CK_ on the macro to avoid conflicting with other libraries.
* Use all caps with underscores for preprocessor macro names.
2025-11-26 09:48:17 -08:00
Aviral Goel
35a4b26af0 fix: add dynamic selection of pipelines for aquant mode (#3282)
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly
2025-11-26 10:58:09 +04:00
Yi DING
8fa90025d0 [CK_TILE] Refine warp_gemm_attribute_mfma (#3272) 2025-11-26 10:57:15 +08:00
Yi DING
c7dce2ac29 [CK_TILE] Fix Compilation of Flatmm Examples (#3285) 2025-11-26 10:11:43 +08:00
Illia Silin
a54f7b1138 Enable ck_builder in CI. (#3296)
* build and run ck_builder tests

* add test_ckb_all to targets

* fix syntax

* fix test path

* Update CMake targets for builder testing in CI (#3290)

Our existing CMake only had build targets. Update CMakeLists.txt to have CTEST targets:
* smoke-builder
* regression-builder
* check-builder

Co-authored-by: John Shumway <jshumway@amd.com>

* use check-builder target

* get rid of test_ckb_all target

* call ninja check-builder separately

---------

Co-authored-by: John Shumway <jshumway@amd.com>
2025-11-25 17:45:59 -08:00
Aviral Goel
cd47293869 chore(copyright): update copyright header for experimental & example directory (#3292) 2025-11-26 03:09:39 +04:00
Bartłomiej Kocot
00dfa2f2ce [CK TILE] Grouped Conv Explicit Gemm (#3289)
* [CK TILE] Grouped Conv Explicit Gemm

* fixes

* apply builder fixes
2025-11-25 23:28:35 +01:00