Commit Graph

573 Commits

Author SHA1 Message Date
Qianfeng Zhang
62cf370749 Fix sched_barrier mask value 2026-02-24 07:37:13 +00:00
Qianfeng Zhang
eba924de35 Remove un-used index constant definition 2026-02-14 08:17:39 +00:00
Qianfeng Zhang
f5b4d5dc26 Update in whole_k_prefetch_trload pipeline to prefetch two k_tile for next iteration in the non-whole-k-perfetch path 2025-12-23 10:23:26 +00:00
Qianfeng Zhang
489e2554ea Update to GetNumPrefetchV() for kM0=64 path 2025-12-23 08:20:08 +00:00
Qianfeng Zhang
6c91b0c407 Move the loading of k_tile for next iteration into the Gemm1 loop (non whole_k_prefetch path in trload pipeline) 2025-12-23 07:07:15 +00:00
Qianfeng Zhang
e7e6ebc91c Update to GetNumPrefetchV() 2025-12-22 15:35:38 +00:00
Qianfeng Zhang
b77fdbf304 Move the loading of k_file for next iteration into the Gemm1 loop (non whole_k_prefetch path) 2025-12-22 15:34:10 +00:00
Qianfeng Zhang
57cf989f63 Update to only pre-load one v_tile during Gemm0 loop 2025-12-22 08:47:09 +00:00
Qianfeng Zhang
db5c12db89 Update to the non-whole-k-prefetch path in the whoke_k_prefetch pipeline 2025-12-21 15:13:14 +00:00
Qianfeng Zhang
1ef76a62ea Fix the static_assert expression in the pipeline 2025-12-21 12:14:39 +00:00
Qianfeng Zhang
3f6d26e9a7 Load Q directly from global memory to registers for BlockGemm 2025-12-20 13:25:36 +00:00
Qianfeng Zhang
57abd10b95 Using is_using_trload_v to check the kUseTrLoad from pipeline 2025-12-20 10:23:30 +00:00
Qianfeng Zhang
eb598a9d1e Add qr_ks_vs_whole_k_prefetch_trload pipeline 2025-12-19 15:54:21 +00:00
Qianfeng Zhang
384f4708a1 Add support of loading QK tiles of hdim96 without padding to hdim128 2025-12-17 16:39:15 +00:00
Qianfeng Zhang
d281c519f3 Adjust in GetNumPrefetchV() 2025-12-15 15:02:15 +00:00
Qianfeng Zhang
370d386427 Remove replicated codes in the pipeline 2025-12-15 10:38:15 +00:00
Qianfeng Zhang
409ec3b56e Fix move_tile_window(k_dram_window, ..) step in the pipeline 2025-12-15 09:54:49 +00:00
Qianfeng Zhang
c3d3487ca4 Load Q through Lds 2025-12-14 15:46:37 +00:00
Qianfeng Zhang
12c88731c6 Separate kN0Sub from kK0 to be used for flexible tile tuning for whole_k_prefetch pipeline 2025-12-11 08:41:16 +00:00
Qianfeng Zhang
2ea8d8313c Using explicit vgpr-saved partition_index with store_tile(lds_window, ...) 2025-12-08 04:54:22 +00:00
Qianfeng Zhang
044f554bf7 Refine the interleaving in the loop of Gemm0 2025-12-07 14:13:16 +00:00
Qianfeng Zhang
5722f8afbc [Performance] Change __builtin_amdgcn_sched_barrier() in block_gemm 2025-12-07 10:11:43 +00:00
Qianfeng Zhang
8b85919288 Simplify the block_gemm codes 2025-12-06 16:03:43 +00:00
Qianfeng Zhang
25521a7e06 Switch the codes based on the iteration index (first/intermediate/last) 2025-12-05 15:58:33 +00:00
Qianfeng Zhang
c32949b285 Change in GetKVBlockGemm to let gemm1 to use WarpTile-16x16x16/32x32x8 on mi350 2025-12-05 02:04:27 +00:00
Qianfeng Zhang
98f9b4a47b Add prefetching whole next iteration K path in the pipeline 2025-12-04 15:38:16 +00:00
Qianfeng Zhang
5fada1ce99 Initial re-implementation of pipeline qr_ks_vs_whole_k_prefetch in looping Gemm0 along n0 dimension 2025-12-04 09:09:47 +00:00
Cong Ma
30727c48fc Tile engine for streamk (#3157)
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.

- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
  It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
  the streamk implementation reaches stability.

* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM

* [CK TILE STREAMK] Refactor: Extract common utility functions.

* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation

* Add pre-commit

* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK

* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK

* [CK TILE STREAMK] Update Jenkinsfile

* [CK TILE Engine] Update StreamK tile engine help message

Remove default value messages as they are automatically printed

* [CK TILE Engine] Update StreamK tile engine

- Remove namespace reboot

* [CK TILE Engine] Update StreamK tile engine

- Fix merge error
2025-11-27 15:49:57 -07:00
arai713
24d88d2472 [CK_TILE] Move DataTypeTraits into a Common File (#3146)
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-11-27 09:09:54 -08:00
Matthias Gehre
678298d4c7 Add support for gfx1153 (#3306) 2025-11-27 08:48:00 +01:00
Thomas Ning
a38aeceb21 Fix and improve the gemm quant pipeline infrastructure (#3245) 2025-11-26 18:04:27 -08:00
Max Podkorytov
79aae7c7f7 [CK Tile] enable building examples by default (#3259)
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch

* fix cpp17 compile error in the ck-tile examples

---------

Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2025-11-26 16:24:44 -08:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
Aviral Goel
35a4b26af0 fix: add dynamic selection of pipelines for aquant mode (#3282)
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly
2025-11-26 10:58:09 +04:00
Yi DING
8fa90025d0 [CK_TILE] Refine warp_gemm_attribute_mfma (#3272) 2025-11-26 10:57:15 +08:00
Yi DING
c7dce2ac29 [CK_TILE] Fix Compilation of Flatmm Examples (#3285) 2025-11-26 10:11:43 +08:00
Bartłomiej Kocot
00dfa2f2ce [CK TILE] Grouped Conv Explicit Gemm (#3289)
* [CK TILE] Grouped Conv Explicit Gemm

* fixes

* apply builder fixes
2025-11-25 23:28:35 +01:00
Bartłomiej Kocot
9ac2666d5b [CK_BUILDER] Add grouped conv bwd ck tile traits (#3281)
* [CK_BUILDER] Add grouped conv bwd ck tile traits

* copilot fixes
2025-11-25 14:57:43 +01:00
rocking
229d43ea0c Fix batch prefill compile fail in aiter (#3279)
* Fix batch prefill aiter compile fail

* Fix compile error
2025-11-25 09:46:32 +08:00
Thomas Ning
de6a9590ab Reorganize of KPack in GEMM (#3247)
* add the reorganize of KPack

* fix the compilation error

* fix the compilation error
2025-11-24 12:38:59 -08:00
Christopher Millette
b9c6cb1452 First look at mfma / wmma unification (#2704)
* First look at mfma / wmma unification

* Refactor

* Re-org file structure

* Restructure transform selection and WaveWiseMma class

* Update license files. Add missing gfx1151 support. Change wave size for HOST to 1. Update datatypes naming consistency

* Fixes default MmaSelector implentation

* Adds unit tests for amdgcn_mma and arch

* Consolidate common arch id checks to constexpr functions. Strongly type ids as amdgcn_target_arch_id object.

* Refactor is_any_value_of

* Fixes mma_selector logic

* Fix typo

* Add mma selector test for tile decomposition

* Fix compilation of mma.hpp

* Revert back to c++17 compatibility

* Fix compiler error by returning index_t from get_warp_size()

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes compiler error for missing is_wave32() function

* Fixes compiler error for host wave_size() should be 64

* Fixes compiler errors where __cpp_concepts is not defined

* Fixes compiler errors where __cpp_concepts is not defined

* Fix test failure for host is wave64 by default

---------

Co-authored-by: Chris Millette <you@example.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-24 09:39:59 -08:00
Khushbu Agarwal
8111572785 [CK_Tile] Support for preshuffle weight(B) quant tensor for block scale gemm (#3165)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* addressing review comments

* fixing CI issue

* addressing reveiw comments

* formatting

* formatting

* fixing aquant operator overlaoding

* formatting

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-11-24 07:48:42 -08:00
Qianfeng
81042ea574 Fix a bug for qr_ks_vs_async_trload pipeline (#3271) 2025-11-24 21:31:48 +08:00
rocking
5948dbffe4 Support fp8 dynamic quantization for fmha (#3206)
* Support qscale for dynamic quant, remove static quant

* Support hdim=256

* Remove bias test case for fp8

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-11-24 16:28:25 +08:00
Johannes Graner
096f0a3b23 [CK Tile] Fix example for conv fwd + bias + clamp (#3235)
* Fix clamp not being applied correctly

* Apply group offsets to D tensors

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-11-24 07:36:26 +01:00
Emily Martins
02ab76c2cb Fix CK Tile DP + 2 Tile Stream-K Validation Errors (#3269)
When there are multiple workgroups contributing to a tile, when using
atomics, there may be round off error in cases where the accumulator
type is not the same as the C type. To compute an error tolerance for
test validation, the Stream-K Tile Partitioner has a function called
estimate_num_wgs_per_tile to estimate the number of workgroups per tile.
That said, this function only provides an estimate. In some cases for
DP+2TSK, the function returns 1 rather than the more accurate value of
2.

Thus, this change updates the estimate_num_wgs_per_tile function to
explicitely return the value of 2 in cases for DP+2TSK to ensure that we
have a better error tolerance to avoid test failures due to round-off
error.
2025-11-21 20:29:47 -07:00
Yi DING
8b284a63a4 [CK_TILE] Refine FP32 => FP16/BF16 Conversion (#3215)
* [CK_TILE] Refine FP32 => FP16/BF16 Conversion

* Thank you Copilot

* Rename fix

* Fix example

* Fix accu checking

* Fix

* Fix
2025-11-20 10:50:26 -08:00
Gavin Zhao
07314ac543 Add support for RDNA1 GPUs (#3220)
* Allow compilation for RDNA1 (__gfx101__)

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* More RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* Even more RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* cmake: skip build quantization for unsupported arches

* add gfx10-1-generic support as well

* add gfx1013 and complete gfx10-1-generic

* fix clang format

* enable DL kernels on gfx101x

---------

Signed-off-by: Gavin Zhao <git@gzgz.dev>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-11-20 10:45:57 -08:00
Emily Martins
2e4b8a8fc4 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed Revert "Add attn sink (#2892)" (#3250)
This reverts commit 9fa4e8d5ab.
2025-11-20 07:55:15 -08:00