Commit Graph

2454 Commits

Author SHA1 Message Date
Qianfeng Zhang
80e08b6efe Use supplement_array_by_last_element() in example to simplify the codes 2025-11-02 03:17:16 +00:00
Qianfeng Zhang
10133e5d51 Update to README.md 2025-11-02 03:16:48 +00:00
Qianfeng Zhang
8408ec0a02 Add scripts for testing the using of separate sequence lengths for k/v 2025-11-02 03:16:22 +00:00
Qianfeng Zhang
17e404be3b Support separate sequence lengths for q and kv 2025-11-02 03:14:53 +00:00
Qianfeng Zhang
eaf9650fed Use separate pipelines for using or not-using softmax situations 2025-10-30 10:01:52 +00:00
Qianfeng Zhang
207e6f10b8 Implementation of hstu attention pipeline using trload for v on mi350 2025-10-29 15:45:14 +00:00
Qianfeng Zhang
a464269bb6 Fix in the comments 2025-10-27 10:47:40 +00:00
Qianfeng Zhang
4eeb5cc917 Update to gemm_0's CBlockDistribution encoding so that it is compatible with gemm_1's ABlockDistribution encoding 2025-10-27 10:47:23 +00:00
Qianfeng Zhang
98a241a2eb Using separate tile settings for no-softmax and with-softmax hstu attention situations 2025-10-24 01:47:55 +00:00
Qianfeng Zhang
7c4012266a Update to benchmark scripts to consider for using softmax 2025-10-23 10:09:37 +00:00
Qianfeng Zhang
d1505786f8 Add support of softmax in hstu attention 2025-10-20 14:26:55 +00:00
Qianfeng Zhang
a874839dc2 Add template parameter to gemm_0 MakeCBlockTile() for the need of defining PcompBlockTileType 2025-10-20 14:26:29 +00:00
Qianfeng Zhang
1a8f2f21fb Move scaling by attn_scale to inside the main-loop 2025-10-20 14:22:18 +00:00
Qianfeng Zhang
bbda3f6f1c Let IsTokenPairInsideMask() return bool type 2025-10-20 14:21:26 +00:00
Qianfeng Zhang
fdb89d3e2f Add instances to consider for adding softmax support 2025-10-20 14:20:54 +00:00
Qianfeng Zhang
2072e53d1e Remove K0 from tile setting since it is not used 2025-10-14 07:15:26 +00:00
Qianfeng Zhang
22a7b31865 Change to pipeline so that it is easier to add support of using softmax 2025-10-12 06:09:55 +00:00
Qianfeng Zhang
d308b09fae Remove using IGLP method for instruction scheduling for kUseLocal true path 2025-10-12 06:09:25 +00:00
Qianfeng Zhang
6b40ce4074 Fix in GetQKBlockGemm() 2025-09-27 14:59:32 +00:00
Qianfeng Zhang
27b96b15c4 Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm 2025-09-25 15:38:55 +00:00
Qianfeng Zhang
bd32cc0de0 Remove useless constant statement in the kernel 2025-09-19 07:24:29 +00:00
Qianfeng Zhang
db62a9f47e Remove un-necessary HSTU_CHECK() callings 2025-09-13 16:39:14 +00:00
Qianfeng Zhang
2427426640 Add HSTU_CHECK() and use it in example codes 2025-09-13 16:38:33 +00:00
Qianfeng Zhang
a5b7360862 Smalle update in reference hstu attention 2025-09-13 06:53:54 +00:00
Qianfeng Zhang
798fc3cc0b Detach HstuBlockMask from pipeline definition and construct the HstuBlockMask type in the kernel according to window_size 2025-09-12 09:11:47 +00:00
Qianfeng Zhang
7d10353fda Unify the license statements on all the source files 2025-09-11 10:27:00 +00:00
Qianfeng Zhang
1c030e8c3c Remove using MakeKargsImpl() to simplify the hstu kernel 2025-09-10 15:28:12 +00:00
Qianfeng Zhang
72eb4e95d8 Clarify the using of kSubQKHeaddim and kQKHeaddim so that less regular hdim (eg. 96, 160) can be efficiently supported 2025-09-09 12:55:01 +00:00
Qianfeng Zhang
f8dea2bc86 Use set_slice_tilie() to replace direct thread_buffer assignment 2025-09-09 12:54:32 +00:00
Qianfeng Zhang
a99d85517c [ck_tile] Fix in set_slice_tile() 2025-09-09 12:54:06 +00:00
Qianfeng Zhang
4bf65d9fe5 Merge branch 'develop' into hstu_attention_mi350_fwd_bwd and change in using ck_tile::make_kernel 2025-09-01 07:35:35 +00:00
Aviral Goel
fcff0043ae chore(gemm): clang format to pass CI (#2758) 2025-08-29 00:38:46 -07:00
Vijay Krish
4208e28988 ck_tile kernel for gemm with groupwise quantized B tensor. (#2663)
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.

* Solve merge conflict

* [CK TILE] Update CHANGELOG.md

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2025-08-28 23:43:02 -07:00
Cong Ma
428090f749 Support transposed C tile in Aquant (#2679)
The performance of Aquant has increased after enabling transposed C.

Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
2025-08-28 13:28:09 -07:00
Mateusz Ozga
0758883fa4 [CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752)
* Init commit

* Quick fix, CI fails

* Remove CDElementWise

* Add CDEELementWise

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-28 12:45:50 -07:00
asleepzzz
038ea82315 Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757)
This reverts commit ead4447b20.
2025-08-28 22:50:42 +08:00
linqunAMD
4a49dac7c6 [Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728)
* fix copy basic build error

* fix other ck tile test build error
2025-08-28 20:30:30 +08:00
Yi DING
ead4447b20 [CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)
* 16x192

* Use buffer_load_lds for lse/d

* Dispatch & cleanup

* Avoid zeroing dq & fix

* fix
2025-08-28 18:54:18 +08:00
msaffari-amd
b951416cdb Ck tile gemm low prec data types int4 int8 unit tests (#2718)
* add gemm unit tests for int4, int8 datatypes

* minor changes based on reviews

---------

Co-authored-by: msaffari-amd <msaffari@banff-cyxtera-s78-2.ctr.dcgpu>
2025-08-28 10:47:16 +02:00
Linjun-AMD
bf7b458e6e use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711)
* add k_lds padding and iglp to improve dim256 fmha fwd

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update  block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp

* clang format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* use same naming style

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-28 11:39:39 +08:00
Aviral Goel
f5f795c4d6 feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types (#2691)
* feat(check_err): add a variable to adjust number of incorrect values to print

* feat(host_tensor): add printing capability for fp8 bf8 int8 int4

* fix(gemm_utils): update acceptable data type

* fix(host_tensor): print both 4 bit ints in pk_int4_t

* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp

* feat(host_tensor): add print first n elements functions
2025-08-27 18:17:24 -07:00
John Shumway
9751583f95 Replace auto with function template for c++17. (#2754)
In #2443, a helper function was added test new print functionality, but it used auto for the function parameter types.

We need to support c++17 for downstream libraries, so we cannot use auto there. This PR replaces it witht the equivalent function template implementation.
2025-08-27 17:10:39 -07:00
Illia Silin
0ac908fb57 Add a daily CI cron job to build pytorch. (#2755)
* add a stage to builf pytorch

* add docker file for pytorch stage

* call build scripts fro mthe default path

* add a daily chron build for pytorcn stage
2025-08-27 16:57:43 -07:00
Cong Ma
cd53e2e57e [CK TILE GEMM] Fix a merge conflict (#2753)
* Fixed a merge conflict in 245467f3
* Foramt the code
2025-08-27 11:08:09 -07:00
Bartłomiej Kocot
cfe5e448db Fix splitk autodeduce for grouped conv bwd weight (#2742) 2025-08-27 12:35:42 +02:00
Cong Ma
245467f359 [CK TILE] Fix bugs in AQuant preshuffle (#2700)
* [CK TILE] Fix bugs in AQuant preshuffle

- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.

* [CK TILE] Rename Preshuffle to PreshuffleQuant

The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.

* [CK TILE Block Scale] Use GemmConfig to save tile properties

- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
  coded tile properties in `gemm_calc_aquant()`

* [CK TILE Block Scale] Rename GemmConfig used in block scale

    - Remove unused GemmConfig
    - Rename GemmConfig used in block scale

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-08-27 00:05:54 -07:00
linqunAMD
95e4a4efcb Fix merge mfma_wmma (part 1) regression (#2749)
root cause: a typo in GetGfx11InputBlkIdx, const ia added by mistake.
2025-08-26 22:49:34 -07:00
JH-Leon-KIM-AMD
19d5327c45 Test comprehensive dataset (#2685)
* Add CSV-driven convolution test pipeline

- Add test_grouped_convnd_fwd_dataset_xdl.cpp with CSV reader functionality
- Add complete dataset generation toolchain in test_data/
- Add Jenkins integration with RUN_CONV_COMPREHENSIVE_DATASET parameter
- Ready for comprehensive convolution testing with scalable datasets

* Update convolution test dataset generation pipeline

* add 2d, 3d dataset csv files

* Remove CSV test dataset files from repository

* Update generate_test_dataset.sh

* Fix channel division for MIOpen to CK conversion

* Remove unnecessary test files

* Fix clang-format-18 formatting issues

* TEST: Enable comprehensive dataset tests by default

* Fix test_data path in Jenkins - build runs from build directory

* Add Python dependencies and debug output for CSV generation

* Remove Python package installation - not needed

* Add better debugging for generate_test_dataset.sh execution

* Fix Jenkinsfile syntax error - escape dollar signs

* Add PyTorch to Docker image for convolution test dataset generation

- Install PyTorch CPU version for lightweight model execution
- Fixes Jenkins CI failures where CSV files were empty due to missing PyTorch
- Model generation scripts require PyTorch to extract convolution parameters

* Add debugging to understand Jenkins directory structure and CSV file status

- Print current working directory
- List CSV files in test_data directory
- Show line counts of CSV files
- Will help diagnose why tests fail in Jenkins

* Fix clang-format-18 formatting issues

- Applied clang-format-18 to test file
- Fixed brace placement and whitespace issues

* Add detailed debugging for CSV dataset investigation

- Check generated_datasets directory contents
- List all CSV files with line counts
- Show first 5 lines of main CSV file
- Applied clang-format-18 formatting
- This will help identify why CSV files are empty in Jenkins

* keep testing add pytorch installation in shell script

* Use virtual environment for PyTorch installation

- Jenkins user doesn't have permission to write to /.local
- Create virtual environment in current directory (./pytorch_venv)
- Install PyTorch in virtual environment to avoid permission issues
- Use PYTHON_CMD variable to run all Python scripts with correct interpreter
- Virtual environment will be reused if it already exists

* Remove debug code and reduce verbose logging in Jenkins

- Remove bash -x and debug commands from Jenkinsfile execute_args
- Remove all debug system() calls and getcwd from C++ test file
- Remove unistd.h include that was only needed for getcwd
- Remove debug print in CSV parser
- Add set +x to generate_test_dataset.sh to disable command echo
- Redirect Python script stdout to /dev/null for cleaner output

This makes Jenkins logs much cleaner while still showing progress messages.

* install gpu torch

* Clean up and optimize comprehensive dataset test pipeline

- Reorder Jenkinsfile execution: build -> generate data -> run test
- Remove commented-out debug code from generate_test_dataset.sh
- Ensure all files end with proper newline character (POSIX compliance)
- Keep useful status messages while removing development debug prints
- Set MAX_ITERATIONS=0 for unlimited test generation in production

* Add configuration modes to reduce test execution time

- Add --mode option (half/full) to generate_model_configs.py
  - half mode (default): ~278 configs (224 2D + 54 3D) -> ~1,058 total tests
  - full mode: ~807 configs (672 2D + 135 3D) -> ~3,093 total tests
- Update generate_test_dataset.sh to use CONFIG_MODE environment variable
- Keeps all model types but reduces parameter combinations intelligently
- Fixes Jenkins timeout issue (was running 3,669 tests taking 17+ hours)
- Default half mode should complete in ~4-5 hours instead of 17+ hours

* Add small mode for quick testing of comprehensive dataset

* jenkins pipeline test done

* jenkins test done

* Trigger CI build

* remove test comment and update data generation option as half

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-08-26 22:18:05 +02:00
John Afaganis
508e7912f9 Revert "[CK-TILE] Default epilogue, adding support for D (#2629)" (#2746)
This reverts commit d43228fbca.
2025-08-26 09:48:49 -07:00
SamiAario-AMD
5e85c38d7d Lwpck 3548 gemm test cleanups (#2717)
* Remove some unnecessary calls to create_args in basic and universal GEMM tests

* Remove unnecessary include statements in universal GEMM tests

* Improve compilation time of basic GEMM tests by only compiling the precision variants that we need

* Universal GEMM PrecType should be the same as CDataType

* Improve compilation time of universal GEMM tests by only compiling the precision variants that we need

* Revert to constexpr when defining some constants
2025-08-26 13:25:48 +03:00