composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
Brock Hargreaves	9c99a397ed	[CK] Use as_posix() instead of str() for paths in fmha_fwd_appendkv.py (#4812 ) ## Motivation This is causing a failing PR for Windows: https://github.com/ROCm/TheRock/pull/3382 ``` [composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` ## Technical Details The file: [fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78) writes a bunch of paths to a text file which is later parsed by cmake. When passing a pathlib.Path to str(), str() converts to a native path, in this case / to \\ on Windows which confuses cmake. In this case we need to write paths with forward slashes and then pass those onward to cmake. ## Test Plan 1. Ensure this doesn't impact existing CI. 2. Ensure compilation of Windows pass locally. ## Test Result 1. Passes existing CI 2. This fixes the compilation error locally. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 09:12:46 -07:00
Brock Hargreaves	1927528b44	[CK] Fix windows build issues (#4819 ) ## Motivation Full build on Windows is currently broken due to compiler errors, this PR should help fix that. This is also holding up the following PR in the TheRock: https://github.com/ROCm/TheRock/pull/3382 ## Technical Details 1. I don't see a good reason to be nesting a windows include inside the ck_tile namespace. It was causing compiler errors too: Windows.h comes with min and max, which was conflicting with ck_tile::min and ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this inclusion in the future. 2. The TRUE/FALSE macros are already used by Windows.h, which causes an error. So I've opted for True/False. You can see this pattern in other rocm-libraries. 3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN context, from \<cmath\> on Windows. We'll be able to use std::numbers::v_pi\<float\> when we have C++20 support. 4. There was a missing \<chrono\> include. ## Test Plan Test locally and make sure this doesn't impact existing CI. ## Test Result Compiles locally and passes existing ci. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 09:12:19 -07:00
Zoltán Lakatos	87d418a07a	[CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425 ) ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-25 05:16:07 +00:00
Bartłomiej Kocot	799b096f50	[CK][CK TILE] Improve oob check (#4791 ) ## Motivation Improve OOB checks. Remove permutes which have been generated by thread buffer zero clear. at now in assembly there is only condmask instead of permute + condmask. Change number of KPack for generated instances ## Technical Details Remove permute instructions from assembly ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com>	2026-02-24 22:40:48 +01:00
Max Podkorytov	eedb35f1a8	[CK] Add project root marker for monorepo compatibility (#4430 ) ## Summary - Add `.ck-project-root` marker file at the composablekernel project root - Update `find_project_root()` in `script/tools/common.sh` to look for this marker instead of `.git` - Fixes project root detection when CK is part of the rocm-libraries monorepo ## Background Since the project was moved into the monorepo, the `.git` directory is at the monorepo root rather than the CK project root. This caused `find_project_root()` to return the wrong path, breaking tools in `script/tools/`. ## Test plan - [x] Verify `find_project_root` returns correct path from any CK subdirectory - [x] Verify `ck-build --help` works - [x] Verify `ck-configure --help` works Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-24 12:59:29 -08:00
assistant-librarian[bot]	0889fb836f	[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to allow caller to pass desciptors directly (#4295 ) ## Proposed changes Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create tensor views from strides and sizes. This refactors the descriptor creation out and add overloaded definitions, allowing descriptors to be created separately by the caller instead of passing explicit strides, with no functional changes. This will enable further refactoring of `RunGemm` to do likewise, enabling derived kernels like BatchedContractionKernel to avoid creating separate versions (PR [#3457](https://github.com/ROCm/composable_kernel/pull/3457)). ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion Since the logic within the MakeXBlockviews chains together operations on tuples, and thus the descriptors are also passed as such, adding a template parameter for the type of the input tuple was the simplest option to enable the overload without too much verbiage. However, for `MakeCBlockView` this adds a complications as the templated definitions are prone to overlap. This for now is avoided by just moving the arguments around for the descriptor version, which avoids the collision. It's not a great solution, so feel free to suggest a better one. --- 🔁 Imported from [ROCm/composable_kernel#3467](https://github.com/ROCm/composable_kernel/pull/3467) 🧑‍💻 Originally authored by @amd-meskelin --------- Co-authored-by: Matti Eskelinen <matti.eskelinen@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-24 12:43:38 -08:00
assistant-librarian[bot]	11e6485457	[CK_TILE] Extend support of mix precision microscaling BQuant (#4267 ) ## Proposed changes Supported types combinations using BQuant=e8m0: - A=bf16 - B=bf16,bf8,fp4 Summary: - remove usage of `pk_fp4_raw_t`: consistent with other implementations and avoid taking into account of the packed size explicitly. In general, the raw type should not be used because CK Tile internally takes care of the PackedSize, so using the raw type adds unnecessary complexity to the implementation - handle microscaling by checking for `e8m0` type for BQuant (previous implementation was inconsistent) - add support for scaling instructions in `DequantPack8` - mx pipeline: - extend existing pipeline to support different B types - add support to scale and cast before writing to LDS or after reading from LDS (this can be defined in the `Problem` by the user) - block gemm: - mx pipeline is now using block gemm BQuant - block gemm BQuant can now load from LDS and apply scale and then call block gemm universal operator. This adds new functionalities and remove code duplication - warp gemm: - add case to support 128bit ds_read/write for both A and B when A=16bit and B=8bit - add examples and tests: note that some tests for bf16/fp4 already existed but were removed during previous tests refactoring. I added them again and other relevant tests for new types combinations ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689) 🧑‍💻 Originally authored by @EnricoDeg --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-24 09:55:50 -08:00
Emily Martins	f973b15516	[CK_TILE] Update Stream-K Reduction Strategy Enum (#4756 ) ## Motivation Currently, Stream-K has 3 reduction options: 1) atomics, 2) The reduction described in the Stream-K paper, and 3) a tree reduction. The reduction strategy described in the original Stream-K paper has the starting workgroup of each tile sequentially accumulating partial results of other contributing workgroups in the tile, which requires a linear number of steps. Hence, for clarity, this works updates the naming of the `StreamKReductionStrategy` enum members to better describe the existing reduction strategy options. ## Technical Details Prior to this change, the enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Reduction = 1u, TreeReduction = 2u }; ``` But, the distinction between `Reduction` and `TreeReduction` is not very clear and has some redundancy. Hence, the updated enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Linear = 1u, Tree = 2u }; ``` All references to `StreamKReductionStrategy` were updated to reflect this change. ## Test Plan No new functionality was added, so no new tests were added; I just validated existing tests and examples. ## Test Result All tests passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-24 06:40:08 +00:00
Cong Ma	4ba02a8b68	[CK TILE] Refactor sequence_reverse_inclusive_scan (#4355 ) ## Proposed changes Refactor ck tile `sequence_reverse_inclusive_scan` from recursive to for-loop. Tracking issue: #4229 This pull request introduces a new lightweight array type, `static_array`, and refactors the sequence utilities to use it for improved constexpr support and simplicity. The changes also include updates to the build system to add container-related tests. Core Library Improvements: * Added a new header `static_array.hpp` that defines the `static_array` type, a constexpr-friendly array with basic accessors and no custom constructors. * Updated includes in `core.hpp` and `sequence.hpp` to import `static_array`. [[1]](diffhunk://#diff-14b406eccf59794051a16c0c9c1a7e11234324bfdd107a5bbe0f173cd25bcddcR44) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76R7) Refactoring to Use `static_array`: * Refactored sequence utilities in `sequence.hpp` to use `static_array` instead of the previously forward-declared `array` type, including in histogram and array generation logic. [[1]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1108-R1133) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1130-R1146) * Rewrote the implementation of `sequence_reverse_inclusive_scan` to use `static_array` for intermediate storage, improving constexpr evaluation and clarity. Build System and Testing: * Added a new test subdirectory for container tests and a GoogleTest executable for `unit_sequence.cpp` to the CMake build configuration. [[1]](diffhunk://#diff-5d35ff7555d3f0b438d45cde06b661eb1332cdbec66287ac7ec3c478d688aae5R5) [[2]](diffhunk://#diff-1f54f0d2b431b7fc74f7b4ffb66e80c381c904c3383b1d27987467e3482d6d7aR1-R7) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-23 14:12:03 -07:00
Aviral Goel	76b5b63049	[CK_TILE] Fix FP8 MXGEMM numerical error in async load path (#4704 ) ## Summary Fixes FP8 MXGEMM producing half the expected result (e.g., 128 instead of 256 with all 1s input). Bug introduced in: `6c58796a52f160db52bb148f2fd3039245a39525` - "[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)" ## Root Cause In the `static_move_ys=true` code path in `tile_window.hpp`, the IMM optimization computes `lds_ys_offset` using a default-constructed tensor descriptor: ```cpp make_tensor_coordinate(decltype(tensor_descriptor){}, idx_ys_offset) ``` This default-constructed descriptor has different strides than the actual DRAM tensor descriptor used for dram_ys_offset. When these offsets are mixed in the address calculation: ```cpp imm_valid = lds_ys_offset % IMM_RANGE; // From wrong descriptor wave_offset = dram_ys_offset - imm_valid; // From correct descriptor ``` The final address wave_offset + imm_valid ≠ dram_ys_offset, causing incorrect memory accesses. Fix ```cpp Set imm_valid = 0 to bypass the IMM optimization and ensure the full offset is passed through wave_offset: constexpr auto imm_valid = 0; // Avoids inconsistency between lds_ys_offset and dram_ys_offset ``` This disables the 12-bit immediate field optimization in the buffer_load_lds instruction but guarantees correctness. A proper fix would require making the DRAM tensor descriptor constexpr, which is not feasible since tensor strides depend on runtime parameters (LDA, LDB). --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2026-02-23 12:29:47 -08:00
chris-tsiaousis-hpc	a26226b49f	[CK] Updated pre-commit entry points (#4649 ) ## Motivation Pre-commit fails after the transition to the monorepo. This fixes it. ## Technical Details - ## Test Plan Try to commit on CK with pre-commit enabled. ## Test Result Pre-commit should pass. (Scripts are correctly found) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-02-23 00:38:36 +00:00
Anton Gorenko	cd23de8b38	[CK_TILE][FMHA] Support gfx11 (#4584 ) ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-20 17:15:10 -08:00
Illia Silin	4d7f4783ba	Revert "[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3" (#4762 ) Reverts ROCm/rocm-libraries#4638 unfortunately, this PR interfered with the PR#4299 and caused build errors for gfx11: In file included from /rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7: In file included from /rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11: /rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21: error: no matching function for call to 'CheckValidity' 553 \| if(!GridwiseGemm::CheckValidity( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~	2026-02-20 22:40:28 +00:00
Geo Min	1f1d0a1ee2	[ci] Adding composablekernel to TheRock CI (#4705 ) Workflow files under `projects/composablekernel/.github/workflows` do not get picked up in GitHub workflows. This will allow composable kernel changes to be build and tested properly CI tests will prove functionality	2026-02-20 11:18:24 -08:00
linqunAMD	7a98b5d002	[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3 (#4638 ) 1. Add GetEstimateVgprCount to estimate the VGPR usage in GridwiseGemm_wmma_cshuffle_v3 2. Add IsValidCompilationParameter to disable kernel which use too many vgprs. - Currently, the threashold is AvailableVgprCount * 1.25 3. Modify examples to avoid test is disabled on gfx11 It is port from internal repo PR[#192](https://github.com/ROCm/composable_kernel/issues/192) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-20 07:56:29 -08:00
Aviral Goel	fc7e265563	fix: correct ULP calculation in get_absolute_threshold for BF16 tolerance (#4556 ) ## Motivation BF16 grouped GEMM tests were failing on gfx1201 with errors like: ``` Error: Incorrect results! out[5457621] != ref[5457621]: -66 != -65.5 max err: 0.5, number of errors: 1 ``` The calculated absolute tolerance (atol ~0.26) was too small to account for legitimate hardware vs software BF16 conversion differences (0.5 ULP). ## Changes 1. Discrete exponent calculation: Changed from continuous `log2()` to `floor(log2())` to match actual IEEE 754 floating-point exponent levels 2. Full ULP for output_error: Changed from 0.5 to 1.0 ULP to account for hardware `__bf16` vs software `float_to_bf16()` conversion differences ## Calculation Example For the failing case with value ~66: Before (incorrect): ``` expo = log2(66) = 6.044... atol = 2^(6.044 - 7) * 0.5 = 2^(-0.956) * 0.5 ≈ 0.26 Error 0.5 > 0.26 → Test fails ❌ ``` After (correct): ``` discrete_expo = floor(log2(66)) = 6 atol = 2^(6 - 7) * 1.0 = 2^(-1) * 1.0 = 0.5 Error 0.5 ≤ 0.5 → Test passes ✓ ``` The ULP for values in [64, 128) is 2^(-1) = 0.5, and the error of 0.5 is exactly 1 ULP, which is the maximum expected difference between hardware and software BF16 conversions at tie cases. ## Rationale Hardware and software BF16 conversions can differ by up to 1 ULP at tie cases due to different rounding strategies (hardware vs IEEE 754 round-to-nearest-even). The discrete exponent ensures ULP is calculated correctly for all values within an exponent range. Modified file: `projects/composablekernel/include/ck_tile/host/check_err.hpp`	2026-02-20 13:45:06 +04:00
assistant-librarian[bot]	9c8d3a39ac	173 implement device grouped gemm fixed nk for rdna4 (#4299 ) ## Proposed changes This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk instance library using for WMMA. The implementation is based on the existing DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level structure, but replaces the XDL kernel with a WMMA-based one. It uses the GridwiseGemm_wmma_cshuffle_v3 kernel. At this stage, the focus is functional correctness and compatibility, not performance tuning. ## Technical Details - Device struct for grouped gemm fixed NK - Example code for the WMMA version - Unit tests for both new wmma implementation and the reference XDL code (previously missing) - Generic ck profiler interface with the purpose of calling unit tests. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3668](https://github.com/ROCm/composable_kernel/pull/3668) 🧑‍💻 Originally authored by @bidlekm --------- Co-authored-by: Marton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: bidlekm <bidlekmarton@gmail.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-19 09:13:05 +01:00
Thrupti Raj Lakshmana Gowda	33f83b99ed	Update CMakeLists.txt (#4655 ) ## Motivation Tile Engine is an internal benchmarking tool and it need not be built everytime which would impact the build time with this PR we are excluding build for stream k operator in Tile Engine. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-18 23:29:54 -07:00
Ville Pietilä	49f62cb57d	Revert "[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size." (#4652 ) PR ROCm/rocm-libraries#4275 contains CK fwd conv instances optimized for `gfx950` and they do not compile for other architectures such as `gfx940`. To ensure that the optimized instances are compiled only for `gfx950`, compile-time guard `#if defined(CK_USE_GFX950)` was used. This approach works correctly when we compile for a single architecture, but when we compile simultaneously for multiple architectures, flag `CK_USE_GFX950` is set for non-gfx950 archs as well. As a result, the multi-arch compilation fails. The problem doesn't appear in the ROCm libraries CI/CD pipeline since only one architecture is compiled at a time. Hence, the CI/CD passed for the original PR. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-18 17:02:13 -07:00
assistant-librarian[bot]	9d437c0630	moe flatmm xcd remap (#4297 ) co-authors: @Chi-Chu319 @juuso-oskari Added XCD remapping for flatmm moe <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Arial, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} --> </style> </head> <body link="#467886" vlink="#96607D"> batch \| Mixtral (tflops, wip_355) \| Mixtral-7B (tflops, our branch) \| perf boost -- \| -- \| -- \| -- 64 \| 865.424 \| 995.455 \| 15.0% 256 \| 886.336 \| 1020.96 \| 15.2% 1024 \| 890.808 \| 1022.53 \| 14.8% </body> </html> --- 🔁 Imported from [ROCm/composable_kernel#3161](https://github.com/ROCm/composable_kernel/pull/3161) 🧑‍💻 Originally authored by @Chi-Chu319 --------- Co-authored-by: Tianxing Wu <chi0chu319@gmail.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-18 11:32:15 -08:00
Thomas Ning	e5fb690945	Fix the Composable Kernel CI and versions incompatibility (#4640 ) ## Motivation This PR has 4 patches: 1. Fix the CI error of grouped gemm. 2. Fix the incompatibility of old linux version. 3. Fix the potential errors of flatmm. 4. Address the previous comments of abquant eight warps pipeline solution. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-18 06:59:37 -08:00
John Shumway	24eaadc4d2	[CK_BUILDER] Fix two staging-compiler errors in CK builder code (#4598 ) This PR fixes two compiler warnings that report as errors with the latest compiler: 1. In tensor descriptor, the `operator[]` accessor needs a `[[clang::lifetimebound]]` attribute. 2. In the unit tests for device buffer, there is a test that explicitly checks for an error on a pointer that went out of scope, so it needs a to disable `-Wlifetime-safety-permissive` in the test code. I ran the CK `smoke-builder` tests with the staging compiler to verify. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-18 01:26:32 +00:00
assistant-librarian[bot]	f0e3f93f1d	[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size. (#4275 ) ## Proposed changes Added new FP16/BF16 instances that are optimized for group size = 1. The new instance use the compute optimized block GEMM pipeline. \| CK prof command \| Baseline (TFLOPs) \| New V3 instances (TFLOPs) \| \|:-----\|:------:\|------:\| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1 1 1 \| 858.818 \| 962.293 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 256 256 3 3 100 100 1 1 1 1 1 1 1 1 \| 979.987 \| 1121.11 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 50 50 1 1 1 1 1 1 1 1 \| 945.951 \| 1091.66 \| --- 🔁 Imported from [ROCm/composable_kernel#3670](https://github.com/ROCm/composable_kernel/pull/3670) 🧑‍💻 Originally authored by @vpietila-amd --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-02-17 17:58:11 -07:00
assistant-librarian[bot]	a2d139ee59	Add multi-file trace parsing and analysis pipeline (#4259 ) Extends build time analysis from ROCm/composable_kernel#3644 to handle multiple trace files across build directories (see #4229): - pipeline.py: Generic pipeline framework with fluent interface for composable data processing. Provides parallel processing, progress tracking, and error handling independent of trace-specific code. Processes thousands of trace files at default resolution in minutes, aggregating results into in-memory DataFrames for analysis. - parse_build.py: Parse all trace files in a build directory - build_analysis_example.ipynb: Demonstrates pipeline aggregation across all build files The pipeline design improves capability (composable operations), performance (parallel processing), and user-friendliness (fluent API) of the analysis modules. It enables analyzing compilation patterns across the entire codebase with all trace data available in pandas DataFrames for interactive exploration. --- 🔁 Imported from [ROCm/composable_kernel#3704](https://github.com/ROCm/composable_kernel/pull/3704) 🧑‍💻 Originally authored by @shumway Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-17 13:13:19 -08:00
assistant-librarian[bot]	ce718ec2e1	feat: add new optimized tutorial kernels (#4272 ) - Add 01_naive_gemm baseline implementation - Add 02_padding_k_first with PADDING_K_FIRST + MFMA_32x32x16 - Add 03_mfma_16x16x16 with PADDING_K_FIRST + MFMA_16x16x16 - Share common reference_gemm.hpp in parent gemm/ directory ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3676](https://github.com/ROCm/composable_kernel/pull/3676) 🧑‍💻 Originally authored by @AviralGoelAMD Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-17 12:41:06 -08:00
John Shumway	cef4904306	[CK_BUILDER] Move some smoke tests that require GPU (#4593 ) Since the CK builder is focused on compile time logic, let's keep the `smoke-builder` target CPU-only so that it can be ran without a CPU. Alternatively, we could define a `smoke-cpu-builder` or some special subtarget, but it's probably simpler to just stick to CPU for this. (My thinking is that in general GPU testing will be heavier than the smoke tests. Further, the GPU testing code will likely move outside of the builder once builder code is moved out of experimental.) This PR clarifies that CPU-only intention for `smoke-builder` and moves some GPU testing code to `smoke-regression`.	2026-02-17 09:32:15 -08:00
Jan Patrick Lehr	c44ccc1d2c	[CK] Work around staging compiler lifetime warning (#4419 ) ## Motivation The staging compiler enables lifetime-safety warnings and we already worked around a few of them. This works around a few more instances that came up recently on gfx950 builds. The initial PR that resolved most issues: https://github.com/ROCm/composable_kernel/pull/3640 ## Technical Details This follows the pattern to locally ignore the newly added lifetime-safety warnings that were moved from experimental to production in upstream LLVM. As a result, CK turned them on and treats them as errors, which prevents the staging compiler from building CK. ## Test Plan ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-12 22:11:53 +00:00
assistant-librarian[bot]	015bf06008	add memsetasync for ck moe splitk (#4282 ) ## Proposed changes add memsetasync for ck moe splitk to fix ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3630](https://github.com/ROCm/composable_kernel/pull/3630) 🧑‍💻 Originally authored by @lalala-sh --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-12 09:44:51 -08:00
Illia Silin	71d80a6209	[CK] add check for THEROCK_SANITIZER in cmake (#4514 ) ## Motivation Check whether the THEROCK_SANITIZER flag is set to ASAN or HOST_ASAN. ## Technical Details In case the THEROCK_SANITIZER flag is set to ASAN or HOST_ASAN and no GPU_TARGETS is selected, the list of the default targets will be set to "gfx908:xnack+;gfx90a:xnack+;gfx942:xnack+;gfx950:xnack+". ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-12 09:18:04 -08:00
Illia Silin	559a35eee7	[CK] Fix the launch_tests script. (#4525 ) ## Motivation Fix the script that filters the tests. ## Technical Details There were several places where the paths had to be updated for the launch_tests script to work correctly. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>	2026-02-11 20:42:43 -08:00
Christopher Millette	a95cf55aa7	[CK] Optimize multi-dimensional static for loop decomposition (#4447 ) ## Motivation Recursive template implementations might initially seem attractive to minimize necessary coding. Unfortunately, this style is often affects readability and requires significant resources from the compiler to generate instantiation chains. In "high-traffic" code (e.g., used in many places + compilation units), this generally does not scale well and can bloat the overall compile times to unnecessary lengths. The aim of this PR is to take some of most high-traffic utility code and try our best to eliminate recursive templates in favor of fold expansions and constexpr function helpers. In local tests with clang build analyzer, device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_16x16_instance.cpp showed high hit-rates on slow template instantiations in static_for, dimensional static_for (static_ford), which are subsequently affected by implementation of the Sequence class and associated transforms. Example: ** Templates that took longest to instantiate: 70111 ms: ck::detail::applier<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1... (372 times, avg 188 ms) // 70 seconds!** The above is part of the implementation of static_for which uses Sequence classes.. ## Technical Details ### Summary of Optimization Techniques \| Technique \| Used In \| Benefit \| \|-----------\|---------\|---------\| \| __Constexpr for-loop computation__ \| sequence_reverse_inclusive_scan, sequence_map_inverse \| Moves O(N) work from template instantiation to constexpr evaluation \| \| __Pack expansion with indexing__ \| sequence_reverse, Sequence::Modify \| Single template instantiation instead of recursive \| \| __Flat iteration + decomposition__ \| ford, static_ford \| O(1) template depth instead of O(N^D) \| \| __Pre-computed strides__ \| index_decomposer \| Enables O(1) linear-to-multi-index conversion \| ### Impact on Compile Time These optimizations reduce template instantiation depth from O(N) or O(N^D) to O(1), which: 1. Reduces compiler memory usage 2. Reduces compile time exponentially for deep instantiation chains 3. Enables larger iteration spaces without hitting template depth limits ## Test Plan * Existing tests for Sequence are re-used to affirm correctness * Unit tests for ford and static_ford are added (dimensional looping) * 8 new regression tests specifically verify the fixes for the PR feedback: - `NonTrivialOrder3D_201` - Tests Orders<2,0,1> for static_ford - `NonTrivialOrder3D_201_Runtime` - Tests Orders<2,0,1> for ford - `ConsistencyWithNonTrivialOrder_201` - Verifies static_ford and ford consistency - `NonTrivialOrder3D_120` - Tests Orders<1,2,0> for static_ford - `NonTrivialOrder3D_120_Runtime` - Tests Orders<1,2,0> for ford - `NonTrivialOrder4D` - Tests 4D with Orders<3,1,0,2> for static_ford - `NonTrivialOrder4D_Runtime` - Tests 4D with Orders<3,1,0,2> for ford - `AsymmetricDimensionsWithOrder` - Tests asymmetric dimensions with non-trivial ordering ## Test Result ### Compile Time Comparison: `8b72bc8` (base) → `477e0686` (optimized) #### Commits in Range (8 commits) 1. `fd4ca17f48` - Optimize sequence_reverse_inclusive_scan and sequence_reverse 2. `7a7e3fdeef` - Optimize sequence_map_inverse 3. `92855c9913` - Optimize ford and static_ford calls to eliminate nested template recursion 4. `88a564032b` - Add unit tests for ford and static_ford 5. `1a0fb22217` - Fix clang-format 6. `8a0d26bddf` - Increase template recursion depth to 1024 7. `dc53bb6e20` - Address copilot feedback and add regression tests 8. `477e06861d` - Increase bracket depth to 1024 #### Build Timing Results \| File \| Base (8b72bc8759d9 \| HEAD(a0438bd398) \| Improvement \| \|------\|------\|------\|-------------\| \| grouped_conv2d_fwd (f16) -j1 \| 313.31s \| 272.93s \| __12.9% faster__ \| \| grouped_conv1d_fwd (bf16) -j1 \| 79.33s \| 68.61s \| __13.5% faster__ \| \| grouped_conv1d_bwd_weight (f16) -j1\| 15.77s \| 14.31s \| __9.2% faster__ \| \| device_grouped_conv2d_fwd_instance -j64 \| s \| s \| __% faster__ \| #### Key Optimizations 1. __sequence_reverse_inclusive_scan/sequence_reverse__: O(N) → O(1) template depth 2. __sequence_map_inverse__: O(N) → O(1) template depth 3. __ford/static_ford__: O(N^D) → O(1) template depth using flat iteration with index decomposition 4. __Copilot feedback fixes__: Corrected New2Old mapping for non-trivial orderings ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-02-11 22:12:31 +00:00
Bartłomiej Kocot	02afe54a30	Revert "[CK Conv] Add bwd weight instance for large-k shape" (#4506 ) Reverts ROCm/rocm-libraries#4266 due to CI failures. Should be investigated by @johannes-graner	2026-02-11 21:36:53 +00:00
Christopher Millette	77169ae227	[CK] Optimize vector type build times (#4471 ) Supercedes https://github.com/ROCm/rocm-libraries/pull/4281 due to CI issues on import ## Proposed changes Build times can be affected by many different things and is highly attributed to the way we write and use the code. Two critical areas of the builds are frontend parsing and backend codegen and compilation. ### Frontend Parsing The length of the code, the include header tree and macro expansions all affect the front-end parsing time. This PR seeks to reduce the parsing time of the dtype_vector.hpp vector_type class by reducing redundant code by generalization. * Partial specializations of vector_type for native and non-native datatypes have been generalized to one single class, consolidating all of the data initialization and AsType casting requirements into one place. * The class nnvb_data_t_selector (e.g., Non-native vector base dataT selector) class has been removed and replaced with scalar_type instantiations as they have the same purpose. Scalar type class' purpose is already to map generalized datatypes to native types compatible with ext_vector_t. ### Backend Codegen Template instantiation behavior can also affect build times. Recursive instantiations are very slow versus concrete instantiations. The compiler must make multiple passes to expand template instantiations so we need to be careful about how they are used. * Previous vector_type classes declared a union storage class, which aliases StaticallyIndexedArray<T,N>. ``` template <typename T> struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>> { using d1_t = T; typedef T d2_t __attribute__((ext_vector_type(2))); typedef T d4_t __attribute__((ext_vector_type(4))); using type = d4_t; union { d4_t d4_; StaticallyIndexedArray<d1_t, 4> d1x4_; StaticallyIndexedArray<d2_t, 2> d2x2_; StaticallyIndexedArray<d4_t, 1> d4x1_; } data_; ... }; ``` * Upon further inspection, StaticallyIndexedArray is built on-top of a recursive Tuple concatenation. ``` template <typename T, index_t N> struct StaticallyIndexedArrayImpl { using type = typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type, typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type; }; ``` This union storage has been removed from the vector_type storage class. * Further references to StaticallyIndexedArray have been replaced with StaticallyIndexedArray_v2, which is a concrete implementation using C-style arrays. ``` template <typename T, index_t N> struct StaticallyIndexedArray_v2 { ... T data_[N]; }; ``` ### Fixes * Using bool datatype with vector_type was previously error prone. Bool, as a native datatype would be stored into bool ext_vector_type(N) for storage, which is a packed datatype. Meaning that for example, sizeof(bool ext_vector_type(4)) == 1, which does not equal sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The union of these datatypes has incorrect data slicing, meaning that the bits location of the packed bool do not match with the StaticallyIndexedArray member. As such, vector_type will use C-Style array storage for bool type instead of ext_vector_type. ``` template <typename T, index_t Rank> using NativeVectorT = T __attribute__((ext_vector_type(Rank))); sizeof(NativeVectorT<bool, 4>) == 1 (1 byte per 4 bool - packed) element0 = bit 0 of byte 0 element1 = bit 1 of byte 0 element2 = bit 2 of byte 0 element3 = bit 3 of byte 0 sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4 (1 byte per bool) element0 = bit 0 of byte 0 element1 = bit 0 of byte 1 element1 = bit 0 of byte 2 element1 = bit 0 of byte 3 union{ NativeVectorT<bool, 4> d1_t; ... StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1; }; // union size == 4 which means invalid slicing! ``` * Math utilities such as next_power_of_two addressed for invalid cases of X < 2 * Remove redundant implementation of next_pow2 ### Additions * integer_log2_floor to math.hpp * is_power_of_two_integer to math.hpp ### Build Time Analysis Machine: banff-cyxtera-s78-2 Target: gfx942 \| Build Target \| Threads \| Frontend Parse Time (s) \| Backend Codegen Time (s) \| TotalTime (s) \| commitId \| \|---------------\|---------\|-------------------------\|--------------------------\|---------------\| ---------------\| \| device_grouped_conv3d_fwd_bias_bnorm_clamp_instance \| 1 \| 1452 \| 331 \| 1783 \| `2e08a7e` (develop) \| \| device_grouped_conv3d_fwd_bias_bnorm_clamp_instance \| 1 \| 1403 \| 332 \| 1735 (-2.7%) \| fad4235\| ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-11 11:59:43 -07:00
Bartłomiej Kocot	790a786035	[CK][CK TILE] Add has hot loop check for pipeline v1 (#4407 ) ## Motivation Add has hot loop check for pipeline v1 (v1 basic and v1 basic async). Enable more tests which have been fixed by this change. ## Technical Details Hot loop has been executed without num loop check. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-651 AICK-663	2026-02-11 13:41:59 +00:00
assistant-librarian[bot]	4e7ecd7f40	[Conv] Add NumGroupsToMerge to BwdWeight type string (#4271 ) ## Proposed changes Add parameter to bwd weight V3 type string showing the number of groups to merge. This is required for MIOpen to be properly tuned since it uses type strings for performance database entries. In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3679](https://github.com/ROCm/composable_kernel/pull/3679) 🧑‍💻 Originally authored by @johannes-graner --------- Co-authored-by: Graner, Johannes <Johannes.Graner@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2026-02-11 10:07:53 +01:00
Cong Ma	cdba8b787c	[CK TILE] fix numerical errors of preshuffle_b (#4354 ) This pull request introduces several improvements and fixes related to quantized grouped GEMM (General Matrix Multiply) pipelines and their supporting utilities. # The numerical issue ## Steps to reproduce ```bash Run ./bin/tile_example_gemm_weight_preshuffle -prec=fp8 ./bin/tile_example_gemm_weight_preshuffle -prec=int4 ``` # Solution The main changes address type correctness, improve data layout and shuffling logic, and expand test coverage to better validate different GEMM configurations. Key changes include: ### Data layout and shuffling logic * Refactored the logic in `shuffle_b_permuteN` to use `constexpr` variables for `KLane` and `ItemsPerAccess`, simplifying tile view construction and correcting the permutation order for improved efficiency and correctness (`tensor_shuffle_utils.hpp`). * Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline policies to account for internal data type conversion (e.g., from `pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`, `gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`). [[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275) [[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105) ### Test infrastructure enhancements * Unit tests did not catch this issue since there were no tests for fp8. Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`) to support additional GEMM tile shapes and updated tests to run with these configurations for broader coverage (`test_gemm_pipeline_util.hpp`). [[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103) [[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-10 23:04:44 -08:00
assistant-librarian[bot]	844b598330	Add padding to cshuffle epilogue to avoid bank conflict (#4274 ) ## Proposed changes Added the padding to CShuffle Epilogue to avoid the bank conflicts of 64. Synced up and learned from the internal repo. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3671](https://github.com/ROCm/composable_kernel/pull/3671) 🧑‍💻 Originally authored by @ThomasNing --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-10 22:52:00 -07:00
Bartłomiej Kocot	7ba919330c	[CK][CK Tile] Temporary disable grouped conv fwd tile comp async instances (#4457 ) ## Motivation [CK][CK Tile] Temporary disable grouped conv fwd tile comp async instances due to the failures ## Technical Details disable configs to not comple these instances ## Test Plan test_grouped_convnd_fwd_Tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-11 01:52:14 +00:00
Joseph Macaranas	88e581438e	[Azure External CI] Disable Azure CI on rocm-libraries (#4460 ) - Deleting all pipeline trigger files tied to Azure External CI from top-level and project-level. --------- Co-authored-by: amd-hsivasun <hsivasun@amd.com>	2026-02-10 18:02:22 -05:00
assistant-librarian[bot]	d8a911202a	Add a README.md file to ck/library/util (#4277 ) I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances. This readme contains internal code comments for CK developers and does not need ROCm documentation review. --- 🔁 Imported from [ROCm/composable_kernel#3665](https://github.com/ROCm/composable_kernel/pull/3665) 🧑‍💻 Originally authored by @shumway Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-10 21:26:45 +00:00
assistant-librarian[bot]	4a42d6e2ba	Adapt parser to monorepo (#4269 ) ## Proposed changes Addressing issues found trying to run the dependency parser on MIOpen: - Ninja is recording the full path, e.g.: [json] ``` "file_to_executables": { "/home/rspauldi/repos/rocm-libraries/projects/miopen/include/miopen/miopen.h": [ ``` - Running git in monorepo reports the full _relative_ path, e.g.: ``` "projects/miopen/include/miopen/miopen.h" ``` Of course, `git diff` also returns all files modified in every other project's commits. These are filtered out as early as possible. This solution searches for `rocm-libraries` in the `parsing` step, and if found extracts the project name and stores it in `enhanced_dependency_mapping.json`. Leading folders are truncated from each file path, up to and including the project name. This allows `_is_project_file` to remain unchanged. The `selection` step then retrieves the project name from the json if it is defined, and truncates the project folder from the `git diff` output so the filenames exactly match the json entries. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [X] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion Successfully runs on rocm-libraries MIOpen PRs and produces a list of tests. I haven't verified the results yet. This version is not applicable to CI since it operates on a per-executable level and MIOpen CI uses the single gtest binary. I'll be working towards that in future PRs over the next few weeks. ``` /home/rspauldi/repos/rocm-libraries/projects/miopen# git checkout miopen/sgundabo_enable_ck_bwd_wrw_navi <run CMake with TEST_DISCRETE=ON> # ninja tests # root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py parse build/build.ninja Parsing ninja dependencies from: build/build.ninja Parsing ninja build file... Found 312 executables Found 820 object-to-source mappings Found 820 object files Extracting detailed dependencies for all object files... Processed 100/820 object files... Processed 200/820 object files... Processed 300/820 object files... Processed 400/820 object files... Processed 500/820 object files... Processed 600/820 object files... Processed 700/820 object files... Processed 800/820 object files... Completed dependency extraction for 820 object files Building file-to-executable mapping... Found rocm-libraries project: 'miopen' Built mapping for 608 files Files used by multiple executables: 216 Sample files with multiple dependencies: build/include/miopen/config.h: 306 executables build/include/miopen/export.h: 306 executables build/include/miopen/export_internals.h: 304 executables driver/InputFlags.hpp: 2 executables driver/driver.hpp: 2 executables === Enhanced Dependency Mapping Summary === Total executables: 312 Total files mapped: 608 Total object files processed: 820 File types: .cpp files: 310 .hpp files: 292 .h files: 6 Files used by multiple executables: 216 Top files with most dependencies: build/include/miopen/config.h: 306 executables build/include/miopen/export.h: 306 executables include/miopen/miopen.h: 304 executables src/include/miopen/config.hpp: 304 executables build/include/miopen/export_internals.h: 304 executables src/include/miopen/rank.hpp: 303 executables src/include/miopen/errors.hpp: 302 executables src/include/miopen/object.hpp: 302 executables src/include/miopen/returns.hpp: 302 executables src/include/miopen/sysinfo_utils.hpp: 302 executables Exporting mapping to build/enhanced_file_executable_mapping.csv Exporting complete mapping to build/enhanced_dependency_mapping.json Results exported to: CSV: build/enhanced_file_executable_mapping.csv JSON: build/enhanced_dependency_mapping.json root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py select build/enhanced_dependency_mapping.json 1b13d8b72d54e34bdc7ae70dd2b6e809dca8b10e 09e5965d55ebbfacfd1ed18e5092580c2ffae748 Identified 30 files modified in project 'miopen' Exported 304 tests to run to tests_to_run.json ``` I don't know if clang-format applies to scripts. If so, could someone show me how to run it in CK? --- 🔁 Imported from [ROCm/composable_kernel#3686](https://github.com/ROCm/composable_kernel/pull/3686) 🧑‍💻 Originally authored by @randyspauldingamd Co-authored-by: Randy J. Spaulding <rspauldi@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-10 18:37:40 +00:00
assistant-librarian[bot]	73c1257b0e	[CK Conv] Add bwd weight instance for large-k shape (#4266 ) ## Proposed changes This instance improves the shape used in `./bin/ckProfiler grouped_conv_bwd_weight 1 2 0 2 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1 1 1 all` from 10.3 ms to 6.6 ms. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3692](https://github.com/ROCm/composable_kernel/pull/3692) 🧑‍💻 Originally authored by @johannes-graner Co-authored-by: Graner, Johannes <johannes.graner@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2026-02-10 16:56:59 +00:00
assistant-librarian[bot]	2073f0a73b	[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant pipeline (#4268 ) ## Proposed changes Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [X] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [X] I have added inline documentation which enables the maintainers with understanding the motivation - [X] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [X] Any dependent changes have been merged --- 🔁 Imported from [ROCm/composable_kernel#3687](https://github.com/ROCm/composable_kernel/pull/3687) 🧑‍💻 Originally authored by @ErwinTerpstra --------- Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-02-10 06:57:55 -07:00
Yi DING	824af07002	[CK_TILE] Blockscale Gemm Fix Multi-Arch Compilation (#4451 ) ## Motivation This PR updates CK_TILE blockscale GEMM-quant kernels and launch helpers to compile across multiple GPU architectures by introducing compile-time availability gating and a new attribute tag mechanism for kernel symbol/attribute specialization. ## Technical Details - Add an architecture-guarded `kIsAvailable` flag to the gfx950 pipeline and propagate availability handling into `QuantGemmKernel`. - Extend `make_kernel`/`kentry` to accept an `Attr` tag enabling per-kernel compile-time attributes (e.g., `no-packed-fp32-ops`) and unique symbols. - Update the blockscale GEMM quant example to pass kernel attributes and adjust gfx950 gating. ## Test Plan - CI - Local test: `cmake .. --preset dev -DGPU_TARGETS='gfx942;gfx950' -GNinja && ninja tile_example_gemm_quant` - Local test with ROCm/aiter#1954 ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-10 12:41:09 +00:00
dependabot[bot]	50387dd79c	Bump fonttools from 4.57.0 to 4.61.0 in /projects/composablekernel/docs/sphinx (#3090 ) Bumps [fonttools](https://github.com/fonttools/fonttools) from 4.57.0 to 4.61.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/fonttools/fonttools/releases">fonttools's releases</a>.</em></p> <blockquote> <h2>4.61.0</h2> <ul> <li>[varLib.main]: <strong>SECURITY</strong> Only use basename(vf.filename) to prevent path traversal attacks when running <code>fonttools varLib</code> command-line script, or code which invokes <code>fonttools.varLib.main()</code>. Fixes CVE-2025-66034, see: <a href="https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv">https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv</a>.</li> <li>[feaLib] Sort BaseLangSysRecords by tag (<a href="https://redirect.github.com/fonttools/fonttools/issues/3986">#3986</a>).</li> <li>Drop support for EOL Python 3.9 (<a href="https://redirect.github.com/fonttools/fonttools/issues/3982">#3982</a>).</li> <li>[instancer] Support --remove-overlaps for fonts with CFF2 table (<a href="https://redirect.github.com/fonttools/fonttools/issues/3975">#3975</a>).</li> <li>[CFF2ToCFF] Add --remove-overlaps option (<a href="https://redirect.github.com/fonttools/fonttools/issues/3976">#3976</a>).</li> <li>[feaLib] Raise an error for rsub with NULL target (<a href="https://redirect.github.com/fonttools/fonttools/issues/3979">#3979</a>).</li> <li>[bezierTools] Fix logic bug in curveCurveIntersections (<a href="https://redirect.github.com/fonttools/fonttools/issues/3963">#3963</a>).</li> <li>[feaLib] Error when condition sets have the same name (<a href="https://redirect.github.com/fonttools/fonttools/issues/3958">#3958</a>).</li> <li>[cu2qu.ufo] skip processing empty glyphs to support sparse kerning masters (<a href="https://redirect.github.com/fonttools/fonttools/issues/3956">#3956</a>).</li> <li>[unicodedata] Update to Unicode 17. Require <code>unicodedata2 >= 17.0.0</code> when installed with 'unicode' extra.</li> </ul> <h2>4.60.1</h2> <ul> <li>[ufoLib] Reverted accidental method name change in <code>UFOReader.getKerningGroupConversionRenameMaps</code> that broke compatibility with downstream projects like defcon (<a href="https://redirect.github.com/fonttools/fonttools/issues/3948">#3948</a>, <a href="https://redirect.github.com/fonttools/fonttools/issues/3947">#3947</a>, <a href="https://redirect.github.com/robotools/defcon/issues/478">robotools/defcon#478</a>).</li> <li>[ufoLib] Added test coverage for <code>getKerningGroupConversionRenameMaps</code> method (<a href="https://redirect.github.com/fonttools/fonttools/issues/3950">#3950</a>).</li> <li>[subset] Don't try to subset BASE table; pass it through by default instead (<a href="https://redirect.github.com/fonttools/fonttools/issues/3949">#3949</a>).</li> <li>[subset] Remove empty BaseRecord entries in MarkBasePos lookups (<a href="https://redirect.github.com/fonttools/fonttools/issues/3897">#3897</a>, <a href="https://redirect.github.com/fonttools/fonttools/issues/3892">#3892</a>).</li> <li>[subset] Add pruning for MarkLigPos and MarkMarkPos lookups (<a href="https://redirect.github.com/fonttools/fonttools/issues/3946">#3946</a>).</li> <li>[subset] Remove duplicate features when subsetting (<a href="https://redirect.github.com/fonttools/fonttools/issues/3945">#3945</a>).</li> <li>[Docs] Added documentation for the visitor module (<a href="https://redirect.github.com/fonttools/fonttools/issues/3944">#3944</a>).</li> </ul> <h2>4.60.0</h2> <ul> <li> <p>[pointPen] Allow <code>reverseFlipped</code> parameter of <code>DecomposingPointPen</code> to take a <code>ReverseFlipped</code> enum value to control whether/how to reverse contour direction of flipped components, in addition to the existing True/False. This allows to set <code>ReverseFlipped.ON_CURVE_FIRST</code> to ensure that the decomposed outline starts with an on-curve point before being reversed, for better consistency with other segment-oriented contour transformations. The change is backward compatible, and the default behavior hasn't changed (<a href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</p> </li> <li> <p>[filterPen] Added <code>ContourFilterPointPen</code>, base pen for buffered contour operations, and <code>OnCurveStartPointPen</code> filter to ensure contours start with an on-curve point (<a href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</p> </li> <li> <p>[cu2qu] Fixed difference in cython vs pure-python complex division by real number (<a href="https://redirect.github.com/fonttools/fonttools/issues/3930">#3930</a>).</p> </li> <li> <p>[varLib.avar] Refactored and added some new sub-modules and scripts (<a href="https://redirect.github.com/fonttools/fonttools/issues/3926">#3926</a>).</p> <ul> <li><code>varLib.avar.build</code> module to build avar (and a missing fvar) binaries into a possibly empty TTFont,</li> <li><code>varLib.avar.unbuild</code> module to print a .designspace snippet that would generate the same avar binary,</li> <li><code>varLib.avar.map</code> module to take TTFont and do the mapping, in user/normalized space,</li> <li><code>varLib.avar.plan</code> module moved from <code>varLib.avarPlanner</code>.</li> </ul> <p>The bare <code>fonttools varLib.avar</code> script is deprecated, in favour of <code>fonttools varLib.avar.build</code> (or <code>unbuild</code>).</p> </li> <li> <p>[interpolatable] Clarify <code>linear_sum_assignment</code> backend options and minimal dependency usage (<a href="https://redirect.github.com/fonttools/fonttools/issues/3927">#3927</a>).</p> </li> <li> <p>[post] Speed up <code>build_psNameMapping</code> (<a href="https://redirect.github.com/fonttools/fonttools/issues/3923">#3923</a>).</p> </li> <li> <p>[ufoLib] Added typing annotations to fontTools.ufoLib (<a href="https://redirect.github.com/fonttools/fonttools/issues/3875">#3875</a>).</p> </li> </ul> <h2>4.59.2</h2> <ul> <li>[varLib] Clear <code>USE_MY_METRICS</code> component flags when inconsistent across masters (<a href="https://redirect.github.com/fonttools/fonttools/issues/3912">#3912</a>).</li> <li>[varLib.instancer] Avoid negative advance width/height values when instatiating HVAR/VVAR, (unlikely in well-behaved fonts) (<a href="https://redirect.github.com/fonttools/fonttools/issues/3918">#3918</a>).</li> <li>[subset] Fix shaping behaviour when pruning empty mark sets (<a href="https://redirect.github.com/fonttools/fonttools/issues/3915">#3915</a>, <a href="https://redirect.github.com/harfbuzz/harfbuzz/issues/5499">harfbuzz/harfbuzz#5499</a>).</li> <li>[cu2qu] Fixed <code>dot()</code> product of perpendicular vectors not always returning exactly 0.0 in all Python implementations (<a href="https://redirect.github.com/fonttools/fonttools/issues/3911">#3911</a>)</li> <li>[varLib.instancer] Implemented fully-instantiating <code>avar2</code> fonts (<a href="https://redirect.github.com/fonttools/fonttools/issues/3909">#3909</a>).</li> <li>[feaLib] Allow float values in <code>VariableScalar</code>'s axis locations (<a href="https://redirect.github.com/fonttools/fonttools/issues/3906">#3906</a>, <a href="https://redirect.github.com/fonttools/fonttools/issues/3907">#3907</a>).</li> <li>[cu2qu] Handle special case in <code>calc_intersect</code> for degenerate cubic curves where 3 to 4 control points are equal (<a href="https://redirect.github.com/fonttools/fonttools/issues/3904">#3904</a>).</li> </ul> <h2>4.59.1</h2> <ul> <li>[featureVars] Update OS/2.usMaxContext if possible after addFeatureVariationsRaw (<a href="https://redirect.github.com/fonttools/fonttools/issues/3894">#3894</a>).</li> <li>[vhmtx] raise TTLibError('not enough data...') when hmtx/vmtx are truncated (<a href="https://redirect.github.com/fonttools/fonttools/issues/3843">#3843</a>, <a href="https://redirect.github.com/fonttools/fonttools/issues/3901">#3901</a>).</li> <li>[feaLib] Combine duplicate features that have the same set of lookups regardless of the order in which those lookups are added to the feature (<a href="https://redirect.github.com/fonttools/fonttools/issues/3895">#3895</a>).</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/fonttools/fonttools/blob/main/NEWS.rst">fonttools's changelog</a>.</em></p> <blockquote> <h2>4.61.0 (released 2025-11-28)</h2> <ul> <li>[varLib.main]: <strong>SECURITY</strong> Only use basename(vf.filename) to prevent path traversal attacks when running <code>fonttools varLib</code> command, or code which invokes <code>fonttools.varLib.main()</code>. Fixes CVE-2025-66034, see: <a href="https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv">https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv</a>.</li> <li>[feaLib] Sort BaseLangSysRecords by tag (<a href="https://redirect.github.com/fonttools/fonttools/issues/3986">#3986</a>).</li> <li>Drop support for EOL Python 3.9 (<a href="https://redirect.github.com/fonttools/fonttools/issues/3982">#3982</a>).</li> <li>[instancer] Support --remove-overlaps for fonts with CFF2 table (<a href="https://redirect.github.com/fonttools/fonttools/issues/3975">#3975</a>).</li> <li>[CFF2ToCFF] Add --remove-overlaps option (<a href="https://redirect.github.com/fonttools/fonttools/issues/3976">#3976</a>).</li> <li>[feaLib] Raise an error for rsub with NULL target (<a href="https://redirect.github.com/fonttools/fonttools/issues/3979">#3979</a>).</li> <li>[bezierTools] Fix logic bug in curveCurveIntersections (<a href="https://redirect.github.com/fonttools/fonttools/issues/3963">#3963</a>).</li> <li>[feaLib] Error when condition sets have the same name (<a href="https://redirect.github.com/fonttools/fonttools/issues/3958">#3958</a>).</li> <li>[cu2qu.ufo] skip processing empty glyphs to support sparse kerning masters (<a href="https://redirect.github.com/fonttools/fonttools/issues/3956">#3956</a>).</li> <li>[unicodedata] Update to Unicode 17. Require <code>unicodedata2 >= 17.0.0</code> when installed with 'unicode' extra.</li> </ul> <h2>4.60.1 (released 2025-09-29)</h2> <ul> <li>[ufoLib] Reverted accidental method name change in <code>UFOReader.getKerningGroupConversionRenameMaps</code> that broke compatibility with downstream projects like defcon (<a href="https://redirect.github.com/fonttools/fonttools/issues/3948">#3948</a>, <a href="https://redirect.github.com/fonttools/fonttools/issues/3947">#3947</a>, <a href="https://redirect.github.com/robotools/defcon/issues/478">robotools/defcon#478</a>).</li> <li>[ufoLib] Added test coverage for <code>getKerningGroupConversionRenameMaps</code> method (<a href="https://redirect.github.com/fonttools/fonttools/issues/3950">#3950</a>).</li> <li>[subset] Don't try to subset BASE table; pass it through by default instead (<a href="https://redirect.github.com/fonttools/fonttools/issues/3949">#3949</a>).</li> <li>[subset] Remove empty BaseRecord entries in MarkBasePos lookups (<a href="https://redirect.github.com/fonttools/fonttools/issues/3897">#3897</a>, <a href="https://redirect.github.com/fonttools/fonttools/issues/3892">#3892</a>).</li> <li>[subset] Add pruning for MarkLigPos and MarkMarkPos lookups (<a href="https://redirect.github.com/fonttools/fonttools/issues/3946">#3946</a>).</li> <li>[subset] Remove duplicate features when subsetting (<a href="https://redirect.github.com/fonttools/fonttools/issues/3945">#3945</a>).</li> <li>[Docs] Added documentation for the visitor module (<a href="https://redirect.github.com/fonttools/fonttools/issues/3944">#3944</a>).</li> </ul> <h2>4.60.0 (released 2025-09-17)</h2> <ul> <li>[pointPen] Allow <code>reverseFlipped</code> parameter of <code>DecomposingPointPen</code> to take a <code>ReverseFlipped</code> enum value to control whether/how to reverse contour direction of flipped components, in addition to the existing True/False. This allows to set <code>ReverseFlipped.ON_CURVE_FIRST</code> to ensure that the decomposed outline starts with an on-curve point before being reversed, for better consistency with other segment-oriented contour transformations. The change is backward compatible, and the default behavior hasn't changed (<a href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</li> <li>[filterPen] Added <code>ContourFilterPointPen</code>, base pen for buffered contour operations, and <code>OnCurveStartPointPen</code> filter to ensure contours start with an on-curve point (<a href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</li> <li>[cu2qu] Fixed difference in cython vs pure-python complex division by real number (<a href="https://redirect.github.com/fonttools/fonttools/issues/3930">#3930</a>).</li> <li>[varLib.avar] Refactored and added some new sub-modules and scripts (<a href="https://redirect.github.com/fonttools/fonttools/issues/3926">#3926</a>). <ul> <li><code>varLib.avar.build</code> module to build avar (and a missing fvar) binaries into a possibly empty TTFont,</li> <li><code>varLib.avar.unbuild</code> module to print a .designspace snippet that would generate the same avar binary,</li> <li><code>varLib.avar.map</code> module to take TTFont and do the mapping, in user/normalized space,</li> <li><code>varLib.avar.plan</code> module moved from <code>varLib.avarPlanner</code>. The bare <code>fonttools varLib.avar</code> script is deprecated, in favour of <code>fonttools varLib.avar.build</code> (or <code>unbuild</code>).</li> </ul> </li> <li>[interpolatable] Clarify <code>linear_sum_assignment</code> backend options and minimal dependency usage (<a href="https://redirect.github.com/fonttools/fonttools/issues/3927">#3927</a>).</li> <li>[post] Speed up <code>build_psNameMapping</code> (<a href="https://redirect.github.com/fonttools/fonttools/issues/3923">#3923</a>).</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`e691e3bef9`"><code>e691e3b</code></a> Release 4.61.0</li> <li><a href="`c2d540f4ad`"><code>c2d540f</code></a> Update NEWS.rst</li> <li><a href="`3859753a05`"><code>3859753</code></a> Update NEWS.rst</li> <li><a href="`26eb070a55`"><code>26eb070</code></a> black</li> <li><a href="`5ff73af326`"><code>5ff73af</code></a> Merge commit from fork</li> <li><a href="`a696d5ba93`"><code>a696d5b</code></a> varLib: only use the basename(vf.filename)</li> <li><a href="`b00bc459ef`"><code>b00bc45</code></a> varLib_test: test path traversal in variable-font filename</li> <li><a href="`066512e4f3`"><code>066512e</code></a> Merge pull request <a href="https://redirect.github.com/fonttools/fonttools/issues/3986">#3986</a> from cmyr/base-minmax-sorting</li> <li><a href="`ce78973e97`"><code>ce78973</code></a> [feaLib] Sort BasLangSysRecords by tag</li> <li><a href="`5bb37dc201`"><code>5bb37dc</code></a> Merge pull request <a href="https://redirect.github.com/fonttools/fonttools/issues/3983">#3983</a> from fonttools/dependabot/pip/brotli-1.2.0</li> <li>Additional commits viewable in <a href="https://github.com/fonttools/fonttools/compare/4.57.0...4.61.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=fonttools&package-manager=pip&previous-version=4.57.0&new-version=4.61.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) You can trigger a rebase of this PR by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ROCm/rocm-libraries/network/alerts). </details> > Note > Automatic rebases have been disabled on this pull request as it has been open for over 30 days. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-10 07:07:06 +00:00
assistant-librarian[bot]	87fd831d43	[CK Tools] Auto-enable unbuffered output for Python commands (#4265 ) ck-docker exec and ck-exec now automatically detect Python commands and set PYTHONUNBUFFERED=1 to enable live output streaming. This eliminates the need to manually set the environment variable when running Python scripts that print progress updates. The detection matches python, python3, or any .py file argument. This helps in watching live terminal output when a python script is running inside the container. --- 🔁 Imported from [ROCm/composable_kernel#3694](https://github.com/ROCm/composable_kernel/pull/3694) 🧑‍💻 Originally authored by @AviralGoelAMD Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Claude (claude-opus-4.5) <noreply@anthropic.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-02-10 02:59:58 +00:00
dependabot[bot]	9ecc13eecd	Bump pillow from 11.2.1 to 11.3.0 in /projects/composablekernel/docs/sphinx (#475 ) Bumps [pillow](https://github.com/python-pillow/Pillow) from 11.2.1 to 11.3.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/python-pillow/Pillow/releases">pillow's releases</a>.</em></p> <blockquote> <h2>11.3.0</h2> <p><a href="https://pillow.readthedocs.io/en/stable/releasenotes/11.3.0.html">https://pillow.readthedocs.io/en/stable/releasenotes/11.3.0.html</a></p> <h2>Deprecations</h2> <ul> <li>Deprecate fromarray mode argument <a href="https://redirect.github.com/python-pillow/Pillow/issues/9018">#9018</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Deprecate saving I mode images as PNG <a href="https://redirect.github.com/python-pillow/Pillow/issues/9023">#9023</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> </ul> <h2>Documentation</h2> <ul> <li>Added release notes for <a href="https://redirect.github.com/python-pillow/Pillow/issues/9041">#9041</a> <a href="https://redirect.github.com/python-pillow/Pillow/issues/9042">#9042</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Add release notes for <a href="https://redirect.github.com/python-pillow/Pillow/issues/8912">#8912</a> and <a href="https://redirect.github.com/python-pillow/Pillow/issues/8969">#8969</a> <a href="https://redirect.github.com/python-pillow/Pillow/issues/9019">#9019</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>ImageFont does not handle multiline text <a href="https://redirect.github.com/python-pillow/Pillow/issues/9000">#9000</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Updated Ubuntu CI targets <a href="https://redirect.github.com/python-pillow/Pillow/issues/8988">#8988</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Update MinGW package names <a href="https://redirect.github.com/python-pillow/Pillow/issues/8987">#8987</a> [<a href="https://github.com/H4M5TER"><code>@H4M5TER</code></a>]</li> <li>Updated docstring <a href="https://redirect.github.com/python-pillow/Pillow/issues/8943">#8943</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Mention that tobytes() with the raw encoder uses Pack.c <a href="https://redirect.github.com/python-pillow/Pillow/issues/8878">#8878</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Refactor docs <code>Makefile</code> <a href="https://redirect.github.com/python-pillow/Pillow/issues/8933">#8933</a> [<a href="https://github.com/hugovk"><code>@hugovk</code></a>]</li> <li>Add template for quarterly release issue <a href="https://redirect.github.com/python-pillow/Pillow/issues/8932">#8932</a> [<a href="https://github.com/aclark4life"><code>@aclark4life</code></a>]</li> <li>Add list of third party plugins <a href="https://redirect.github.com/python-pillow/Pillow/issues/8910">#8910</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Update redirected URL <a href="https://redirect.github.com/python-pillow/Pillow/issues/8919">#8919</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Docs: use sentence case for headers <a href="https://redirect.github.com/python-pillow/Pillow/issues/8914">#8914</a> [<a href="https://github.com/hugovk"><code>@hugovk</code></a>]</li> <li>Docs: remove unused Makefile targets <a href="https://redirect.github.com/python-pillow/Pillow/issues/8917">#8917</a> [<a href="https://github.com/hugovk"><code>@hugovk</code></a>]</li> <li>Remove indentation from lists <a href="https://redirect.github.com/python-pillow/Pillow/issues/8915">#8915</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Python 3.13 is tested on Arch <a href="https://redirect.github.com/python-pillow/Pillow/issues/8894">#8894</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Move XV Thumbnails to read only section <a href="https://redirect.github.com/python-pillow/Pillow/issues/8893">#8893</a> [<a href="https://github.com/aclark4life"><code>@aclark4life</code></a>]</li> <li>Updated macOS tested Pillow versions <a href="https://redirect.github.com/python-pillow/Pillow/issues/8890">#8890</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> </ul> <h2>Dependencies</h2> <ul> <li>Add AVIF to wheels using only aomenc and dav1d AVIF codecs for reduced size <a href="https://redirect.github.com/python-pillow/Pillow/issues/8858">#8858</a> [<a href="https://github.com/fdintino"><code>@fdintino</code></a>]</li> <li>Use same AVIF URL when fetching dependency <a href="https://redirect.github.com/python-pillow/Pillow/issues/8871">#8871</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Update dependency mypy to v1.16.1 <a href="https://redirect.github.com/python-pillow/Pillow/issues/9026">#9026</a> [@<a href="https://github.com/apps/renovate">renovate[bot]</a>]</li> <li>Update libpng to 1.6.49 <a href="https://redirect.github.com/python-pillow/Pillow/issues/9014">#9014</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Update dependency cibuildwheel to v3 <a href="https://redirect.github.com/python-pillow/Pillow/issues/9010">#9010</a> [@<a href="https://github.com/apps/renovate">renovate[bot]</a>]</li> <li>Updated libjpeg-turbo to 3.1.1 <a href="https://redirect.github.com/python-pillow/Pillow/issues/9009">#9009</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Update dependency mypy to v1.16.0 <a href="https://redirect.github.com/python-pillow/Pillow/issues/8991">#8991</a> [@<a href="https://github.com/apps/renovate">renovate[bot]</a>]</li> <li>Updated libpng to 1.6.48 <a href="https://redirect.github.com/python-pillow/Pillow/issues/8940">#8940</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Updated Ghostscript to 10.5.1 <a href="https://redirect.github.com/python-pillow/Pillow/issues/8939">#8939</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Updated harfbuzz to 11.2.1 <a href="https://redirect.github.com/python-pillow/Pillow/issues/8937">#8937</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Updated libavif to 1.3.0 <a href="https://redirect.github.com/python-pillow/Pillow/issues/8949">#8949</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Update dependency cibuildwheel to v2.23.3 <a href="https://redirect.github.com/python-pillow/Pillow/issues/8931">#8931</a> [@<a href="https://github.com/apps/renovate">renovate[bot]</a>]</li> <li>Updated harfbuzz to 11.1.0 <a href="https://redirect.github.com/python-pillow/Pillow/issues/8904">#8904</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> </ul> <h2>Testing</h2> <ul> <li>Add <code>match</code> parameter to <code>pytest.warns()</code> <a href="https://redirect.github.com/python-pillow/Pillow/issues/9038">#9038</a> [<a href="https://github.com/hugovk"><code>@hugovk</code></a>]</li> <li>Increase pytest verbosity <a href="https://redirect.github.com/python-pillow/Pillow/issues/9040">#9040</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Improve SgiImagePlugin test coverage <a href="https://redirect.github.com/python-pillow/Pillow/issues/8896">#8896</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> <li>Update ruff pre-commit ID <a href="https://redirect.github.com/python-pillow/Pillow/issues/8994">#8994</a> [<a href="https://github.com/radarhere"><code>@radarhere</code></a>]</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`89f1f4626a`"><code>89f1f46</code></a> 11.3.0 version bump</li> <li><a href="`f2de251c76`"><code>f2de251</code></a> Updated check script paths (<a href="https://redirect.github.com/python-pillow/Pillow/issues/9052">#9052</a>)</li> <li><a href="`84855d11c8`"><code>84855d1</code></a> Raise FileNotFoundError when opening an empty path (<a href="https://redirect.github.com/python-pillow/Pillow/issues/9048">#9048</a>)</li> <li><a href="`204d11d4da`"><code>204d11d</code></a> Raise FileNotFoundError when opening an empty path</li> <li><a href="`2b39f7581e`"><code>2b39f75</code></a> Handle IPTC TIFF tags with incorrect type (<a href="https://redirect.github.com/python-pillow/Pillow/issues/8925">#8925</a>)</li> <li><a href="`e7a53ba19b`"><code>e7a53ba</code></a> Do not update palette for L mode GIF frame (<a href="https://redirect.github.com/python-pillow/Pillow/issues/8924">#8924</a>)</li> <li><a href="`c22230b761`"><code>c22230b</code></a> Use save parameters as encoderinfo defaults (<a href="https://redirect.github.com/python-pillow/Pillow/issues/9001">#9001</a>)</li> <li><a href="`da10ed1cf3`"><code>da10ed1</code></a> Add support for iOS (<a href="https://redirect.github.com/python-pillow/Pillow/issues/9030">#9030</a>)</li> <li><a href="`be2b4e7864`"><code>be2b4e7</code></a> Fix qtables and quality scaling (<a href="https://redirect.github.com/python-pillow/Pillow/issues/8879">#8879</a>)</li> <li><a href="`d4162f8505`"><code>d4162f8</code></a> Updated return type</li> <li>Additional commits viewable in <a href="https://github.com/python-pillow/Pillow/compare/11.2.1...11.3.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pillow&package-manager=pip&previous-version=11.2.1&new-version=11.3.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ROCm/rocm-libraries/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-09 19:49:39 -07:00
Bartłomiej Kocot	0b9fa702ac	[CK] CK Tile grouped convolution direct load (#4406 ) ## Motivation CK Tile grouped convolution forward direct load support. ## Technical Details Basic pipeline for direct load and new instances for forward for v1 and v4 pipelines. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-130	2026-02-09 22:08:57 +01:00
assistant-librarian[bot]	bb8c746cbc	Enable group mode (varlen) kernel generation for PyTorch integration (#4292 ) ## Proposed changes This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API. --- 🔁 Imported from [ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553) 🧑‍💻 Originally authored by @chinmaydk99 Co-authored-by: Chinmay_Kuchinad <ChinmayDattanand.Kuchinad@amd.com>	2026-02-09 20:58:57 +00:00
Bartłomiej Kocot	72016e355e	[CK] Fix grouped conv fwd transform for merged groups (#4399 ) ## Motivation [CK] Fix grouped conv fwd transform for merged groups for 1d and 3d. ## Technical Details After optimizations for 2d there is a lack of implementation for 1d and 3d ## Test Plan test_grouped_convnd_fwd ## Test Result pending CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-09 09:36:52 -06:00

1 2 3 4 5 ...

3909 Commits