[CK Profiler] Instance selection for grouped conv profilers
(#4800)
## Motivation
This PR adds instance selection support for ckProfiler grouped
convolution operations (forward, backward data, backward weight),
allowing users to run specific kernel instances rather than sweeping all
available instances.
When profiling or debugging convolution kernels, users often need to
test specific kernel configurations without running the full instance
sweep. This is particularly useful for:
- Debugging a specific failing instance
- Profiling a known-best configuration
- Quick validation during development
## Technical Details
**Features added**:
- `--instance <id>` flag to run only the N-th valid instance (0-indexed)
- `--list-instances` flag to list all valid instances without running
any kernels
- Named arguments can appear anywhere on the command line
- Best instance index is now printed with results for reference
- Python script support via `-ii` / `--instance_index` arguments
**Design decisions**:
- Named arguments (`--instance`, `--list-instances`) instead of
positional to avoid conflicts with existing parameters
- Instance index refers to the N-th valid instance (0-indexed), not the
global instance index
- Auto-disable verification when `--list-instances` is used for fast
enumeration
- Shared utilities in `profiler_arg_utils.hpp` to deduplicate parsing
logic
## Test Plan
Manual testing with various scenarios:
List all valid instances:
```bash
./bin/ckProfiler grouped_conv_fwd <usual args> --list-instances
```
Run only instance 5:
```bash
./bin/ckProfiler grouped_conv_fwd <usual args> --instance 5
```
Test cases:
- Single instance selection
- List instances mode
- Out-of-bounds instance index (verified warning messages)
- No instance flag (runs all instances - default behavior)
- All three operations (fwd, bwd_data, bwd_weight)
## Test Result
All test scenarios passed:
- Instance selection correctly filters kernel executions
- List mode enumerates valid instances without running kernels
- Invalid indices produce appropriate warnings without crashing
- Default behavior (all instances) unchanged when flags not provided
- Consistent behavior across all three grouped convolution operations
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_BUILDER] ck builder conv transfer fix
## Motivation
This PR fixes how CK Builder is validating transfer vector size and adds
proper validation for LDS transfer vector size as well.
## Changes:
* [__source vector dim__] -- Before this PR the data transfer validation
logic didn't allow to set the source vectorized dimension to 1. However
there are CK instances that are doing this when the group merging is
used. This is used only for
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle` kernel.
* [__valid vector size__] -- Before this PR the validation logic
concerned only single instruction maximum vector size. However our
buffer loading logic has implemented support for loading more values
through multiple buffer instructions. This again was discovered to be
used in some of the convolution instances. Thus this behavior was
reflected in validation logic.
* [__valid LDS vector size__] -- Before this PR the LDS vector size
validation was done in the same way as VMEM. This PR adds proper LDS
vector size validation based on the available LDS instruction sizes.
## Test Plan
Run CK BUILDER conv fwd factories tests
## Test Result
All CK BUILDER conv fwd factories work (except DL one & ck tile since
they're not yet added now)
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Add project root marker for monorepo compatibility
(#4430)
## Summary
- Add `.ck-project-root` marker file at the composablekernel project
root
- Update `find_project_root()` in `script/tools/common.sh` to look for
this marker instead of `.git`
- Fixes project root detection when CK is part of the rocm-libraries
monorepo
## Background
Since the project was moved into the monorepo, the `.git` directory is
at the monorepo root rather
than the CK project root. This caused `find_project_root()` to return
the wrong path, breaking tools
in `script/tools/`.
## Test plan
- [x] Verify `find_project_root` returns correct path from any CK
subdirectory
- [x] Verify `ck-build --help` works
- [x] Verify `ck-configure --help` works
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Add multi-file trace parsing and analysis pipeline
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Extends build time analysis from ROCm/composable_kernel#3644 to handle
multiple trace files across build directories (see #4229):
- pipeline.py: Generic pipeline framework with fluent interface for
composable data processing. Provides parallel processing, progress
tracking, and error handling independent of trace-specific code.
Processes thousands of trace files at default resolution in minutes,
aggregating results into in-memory DataFrames for analysis.
- parse_build.py: Parse all trace files in a build directory
- build_analysis_example.ipynb: Demonstrates pipeline aggregation across
all build files
The pipeline design improves capability (composable operations),
performance (parallel processing), and user-friendliness (fluent API) of
the analysis modules. It enables analyzing compilation patterns across
the entire codebase with all trace data available in pandas DataFrames
for interactive exploration.
[CK] Fix the launch_tests script.
## Motivation
Fix the script that filters the tests.
## Technical Details
There were several places where the paths had to be updated for the
launch_tests script to work correctly.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Adapt parser to monorepo
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
Addressing issues found trying to run the dependency parser on MIOpen:
- Ninja is recording the full path, e.g.: [json]
```
"file_to_executables": {
"/home/rspauldi/repos/rocm-libraries/projects/miopen/include/miopen/miopen.h": [
```
- Running git in monorepo reports the full _relative_ path, e.g.:
```
"projects/miopen/include/miopen/miopen.h"
```
Of course, `git diff` also returns all files modified in every other
project's commits. These are filtered out as early as possible.
This solution searches for `rocm-libraries` in the `parsing` step, and
if found extracts the project name and stores it in
`enhanced_dependency_mapping.json`. Leading folders are truncated from
each file path, up to and including the project name. This allows
`_is_project_file` to remain unchanged.
The `selection` step then retrieves the project name from the json if it
is defined, and truncates the project folder from the `git diff` output
so the filenames exactly match the json entries.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
Successfully runs on rocm-libraries MIOpen PRs and produces a list of
tests. I haven't verified the results yet.
This version is not applicable to CI since it operates on a
per-executable level and MIOpen CI uses the single gtest binary. I'll be
working towards that in future PRs over the next few weeks.
```
/home/rspauldi/repos/rocm-libraries/projects/miopen# git checkout miopen/sgundabo_enable_ck_bwd_wrw_navi
<run CMake with TEST_DISCRETE=ON>
# ninja tests
# root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py parse build/build.ninja
Parsing ninja dependencies from: build/build.ninja
Parsing ninja build file...
Found 312 executables
Found 820 object-to-source mappings
Found 820 object files
Extracting detailed dependencies for all object files...
Processed 100/820 object files...
Processed 200/820 object files...
Processed 300/820 object files...
Processed 400/820 object files...
Processed 500/820 object files...
Processed 600/820 object files...
Processed 700/820 object files...
Processed 800/820 object files...
Completed dependency extraction for 820 object files
Building file-to-executable mapping...
Found rocm-libraries project: 'miopen'
Built mapping for 608 files
Files used by multiple executables: 216
Sample files with multiple dependencies:
build/include/miopen/config.h: 306 executables
build/include/miopen/export.h: 306 executables
build/include/miopen/export_internals.h: 304 executables
driver/InputFlags.hpp: 2 executables
driver/driver.hpp: 2 executables
=== Enhanced Dependency Mapping Summary ===
Total executables: 312
Total files mapped: 608
Total object files processed: 820
File types:
.cpp files: 310
.hpp files: 292
.h files: 6
Files used by multiple executables: 216
Top files with most dependencies:
build/include/miopen/config.h: 306 executables
build/include/miopen/export.h: 306 executables
include/miopen/miopen.h: 304 executables
src/include/miopen/config.hpp: 304 executables
build/include/miopen/export_internals.h: 304 executables
src/include/miopen/rank.hpp: 303 executables
src/include/miopen/errors.hpp: 302 executables
src/include/miopen/object.hpp: 302 executables
src/include/miopen/returns.hpp: 302 executables
src/include/miopen/sysinfo_utils.hpp: 302 executables
Exporting mapping to build/enhanced_file_executable_mapping.csv
Exporting complete mapping to build/enhanced_dependency_mapping.json
Results exported to:
CSV: build/enhanced_file_executable_mapping.csv
JSON: build/enhanced_dependency_mapping.json
root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py select build/enhanced_dependency_mapping.json 1b13d8b72d54e34bdc7ae70dd2b6e809dca8b10e 09e5965d55ebbfacfd1ed18e5092580c2ffae748
Identified 30 files modified in project 'miopen'
Exported 304 tests to run to tests_to_run.json
```
I don't know if clang-format applies to scripts. If so, could someone
show me how to run it in CK?
[CK Tools] Auto-enable unbuffered output for Python commands
(#4265)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
ck-docker exec and ck-exec now automatically detect Python commands and
set PYTHONUNBUFFERED=1 to enable live output streaming. This eliminates
the need to manually set the environment variable when running Python
scripts that print progress updates.
The detection matches python, python3, or any .py file argument.
This helps in watching live terminal output when a python script is
running inside the container.
[CK] a bunch of CI fixes.
## Motivation
Fixing some of the CK CI issues
## Technical Details
fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
[CK] MICI: Fix git diff in selective_test_filter.py
## Motivation
- git diff needs access to reference repo
## Technical Details
- mount reference repo path into docker for selective_test_filter.py to
access
## Test Plan
- tested in MICI
## Test Result
- launch_tests.sh ran successfully
* Decouple configure/build/test tools from Docker
Create a two-layer tool architecture:
- Core tools (ck-configure, ck-build, ck-test): Environment-agnostic,
work on any system with ROCm - no Docker dependency
- Container tools (ck-docker): Manage Docker containers and delegate
to core tools via docker exec
Changes:
- Add ck-configure: New CMake configuration tool with preset support,
native GPU detection, and flexible options
- Refactor ck-build: Remove Docker dependency, add --configure and
--list options, call ninja directly
- Refactor ck-test: Remove Docker dependency, add CTest integration
with --smoke/--regression/--all options
- Enhance common.sh: Add native GPU detection, build directory utils,
and output helpers
- Update ck-docker: Add configure/build/test/exec commands that
delegate to core tools inside container
This enables:
- Native development on ROCm hosts without Docker
- Simpler CI/CD integration
- Consistent behavior inside and outside containers
Co-Authored-By: Claude <noreply@anthropic.com>
* Add ck-rocprof: GPU profiling tool for rocprof-compute
Adds a command-line profiling tool to simplify GPU performance
analysis workflow using AMD rocprof-compute.
Features:
- Easy setup with automatic Python venv configuration
- Simple CLI: setup, run, analyze, compare, list
- Automatic GPU architecture detection
- Focus on LDS metrics (Block 12) for bank conflict analysis
- Comprehensive documentation with examples and troubleshooting
Usage:
ck-rocprof setup # One-time environment setup
ck-rocprof run <name> <executable> # Profile executable
ck-rocprof analyze <name> [block] # Analyze metrics
ck-rocprof compare <name1> <name2> # Compare two runs
ck-rocprof list # List available runs
* Make ck-rocprof documentation concise and improve Docker integration
- Streamlined documentation from 416 to 157 lines (62% reduction)
- Focused on essential commands, metrics, and workflows
- Enhanced script to run all operations inside Docker containers
- Fixed workload directory path and improved container management
- Added automatic rocprofiler-compute installation and dependency handling
* Add --no-roof flag to ck-rocprof profile command
Skip roofline analysis by default to speed up profiling. Roofline
analysis can add significant time to profiling runs but is not
needed for most LDS bank conflict analysis workflows.
* Make ck-rocprof work independently of Docker
Add native execution mode that runs rocprof-compute directly on the host
system when available, falling back to Docker mode when not.
Key changes:
- Auto-detect native mode when rocprof-compute is in PATH or common locations
- Add execution mode wrappers (exec_cmd, file_exists, dir_exists, etc.)
- Native mode stores venv at .ck-rocprof-venv in project root
- Native mode stores workloads at build/workloads/
- Support user-installed rocprofiler-compute (e.g., ~/.local/rocprofiler-compute)
- Add CK_FORCE_DOCKER env var to force Docker mode
- Update help message to show current execution mode
- Maintain full backward compatibility with existing Docker workflow
Tested successfully with rocprofiler-compute 3.4.0 installed from source
on MI300X GPU in native mode.
Co-Authored-By: Claude <noreply@anthropic.com>
* Add clean/status commands and improve ck-rocprof robustness
- Add 'clean' command to remove profiling runs (supports --all)
- Add 'status' command to show configuration and environment info
- Add workload name validation to prevent path traversal attacks
- Fix uv installation to use pip instead of curl for reliability
- Add cross-platform stat support for macOS compatibility
- Consolidate ROCPROF_CANDIDATES to avoid code duplication
- Expand help documentation with all profiling block descriptions
- Fix Docker wrapper script escaping issues
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix analyze command to use correct workload path
rocprof-compute stores results directly in the workload directory
(pmc_perf.csv) rather than in a GPU architecture subdirectory.
Updated find_workload_path to detect this correctly.
Co-Authored-By: Claude <noreply@anthropic.com>
* Address PR review security and robustness issues
Security fixes:
- Escape executable path in cmd_run to prevent shell injection
- Add workload name validation to cmd_analyze and cmd_compare
Robustness improvements:
- Add error checking for uv package manager installation
- Use consistent project root detection (find_project_root || get_project_root)
- Use /opt/rocm instead of hardcoded /opt/rocm-7.0.1 in Docker mode
- Derive ROCM_REQUIREMENTS path from ROCPROF_BIN for flexibility
- Use gfx950 as fallback GPU consistent with common.sh
Documentation updates:
- Fix env var name GPU_TARGET -> CK_GPU_TARGET
- Update storage layout to reflect current structure (workloads/<name>/)
- Document clean and status commands
- Clarify native vs Docker default paths
Co-Authored-By: Claude <noreply@anthropic.com>
* Simplify ck-rocprof to native-only mode
Remove Docker mode from ck-rocprof. Docker users should run the tool
via `ck-docker exec ck-rocprof ...` instead.
This simplification:
- Removes ~210 lines of Docker-specific code
- Eliminates mode detection complexity
- Makes the script easier to maintain
- Provides clearer error messages when rocprof-compute is not found
The setup command now lists all searched locations when rocprof-compute
is not found, helping users understand how to install it.
Co-Authored-By: Claude <noreply@anthropic.com>
* Add rocprofiler-compute source installation fallback
When rocprof-compute is not found in system locations, automatically
install rocprofiler-compute 3.4.0 from source as a fallback. This
eliminates the hard dependency on system ROCm packages.
Implementation details:
- Clone rocprofiler-compute from GitHub to ~/.local/
- Install dependencies via requirements.txt (not editable install)
- Create wrapper that sets PYTHONPATH to source directory
- Execute source script directly rather than importing as module
This approach matches the project's development workflow and works
around the incomplete pyproject.toml that prevents editable installs.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575).
The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration.
Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows.
This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection.
* CMakeLists.txt hack for Windows.
* Add Windows build instructions.
* Fix type issue with variadic min function.
* Use std::common_type to fix the variadic min/max functions.
* Enable CPU guard compilation on Windows.
* Suppress warnings related to std::getenv on Windows platform.
* Git ignore the output directory on Windows platform.
* Powershell script for running tests and generating reports.
* Improve test logging.
* Disable non-conv tests.
* Fix Debug build on Windows.
* More debug build changes.
* Update Windows build instructions.
* Enable all tests.
* Test fixes.
* Suppress not found linker options warning.
* Update unsigned long literals and format specifiers to work correctly in Windows
* Fix conv 3D bwd weight bilinear tests on Windows.
* Revert changes on .gitignore.
* Clean-up CMake project file for Windows builds.
* clang-format
* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.
* Fix building examples on Windows.
* Update Readme.
* Remove the suppression of the deprecated warnings.
* Remove Windows specific min/max implementations from CK Tile math core.
* Remove unnecessary no-op on Windows.
---------
Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>
* Removing hard-coded trace filename
* Including stage name in notification
* Simplifying capture setup and tagging file names with arch
* Removed test property from notification message
* Fixing regex to get arch name
* Fixing error in notification and modified regex
* generate and visualize build traces for all archs
* generate build traces in all cases
* fix jenkins logic
* fix typo
* use more threads for parsing dependency map
* add script to parse ninja traces and issue warnings
* fix python script syntax and header
* fix python syntax one more time
* fix python syntax
* Add README.md for testing
* Add tensor_memory_manager.
* ck-builder: tensor memory manager rebase fixes
This fixes some issues caused by the API being changed recently.
Also, this streamlines the ckt namespace to always be ck_tile::builder::test,
as this is already being used by other tests
Really, this commit should be squashed into the previous,
but I'm keeping it separate for brevity.
* ck-builder: test arguments initial prototype
* ck-builder: test system initial prototype
* ck-builder: fix non-standardized copyright comments
* ck-builder: new prototype
* ck-builder: group testing inputs/outputs into a separate structure
This is basically the return of the tensor memory manager after all,
except that the design is more closely tied to the actual operation.
Using a struct allows us to add additional input/output tensors
without breaking code (by defaulting those new parameters). Note
that the tensors are split into a separate inputs/outputs because we
usually want to allocate the output _twice_: once for the real
computation and once for the reference computation.
* ck-builder: simplify prototype naming; start docs
* ck-builder: update testing readme
* ck-builder: testing documentation
* ck-builder: HipStatusMatcher
This matcher can be used to check HIP status codes and provide
nice and readable error messages.
* ck-builder: tensor_buffer.hpp tests
* ck-builder: conv_fwd.hpp tests
* ck-builder: add example end-to-end test in conv fwd 2d fp16
* ck-builder: simplify extent usage
* ck-builder: update testing doc
* ck-builder: skip end to end test on non-gfx9
* fix check_copyright_year interpreter
/bin/bash is not guaranteed to exist on Linux. Signed,
a NixOS user
* ck-builder: fix copyrights
* ck-builder: reduce conv fwd testing size
This test allocated 24GB of memory, too much for 16GB cards.
---------
Co-authored-by: John Shumway <jshumway@amd.com>
This pull requests adds some initial "factory tests" - these check that the instances which are used in MIOpen are actually present in CK. The main reason for this is documentation and sanity checking. Its likely that these tests get outdated fast, so we'll have to maintain them, but fortunately this is quite straight forward and shouldn't take a lot of time once they are in place.
* Run ctest with --output-on-failure
* Fix synchronization issues in bwd pipelines
The bwd kernel reuses the same area of LDS for ds (SGrad), bias and
dbias (BiasGrad). This means that there must be block_sync_lds between
loading one tensor and storing another to the same area.
Heavy instructions like MFMA/WMMA and global loads are executed between
reuses of the same memory so in MOST cases loading is finished by all
warps before storing is started. However, sometimes warps progress at
different speeds.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure
* initial work on adding support of gfx12 in tile_engine for GEMM benchmarking
* add stage("Run TILE_ENGINE_GEMM Tests on gfx1201") to Jenkins config
* make tile_[m/n/k] validation arch dependent
* docs(README): update readme with new build instructions
* feat(grouped_gemm): add support back for non persistent kernel
* refactor(grouped_gemm): simplify tensor creation
* refactor(grouped_gemm): Persistance is now GemmConfig value for easier management
* chore(grouped_gemm): add print statements to ease debugging
* WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration
* fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity
* WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines
* chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle
* style: clang format work so far
* BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline
* fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors
* refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm
* chore(grouped_gemm): add/remove debug comments and debug print statements
* feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes
* chore(gemm_profile): add new argument combinations
* fix: branch cleanup, formatting, refactoring
* fix: branch cleanup, formatting, refactoring
* chore(changelog): update changelog to reflect new featuer
* address review comments & nit
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Making edits to identify individual compilation issues.
* Minor fix for blob txt files not being created.
* Fixing compilation issues.
* Fixing ordering bug.
* Adding python profiling functionality.
* Setting individual build as default.
* Setting gpu target filtering for tile engine to gfx90a, gfx942 and gfx950.
* update the default running parameters and settings
* Fixing bug with benchmarking, shifting file generation to build instead of config.
* Updating fixes.
* Fixing json output and parsing.
* Disable ccache for tile engine gemm ops because we dont need it.
* Removing duplicate type definition.
* Improving json printing.
* Add the flexibility of different layout and more warp tile support
* Fix extra flag in name of individual kernels.
* Fixing bug with booleans.
* Solve the first patch of the post merge conflict
* Compilation fixes, and cosmetic improvements.
* Yet again compilation fixes after latest changes from develop.
* Fixing python benchmarking script.
---------
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
Co-authored-by: Vidyasagar Ananthan <vanantha@amd.com>
* update the performance test logic
* fix unstash perf logs logic
* untangle unstashing fmha logs for different archs
* run process stage after running fmha tests
* fix the processing of perf logs
* fix arguments for run_performance scripts
* script for generating list of not referenced files in tests, list is in json format
* script comment added
* added empty line at the end of the script
* format changes
* something khushbu can help with
* v1 v2 works with flatmm develop
* v0 v1 v2 numerical error gone
* Fixing numerical error, and interchange preshuffle configs to match with flatmm
* Refactor GEMM pipeline configurations and integrate preshuffle support
- Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3).
- Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations.
- Removed obsolete code and comments
* clang format
* fix vectorloadsize bug
* add the Preshuffle3
* update kwarp calculation in gemm utils
* update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch
* fix: add CK_GFX950_SUPPORT macro for gfx950 detection
* default disable rotating buffer
* docs(CHANGELOG): update changelog for rocm 7.0
* Revert "docs(CHANGELOG): update changelog for rocm 7.0"
This reverts commit 2bc16fff84.
* Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files.
* revert example/ck_tile/flatmm to its original state
* remove comment added by second author
* switch to xor ALDSDescriptor
* modify the MakeALdsDescriptor()
* temporary profiling script
* getting rid of line marker compiler error
* UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy
* add a minor fix for the config
* typo fix
* Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2
* revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp
* revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp
* reenable the GemmSpatiallyLocalTilePartitioner
* make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline
* remove hardcoded true for preshuffle bool template argument
* rename script
* remove gemm_profilie.sh script
* merge conflict resolve
* clang formatted
* typo fix
* Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp
* Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy
* Fix missing newline at end of file in run_gemm_example.inc
* Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1
* addressing review comments
* removing debug code
* addressing review comments
* Revert "addressing review comments"
This reverts commit 29c45192ba.
* updating tile_engine code
* addressing review comments
---------
Co-authored-by: amd-khushbu <khuagarw@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
All our platforms support C++20 now, so update to C++20 standard
for language features such as concepts, designated initializers,
range-based for initializers, and consteval. This PR only switches
the compiler flags to C++20, no other changes.
* Updating runtime log message for CK TILE ENGINE
* CKTile layout from config
* CKTile custom config for CI
* Documentation for Layout Changes
* CKTile Layout changes to Jenkins
* Fixing Clang Format
* Changes to Jenkins file to fix error
* fix(cmake-ck-dev): no longer sets invalid values as gpu arch
* style(py files): ruff formatting
* fix(cmake-ck-release): no longer sets invalid values as gpu arch
* chore(cmake-tile_engine): add reminder to uncomment user config json
* Changes to jenkin file to address more cases
* Changes to Jenkins to fix Error
* Changes to Jenkins file for fixing an error
* Update Jenkinsfile (#2517)
* Update Jenkinsfile
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>