# Data Generation Guide This document explains how to build benchmark binaries from the CK Tile engine, generate benchmark datasets, and manage them for the ML kernel performance prediction system. ## Overview The ML heuristic needs benchmark data: measured TFLOPS, latency, and bandwidth for every (problem shape, kernel config) pair. The tile engine builds one executable per kernel configuration. Each executable benchmarks a single kernel on a given problem size and outputs JSON with performance metrics. ``` CK source --> CMake configure --> ninja build --> benchmark binaries (4608 per op/dtype/layout) benchmark binaries --> run on GPU --> streaming log --> parquet dataset (per shape) (JSON blocks) (canonical schema) ``` ## Prerequisites - **ROCm**: HIP >= 6.0.3 (for gfx950: HIP >= 6.0.4) - **Build tools**: CMake >= 3.21, Ninja, HIP-aware clang compiler - **Python**: 3.10+ with `pandas`, `pyarrow` - **GPU**: ROCm-capable AMD GPU (MI250X, MI300X, MI355X, etc.) --- ## Part 1: Building Benchmark Binaries from the Tile Engine If you already have pre-built binaries (e.g., in `/workspace/ck_tile/bin/`), skip to Part 2. This section explains how to build them from source. ### Step 1: CMake Configure From the CK repository root: ```bash cmake -S /workspace/rocm-libraries/projects/composablekernel \ -B build \ -DCMAKE_BUILD_TYPE=Release \ -DGPU_TARGETS="gfx950" \ -DGEMM_UNIVERSAL_DATATYPE="fp8" \ -DGEMM_UNIVERSAL_LAYOUT="rcr" \ -G Ninja ``` **Key CMake variables:** | Variable | Default | Description | |---|---|---| | `GPU_TARGETS` | (required) | Target GPU architectures. Supported: `gfx90a`, `gfx942`, `gfx950`, `gfx1201`. Semicolon-separated for multiple. | | `GEMM_UNIVERSAL_DATATYPE` | `"fp8;fp16"` | Data types to build. Options: `fp8`, `fp16`, `bf16`, `bf8`. Semicolon-separated. | | `GEMM_UNIVERSAL_LAYOUT` | `"rcr;rrr;crr;ccr"` | Layouts to build. Semicolon-separated. | | `GEMM_UNIVERSAL_CONFIG_FILE` | `"default_config.json"` | Kernel config file (in the `configs/` directory). Controls which tile sizes, warp configs, pipelines, etc. are enumerated. | | `ENABLE_CCACHE_GEMM_UNIVERSAL` | `OFF` | Enable ccache for faster rebuilds. | **Example: build only fp8 RCR for gfx950 (fastest, ~4608 kernels):** ```bash cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \ -DGPU_TARGETS="gfx950" \ -DGEMM_UNIVERSAL_DATATYPE="fp8" \ -DGEMM_UNIVERSAL_LAYOUT="rcr" \ -G Ninja ``` **Example: build all dtypes and layouts (slow, ~4608 * 4 * 4 = ~73K kernels):** ```bash cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \ -DGPU_TARGETS="gfx950" \ -DGEMM_UNIVERSAL_DATATYPE="fp8;fp16;bf16;bf8" \ -DGEMM_UNIVERSAL_LAYOUT="rcr;rrr;crr;ccr" \ -G Ninja ``` ### What happens during configure 1. CMake calls `gemm_universal_instance_builder.py --list_kernels` to enumerate all valid kernel configurations from the config JSON. 2. It writes `gemm_universal_kernel_list.txt` (one kernel per line) and `gemm_universal_kernel_count.txt` to the build directory. 3. For each kernel, it creates a ninja build target. ### Step 2: Build ```bash # Build all benchmarks for the configured dtypes/layouts ninja -C build benchmark_gemm_universal_all # Or build a specific dtype/layout combo ninja -C build benchmark_gemm_universal_fp8_rcr # Or build by pipeline type ninja -C build benchmark_gemm_universal_compv4_pipeline ninja -C build benchmark_gemm_universal_mem_pipeline # Or build a single specific kernel ninja -C build benchmark_gemm_universal_fp8_rcr_compv3_cshuffle_intrawave_False_False_False_False_128x128x128_1x4x1_16x16x128 ``` **Build time estimates:** - ~4608 kernels (one dtype, one layout): 1-4 hours depending on CPU cores - Use `-j ` to control parallelism: `ninja -C build -j 32 benchmark_gemm_universal_fp8_rcr` ### Step 3: Verify binaries Binaries are placed in `build/bin/`: ```bash ls build/bin/benchmark_gemm_universal_fp8_rcr_* | wc -l # Expected: 4608 (for default config) # Test one binary ./build/bin/benchmark_gemm_universal_fp8_rcr_compv3_cshuffle_intrawave_False_False_False_False_128x128x128_1x4x1_16x16x128 \ -m=1024 -n=1024 -k=1024 -warmup=3 -repeat=10 -verify=0 ``` ### Kernel config files The config files live in: ``` tile_engine/ops/gemm/gemm_universal/configs/ default_config.json # Default: full enumeration default_ci_config.json # CI: reduced set for fast testing user_provided_config.json # Custom: your own subset ``` To use a custom config: ```bash cmake ... -DGEMM_UNIVERSAL_CONFIG_FILE="user_provided_config.json" ``` The config controls which tile sizes (e.g., 128x128x64, 256x256x32), warp configurations (e.g., 2x2x1, 1x4x1), pipelines (compv3, compv4, mem), schedulers, and other parameters are included in the kernel enumeration. ### Building StreamK / other ops The same pattern applies to other tile engine ops: ```bash # StreamK ninja -C build benchmark_gemm_streamk_fp8_rcr # Grouped convolution ninja -C build benchmark_grouped_conv_fwd_fp16_nhwgc ``` Each op has its own instance builder and config directory. --- ## Part 2: Running Benchmarks and Generating Data ## Quick Start ### 1. Run benchmarks for a set of shapes Each binary accepts `-m=`, `-n=`, `-k=`, `-warmup=`, `-repeat=`, `-verify=` flags and outputs JSON to stdout: ```bash /workspace/ck_tile/bin/benchmark_gemm_universal_fp8_rcr_compv3_cshuffle_intrawave_False_False_False_False_128x128x128_1x4x1_16x16x128 \ -m=1024 -n=1024 -k=1024 -warmup=3 -repeat=10 -verify=0 ``` Output: ```json { "name": "gemm_universal_fp8_rcr_compv3_cshuffle_intrawave_...", "problem": { "split_k": 1, "m": 1024, "n": 1024, "k": 1024, "dtype_a": "fp8", "dtype_b": "fp8", ... }, "perf_result": { "latency(ms)": 0.04, "tflops(TFlops)": 204.60, "bandwidth(GB/s)": 624.39 } } ``` ### 2. Batch generation using provided scripts **Wide coverage (diverse shapes across all regimes):** ```bash python3 generate_wide_coverage.py \ --bin_dir /workspace/ck_tile/bin \ --out_dir data/wide_coverage \ --batch_size 25 \ --warmup 3 --repeat 10 ``` **Edge-case dimensions (N=1, K=1, small N/K):** ```bash python3 generate_edge_dims.py ``` Both scripts write streaming log files that `data_pipeline.py` can parse. ### 3. Parse logs into parquet ```bash python3 data_pipeline.py \ -o data/my_dataset.parquet \ --arch gfx950 \ --capture_hw ``` The `--capture_hw` flag runs `rocminfo` once and injects the GPU hardware profile (CU count, clock speed, cache sizes, etc.) into every row. ## Canonical Data Schema Every parquet file follows this schema: | Column | Type | Description | |---|---|---| | `op_type` | str | `gemm_universal`, `gemm_streamk`, etc. | | `dtype` | str | `fp8`, `fp16`, `bf16`, `bf8` | | `layout` | str | `rcr`, `rrr`, `crr`, `ccr` | | `arch` | str | `gfx942`, `gfx950`, etc. | | `kernel_name` | str | Full kernel identifier | | `m`, `n`, `k` | int | Problem dimensions | | `split_k` | int | Split-K factor (1 = standard) | | `measured_tflops` | float | Ground-truth TFLOPS | | `latency_ms` | float | Measured latency | | `bandwidth_gb_s` | float | Measured bandwidth | | `is_valid` | bool | True if tflops > 0 and latency > 0 | | `tile_m`, `tile_n`, `tile_k` | int | Tile dimensions | | `warp_m`, `warp_n`, `warp_k` | int | Warp config | | `warp_tile_m/n/k` | int | Warp tile dimensions | | `pipeline` | str | `compv3`, `compv4`, `mem`, etc. | | `scheduler` | str | `intrawave`, `interwave` | | `epilogue` | str | `cshuffle`, `default` | | `pad_m`, `pad_n`, `pad_k` | bool | Padding flags | | `persistent` | bool | Persistent kernel flag | | `run_id` | str | Unique collection run identifier | ## Shape Selection Guidelines Good training data requires diverse shapes. Cover all of these regimes: ### By M dimension (batch size / output rows) - **M=1**: single-token inference (hardest case for tiling) - **Tiny M (2-16)**: small batch inference - **Small M (32-128)**: medium batch - **Medium M (256-2048)**: large batch / training - **Large M (4096-20480)**: very large batch ### By N and K dimension - **N=1**: vector-matrix multiply (degenerate) - **K=1**: rank-1 update / outer product (degenerate) - **Small N or K (2-16)**: stress tile efficiency - **Deep K (K > 4096)**: compute-bound regime - **Shallow K (K < 256)**: memory-bound regime ### By shape family - **Square**: M ~ N ~ K (powers of 2) - **Tall**: M >> N (tall output matrix) - **Wide**: N >> M (wide output matrix) - **Deep-K**: K >> M and K >> N ### Special cases - **Prime dimensions**: 17, 31, 127, 251, 509, 1021, 2039, 4093 (worst-case for tile alignment, tests padding logic) - **Non-power-of-2**: 48, 96, 192, 384, 576, 768, 1536, 3072, 4608 (common in LLM architectures) - **LLM inference shapes**: DeepSeek, LLaMA-7B, LLaMA-70B MLP/attention dims ### Minimum recommended coverage For a production-quality model, aim for: - At least 200 unique (M, N, K) shapes - At least 10 shapes per shape family - All kernel configs (4608 for fp8 RCR) run against every shape - Multiple layouts if training a cross-layout model ## Benchmark Quality Guidelines ### Warmup and repeat - Minimum `warmup=3`, `repeat=10` for fast iteration - Production quality: `warmup=5`, `repeat=20` for stable measurements - The `perf_result` values are averaged over `repeat` iterations ### Noise handling - Use **median** latency when aggregating multiple runs of the same benchmark - Flag measurements where coefficient of variation exceeds 10% - Avoid benchmarking under thermal throttling (check GPU temperature) - Lock GPU clocks if possible for reproducibility ### Environment metadata Store with every dataset: - GPU model and architecture (from `rocminfo`) - ROCm driver version - Clock mode (default / locked) - Git hash of the CK tile engine build (if available) - Timestamp ## Adding Data for a New Op To generate benchmark data for a new operation (e.g., `gemm_streamk`): 1. **Build the binaries** using the tile engine: ```bash ninja -C build benchmark_gemm_streamk_fp8_rcr ``` 2. **Write a generation script** (or modify `generate_wide_coverage.py`): - Change the executable glob pattern to match the new op - Add any op-specific CLI flags the binaries need 3. **Run and parse**: ```bash python3 data_pipeline.py my_streamk_run.log \ -o data/gemm_streamk_fp8_gfx950.parquet --arch gfx950 ``` 4. **Train**: ```bash python3 train.py --op gemm_streamk --dtype fp8 --arch gfx950 \ --data_dir data/ --out_dir models/gemm_streamk_fp8_gfx950 ``` ## Adding Data for a New Layout Same binaries, same shapes -- just change the layout filter: ```bash # Build rrr binaries ninja -C build benchmark_gemm_universal_fp8_rrr # Generate and parse # ... (same flow, different bin_dir or executable glob) # Train a cross-layout model by putting all layouts in the same data_dir python3 train.py --data_dir data/ --out_dir models/gemm_universal_fp8_gfx950_all_layouts ``` The feature engine includes `layout` as a categorical feature, so one model can handle all layouts. ## Incremental Data Collection When you have a trained model and want to add more data: 1. Generate new data (new shapes, new layouts, etc.) 2. Parse into parquet alongside existing data 3. Warm-start from the previous model: ```bash python3 train.py --data_dir data/ --out_dir models/v2 \ --warm_start models/v1 \ --warm_start_n_estimators 200 ``` This adds 200 new trees on top of the existing model. The feature schema must match exactly (enforced automatically). ## File Organization Recommended directory structure: ``` heuristics/ data/ gemm_universal_fp8_rcr_gfx950.parquet # original 108 shapes wide_coverage/ # batch log files wide_coverage_batch_001.log wide_coverage_batch_002.log ... edge_dims/ # N=1, K=1 edge cases edge_dims_batch_001.log ... models/ gemm_universal_fp8_gfx950/ # trained model artifacts model_tflops.lgbm model_latency.lgbm model_bandwidth.lgbm feature_spec.json train_manifest.json cv_metrics_tflops.json eval_report.json ... ``` ## Troubleshooting ### Benchmark binary exits with non-zero code Some kernel configs are invalid for certain problem sizes (e.g., tile_m=256 with M=16). The data pipeline marks these as `is_valid=False` and they are filtered out during training. This is expected. ### Edge dims produce very few results N=1 and K=1 shapes are degenerate -- most kernel configurations have minimum dimension requirements and will fail or produce zero TFLOPS. The small number of valid results is still useful (it tells the model which configs work for these shapes). ### Benchmarks are slow Each shape requires running all 4608 kernel executables sequentially. At ~0.01s per kernel, that is ~46 seconds per shape. For 700 shapes, expect ~9 hours. Tips: - Run on a dedicated GPU (no other workloads) - Use `--batch_size 25` to get incremental output - Parse and train on partial data while generation continues ### Data from different GPUs / driver versions Store `run_id` and hardware metadata with each dataset. Training on mixed data is allowed but not recommended for production models. Filter to a single `run_id` or `arch` for clean experiments.