composable_kernel/tile_engine/ops/grouped_conv/README.md

# Grouped Convolution ML Heuristics & Benchmarking

Training data collection and validation utilities for ML-based kernel selection in grouped convolution operations.

## Overview

This directory supports the **ML heuristic system** for grouped convolution kernel selection. The system achieves **99.67% efficiency** on unseen production workloads by predicting optimal kernels without exhaustive GPU search.

**Key Results:**
- Forward pass: 99.67% mean efficiency (validated on 10 unseen MIOpen shapes)
- 70% perfect oracle matches (selected exact best kernel)
- <1ms selection latency (30,000-60,000× faster than exhaustive search)

See [dispatcher/heuristics/GROUPED_CONV_ML_SUMMARY.md](../../dispatcher/heuristics/GROUPED_CONV_ML_SUMMARY.md) for full technical details.

---

## Files

### Benchmarking & Data Collection
- **`grouped_conv_full_benchmark.py`** - Systematic sweep for training data (kernels × problems)
- **`run_one_grouped_conv_kernel.py`** - Subprocess worker for isolated GPU execution
- **`test_batch_benchmark.py`** - Quick integration test (2 kernels × small problems)
- **`grouped_conv_instance_builder.py`** - Kernel configuration generator from JSON

### ML Validation
- **`validate_ml_vs_oracle.py`** - Compare ML predictions vs exhaustive GPU search
- **`compare_ml_vs_oracle.py`** - Analysis of ML vs oracle performance

### Configuration
- **`configs/*.json`** - Kernel trait configurations (forward, bwd_data, bwd_weight)
- **`problems/*.py`** - Problem datasets (training, validation, MIOpen production shapes)

---

## ML Heuristic Workflow

### 1. Training Data Collection

Already completed. Training datasets:
- **Forward**: 48,845 samples (1,372 unique shapes) - Tier-1 extended
- **Bwd Data**: 14,562 samples (701 unique shapes)
- **Bwd Weight**: 18,150 samples (921 unique shapes)

If you need to collect new data:

```bash
# Full benchmark sweep (all kernels × all problems)
python grouped_conv_full_benchmark.py \
  --variant forward \
  --category full \
  --workers 256 \
  --output training_data_forward_bf16.csv
```

### 2. Training Models

Models are located in `dispatcher/heuristics/models/`:
- `grouped_conv_forward_bf16_gfx950/` - **Production-ready** (99.67% efficiency)
- `grouped_conv_bwd_data_bf16_gfx950/` - Trained, needs hardware validation
- `grouped_conv_bwd_weight_bf16_gfx950/` - Trained, needs hardware validation

To train new models, see [dispatcher/heuristics/README.md](../../dispatcher/heuristics/README.md).

### 3. Validation

Validate ML model performance on unseen shapes:

```bash
cd ../../dispatcher/heuristics/validation/grouped_conv

# Quick sanity check on training shapes (hardware)
python validate_training_shapes.py --direction forward

# Backward models validation (no GPU)
python validate_backward_models.py
```

See [dispatcher/heuristics/validation/README.md](../../dispatcher/heuristics/validation/README.md) for details.

---

## Problem Datasets

Located in `problems/`:

### Training Sets
- **`forward_training.py`** - 2,630 shapes (300 MIOpen + 2,330 synthetic)
- **`forward_training_miopen.py`** - 300 MIOpen production shapes
- **`bwd_data_synthetic_extended.py`** - Backward data training set
- **`bwd_weight_synthetic_extended.py`** - Backward weight training set

### Validation Sets (Unseen)
- **`bwd_data_test_validation.py`** - 10 unseen backward data shapes
- **`bwd_weight_test_validation.py`** - 10 unseen backward weight shapes

### Dataset Generator
- **`create_miopen_training_set.py`** - Extract shapes from MIOpen ALL_CONFIGS_FULL.txt

---

## Benchmarking Usage

### Quick Test (2 Kernels × Few Problems)

```bash
# Test benchmark pipeline
python test_batch_benchmark.py
```

### Full Sweep (All Kernels × All Problems)

```bash
# Forward: 20 kernels × 200 problems = 4,000 measurements
python grouped_conv_full_benchmark.py \
  --variant forward \
  --category full \
  --workers 256 \
  --output sweep_forward.csv

# Backward data
python grouped_conv_full_benchmark.py \
  --variant bwd_data \
  --category full \
  --workers 256

# Backward weight
python grouped_conv_full_benchmark.py \
  --variant bwd_weight \
  --category full \
  --workers 256
```

**Output**: CSV with columns:
```
kernel,problem_idx,N,C,K,G,Hi,Wi,Y,X,stride_h,stride_w,pad_h,pad_w,latency_ms,tflops,non_zero
```

**Note**: The benchmark always starts fresh and overwrites the output CSV file. If you need to preserve previous results, rename or move the CSV file before running a new benchmark.

---

## Instance Builder

Generate kernel configurations from JSON trait files:

```bash
# List all kernels matching config
python grouped_conv_instance_builder.py configs/forward_bf16.json --arch gfx950 --list

# Count kernels
python grouped_conv_instance_builder.py configs/forward_bf16.json --count-only

# Apply filter
python grouped_conv_instance_builder.py configs/forward_bf16.json \
  --filter "c.tile_n >= 128 and c.pipeline == 'compv5'" --list

# Export to JSON
python grouped_conv_instance_builder.py configs/forward_bf16.json \
  --export-json kernels.json
```

### Config Files

- **`forward_bf16.json`** - Forward BF16 (compv3/v4/v5, 30 kernels)
- **`bwd_data.json`** - Backward data (compv3/mem, 20 kernels)
- **`bwd_weight.json`** - Backward weight (compv3/mem, 20 kernels)

**Trait filtering** (see configs for examples):
```json
{
  "variant": "forward",
  "trait_config": {
    "data_type": {"values": ["bf16"]},
    "pipeline": {"values": ["compv3", "compv4", "compv5"]},
    "ndim_spatial": {"values": [2]}
  }
}
```

---

## Architecture

Based on FMHA tile engine design with subprocess isolation:

```
grouped_conv_full_benchmark.py (orchestrator)
  ├─> grouped_conv_instance_builder.py (generate kernel configs)
  ├─> Build phase: JIT compile all kernels (serial, avoids fork/GPU issues)
  └─> Benchmark phase: subprocess workers (serial GPU access)
      └─> run_one_grouped_conv_kernel.py (subprocess)
          └─> GpuGroupedConvRunner (fresh GPU context per problem)
```

**Key design decisions:**
1. **Subprocess isolation** - Fresh GPU context prevents memory leaks
2. **Batch size 20** - Optimal kernels per subprocess
3. **Path-only build** - Main process never initializes GPU
4. **Serial GPU access** - Accurate timing, no contention
5. **Serial codegen/compile** - Avoids ProcessPoolExecutor + GPU fork() issues

**Note**: The `--workers` flag is accepted for API compatibility but currently ignored.
Codegen and compilation run serially to avoid GPU context issues with process forking.

**Success rate**: 99.5% (3,760/3,780 measurements succeeded)

---

## Example Workflow: New Data Collection

```bash
# 1. Generate problem set
cd problems/
python create_miopen_training_set.py \
  --input /path/to/ALL_CONFIGS_FULL.txt \
  --output forward_training_new.py \
  --count 500

# 2. Collect training data
cd ..
python grouped_conv_full_benchmark.py \
  --variant forward \
  --category full \
  --workers 256 \
  --output new_training_data.csv

# 3. Convert to parquet
cd ../../dispatcher/heuristics
python convert_csv_to_parquet.py \
  --input ../../tile_engine/ops/grouped_conv/new_training_data.csv \
  --output data/grouped_conv_forward_bf16_gfx950/new_data.parquet

# 4. Train model
python train.py \
  --data_dir data/ \
  --out_dir models/grouped_conv_forward_bf16_gfx950_v2 \
  --op grouped_conv \
  --variant forward

# 5. Validate (sanity check on training shapes)
cd validation/grouped_conv
python validate_training_shapes.py --direction forward
```

---

## Performance Results

### Forward Pass (Production-Ready)
- **Mean efficiency**: 99.67% on 10 unseen MIOpen shapes
- **Perfect matches**: 70% (7/10 selected exact oracle best)
- **Min efficiency**: 98.4% (even on edge case: 1×491 spatial)
- **Selection time**: <1ms (vs 30-60s exhaustive search)

### Backward Passes (Prediction-Validated)
- **Bwd Data**: 14,562 samples, prediction quality tested
- **Bwd Weight**: 18,150 samples, prediction quality tested
- **Status**: Models trained, hardware validation pending

See [dispatcher/heuristics/GROUPED_CONV_ML_SUMMARY.md](../../dispatcher/heuristics/GROUPED_CONV_ML_SUMMARY.md) for full metrics.

---

## Hardware Tested

- **GPU**: AMD MI300 (gfx950)
- **Datatypes**: BF16 (primary), FP16, FP32
- **Pipelines**: CompV3, CompV4, CompV5 (forward), CompV3/Mem (backward)
- **Schedulers**: Intrawave, Interwave
- **Tile sizes**: 16×64×64, 32×64×64, 64×64×64, 128×128×64, etc.

---

## Related Documentation

- **ML System Overview**: [dispatcher/heuristics/GROUPED_CONV_ML_SUMMARY.md](../../dispatcher/heuristics/GROUPED_CONV_ML_SUMMARY.md)
- **Training Pipeline**: [dispatcher/heuristics/README.md](../../dispatcher/heuristics/README.md)
- **Validation Framework**: [dispatcher/heuristics/validation/README.md](../../dispatcher/heuristics/validation/README.md)
- **Python Examples**: [dispatcher/examples/grouped_conv/python/README_ML_HEURISTIC.md](../../dispatcher/examples/grouped_conv/python/README_ML_HEURISTIC.md)

---

## Next Steps

**For Forward Pass**: Production-ready, integrate into runtime dispatcher

**For Backward Passes**: Run prediction-quality check
```bash
cd ../../dispatcher/heuristics/validation/grouped_conv
python validate_backward_models.py
```

Target: >85% mean efficiency on unseen shapes before production deployment.