Add LLM-agnostic Docker and build analysis tools (#3576)

This commit introduces utility tools for building, testing, and analyzing Composable Kernel. The tools are designed to be LLM-agnostic and can be used with any AI assistant or directly from the command line. Tools Added: ============ 1. ck-docker - Docker container management - Start/stop ROCm-enabled containers - Build targets with CMake + Ninja - Run tests with gtest filters - Auto-detect GPU targets (gfx950, gfx942, etc.) - Per-user, per-branch container naming to avoid conflicts 2. ck-build-analysis - Build time profiling - Uses Clang's -ftime-trace for compilation analysis - Aggregates statistics across multiple trace files - Identifies template instantiation bottlenecks - Generates detailed Markdown reports with: * Compilation phase breakdown * Top expensive instantiations * Template family analysis * Data-driven optimization recommendations - Configurable granularity (1µs to 500µs) - PEP 723 compliant Python script with auto-dependency management via uv Key Features: ============= - LLM-agnostic design (works with any AI assistant) - Zero-configuration setup with automatic dependency installation - Comprehensive documentation in script/tools/README*.md - Security hardening (input validation, no command injection) - Multi-file trace aggregation for accurate build analysis - Jinja2-based report generation for customizable output Implementation: =============== - script/tools/ck-docker - Main Docker orchestration script - script/tools/ck-build-analysis - Build analysis orchestration - script/tools/common.sh - Shared utilities (container mgmt, GPU detection) - script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer - script/tools/templates/ - Jinja2 templates for report generation - script/tools/README*.md - Comprehensive documentation Directory Structure: ==================== script/tools/ ├── README.md # Main overview ├── README_ck-docker.md # ck-docker documentation ├── README_ck-build-analysis.md # ck-build-analysis documentation ├── ck-docker # Docker orchestration script ├── ck-build-analysis # Build analysis orchestration ├── common.sh # Shared utilities ├── analyze_build_trace.py # Python analyzer (PEP 723) └── templates/ └── build_analysis_report.md.jinja # Report template The tools follow Unix philosophy: do one thing well, compose easily, and work from both CLI and programmatic contexts.
2026-04-20 06:49:15 +00:00 · 2026-01-15 08:30:23 -08:00
parent f57395689b
commit 086a1f8861
8 changed files with 1426 additions and 0 deletions
--- a/script/tools/README.md
+++ b/script/tools/README.md
@@ -0,0 +1,78 @@
+# Composable Kernel Tools
+
+This directory contains utility tools for building, testing, and analyzing Composable Kernel.
+
+These tools are designed to be LLM-agnostic and can be used with any AI assistant or directly from the command line.
+
+## Available Tools
+
+### ck-docker
+
+Build and test composable_kernel in Docker with ROCm support.
+
+See [README_ck-docker.md](README_ck-docker.md) for details.
+
+**Quick start:**
+```bash
+# Add to PATH
+export PATH="$PATH:$PWD/script/tools"
+
+# Start container and build
+ck-docker start
+ck-docker build test_amdgcn_mma
+ck-docker test test_amdgcn_mma
+```
+
+### ck-build-analysis
+
+Analyze Composable Kernel build times using Clang's -ftime-trace profiler.
+
+See [README_ck-build-analysis.md](README_ck-build-analysis.md) for details.
+
+**Quick start:**
+```bash
+# Add to PATH
+export PATH="$PATH:$PWD/script/tools"
+
+# Analyze build time
+ck-build-analysis example_convnd_fwd_xdl_fp8
+```
+
+## LLM Assistant Integration
+
+These tools can be used as-is with any LLM assistant by providing the tool documentation to the assistant. The assistant can then invoke these tools on your behalf.
+
+For example, you can ask:
+- "Start the docker container"
+- "Build and test test_amdgcn_mma"
+- "Analyze build time for example_convnd_fwd_xdl_fp8"
+
+The assistant will translate your natural language request into the appropriate tool invocation.
+
+## Dependencies
+
+- **ck-docker**: Requires Docker and ROCm-capable GPU (for running tests)
+- **ck-build-analysis**: Requires Docker, automatically installs Python dependencies (jinja2) via `uv`
+
+## Directory Structure
+
+```
+script/tools/
+├── README.md                          # This file
+├── README_ck-docker.md                # Documentation for ck-docker
+├── README_ck-build-analysis.md        # Documentation for ck-build-analysis
+├── ck-docker                          # Docker container management tool
+├── ck-build-analysis                  # Build time analysis tool
+├── common.sh                          # Shared utilities for bash scripts
+├── analyze_build_trace.py             # Python script for trace analysis (PEP 723 compliant)
+└── templates/
+    └── build_analysis_report.md.jinja # Jinja2 template for analysis reports
+```
+
+## Contributing
+
+When adding new tools to this directory:
+1. Keep them LLM-agnostic (avoid hardcoding references to specific AI assistants)
+2. Provide clear command-line usage documentation
+3. Include examples for both CLI and LLM assistant usage
+4. Follow the existing naming convention and structure
--- a/script/tools/README_ck-build-analysis.md
+++ b/script/tools/README_ck-build-analysis.md
@@ -0,0 +1,168 @@
+# ck-build-analysis
+
+Analyze Composable Kernel build times using Clang's -ftime-trace profiler.
+
+## Terminal Usage
+
+Direct command-line usage:
+
+```bash
+# From composable_kernel directory
+script/tools/ck-build-analysis example_convnd_fwd_xdl_fp8
+script/tools/ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=1
+script/tools/ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=1 --output=my_report.md
+
+# Or add to PATH
+export PATH="$PATH:$PWD/script/tools"
+ck-build-analysis example_convnd_fwd_xdl_fp8
+```
+
+## LLM Assistant Integration
+
+If using an LLM assistant, you can ask in natural language:
+- "Analyze build time for example_convnd_fwd_xdl_fp8"
+- "Profile the compilation of test_amdgcn_mma with 1us granularity"
+- "Generate a build time report for example_gemm_xdl"
+
+## Commands
+
+```
+ck-build-analysis <target> [options]
+
+Options:
+  --granularity=N      Time trace granularity in microseconds (default: 1)
+  --output=FILE        Output report filename (default: build_time_analysis_report.md)
+  --name=NAME          Docker container name (default: from CK_CONTAINER_NAME or auto-generated)
+  --no-reconfigure     Skip CMake reconfiguration if build exists
+  --help               Show this help message
+```
+
+## What It Does
+
+1. **Configures CMake** with `-ftime-trace` and custom granularity
+2. **Builds the target** using Ninja in Docker
+3. **Analyzes the trace** JSON file for template instantiation patterns
+4. **Generates a report** with:
+   - Compilation phase breakdown
+   - Top expensive individual instantiations
+   - Template families ranked by total time and count
+   - Key insights and optimization recommendations
+   - Complete statistics
+
+## Configuration
+
+- **Container**: Uses ck-docker container (auto-starts if needed)
+- **Granularity**: Default 1us (100% template coverage, best balance)
+- **Output**: Markdown report in project root
+
+## Environment
+
+```bash
+export CK_CONTAINER_NAME=my_build       # Override container name
+export CK_BUILD_ANALYSIS_GRANULARITY=1  # Default granularity in microseconds
+```
+
+## Examples
+
+```bash
+# Complete template analysis with default granularity (1us - recommended)
+ck-build-analysis example_convnd_fwd_xdl_fp8
+
+# Quick daily check (10us granularity, captures most expensive templates)
+ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=10
+
+# Maximum detail (0us granularity, includes LLVM internals)
+ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=0
+
+# High-level overview (500us granularity, major bottlenecks only)
+ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=500
+
+# Custom output filename
+ck-build-analysis example_convnd_fwd_xdl_fp8 --output=fp8_conv_analysis.md
+
+# Analyze test target
+ck-build-analysis test_amdgcn_mma
+
+# Use existing build (skip reconfigure)
+ck-build-analysis example_convnd_fwd_xdl_fp8 --no-reconfigure
+```
+
+## Output
+
+The report includes:
+- **Executive Summary**: Total time, events, instantiations, unique templates
+- **Compilation Phases**: InstantiateFunction, Frontend, Backend, Optimizer, etc.
+- **Top 30 Individual Instantiations**: Most expensive single templates
+- **Template Families**: Grouped by total time and instantiation count
+- **Key Insights**: What's slow and why
+- **Optimization Recommendations**: Short, medium, and long-term strategies
+- **Detailed Statistics**: Averages, medians, distributions
+
+## Granularity Trade-offs
+
+| Granularity | Template Coverage | Use Case |
+|-------------|-------------------|----------|
+| **0us** | All templates + sub-us compiler internals | LLVM internals debugging, very large files, higher overhead |
+| **1us (default)** | **All templates** | **Default: Complete template analysis with low overhead** |
+| **10us** | Most expensive templates | Daily quick checks, smaller files, minimal overhead |
+| **50-100us** | Top bottlenecks | Balanced detail/size, suitable for CI/CD |
+| **500us** | High-level phases only | Not recommended for template analysis |
+
+**Recommended default**: 1us captures all template instantiations with minimal overhead
+
+## Notes
+
+- **0us and 1us capture all templates** - 0us adds sub-microsecond compiler internals
+- **1us is the sweet spot**: complete template coverage, filters noise, low overhead
+- **10us is practical** for daily use: captures most expensive templates, smaller files
+- **500us loses most template instantiation data** - only use for high-level phase breakdown
+- Finer granularity = more events = larger files + higher build time overhead
+- For template-heavy C++ codebases like CK: **use 1us for analysis, 10us for daily checks**
+
+## Implementation Details
+
+### PEP 723 Compliance with Automatic Dependency Management
+
+The analysis script (`analyze_build_trace.py`) is PEP 723 compliant with inline dependency metadata:
+
+```python
+# /// script
+# requires-python = ">=3.8"
+# dependencies = [
+#   "jinja2>=3.0.0",
+# ]
+# ///
+```
+
+**The tool automatically installs and uses `uv`**, which provides:
+- ✅ Zero-configuration dependency management
+- ✅ Automatic installation of jinja2 from PEP 723 metadata
+- ✅ Isolated dependency environment (no system pollution)
+- ✅ Fast caching for subsequent runs
+
+**No manual setup required!** The first time you run the tool, it will:
+1. Detect if `uv` is installed in the container
+2. If not, automatically install it via Ubuntu packages (pipx install uv)
+3. Use `uv run` to execute the analysis with auto-managed dependencies
+
+On subsequent runs, `uv` will already be available and dependencies will be cached.
+
+Installation is done through Ubuntu's package manager for security and reliability.
+
+### Components
+
+- **ck-build-analysis** - Main bash script that orchestrates Docker, CMake, and analysis
+- **analyze_build_trace.py** - PEP 723 compliant Python script for trace analysis
+- **templates/build_analysis_report.md.jinja** - Jinja2 template for report generation
+
+### Standalone Usage
+
+The Python script can also be run independently:
+
+```bash
+# With uv (recommended - auto-installs dependencies from PEP 723 metadata)
+uv run script/tools/analyze_build_trace.py trace.json report.md target 100 22 templates/
+
+# With pipx (alternative - also auto-installs dependencies)
+pipx run script/tools/analyze_build_trace.py trace.json report.md target 100 22 templates/
+```
--- a/script/tools/README_ck-docker.md
+++ b/script/tools/README_ck-docker.md
@@ -0,0 +1,80 @@
+# ck-docker
+
+Build and test composable_kernel in Docker with ROCm support.
+
+## Terminal Usage
+
+Direct command-line usage:
+
+```bash
+# From composable_kernel directory
+script/tools/ck-docker start
+script/tools/ck-docker build test_amdgcn_mma
+script/tools/ck-docker test test_amdgcn_mma --gtest_filter=*Fp16*
+script/tools/ck-docker status
+script/tools/ck-docker shell
+
+# Or add to PATH
+export PATH="$PATH:$PWD/script/tools"
+ck-docker start
+```
+
+## LLM Assistant Integration
+
+If using an LLM assistant, you can ask in natural language:
+- "Start the docker container"
+- "Build test_amdgcn_mma"
+- "Run test_amdgcn_mma with filter *Fp16*"
+- "Check container status"
+- "Open a shell in the container"
+
+## Commands
+
+```
+ck-docker start [name]                    Start Docker container
+ck-docker build [target] [--reconfigure]  Build target (optionally reconfigure CMake)
+ck-docker test <name> [options]           Run test
+ck-docker shell [name]                    Interactive shell
+ck-docker status [name]                   Check status
+ck-docker stop [name]                     Stop container
+```
+
+## Configuration
+
+- **Image**: rocm/composable_kernel:ck_ub24.04_rocm7.0.1
+- **GPU**: Auto-detected via rocminfo (fallback: gfx950)
+- **Compiler**: /opt/rocm/llvm/bin/clang++
+- **Build**: Ninja + CMake (Release)
+- **Mount**: Current directory → /workspace
+- **Container Name**: Auto-generated as `ck_<username>_<branch>` to avoid clashes
+
+## Environment
+
+```bash
+export CK_CONTAINER_NAME=my_build                                   # Override default container name
+export CK_DOCKER_IMAGE=rocm/composable_kernel:ck_ub24.04_rocm7.0.1  # Override Docker image
+export GPU_TARGET=gfx942                                             # Override GPU target detection
+```
+
+## Examples
+
+```bash
+# Start container
+ck-docker start
+
+# Build and run test
+ck-docker build test_amdgcn_mma
+ck-docker test test_amdgcn_mma
+
+# Force clean CMake reconfiguration and build
+ck-docker build --reconfigure test_amdgcn_mma
+
+# Custom container
+ck-docker start my_build
+ck-docker build test_amdgcn_mma --name my_build
+ck-docker test test_amdgcn_mma --name my_build
+
+# Debug
+ck-docker shell
+ck-docker status
+```
--- a/script/tools/analyze_build_trace.py
+++ b/script/tools/analyze_build_trace.py
@@ -0,0 +1,347 @@
+#!/usr/bin/env python3
+# Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+# SPDX-License-Identifier: MIT
+
+# /// script
+# requires-python = ">=3.8"
+# dependencies = [
+#   "jinja2>=3.0.0",
+# ]
+# ///
+"""
+Build Time Analysis Tool for Composable Kernel
+
+Analyzes Clang -ftime-trace output to identify template instantiation
+bottlenecks and generate comprehensive build time reports.
+"""
+
+import json
+import os
+import re
+import sys
+from collections import defaultdict
+from datetime import datetime
+
+try:
+    from jinja2 import Environment, FileSystemLoader
+except ImportError:
+    print("Error: jinja2 is required but not installed.", file=sys.stderr)
+    print("Install with: apt-get install python3-jinja2", file=sys.stderr)
+    print("Or with pip: pip install jinja2", file=sys.stderr)
+    sys.exit(1)
+
+
+def parse_arguments():
+    """Parse command-line arguments."""
+    if len(sys.argv) < 7:
+        print(
+            "Usage: analyze_build_trace.py <trace_files_or_dir> <output_file> <target> <granularity> <build_time> <template_dir>"
+        )
+        print(
+            "  trace_files_or_dir: Comma-separated list of trace files OR directory containing .json files"
+        )
+        sys.exit(1)
+
+    return {
+        "trace_input": sys.argv[1],
+        "output_file": sys.argv[2],
+        "target": sys.argv[3],
+        "granularity": sys.argv[4],
+        "build_time": sys.argv[5],
+        "template_dir": sys.argv[6],
+    }
+
+
+def find_trace_files(trace_input):
+    """Find all trace files from input (file list, single file, or directory)."""
+    trace_files = []
+
+    # Check if it's a directory
+    if os.path.isdir(trace_input):
+        print(f"Scanning directory: {trace_input}")
+        for root, dirs, files in os.walk(trace_input):
+            for file in files:
+                # Include .cpp.json and .hip.json, exclude compile_commands.json and CMake files
+                if file.endswith((".cpp.json", ".hip.json")) and "CMakeFiles" in root:
+                    trace_files.append(os.path.join(root, file))
+        trace_files.sort()
+    # Check if it's a comma-separated list
+    elif "," in trace_input:
+        trace_files = [f.strip() for f in trace_input.split(",")]
+    # Single file
+    else:
+        trace_files = [trace_input]
+
+    # Filter out non-existent files
+    valid_files = [f for f in trace_files if os.path.isfile(f)]
+
+    if not valid_files:
+        print(f"Error: No valid trace files found in: {trace_input}", file=sys.stderr)
+        sys.exit(1)
+
+    print(f"Found {len(valid_files)} trace file(s)")
+    return valid_files
+
+
+def load_trace_data(trace_files):
+    """Load and parse multiple trace JSON files."""
+    all_data = []
+
+    for trace_file in trace_files:
+        print(f"  Loading: {trace_file}")
+        try:
+            with open(trace_file, "r") as f:
+                data = json.load(f)
+                # Get file basename for tracking
+                file_name = os.path.basename(trace_file)
+                all_data.append({"file": file_name, "path": trace_file, "data": data})
+        except Exception as e:
+            print(f"  Warning: Failed to load {trace_file}: {e}", file=sys.stderr)
+
+    return all_data
+
+
+def process_events(all_trace_data):
+    """Process trace events from multiple files and extract statistics."""
+    print("Processing events from all files...")
+
+    template_stats = defaultdict(lambda: {"count": 0, "total_dur": 0})
+    phase_stats = defaultdict(int)
+    top_individual = []
+    file_stats = []
+    total_events = 0
+
+    for trace_info in all_trace_data:
+        file_name = trace_info["file"]
+        data = trace_info["data"]
+        events = data.get("traceEvents", [])
+
+        file_template_time = 0
+        file_event_count = len(events)
+        total_events += file_event_count
+
+        print(f"  Processing {file_name}: {file_event_count:,} events")
+
+        for event in events:
+            name = event.get("name", "")
+            dur = int(event.get("dur", 0))  # Keep as integer microseconds
+
+            if name and dur > 0:
+                phase_stats[name] += dur
+
+            if name in ["InstantiateFunction", "InstantiateClass"]:
+                detail = event.get("args", {}).get("detail", "")
+                top_individual.append(
+                    {"detail": detail, "dur": dur, "type": name, "file": file_name}
+                )
+
+                file_template_time += dur
+
+                # Extract template name (everything before '<' or '(')
+                match = re.match(r"^([^<(]+)", detail)
+                if match:
+                    template_name = match.group(1).strip()
+                    # Normalize template names
+                    template_name = re.sub(r"^ck::", "", template_name)
+                    template_name = re.sub(r"^std::", "std::", template_name)
+
+                    template_stats[template_name]["count"] += 1
+                    template_stats[template_name]["total_dur"] += dur
+
+        file_stats.append(
+            {
+                "name": file_name,
+                "events": file_event_count,
+                "template_time": file_template_time,
+            }
+        )
+
+    return template_stats, phase_stats, top_individual, file_stats, total_events
+
+
+def prepare_template_data(template_stats, phase_stats, top_individual, file_stats):
+    """Prepare and calculate derived statistics for template rendering."""
+    print("Sorting data...")
+
+    # Sort data
+    sorted_phases = sorted(phase_stats.items(), key=lambda x: x[1], reverse=True)
+    top_individual.sort(key=lambda x: x["dur"], reverse=True)
+    file_stats.sort(key=lambda x: x["template_time"], reverse=True)
+
+    # Calculate totals
+    total_template_time = sum(s["total_dur"] for s in template_stats.values())
+    total_trace_time = sum(phase_stats.values())
+    total_inst = sum(s["count"] for s in template_stats.values())
+
+    # Prepare templates by time with calculated fields
+    templates_by_time = []
+    for name, stats in sorted(
+        template_stats.items(), key=lambda x: x[1]["total_dur"], reverse=True
+    ):
+        templates_by_time.append(
+            (
+                name,
+                {
+                    "count": stats["count"],
+                    "total_dur": stats["total_dur"],
+                    "avg": stats["total_dur"] // stats["count"]
+                    if stats["count"] > 0
+                    else 0,
+                    "pct": 100 * stats["total_dur"] / total_template_time
+                    if total_template_time > 0
+                    else 0,
+                },
+            )
+        )
+
+    # Prepare templates by count
+    templates_by_count = []
+    for name, stats in sorted(
+        template_stats.items(), key=lambda x: x[1]["count"], reverse=True
+    ):
+        templates_by_count.append(
+            (
+                name,
+                {
+                    "count": stats["count"],
+                    "total_dur": stats["total_dur"],
+                    "avg": stats["total_dur"] // stats["count"]
+                    if stats["count"] > 0
+                    else 0,
+                },
+            )
+        )
+
+    # Add friendly type names to individual instantiations
+    for inst in top_individual:
+        inst["inst_type"] = "Func" if inst["type"] == "InstantiateFunction" else "Class"
+
+    # Calculate additional metrics
+    median_count = 0
+    if len(template_stats) > 0:
+        median_count = sorted([s["count"] for s in template_stats.values()])[
+            len(template_stats) // 2
+        ]
+
+    top10_pct = 0
+    if len(templates_by_time) >= 10:
+        top10_pct = (
+            100
+            * sum(s[1]["total_dur"] for s in templates_by_time[:10])
+            / total_template_time
+        )
+
+    return {
+        "sorted_phases": sorted_phases,
+        "top_individual": top_individual,
+        "templates_by_time": templates_by_time,
+        "templates_by_count": templates_by_count,
+        "total_template_time": total_template_time,
+        "total_trace_time": total_trace_time,
+        "total_inst": total_inst,
+        "median_count": median_count,
+        "top10_pct": top10_pct,
+        "unique_families": len(template_stats),
+        "file_stats": file_stats,
+    }
+
+
+def setup_jinja_environment(template_dir):
+    """Set up Jinja2 environment with custom filters."""
+    env = Environment(loader=FileSystemLoader(template_dir))
+
+    def format_number(value):
+        """Format number with thousand separators."""
+        return f"{value:,}"
+
+    def truncate(value, length):
+        """Truncate string to length with ellipsis."""
+        if len(value) > length:
+            return value[: length - 3] + "..."
+        return value
+
+    def pad(value, length):
+        """Pad string to specified length."""
+        return f"{value:<{length}}"
+
+    def us_to_ms(value):
+        """Convert microseconds to milliseconds."""
+        return value / 1000.0
+
+    def us_to_s(value):
+        """Convert microseconds to seconds."""
+        return value / 1000000.0
+
+    env.filters["format_number"] = format_number
+    env.filters["truncate"] = truncate
+    env.filters["pad"] = pad
+    env.filters["us_to_ms"] = us_to_ms
+    env.filters["us_to_s"] = us_to_s
+
+    return env
+
+
+def generate_report(env, data, args, total_events, num_files):
+    """Generate the final report using Jinja2 template."""
+    print("Rendering report with Jinja2...")
+
+    template = env.get_template("build_analysis_report.md.jinja")
+
+    report_content = template.render(
+        timestamp=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+        target=args["target"],
+        granularity=args["granularity"],
+        build_time=args["build_time"],
+        total_events=total_events,
+        num_files=num_files,
+        total_instantiations=data["total_inst"],
+        unique_families=data["unique_families"],
+        total_trace_time=data["total_trace_time"],
+        total_template_time=data["total_template_time"],
+        phases=data["sorted_phases"],
+        top_individual=data["top_individual"],
+        templates_by_time=data["templates_by_time"],
+        templates_by_count=data["templates_by_count"],
+        median_count=data["median_count"],
+        top10_pct=data["top10_pct"],
+        file_stats=data["file_stats"],
+    )
+
+    return report_content
+
+
+def main():
+    """Main entry point for the analysis tool."""
+    args = parse_arguments()
+
+    # Find and load trace files
+    trace_files = find_trace_files(args["trace_input"])
+    all_trace_data = load_trace_data(trace_files)
+
+    # Process events from all files
+    template_stats, phase_stats, top_individual, file_stats, total_events = (
+        process_events(all_trace_data)
+    )
+
+    # Prepare template data
+    data = prepare_template_data(
+        template_stats, phase_stats, top_individual, file_stats
+    )
+
+    # Setup Jinja2 environment
+    env = setup_jinja_environment(args["template_dir"])
+
+    # Generate report
+    report_content = generate_report(env, data, args, total_events, len(all_trace_data))
+
+    # Write output
+    with open(args["output_file"], "w") as f:
+        f.write(report_content)
+
+    print(f"Report generated: {args['output_file']}")
+    print(f"Report size: {len(report_content):,} bytes")
+    print(f"Analyzed {len(all_trace_data)} file(s) with {total_events:,} total events")
+
+
+if __name__ == "__main__":
+    main()
--- a/script/tools/ck-build-analysis
+++ b/script/tools/ck-build-analysis
@@ -0,0 +1,237 @@
+#!/bin/bash
+# Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+# SPDX-License-Identifier: MIT
+
+# CK Build Analysis Tool - Analyze build times using -ftime-trace
+
+set -e
+set -o pipefail
+
+# Find script directory and load common utilities
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+source "${SCRIPT_DIR}/common.sh"
+
+# Initialize configuration
+PROJECT_ROOT=$(get_project_root "${SCRIPT_DIR}")
+CONTAINER_NAME=$(get_container_name "${PROJECT_ROOT}")
+
+# Default settings
+GRANULARITY="${CK_BUILD_ANALYSIS_GRANULARITY:-1}"
+OUTPUT_FILE="build_time_analysis_report.md"
+RECONFIGURE=true
+
+# Help message
+show_help() {
+    cat << EOF
+CK Build Analysis - Analyze build times using Clang -ftime-trace
+
+Usage: ck-build-analysis <target> [options]
+
+Arguments:
+  target                      Build target to analyze (e.g., example_convnd_fwd_xdl_fp8)
+
+Options:
+  --granularity=N            Time trace granularity in microseconds (default: 1)
+  --output=FILE              Output report filename (default: build_time_analysis_report.md)
+  --name=NAME                Docker container name (default: ${CONTAINER_NAME})
+  --no-reconfigure           Skip CMake reconfiguration if build exists
+  --help                     Show this help message
+
+Examples:
+  ck-build-analysis example_convnd_fwd_xdl_fp8
+  ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=10
+  ck-build-analysis test_amdgcn_mma --granularity=1 --output=mma_test_analysis.md
+
+Granularity Guide:
+  0              - Everything: All compiler events including sub-microsecond operations
+                   Use for LLVM internals debugging. Large files, higher overhead.
+
+  1   (default)  - Complete template coverage: Captures all template instantiations
+                   Best balance - filters sub-microsecond noise, low overhead
+
+  10             - Daily use: Captures most expensive templates, smaller files
+                   Good for quick checks and routine analysis
+
+  50-100         - Intermediate: Balanced between detail and file size
+                   Suitable for CI/CD tracking
+
+  500            - High-level only: Major compilation phases, minimal detail
+                   Not recommended for template analysis (loses most instantiations)
+
+  Recommendation: Use 1us (default) for template analysis, 10us for quick checks.
+EOF
+}
+
+# Parse arguments
+TARGET=""
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --granularity=*)
+            GRANULARITY="${1#*=}"
+            shift
+            ;;
+        --output=*)
+            OUTPUT_FILE="${1#*=}"
+            shift
+            ;;
+        --name=*)
+            CONTAINER_NAME="${1#*=}"
+            shift
+            ;;
+        --no-reconfigure)
+            RECONFIGURE=false
+            shift
+            ;;
+        --help|-h)
+            show_help
+            exit 0
+            ;;
+        -*)
+            echo "Unknown option: $1"
+            show_help
+            exit 1
+            ;;
+        *)
+            if [ -z "$TARGET" ]; then
+                TARGET="$1"
+            else
+                echo "Error: Multiple targets specified"
+                show_help
+                exit 1
+            fi
+            shift
+            ;;
+    esac
+done
+
+if [ -z "$TARGET" ]; then
+    echo "Error: No target specified"
+    echo ""
+    show_help
+    exit 1
+fi
+
+# Validate OUTPUT_FILE to prevent path traversal
+if [[ "$OUTPUT_FILE" =~ / ]] || [[ "$OUTPUT_FILE" =~ \.\. ]]; then
+    echo "Error: OUTPUT_FILE must be a simple filename (no path separators or .. allowed)"
+    echo "Invalid: $OUTPUT_FILE"
+    exit 1
+fi
+
+echo "═══════════════════════════════════════════════════════════════"
+echo "  CK Build Time Analysis"
+echo "═══════════════════════════════════════════════════════════════"
+echo "Target:       $TARGET"
+echo "Granularity:  ${GRANULARITY}us"
+echo "Container:    $CONTAINER_NAME"
+echo "Output:       $OUTPUT_FILE"
+echo "═══════════════════════════════════════════════════════════════"
+echo ""
+
+# Ensure container is running
+ensure_container_running "${CONTAINER_NAME}" "${SCRIPT_DIR}"
+
+# Configure CMake with -ftime-trace if needed
+if [ "$RECONFIGURE" = true ] || ! docker exec "${CONTAINER_NAME}" test -f /workspace/build/build.ninja 2>/dev/null; then
+    echo ""
+    echo "Configuring CMake with -ftime-trace (granularity=${GRANULARITY}us)..."
+
+    GPU_TARGET=$(detect_gpu_target "${CONTAINER_NAME}")
+
+    docker exec -e GPU_TARGET="${GPU_TARGET}" -e GRANULARITY="${GRANULARITY}" "${CONTAINER_NAME}" bash -c '
+        cd /workspace || exit 1
+        rm -rf /workspace/build
+        mkdir /workspace/build
+        cd /workspace/build || exit 1
+        cmake .. -GNinja \
+            -DGPU_TARGETS="${GPU_TARGET}" \
+            -DCMAKE_BUILD_TYPE=Release \
+            -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ \
+            -DCMAKE_CXX_FLAGS="-ftime-trace -ftime-trace-granularity=${GRANULARITY}" \
+            -DCMAKE_HIP_FLAGS="-ftime-trace -ftime-trace-granularity=${GRANULARITY}" \
+            -DBUILD_TESTING=ON 2>&1 | tail -20
+    '
+    echo "CMake configuration complete"
+fi
+
+# Build the target
+echo ""
+echo "Building target: $TARGET"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+
+BUILD_START=$(date +%s)
+docker exec -e TARGET="${TARGET}" "${CONTAINER_NAME}" bash -c 'cd /workspace/build && time ninja "${TARGET}" 2>&1'
+BUILD_END=$(date +%s)
+BUILD_TIME=$((BUILD_END - BUILD_START))
+
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo "Build completed in ${BUILD_TIME} seconds"
+
+# Find all trace JSON files for the target
+echo ""
+echo "Locating trace files..."
+
+# Count trace files
+TRACE_COUNT=$(docker exec -e TARGET="${TARGET}" "${CONTAINER_NAME}" bash -c '
+    find /workspace/build -type f \( -name "*.cpp.json" -o -name "*.hip.json" \) 2>/dev/null | \
+    grep -vF "compile_commands.json" | wc -l
+')
+
+if [ "$TRACE_COUNT" -eq 0 ]; then
+    echo "Error: Could not find any trace files in /workspace/build"
+    echo "Expected .cpp.json or .hip.json files from -ftime-trace compilation"
+    exit 1
+fi
+
+echo "Found ${TRACE_COUNT} trace file(s) in build directory"
+
+# We'll pass the build directory to the Python script
+BUILD_DIR="/workspace/build"
+
+# Generate analysis report
+echo ""
+echo "Generating analysis report..."
+
+# Copy analysis script and templates to container
+docker cp "${SCRIPT_DIR}/analyze_build_trace.py" "${CONTAINER_NAME}:/tmp/analyze_build_trace.py"
+docker cp "${SCRIPT_DIR}/templates" "${CONTAINER_NAME}:/tmp/ck_build_analysis_templates"
+
+# Check if uv is available, install if needed, and use for PEP 723 dependency management
+if ! docker exec "${CONTAINER_NAME}" bash -c "command -v uv >/dev/null 2>&1 || test -x \$HOME/.local/bin/uv"; then
+    echo "uv not found, installing via pipx..."
+    docker exec "${CONTAINER_NAME}" bash -c "
+        # Install pipx if not available
+        if ! command -v pipx >/dev/null 2>&1; then
+            apt-get update -qq && apt-get install -y -qq pipx >/dev/null 2>&1
+        fi
+        # Install uv via pipx
+        pipx install uv >/dev/null 2>&1
+    "
+    echo "uv installed successfully"
+fi
+
+echo "Using uv run for automatic dependency management..."
+# Ensure uv is in PATH (handles ~/.local/bin installation)
+# Pass build directory instead of single file
+docker exec -e BUILD_DIR="${BUILD_DIR}" -e OUTPUT_FILE="${OUTPUT_FILE}" -e TARGET="${TARGET}" -e GRANULARITY="${GRANULARITY}" -e BUILD_TIME="${BUILD_TIME}" "${CONTAINER_NAME}" bash -c 'export PATH="$HOME/.local/bin:$PATH" && uv run --no-project /tmp/analyze_build_trace.py "${BUILD_DIR}" "/workspace/${OUTPUT_FILE}" "${TARGET}" "${GRANULARITY}" "${BUILD_TIME}" /tmp/ck_build_analysis_templates'
+
+# Copy report back to host
+docker cp "${CONTAINER_NAME}:/workspace/${OUTPUT_FILE}" "${PROJECT_ROOT}/${OUTPUT_FILE}"
+
+# Cleanup
+docker exec "${CONTAINER_NAME}" rm -f /tmp/analyze_build_trace.py
+docker exec "${CONTAINER_NAME}" rm -rf /tmp/ck_build_analysis_templates
+
+echo ""
+echo "═══════════════════════════════════════════════════════════════"
+echo "  Analysis Complete!"
+echo "═══════════════════════════════════════════════════════════════"
+echo "Report: ${PROJECT_ROOT}/${OUTPUT_FILE}"
+echo ""
+echo "Summary:"
+docker exec "${CONTAINER_NAME}" bash -c "head -20 /workspace/${OUTPUT_FILE} | tail -10"
+echo ""
+echo "View the full report:"
+echo "  cat ${OUTPUT_FILE}"
+echo "  or open it in your editor"
+echo "═══════════════════════════════════════════════════════════════"
--- a/script/tools/ck-docker
+++ b/script/tools/ck-docker
@@ -0,0 +1,294 @@
+#!/bin/bash
+# Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+# SPDX-License-Identifier: MIT
+
+# CK Docker Tool - Build and test composable_kernel in Docker with ROCm support
+
+set -e
+set -o pipefail
+
+# Find script directory and load common utilities
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+source "${SCRIPT_DIR}/common.sh"
+
+# Initialize configuration
+PROJECT_ROOT=$(get_project_root "${SCRIPT_DIR}")
+CONTAINER_NAME=$(get_container_name "${PROJECT_ROOT}")
+
+# Help message
+show_help() {
+    cat << EOF
+CK Docker Tool - Build and test composable_kernel in Docker
+
+Usage: ck-docker <command> [options]
+
+Commands:
+  start [name]                    Start Docker container
+  build [target] [--reconfigure]  Build target (optionally reconfigure CMake)
+  test <test> [options]           Run test
+  shell [name]                    Open shell in container
+  status [name]                   Check container status
+  stop [name]                     Stop and remove container
+
+Examples:
+  ck-docker start
+  ck-docker build test_amdgcn_mma
+  ck-docker build --reconfigure test_amdgcn_mma
+  ck-docker test test_amdgcn_mma --gtest_filter=*Fp16*
+  ck-docker shell
+
+Environment:
+  CK_CONTAINER_NAME - Override default container name (default: ck_<username>_<branch>)
+  CK_DOCKER_IMAGE   - Override Docker image (default: rocm/composable_kernel:ck_ub24.04_rocm7.0.1)
+  GPU_TARGET        - Override GPU target detection (e.g., gfx950, gfx942)
+EOF
+}
+
+# Start container
+cmd_start() {
+    local name="${1:-${CONTAINER_NAME}}"
+    local docker_image=$(get_docker_image)
+
+    # Check if container exists and is running
+    if container_exists "${name}"; then
+        if container_is_running "${name}"; then
+            echo "Container '${name}' is already running"
+            return 0
+        else
+            echo "Starting existing container '${name}'..."
+            docker start "${name}"
+            echo "Container started"
+            return 0
+        fi
+    fi
+
+    echo "Creating new Docker container '${name}'..."
+    docker run -d \
+        --name "${name}" \
+        --device=/dev/kfd --device=/dev/dri \
+        --security-opt seccomp=unconfined \
+        --group-add video \
+        -v "${PROJECT_ROOT}":/workspace \
+        -w /workspace \
+        "${docker_image}" \
+        tail -f /dev/null
+
+    echo "Container '${name}' started successfully"
+    docker exec "${name}" bash -c "echo 'Working directory:' && pwd"
+}
+
+# Build target
+cmd_build() {
+    local target=""
+    local name="${CONTAINER_NAME}"
+    local reconfigure=false
+
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            --name)
+                name="$2"
+                shift 2
+                ;;
+            --reconfigure)
+                reconfigure=true
+                shift
+                ;;
+            *)
+                target="$1"
+                shift
+                ;;
+        esac
+    done
+
+    # Check if container is running
+    if ! container_is_running "${name}"; then
+        echo "Container '${name}' not running. Starting..."
+        cmd_start "${name}"
+    fi
+
+    # Reconfigure CMake if requested or if build.ninja doesn't exist
+    if [ "$reconfigure" = true ] || ! docker exec "${name}" test -f /workspace/build/build.ninja 2>/dev/null; then
+        echo "Detecting GPU target..."
+        local gpu_target=$(detect_gpu_target "${name}")
+
+        if [ "$reconfigure" = true ]; then
+            echo "Reconfiguring CMake from scratch for GPU target: ${gpu_target}"
+        else
+            echo "Configuring build with CMake for GPU target: ${gpu_target}"
+        fi
+
+        docker exec "${name}" bash -c "
+            cd /workspace || exit 1
+            rm -rf /workspace/build
+            mkdir /workspace/build
+            cd /workspace/build || exit 1
+            cmake .. -GNinja \
+                -DGPU_TARGETS=${gpu_target} \
+                -DCMAKE_BUILD_TYPE=Release \
+                -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ \
+                -DBUILD_TESTING=ON 2>&1 | tail -30
+        "
+    fi
+
+    if [ -z "$target" ]; then
+        echo "Building all configured targets..."
+    else
+        echo "Building target: ${target}"
+    fi
+
+    docker exec "${name}" bash -c "
+        cd /workspace/build || exit 1
+        ninja ${target} 2>&1
+    "
+
+    echo "Build complete"
+}
+
+# Run test
+cmd_test() {
+    local test_name=""
+    local name="${CONTAINER_NAME}"
+    local -a test_options=()
+
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            --name)
+                name="$2"
+                shift 2
+                ;;
+            --gtest_*|--help)
+                test_options+=("$1")
+                shift
+                ;;
+            *)
+                if [ -z "$test_name" ]; then
+                    test_name="$1"
+                else
+                    test_options+=("$1")
+                fi
+                shift
+                ;;
+        esac
+    done
+
+    if [ -z "$test_name" ]; then
+        echo "Error: test_name required"
+        echo "Usage: ck-docker test <test_name> [--name container_name] [gtest_options]"
+        return 1
+    fi
+
+    # Check if container is running
+    if ! container_is_running "${name}"; then
+        echo "Error: Container '${name}' not running"
+        echo "Start it with: ck-docker start --name ${name}"
+        return 1
+    fi
+
+    if ! docker exec "${name}" test -f "/workspace/build/bin/${test_name}" 2>/dev/null; then
+        echo "Test executable not found. Building ${test_name}..."
+        cmd_build "${test_name}" --name "${name}"
+    fi
+
+    echo "Running: ${test_name} ${test_options[*]}"
+    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+    # Build the command with proper quoting
+    local cmd="cd /workspace/build && ./bin/${test_name}"
+    for opt in "${test_options[@]}"; do
+        cmd="${cmd} $(printf '%q' "$opt")"
+    done
+    docker exec "${name}" bash -c "${cmd}"
+}
+
+# Shell
+cmd_shell() {
+    local name="${1:-${CONTAINER_NAME}}"
+
+    # Check if container is running
+    if ! container_is_running "${name}"; then
+        echo "Container '${name}' not running. Starting..."
+        cmd_start "${name}"
+    fi
+
+    echo "Opening shell in '${name}' (type 'exit' to leave)..."
+    docker exec -it "${name}" bash
+}
+
+# Status
+cmd_status() {
+    local name="${1:-}"
+    local docker_image=$(get_docker_image)
+
+    if [ -z "$name" ]; then
+        echo "Composable Kernel Docker Containers:"
+        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+        docker ps -a --filter "ancestor=${docker_image}" \
+            --format "table {{.Names}}\t{{.Status}}\t{{.CreatedAt}}" || echo "No containers found"
+    else
+        # Check container status
+        if container_is_running "${name}"; then
+            echo "Container '${name}' is RUNNING"
+            docker ps --filter "name=^${name}$" --format "table {{.Names}}\t{{.Status}}\t{{.Image}}"
+            echo ""
+            echo "GPU Information:"
+            docker exec "${name}" bash -c "rocm-smi --showproductname 2>/dev/null | head -10 || echo 'No GPU detected'"
+        elif container_exists "${name}"; then
+            echo "Container '${name}' exists but is STOPPED"
+            echo "Start with: ck-docker start ${name}"
+        else
+            echo "Container '${name}' does NOT exist"
+            echo "Create with: ck-docker start ${name}"
+        fi
+    fi
+}
+
+# Stop
+cmd_stop() {
+    local name="${1:-${CONTAINER_NAME}}"
+
+    # Check if container exists
+    if container_exists "${name}"; then
+        echo "Stopping and removing container '${name}'..."
+        docker stop "${name}" 2>/dev/null || true
+        docker rm "${name}" 2>/dev/null || true
+        echo "Container stopped and removed"
+    else
+        echo "Container '${name}' does not exist"
+    fi
+}
+
+# Main command dispatcher
+case "${1:-}" in
+    start)
+        shift
+        cmd_start "$@"
+        ;;
+    build)
+        shift
+        cmd_build "$@"
+        ;;
+    test)
+        shift
+        cmd_test "$@"
+        ;;
+    shell)
+        shift
+        cmd_shell "$@"
+        ;;
+    status)
+        shift
+        cmd_status "$@"
+        ;;
+    stop)
+        shift
+        cmd_stop "$@"
+        ;;
+    help|--help|-h)
+        show_help
+        ;;
+    *)
+        echo "Unknown command: ${1:-}"
+        echo ""
+        show_help
+        exit 1
+        ;;
+esac
--- a/script/tools/common.sh
+++ b/script/tools/common.sh
@@ -0,0 +1,97 @@
+#!/bin/bash
+# Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+# SPDX-License-Identifier: MIT
+
+# Common utilities for CK Docker tools
+# Shared configuration and helper functions
+
+# Find project root (where .git directory is)
+get_project_root() {
+    local script_dir="$1"
+    cd "${script_dir}/../.." && pwd
+}
+
+# Detect git branch and sanitize for Docker naming
+get_sanitized_branch() {
+    local project_root="$1"
+    local branch
+
+    branch=$(cd "${project_root}" && git rev-parse --abbrev-ref HEAD 2>/dev/null | tr '/' '_' | tr -cd 'a-zA-Z0-9_-' || echo "")
+    branch=${branch:-unknown}
+
+    # Handle detached HEAD state
+    if [ "${branch}" = "HEAD" ]; then
+        branch="detached"
+    fi
+
+    echo "${branch}"
+}
+
+# Get username with fallback
+get_username() {
+    echo "${USER:-$(whoami 2>/dev/null || echo "user")}"
+}
+
+# Generate default container name: ck_<username>_<branch>
+get_default_container_name() {
+    local project_root="$1"
+    local user_name
+    local git_branch
+
+    user_name=$(get_username)
+    git_branch=$(get_sanitized_branch "${project_root}")
+
+    echo "ck_${user_name}_${git_branch}"
+}
+
+# Get container name (respects CK_CONTAINER_NAME env var)
+get_container_name() {
+    local project_root="$1"
+    local default_name
+
+    default_name=$(get_default_container_name "${project_root}")
+    echo "${CK_CONTAINER_NAME:-${default_name}}"
+}
+
+# Get Docker image (respects CK_DOCKER_IMAGE env var)
+get_docker_image() {
+    echo "${CK_DOCKER_IMAGE:-rocm/composable_kernel:ck_ub24.04_rocm7.0.1}"
+}
+
+# Check if container exists (exact match)
+container_exists() {
+    local name="$1"
+    docker ps -a --filter "name=^${name}$" --format '{{.Names}}' | grep -q "^${name}$"
+}
+
+# Check if container is running (exact match)
+container_is_running() {
+    local name="$1"
+    docker ps --filter "name=^${name}$" --format '{{.Names}}' | grep -q "^${name}$"
+}
+
+# Detect GPU target in container
+detect_gpu_target() {
+    local container="$1"
+
+    # Allow override via GPU_TARGET environment variable
+    if [ -n "${GPU_TARGET:-}" ]; then
+        echo "${GPU_TARGET}"
+        return 0
+    fi
+
+    docker exec "${container}" bash -c "
+        rocminfo 2>/dev/null | grep -oP 'gfx[0-9a-z]+' | head -1 || echo 'gfx950'
+    " | tr -d '\r\n'
+}
+
+# Ensure container is running, start if needed
+ensure_container_running() {
+    local container="$1"
+    local script_dir="$2"
+
+    if ! container_is_running "${container}"; then
+        echo "Container '${container}' not running. Starting with ck-docker..."
+        "${script_dir}/ck-docker" start "${container}"
+    fi
+}
--- a/script/tools/templates/build_analysis_report.md.jinja
+++ b/script/tools/templates/build_analysis_report.md.jinja
@@ -0,0 +1,125 @@
+# Composable Kernel Build Time Analysis Report
+
+**Generated:** {{ timestamp }}
+**Target:** {{ target }}
+**Granularity:** {{ granularity }}µs
+**Files Analyzed:** {{ num_files }}
+
+## Executive Summary
+
+- **Wall Clock Time:** {{ build_time }} seconds
+- **Trace Time:** {{ total_trace_time|us_to_s|round(1) }} seconds
+- **Template Instantiation Time:** {{ total_template_time|us_to_s|round(1) }} seconds ({{ (100 * total_template_time / total_trace_time)|round(1) }}% of trace)
+- **Total Events Captured:** {{ total_events|format_number }} (across {{ num_files }} file{{ 's' if num_files != 1 else '' }})
+- **Total Template Instantiations:** {{ total_instantiations|format_number }}
+- **Unique Template Families:** {{ unique_families }}
+
+{% if num_files > 1 -%}
+## Per-File Analysis
+
+| File | Events | Template Time (ms) | % of Total |
+|------|--------|-------------------|------------|
+{% for file in file_stats[:20] -%}
+| {{ file.name|truncate(50)|pad(50) }} | {{ "%7d"|format(file.events) }} | {{ "%17.2f"|format(file.template_time|us_to_ms) }} | {{ "%9.1f"|format(100 * file.template_time / total_template_time if total_template_time > 0 else 0) }}% |
+{% endfor %}
+
+{% endif -%}
+## Compilation Phase Breakdown
+
+| Phase | Time (ms) | Time (s) | % of Total |
+|-------|-----------|----------|------------|
+{% for phase, dur in phases[:20] -%}
+| {{ phase|pad(40) }} | {{ "%9.2f"|format(dur|us_to_ms) }} | {{ "%8.2f"|format(dur|us_to_s) }} | {{ "%9.1f"|format(100 * dur / total_trace_time) }}% |
+{% endfor %}
+
+## Top 30 Most Expensive Individual Instantiations
+
+{% if num_files > 1 -%}
+| Rank | Template | Type | Time (ms) | File |
+|------|----------|------|-----------|------|
+{% for inst in top_individual[:30] -%}
+| {{ "%4d"|format(loop.index) }} | {{ inst.detail|truncate(50) }} | {{ inst.inst_type|pad(5) }} | {{ "%9.2f"|format(inst.dur|us_to_ms) }} | {{ inst.file|truncate(20) }} |
+{% endfor -%}
+{% else -%}
+| Rank | Template | Type | Time (ms) |
+|------|----------|------|-----------|
+{% for inst in top_individual[:30] -%}
+| {{ "%4d"|format(loop.index) }} | {{ inst.detail|truncate(70) }} | {{ inst.inst_type|pad(5) }} | {{ "%9.2f"|format(inst.dur|us_to_ms) }} |
+{% endfor -%}
+{% endif %}
+
+## Template Families by Total Time (Top 50)
+
+| Rank | Template Family | Count | Total (ms) | Avg (ms) | % of Total |
+|------|-----------------|-------|------------|----------|------------|
+{% for name, stats in templates_by_time[:50] -%}
+| {{ "%4d"|format(loop.index) }} | {{ name|truncate(43)|pad(43) }} | {{ "%5d"|format(stats.count) }} | {{ "%10.2f"|format(stats.total_dur|us_to_ms) }} | {{ "%8.2f"|format(stats.avg|us_to_ms) }} | {{ "%9.1f"|format(stats.pct) }}% |
+{% endfor %}
+
+## Template Families by Instantiation Count (Top 50)
+
+| Rank | Template Family | Count | Total (ms) | Avg (ms) |
+|------|-----------------|-------|------------|----------|
+{% for name, stats in templates_by_count[:50] -%}
+| {{ "%4d"|format(loop.index) }} | {{ name|truncate(43)|pad(43) }} | {{ "%5d"|format(stats.count) }} | {{ "%10.2f"|format(stats.total_dur|us_to_ms) }} | {{ "%8.2f"|format(stats.avg|us_to_ms) }} |
+{% endfor %}
+
+## Key Insights
+
+### 1. Template Instantiation Impact
+- Template instantiation accounts for {{ (100 * total_template_time / total_trace_time)|round(1) }}% of total trace time
+{% if unique_families >= 10 -%}
+- Top 10 template families account for {{ top10_pct|round(1) }}% of instantiation time
+{% endif %}
+
+### 2. Most Expensive Templates
+{% if templates_by_time|length > 0 -%}
+- **{{ templates_by_time[0][0] }}**: {{ templates_by_time[0][1].count|format_number }} instantiations, {{ (templates_by_time[0][1].total_dur|us_to_s)|round(2) }}s total
+{% endif -%}
+{% if templates_by_time|length > 1 -%}
+- **{{ templates_by_time[1][0] }}**: {{ templates_by_time[1][1].count|format_number }} instantiations, {{ (templates_by_time[1][1].avg|us_to_ms)|round(2) }}ms average
+{% endif %}
+
+## Optimization Recommendations
+
+### High-Impact Targets (by total time)
+{% for name, stats in templates_by_time[:5] -%}
+**{{ loop.index }}. {{ name }}** - {{ (stats.total_dur|us_to_s)|round(1) }}s total ({{ stats.pct|round(1) }}%)
+   - {{ stats.count|format_number }} instantiations, {{ (stats.avg|us_to_ms)|round(2) }}ms average
+   {% if stats.count > 100 -%}
+   - Strategy: Extern templates - High instantiation count suggests repeated compilation
+   {% elif stats.avg|us_to_ms > 50 -%}
+   - Strategy: Template specialization - High individual cost suggests complexity
+   {% else -%}
+   - Strategy: Explicit instantiation - Pre-instantiate common configurations
+   {% endif %}
+
+{% endfor %}
+### Frequently Instantiated (optimization candidates)
+{% for name, stats in templates_by_count[:5] if stats.count > 100 -%}
+**{{ name }}** - {{ stats.count|format_number }} times ({{ (stats.total_dur|us_to_s)|round(2) }}s total)
+   - Consider: Precompiled headers or extern templates to avoid recompilation
+
+{% endfor %}
+### Most Expensive Individual Instantiations
+{% for inst in top_individual[:3] -%}
+**{{ loop.index }}. {{ inst.detail|truncate(60) }}** - {{ (inst.dur|us_to_ms)|round(1) }}ms
+   - Strategy: Profile and simplify this specific instantiation
+
+{% endfor %}
+
+## Detailed Statistics
+
+- **Total Unique Templates:** {{ unique_families }}
+- **Total Instantiations:** {{ total_instantiations|format_number }}
+{% if total_instantiations > 0 -%}
+- **Average Instantiation Time:** {{ ((total_template_time // total_instantiations)|us_to_ms)|round(3) }}ms
+{% endif -%}
+{% if unique_families > 0 -%}
+- **Median Template Family Count:** {{ median_count }}
+{% endif %}
+
+---
+
+*Report generated using Clang -ftime-trace with {{ granularity }}µs granularity*
+*Analysis tool: ck-build-analysis*