Files
nvbench/python/test/test_nvbench_compare.py
Oleksandr Pavlyk 1d13b49996 Add scoped filtering and device pairing to nvbench_compare
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.

Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.

Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.

Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.

Original commit messages folded into this change:

Tweaks for nvbench_compare

1. When JSON files contain multiple entries with the same name and axis values,
   make sure that scripts compares corresponding entries.

   Previous logic would extract the first entry from ref data, and would compare
   measurements for each state in cmp against the first entry from ref. The
   change introduces a counter to know which nth entry we process for a
   particular axis value, and retrieve corresponding entry in ref.

Scope occurrence matching by device.

Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.

Require devices in ref and cmp to have the same cardinality

Handle mismatch when number of duplicates in ref data is not same as in cmp data

Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency

Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.

Fix to filter-before-matching.

 - compare_benches() now pairs devices by selected position instead of taking a
   device id.
 - For each device pair, compare_benches() now builds:
     - ref_device_states: matching reference device and axis filters
     - cmp_device_states: matching compare device and axis filters
 - State occurrence counts and duplicate occurrence matching now operate only
   on those filtered per-device lists.
 - Removed the later matches_axis_filters() skip inside the compare-state loop
   because filtering now happens before matching.

Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.

Introduce device filtering in nvbench_compare

 - --reference-devices all|ID|ID,ID,...
 - --compare-devices all|ID|ID,ID,...
 - Integer lists preserve order and duplicates.
 - Requested IDs are validated against the file-level device list.
 - Filtered reference/compare device counts must match before comparison.
 - compare_benches() pairs selected reference and compare devices by position.
 - Each benchmark validates that requested device IDs are present in its own
   devices list.

Implemented benchmark-scoped --axis handling.

  - --axis and --benchmark now share an ordered argparse action, so their
    relative CLI order is preserved.
  - -a before any -b becomes a global axis filter.
  - -a after -b <name> applies to that most recent benchmark only.
  - Repeated -b entries are treated as separate filter scopes and combined as
    alternatives for that benchmark.
  - Device filtering remains global and is applied independently.

Allow non-matching devices for explicit device selection

Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.

Fix for resolve_benchmark_device_ids, add comments

The return value of resolve_benchmark_device_ids now always owns its list.

Use monkeypatch class in set_test_devices helper

Stricted device id validation

Test for device id validation
2026-06-02 11:48:01 -05:00

594 lines
18 KiB
Python

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
import importlib.util
import sys
import types
from pathlib import Path
import pytest
@pytest.fixture
def nvbench_compare(monkeypatch):
class DummyLine:
def get_color(self):
return "black"
pyplot = types.ModuleType("matplotlib.pyplot")
pyplot.figure = lambda *args, **kwargs: None
pyplot.xscale = lambda *args, **kwargs: None
pyplot.yscale = lambda *args, **kwargs: None
pyplot.xlabel = lambda *args, **kwargs: None
pyplot.ylabel = lambda *args, **kwargs: None
pyplot.title = lambda *args, **kwargs: None
pyplot.plot = lambda *args, **kwargs: [DummyLine()]
pyplot.fill_between = lambda *args, **kwargs: None
pyplot.legend = lambda *args, **kwargs: None
pyplot.show = lambda *args, **kwargs: None
pyplot.close = lambda *args, **kwargs: None
matplotlib = types.ModuleType("matplotlib")
matplotlib.pyplot = pyplot
monkeypatch.setitem(sys.modules, "matplotlib", matplotlib)
monkeypatch.setitem(sys.modules, "matplotlib.pyplot", pyplot)
monkeypatch.setitem(
sys.modules,
"seaborn",
types.SimpleNamespace(set_theme=lambda *args, **kwargs: None),
)
monkeypatch.setitem(
sys.modules, "jsondiff", types.SimpleNamespace(diff=lambda *args, **kwargs: {})
)
monkeypatch.setitem(
sys.modules,
"tabulate",
types.SimpleNamespace(
__version__="0.8.10", tabulate=lambda *args, **kwargs: ""
),
)
monkeypatch.setitem(
sys.modules,
"colorama",
types.SimpleNamespace(
Fore=types.SimpleNamespace(
BLUE="",
GREEN="",
RED="",
RESET="",
YELLOW="",
)
),
)
module_path = Path(__file__).resolve().parents[1] / "scripts" / "nvbench_compare.py"
spec = importlib.util.spec_from_file_location("nvbench_compare", module_path)
assert spec is not None
assert spec.loader is not None
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
return module
def make_state(
nvbench_compare, name, *, mean="1.0", noise="0.01", axis_value=None, device=0
):
return {
"name": name,
"device": device,
"axis_values": []
if axis_value is None
else [{"name": "A", "type": "int64", "value": axis_value}],
"summaries": [
{
"tag": nvbench_compare.GPU_TIME_MEAN_TAG,
"data": [{"name": "value", "type": "float64", "value": mean}],
},
{
"tag": nvbench_compare.GPU_TIME_STDEV_RELATIVE_TAG,
"data": [{"name": "value", "type": "float64", "value": noise}],
},
],
}
def make_summary(nvbench_compare, tag, value):
return {
"tag": getattr(nvbench_compare, tag),
"data": [{"name": "value", "type": "float64", "value": value}],
}
def make_benchmark(states, *, name="bench"):
devices = []
for state in states:
if state["device"] not in devices:
devices.append(state["device"])
return {
"name": name,
"devices": devices,
"axes": [{"name": "A", "type": "int64", "flags": ""}]
if any(state["axis_values"] for state in states)
else [],
"states": states,
}
def set_test_devices(monkeypatch, nvbench_compare, ref_devices=None, cmp_devices=None):
devices = [{"id": 0, "name": "Test GPU"}]
monkeypatch.setattr(
nvbench_compare,
"all_ref_devices",
devices if ref_devices is None else ref_devices,
)
monkeypatch.setattr(
nvbench_compare,
"all_cmp_devices",
devices if cmp_devices is None else cmp_devices,
)
monkeypatch.setattr(nvbench_compare, "config_count", 0)
monkeypatch.setattr(nvbench_compare, "pass_count", 0)
monkeypatch.setattr(nvbench_compare, "improvement_count", 0)
monkeypatch.setattr(nvbench_compare, "regression_count", 0)
monkeypatch.setattr(nvbench_compare, "unknown_count", 0)
def make_filter_plan(nvbench_compare, filter_actions=None):
return nvbench_compare.build_benchmark_filter_plan(filter_actions or [])
def test_compare_benches_accepts_matching_duplicate_state_counts(
monkeypatch, nvbench_compare
):
set_test_devices(monkeypatch, nvbench_compare)
ref_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state1"),
make_state(nvbench_compare, "state1"),
make_state(nvbench_compare, "state2"),
]
)
]
cmp_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state1", mean="1.005"),
make_state(nvbench_compare, "state1", mean="1.005"),
make_state(nvbench_compare, "state2", mean="1.005"),
]
)
]
nvbench_compare.compare_benches(
ref_benches,
cmp_benches,
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare),
no_color=True,
)
assert nvbench_compare.config_count == 3
assert nvbench_compare.pass_count == 3
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 0
assert nvbench_compare.unknown_count == 0
def test_compare_benches_rejects_swapped_duplicate_state_counts(
monkeypatch, nvbench_compare
):
set_test_devices(monkeypatch, nvbench_compare)
ref_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state1"),
make_state(nvbench_compare, "state1"),
make_state(nvbench_compare, "state1"),
make_state(nvbench_compare, "state2"),
make_state(nvbench_compare, "state2"),
]
)
]
cmp_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state1"),
make_state(nvbench_compare, "state1"),
make_state(nvbench_compare, "state2"),
make_state(nvbench_compare, "state2"),
make_state(nvbench_compare, "state2"),
]
)
]
with pytest.raises(ValueError, match="mismatched state occurrences"):
nvbench_compare.compare_benches(
ref_benches,
cmp_benches,
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare),
no_color=True,
)
def test_compare_benches_matches_duplicate_states_after_axis_filter(
monkeypatch, nvbench_compare
):
set_test_devices(monkeypatch, nvbench_compare)
ref_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state", mean="1.0", axis_value=1),
make_state(nvbench_compare, "state", mean="2.0", axis_value=2),
]
)
]
cmp_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state", mean="2.0", axis_value=2),
make_state(nvbench_compare, "state", mean="1.0", axis_value=1),
]
)
]
nvbench_compare.compare_benches(
ref_benches,
cmp_benches,
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare, [("axis", "A=2")]),
no_color=True,
)
assert nvbench_compare.config_count == 1
assert nvbench_compare.pass_count == 1
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 0
assert nvbench_compare.unknown_count == 0
def test_compare_benches_skips_non_finite_centers(monkeypatch, nvbench_compare):
set_test_devices(monkeypatch, nvbench_compare)
ref_benches = [
make_benchmark(
[
make_state(nvbench_compare, "finite", mean="1.0"),
make_state(nvbench_compare, "nan", mean="nan"),
make_state(nvbench_compare, "inf", mean="inf"),
]
)
]
cmp_benches = [
make_benchmark(
[
make_state(nvbench_compare, "finite", mean="1.0"),
make_state(nvbench_compare, "nan", mean="1.0"),
make_state(nvbench_compare, "inf", mean="1.0"),
]
)
]
nvbench_compare.compare_benches(
ref_benches,
cmp_benches,
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare),
no_color=True,
)
assert nvbench_compare.config_count == 1
assert nvbench_compare.pass_count == 1
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 0
assert nvbench_compare.unknown_count == 0
def test_compare_benches_prefers_median_and_iqr_when_available(
monkeypatch, nvbench_compare
):
set_test_devices(monkeypatch, nvbench_compare)
ref_state = make_state(nvbench_compare, "state", mean="1.0", noise="0.01")
ref_state["summaries"].extend(
[
make_summary(nvbench_compare, "GPU_TIME_MEDIAN_TAG", "1.0"),
make_summary(nvbench_compare, "GPU_TIME_IR_RELATIVE_TAG", "0.01"),
]
)
cmp_state = make_state(nvbench_compare, "state", mean="1.0", noise="0.01")
cmp_state["summaries"].extend(
[
make_summary(nvbench_compare, "GPU_TIME_MEDIAN_TAG", "1.2"),
make_summary(nvbench_compare, "GPU_TIME_IR_RELATIVE_TAG", "0.01"),
]
)
nvbench_compare.compare_benches(
[make_benchmark([ref_state])],
[make_benchmark([cmp_state])],
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare),
no_color=True,
)
assert nvbench_compare.config_count == 1
assert nvbench_compare.pass_count == 0
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 1
assert nvbench_compare.unknown_count == 0
def test_compare_benches_marks_unavailable_noise_unknown(monkeypatch, nvbench_compare):
set_test_devices(monkeypatch, nvbench_compare)
missing_noise_ref = make_state(nvbench_compare, "missing_noise")
missing_noise_ref["summaries"] = [
make_summary(nvbench_compare, "GPU_TIME_MEAN_TAG", "1.0")
]
missing_noise_cmp = make_state(nvbench_compare, "missing_noise")
missing_noise_cmp["summaries"] = [
make_summary(nvbench_compare, "GPU_TIME_MEAN_TAG", "1.001")
]
null_noise_ref = make_state(nvbench_compare, "null_noise")
null_noise_ref["summaries"] = [
make_summary(nvbench_compare, "GPU_TIME_MEAN_TAG", "1.0"),
make_summary(nvbench_compare, "GPU_TIME_STDEV_RELATIVE_TAG", None),
]
null_noise_cmp = make_state(nvbench_compare, "null_noise")
null_noise_cmp["summaries"] = [
make_summary(nvbench_compare, "GPU_TIME_MEAN_TAG", "1.001"),
make_summary(nvbench_compare, "GPU_TIME_STDEV_RELATIVE_TAG", None),
]
nvbench_compare.compare_benches(
[make_benchmark([missing_noise_ref, null_noise_ref])],
[make_benchmark([missing_noise_cmp, null_noise_cmp])],
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare),
no_color=True,
)
assert nvbench_compare.config_count == 2
assert nvbench_compare.pass_count == 0
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 0
assert nvbench_compare.unknown_count == 2
def test_plot_along_skips_states_without_selected_axis(monkeypatch, nvbench_compare):
set_test_devices(monkeypatch, nvbench_compare)
ref_benches = [
make_benchmark(
[
make_state(nvbench_compare, "with_axis", axis_value=1),
make_state(nvbench_compare, "without_axis"),
]
)
]
cmp_benches = [
make_benchmark(
[
make_state(nvbench_compare, "with_axis", axis_value=1),
make_state(nvbench_compare, "without_axis"),
]
)
]
nvbench_compare.compare_benches(
ref_benches,
cmp_benches,
threshold=0.0,
plot_along="A",
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare),
no_color=True,
)
assert nvbench_compare.config_count == 2
assert nvbench_compare.pass_count == 2
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 0
assert nvbench_compare.unknown_count == 0
def test_device_filter_parser_accepts_all_and_duplicate_ids(nvbench_compare):
assert nvbench_compare.parse_device_filter(" all ", "--reference-devices") is None
assert nvbench_compare.parse_device_filter("0", "--reference-devices") == [0]
assert nvbench_compare.parse_device_filter("0, 2,0", "--reference-devices") == [
0,
2,
0,
]
@pytest.mark.parametrize(
"device_arg",
[
"",
" ",
"gpu",
"-1",
"0,gpu",
"0,-1",
"0,",
",0",
],
)
def test_device_filter_parser_rejects_invalid_values(nvbench_compare, device_arg):
with pytest.raises(ValueError, match="must be 'all'"):
nvbench_compare.parse_device_filter(device_arg, "--reference-devices")
def test_explicit_device_filters_downgrade_device_mismatch_to_warning(nvbench_compare):
assert nvbench_compare.require_matching_device_sections(None, None)
assert not nvbench_compare.require_matching_device_sections([0], None)
assert not nvbench_compare.require_matching_device_sections(None, [1])
assert not nvbench_compare.require_matching_device_sections([0], [1])
def test_compare_benches_pairs_filtered_devices_by_position(
monkeypatch, nvbench_compare
):
set_test_devices(
monkeypatch,
nvbench_compare,
ref_devices=[
{"id": 0, "name": "Reference GPU 0"},
{"id": 1, "name": "Reference GPU 1"},
],
cmp_devices=[
{"id": 0, "name": "Compare GPU 0"},
{"id": 1, "name": "Compare GPU 1"},
],
)
ref_benches = [
make_benchmark(
[
make_state(nvbench_compare, "Device=0", mean="1.0", device=0),
make_state(nvbench_compare, "Device=1", mean="9.0", device=1),
]
)
]
cmp_benches = [
make_benchmark(
[
make_state(nvbench_compare, "Device=0", mean="9.0", device=0),
make_state(nvbench_compare, "Device=1", mean="1.0", device=1),
]
)
]
nvbench_compare.compare_benches(
ref_benches,
cmp_benches,
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(nvbench_compare),
no_color=True,
reference_device_filter=[0],
compare_device_filter=[1],
)
assert nvbench_compare.config_count == 1
assert nvbench_compare.pass_count == 1
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 0
assert nvbench_compare.unknown_count == 0
def test_axis_filter_applies_to_most_recent_benchmark(monkeypatch, nvbench_compare):
set_test_devices(monkeypatch, nvbench_compare)
ref_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state", mean="1.0", axis_value=1),
make_state(nvbench_compare, "state", mean="2.0", axis_value=2),
],
name="bench1",
),
make_benchmark(
[
make_state(nvbench_compare, "state", mean="3.0", axis_value=1),
make_state(nvbench_compare, "state", mean="4.0", axis_value=2),
],
name="bench2",
),
]
cmp_benches = [
make_benchmark(
[
make_state(nvbench_compare, "state", mean="1.0", axis_value=1),
make_state(nvbench_compare, "state", mean="2.0", axis_value=2),
],
name="bench1",
),
make_benchmark(
[
make_state(nvbench_compare, "state", mean="3.0", axis_value=1),
make_state(nvbench_compare, "state", mean="4.0", axis_value=2),
],
name="bench2",
),
]
nvbench_compare.compare_benches(
ref_benches,
cmp_benches,
threshold=0.0,
plot_along=None,
plot=False,
dark=False,
filter_plan=make_filter_plan(
nvbench_compare,
[("benchmark", "bench1"), ("axis", "A=2"), ("benchmark", "bench2")],
),
no_color=True,
)
assert nvbench_compare.config_count == 3
assert nvbench_compare.pass_count == 3
assert nvbench_compare.improvement_count == 0
assert nvbench_compare.regression_count == 0
assert nvbench_compare.unknown_count == 0
def test_main_returns_success_exit_code_when_regressions_are_detected(
monkeypatch, capsys, nvbench_compare
):
devices = [{"id": 0, "name": "Test GPU"}]
ref_root = {
"devices": devices,
"benchmarks": [
make_benchmark([make_state(nvbench_compare, "state", mean="1.0")])
],
}
cmp_root = {
"devices": devices,
"benchmarks": [
make_benchmark([make_state(nvbench_compare, "state", mean="1.2")])
],
}
def read_file(path):
return ref_root if path == "ref.json" else cmp_root
monkeypatch.setattr(nvbench_compare.reader, "read_file", read_file)
monkeypatch.setattr(sys, "argv", ["nvbench_compare", "ref.json", "cmp.json"])
assert nvbench_compare.main() == 0
assert nvbench_compare.regression_count == 1
assert (
"Regression (abs(%Diff) > max_noise, %Diff > 0): 1" in capsys.readouterr().out
)