Commit Graph

119 Commits

Author SHA1 Message Date
Allison Vacanti
8d6d934dfe Add default axis names.
Also cleaned up the annoying quirk where `set_type_axes_names` *had*
to be called on all benchmarks with type axes.

Default names are {"T", "U", "V", "W"} for up-to four type axes. For
five or more, {"T0", "T1", ...} is used instead.
2021-02-19 12:37:05 -05:00
Allison Vacanti
324b0d107e Add "global args" to option parser.
If a benchmark modifier is passed before `--benchmark`, the modifier
will apply to all benchmarks.
2021-02-19 10:37:00 -05:00
Allison Vacanti
a747982415 Add nvbench::main CMake target.
Linking to this instead of `nvbench::nvbench` will automatically include
the `NVBENCH_MAIN` macro.
2021-02-19 09:34:02 -05:00
Allison Vacanti
2cc9bf41e3 Add demangle<T>() convenience overload. 2021-02-18 23:41:49 -05:00
Allison Vacanti
543488ef75 Make kernel wrapper into an lvalue. 2021-02-18 23:41:23 -05:00
Allison Vacanti
b5443e98c8 Use std::size_t for element counts, buffer size metadata. 2021-02-18 18:23:14 -05:00
Allison Vacanti
7dd46b0021 Update old benchmarks to use nvbench, remove old scaffolding.
Remove the original attempt to adapt gbench to do CUDA stuff.

Update all benchmarks to use some conventions:

- Element count -> "Elements" [16:32]
- Throughput calcs
- Add input buffer column: "Size"
2021-02-18 18:22:50 -05:00
Allison Vacanti
7657036f9c Add helper methods to configure throughput.
Instead of:

```
state.set_element_count(size);
state.set_global_memory_bytes_accessed(
  size * (sizeof(InT) + sizeof(OutT)));
```

do:

```
state.add_element_count(size, "Elements");
state.add_global_memory_read<InT>(size, "InputSize");
state.add_global_memory_write<InT>(size, "OutputSize");
```

The string arguments are optional. If provided, a new column will
be added to the output with the indicated name and number
of bytes (or elements for `add_element_count`).
2021-02-18 15:47:59 -05:00
Allison Vacanti
dcd5d1ffa6 Update markdown output format. 2021-02-18 14:44:17 -05:00
Allison Vacanti
ef3e1594eb Implement manual timers.
See the new thrust/sort/basic.cu benchmark for usage.

Other notable changes:

- Updated summary column names:
  - Cold GPU -> GPU Time
  - Cold CPU -> CPU Time
  - Hot GPU  -> Batch GPU
- Removed CPU timings from measure_hot
  - They'd been hidden for a while, and aren't really useful.
- Moved the throughput calcs to measure_cold
  - `timer` will disable `hot` timings, still want throughput
  - `cold` timings make more sense for throughput, global BW numbers
    are meaningless if the data is sitting in L2.
2021-02-17 18:48:26 -05:00
Allison Vacanti
385d4f77ba Teach markdown_format about sample_sizes. 2021-02-17 18:34:35 -05:00
Allison Vacanti
8a1f017a4e Inline some methods used in benchmark loops. 2021-02-17 18:34:09 -05:00
Allison Vacanti
f61be70a93 Add initial implementation of exec_tag dispatching.
nvbench::exec_tags are used to request measurement types and share
information about the kernel. They are used to ensure that templated
measurement code is not instantiated unless actually used.

Replaces the nvbench::exec(state, launcher, tags) pattern with:

state.exec(tags, launcher);
state.exec(launcher); // defaults to hot/cold cuda measurements
2021-02-16 23:47:36 -05:00
Allison Vacanti
37e753f7b6 Update benchmark macros:
s/NVBENCH_CREATE/NVBENCH_BENCH/g
s/NVBENCH_BENCH_TEMPLATE/NVBENCH_BENCH_TYPES/g

This will fit nicer once the exec_tags version are added:

NVBENCH_BENCH
NVBENCH_BENCH_TYPES
NVBENCH_BENCH_FLAGS
NVBENCH_BENCH_TYPES_FLAGS
2021-02-16 16:08:38 -05:00
Allison Vacanti
d12326083d Clean up l2flush initialization. 2021-02-16 12:01:50 -05:00
Allison Vacanti
f46dda0e81 Use noexcept CUDA_CALL check in destructor. 2021-02-16 12:00:04 -05:00
Allison Vacanti
55aa78ce17 Make the use of the blocking_kernel optional.
This breaks thrust algorithms, which sync internally. I'll need
to add an exec_tag to toggle this.
2021-02-15 21:55:26 -05:00
Allison Vacanti
bb871094c3 Fixes for multidevice/gcc.
- Allow devices to be cleared during benchmark definition.
- Fix various demangling bugs.
2021-02-15 21:26:21 -05:00
Allison Vacanti
8897490a6d Add cxxabi demangling for gcc/clang. 2021-02-15 21:00:09 -05:00
Allison Vacanti
6c67578dcd Implement skip_time and improve logging. 2021-02-15 17:39:46 -05:00
Allison Vacanti
ead8392bce Use NVBENCH_THROW in option_parser.cu. 2021-02-15 17:19:07 -05:00
Allison Vacanti
6cf29b5083 Various small updates and refactorings.
- collapse nested namespace specifiers.
- Clean up markdown format tables.
2021-02-15 17:18:03 -05:00
Allison Vacanti
d323f569b8 Add termination criteria API.
- min_samples
- min_time
- max_noise
- skip_time (not yet implemented)
- timeout

Refactored s/(trials)|(iters)/samples/s.
2021-02-15 12:04:15 -05:00
Allison Vacanti
e5914ff620 Clean up blocking_kernel.
- Rename release() -> unblock() to avoid confusion with release fences.
- Remove some unused headers.
2021-02-14 16:07:22 -05:00
Allison Vacanti
1cea5e1965 Add and use blocking_kernel. 2021-02-13 11:21:30 -05:00
Allison Vacanti
2125ada770 Call cudaDeviceReset from NVBENCH_MAIN. 2021-02-13 10:01:09 -05:00
Allison Vacanti
3b57127571 Add NVBENCH_CUDA_CALL_NOEXCEPT.
It'll just exit instead of throw. Used in destructors and other
noexcept contexts.
2021-02-13 10:00:15 -05:00
Allison Vacanti
878f1ca4f6 Add --device option. 2021-02-12 21:51:17 -05:00
Allison Vacanti
92cc3b1189 Execute benchmarks on all devices. 2021-02-12 20:53:10 -05:00
Allison Vacanti
5348f65e12 Clean up nvbench::state.
- Replace cref member with ref_wrapper to make movable.
- Use friendship instead of inheritance for testing.
- Add missing [[nodiscard]] annotations.
2021-02-11 21:27:01 -05:00
Allison Vacanti
9f9c6e5278 Refactor state_generator.
The old implementation was scattered and ad hoc. This one is slightly
less so.

More importantly, refactoring to this design will make it easier to
add device traversal.
2021-02-11 21:24:58 -05:00
Allison Vacanti
4820c557a6 Update docstring. 2021-02-11 21:14:22 -05:00
Allison Vacanti
56d182ad41 Fix float64_axis test.
Changed to use `{:0.5g}` formatting for input strings until I figure
out something better.
2021-02-11 21:13:48 -05:00
Allison Vacanti
3bc8291b28 Allow benchmarks to be specified by index with --benchmark. 2021-02-10 21:40:39 -05:00
Allison Vacanti
2561816f15 Print indices for benchmarks in --list output. 2021-02-10 21:39:23 -05:00
Allison Vacanti
e9ae291736 Add benchmark_manager::get_benchmark(idx). 2021-02-10 21:38:56 -05:00
Allison Vacanti
0477514bb6 Axis spec revamp.
- Add support for single values ("Axis=Value").
- Make other value specs shell friendly:
  - Range: "Axis:(2:10:2)"  -> "Axis=[2:10:2]"
  - List:  "Axis:{2,3,4,5}" -> "Axis=[2,3,4,5]"
  - ":" -> "=" feels more natural
  - "{}()" characters have special meaning in bash.
  - "[]" character don't require escapes.
  - Using the same braces for both ranges/list is easier to remember,
    only the delimiter changes.
2021-02-10 09:55:50 -05:00
Allison Vacanti
ed658f0cec Make summary move-only.
This prevents subtle bugs like

auto s = state.add_summary("foo");

instead of

auto& s = state.add_summary("foo");
2021-02-10 09:19:18 -05:00
Allison Vacanti
bf60ff3f0f Add byte formatting to markdown_format. 2021-02-10 09:17:49 -05:00
Allison Vacanti
696578422f Compact the summary table columns.
Before:

```
| In  | Out | Init | Size |  (Size)   | Cold Trials |  Cold GPU  | GPU Noise |  Cold CPU  | CPU Noise | Hot Trials |  Hot GPU   | Item Rate | GlobalMemUse | PeakGMem |
|-----|-----|------|------|-----------|-------------|------------|-----------|------------|-----------|------------|------------|-----------|--------------|----------|
| I32 | F32 |  I32 | 2^20 |   1048576 |        7863 |   85.60 us |     3.35% |  127.19 us |    10.00% |       7864 |   84.98 us | 12.34 GHz |  91.94 GiB/s |   77.10% |
| I32 | F32 |  I32 | 2^24 |  16777216 |         380 | 1240.13 us |     0.36% | 1316.48 us |     2.90% |        424 | 1236.32 us | 13.57 GHz | 101.11 GiB/s |   84.79% |
| I32 | F32 |  I32 | 2^28 | 268435456 |          51 |   19.67 ms |     0.05% |   19.76 ms |     0.13% |         51 |   19.67 ms | 13.65 GHz | 101.70 GiB/s |   85.29% |
```

After:

```
| In  | Out | Init | Size |  (Size)   |  Cold GPU  | Noise |  Cold CPU  | Noise  | Trials |  Hot GPU   | Trials | Item Rate | GlobalMemUse | PeakGMem |
|-----|-----|------|------|-----------|------------|-------|------------|--------|--------|------------|--------|-----------|--------------|----------|
| I32 | F32 |  I32 | 2^20 |   1048576 |   96.25 us | 2.41% |  141.23 us | 13.84% |   7081 |   96.16 us |   7082 | 10.90 GHz |  81.24 GiB/s |   68.13% |
| I32 | F32 |  I32 | 2^24 |  16777216 | 1406.90 us | 0.27% | 1482.86 us |  2.21% |    338 | 1402.06 us |    374 | 11.97 GHz |  89.15 GiB/s |   74.77% |
| I32 | F32 |  I32 | 2^28 | 268435456 |   22.29 ms | 0.12% |   22.38 ms |  0.22% |     45 |   22.28 ms |     45 | 12.05 GHz |  89.78 GiB/s |   75.29% |
```
2021-02-09 10:04:07 -05:00
Allison Vacanti
cd38a4e9ca Allow duplicate column headers in markdown tables. 2021-02-09 10:03:39 -05:00
Allison Vacanti
d0ad118136 Add implementation of transform_reduce.
No released version of GCC supports this yet.
2021-02-08 14:20:53 -05:00
Allison Vacanti
bf94881477 Add warnings when max_time is exceeded without meeting other criteria. 2021-02-06 10:48:26 -05:00
Allison Vacanti
a0c4480a1e Catch and report exceptions in NVBENCH_MAIN. 2021-02-06 09:34:54 -05:00
Allison Vacanti
40aa60b709 Report total time in log. 2021-02-06 09:32:28 -05:00
Allison Vacanti
478d657124 Clean up the cold convergence implementation. 2021-02-06 09:32:03 -05:00
Allison Vacanti
8c9cc84025 Only consider target time if noise convergence is not available. 2021-02-05 18:44:31 -05:00
Allison Vacanti
f7b985cd6e Rework benchmark termination critera to use rel stdev convergence. 2021-02-05 18:08:39 -05:00
Allison Vacanti
d8b3c8967c Fix "peak sm clock" descriptor to "default sm clock". 2021-02-05 18:03:29 -05:00
Allison Vacanti
88119aede1 Implement --list/-l, more markdown cleanup. 2021-02-05 16:35:37 -05:00