joleuger/vuinputd

Fork 0

mirror of https://github.com/joleuger/vuinputd.git synced 2026-06-29 10:47:17 +00:00

Files

Johannes Leupolz 8d4a0c9413 Started to implement poll

2026-04-02 22:18:41 +00:00

43 KiB

Raw Permalink Blame History

Design Document

1. Introduction

This project provides a safe, general-purpose way to run Sunshine, Steam Input, and other applications that use /dev/uinput inside containers — including systemd-nspawn, Docker, LXC, Podman, and similar runtimes.

Applications like Sunshine require creating virtual input devices (/dev/uinput) for keyboards, mice, and controllers.
Naively bind-mounting /dev/uinput from the host into a container breaks isolation: a container could create devices visible to other containers or even the host, leading to unwanted input injection and security risks.

vuinputd introduces a mediated /dev/uinput proxy that preserves isolation without kernel changes.

2. Architecture

Normally, applications open /dev/uinput directly to create virtual event devices such as /dev/input/event9:

sequenceDiagram
uinput apps->>uinput (kernel): open /dev/uinput and setup
create participant eventx
uinput (kernel)->>eventx: create /dev/input/eventx
uinput (kernel)->>libinput/game: announce new device via udev
libinput/game->>eventx: open /dev/input/eventx

vuinputd provides a virtual /dev/vuinput implemented via CUSE (Character Device in Userspace). This device can be bind-mounted into a container as /dev/uinput, so applications operate normally:


sequenceDiagram
box transparent Host
participant uinput (kernel)
participant udevd
participant vuinputd
participant vuinput (host)
end

box transparent Container
participant uinput (container)
participant uinput apps
participant eventX
participant libinput/game
end

vuinputd->>vuinput (host): create /dev/vuinput with cuse
uinput apps->>uinput (container): open /dev/uinput and setup
uinput (container)-->vuinput (host): is equal (bind mount)
vuinput (host)->>vuinputd: forward data
vuinputd->>uinput (kernel): forward data
uinput (kernel)->>eventX: create /dev/input/eventX
uinput (kernel)->>udevd: announce new device via netlink
udevd->>vuinputd: announce new device via netlink
vuinputd->>libinput/game: announce new device via netlink
libinput/game->>eventX: open /dev/input/eventX

vuinputd forwards udev events into the container via netlink, because otherwise the game in the container would not recognize when its net namespace is different from the one of udevd.

3. Design Decisions

This section describes why the architecture works, which invariants it relies on, and why the implementation is correct given the constraints of /dev/uinput, udev, containers, CUSE, and Rust’s async model.

The key idea: vuinputd linearizes the requests to create or destroy devices from a container, and replays the linearized device generation actions in the container in a strict order per container.

3.1 Source of Truth: The CUSE Device (`/dev/vuinput`)

Decision

The only authoritative source of truth about devices is the set of active clients connected to the CUSE implementation (/dev/vuinput).

Rationale

Clients intentionally open /dev/uinput to create virtual devices. Those clients → via the container → bind mount → map to /dev/vuinput host-side.

Therefore:

Every relevant device originates from an explicit open() request
Clients must communicate the device’s lifetime to vuinputd (via file handle lifetime)
The daemon can unambiguously map:
- which container the client belongs to (from CUSE file handle metadata)
- which input devices belong to which client
- when a client terminates (file descriptor closes)

3.2 Event Dispatcher and Job Engine

The system uses a single-threaded job engine that processes events sequentially. Job This engine runs:

udev events
internal jobs (hotplug propagation, cleanup, reconciliation)

CUSE events (open, ioctl, write, close) happen in a separate thread and may trigger jobs in the job engine.

Decision

Use a “logically single-threaded” dispatcher with async/await for I/O.

Why this is correct

Even though Rust’s async runtime may move tasks between OS threads, the dispatcher ensures:

only one mutation happens at a time
jobs should be relatively short or have async waiting points (awaits)
async I/O does not break ordering, because state mutations always occur inside the central job loop

Invariants ensured

No two jobs mutate global state concurrently
All jobs are serialized
Await points do not allow interleaving jobs affecting the same container
Awaiting in one container allows progress in another container

This is the core of correctness.

Implementation Details on the job engine

This subsection documents the concrete behaviour, invariants and policies implemented by the Dispatcher (the job engine). It is intentionally precise and maps to the code: src/jobs/* and src/jobs/job.rs.

What the Dispatcher is

A logically single-threaded job executor that serializes all state-mutating work.
Implementation lives in src/jobs/job.rs (type Dispatcher) and src/jobs/*.

Core job targets / types

JobTarget::Host — host-global jobs (maintenance, one of jobs).
JobTarget::BackgroundLoop — long-lived background tasks (udev monitor loop).
JobTarget::Container(RequestingProcess) — per-container jobs, strictly ordered per container.

Common concrete jobs (examples)

InjectInContainerJob — create device node in container, write udev runtime data, send udev add.
RemoveFromContainerJob — remove device node, delete runtime data, send udev remove.
MonitorBackgroundLoop — reads host udev events and populates EVENT_STORE.

Ordering guarantees

Jobs are FIFO per JobTarget.
BackgroundLoop jobs are spawned independent of per-target queues.
The dispatcher may spawn new per-target loops lazily on first job.
No two jobs for the same JobTarget run concurrently.

Why this matters

Strict serialization per target prevents device lifecycle races (create/remove) for a given container.
Global or host-level jobs are separate so they don't block per-container sequencing.

Code pointers

Dispatcher implementation: src/jobs/job.rs (Dispatcher, get_or_spawn_target_loop, job_target_loop)
Per-target job queues: async_channel::unbounded()
Background loop registration: JobTarget::BackgroundLoop special-case spawning
Event storage: src/monitor_udev.rs (EVENT_STORE) — jobs read and consume entries from it
Example jobs: src/container/inject_in_container_job.rs, src/container/remove_from_container_job.rs, src/monitor_udev.rs (background)

3.3 Combined Queue: Creation, Updates, Cleanup

Decision

A single queue per container holds all jobs affecting this container.

Why a single queue is correct

Because events must follow strict causality:

A CUSE open → must be followed by corresponding uinput device creation
A cleanup (fd closed) → must be applied before the device is reused
forwarding events → must happen after device creation is fully committed

Executing the jobs directly would create opportunities for:

unintended interleavings ors reorderings
partially applied lifecycle transitions
delivering udev signals before device registration
cleanup before creation
or worse — cleanup of a new device based on stale IDs

A single queue avoids all of this.

Why ordering issues cannot happen

Because:

the dispatcher processes events strictly FIFO
each mutation step is atomic
every step sees consistent internal state
cleanup is just another state transition, not a special phase

3.4 Device State Model and Convergence

Every device is represented by a state machine:

Nonexistent → Creating → Live → PendingCleanup → Removed

The key correctness property:

Decision

State transitions are monotonic and idempotent.

Meaning

If the dispatcher sees any event (udev add/remove, client close, reconciliation trigger), it will:

re-evaluate the device's intended state
re-evaluate the observed state
compute the delta
run the next monotonic transition

This ensures:

No transition can "undo" a future one
Unexpected udev events (late, missing, reordered) cannot create inconsistencies
A device will eventually reach correct state regardless of event order

Why this guarantees correctness

As long as:

intended state is derived solely from CUSE clients
observed state is derived from udev
transitions are deterministic and monotonic

the system always converges to the correct overall state. This is CRDT-like behaviour (convergent replicated state machine), but simpler.

3.5 Integration with Container Runtimes

Decision

Every host-created /dev/input/eventX is tagged:

ID_SEAT=seat_vuinput
stripped of ID_INPUT_KEYBOARD=1, ID_INPUT_MOUSE=1
placed into the correct container’s device namespace via bind-mount, cgroup association, or namespace logic

Rationale

The host must:

see device nodes (kernel requires this)
but must not consume them
while containers must believe they were created locally

Why tagging is correct

On the host:

libinput ignores devices with no ID_INPUT_*
tag-based routing ensures correct multi-container isolation
per-container udev forwarding ensures applications like SDL/libinput behave normally

Inside containers:

the device node exists
udev hotplug events are synthesized and delivered
container-side libraries operate normally

3.6 Cleanups and Race-Free Destruction

Decision

Cleanup is triggered by:

CUSE fd closure
udev “remove” events
container teardown
daemon shutdown
late reconciliation jobs

Cleanup does not need to be a dedicated final phase — it is part of normal operation.

Why no race conditions

Because:

cleanup is just another serialized job in the dispatcher
cleanup transitions are monotonic (“Live → PendingCleanup → Removed”)
device destruction happens only after:
- CUSE client is gone
- forwarding is complete
- all references are released
udev “remove” events are treated as hints, not authoritative commands

Result

No “use after free”, no double cleanup, no ghost devices.

3.7 Why This Implementation Is Correct

The design is correct because it satisfies:

Correctness Criteria

Isolation Containers cannot interfere with each other or with host input.
Safety Device lifecycle is serialized, deterministic, and race-free.
Eventual Convergence Regardless of event order, the system converges to the correct set of live devices.
Compatibility Applications requiring /dev/uinput behave identically to native host execution.

Proof Sketch

Because:

the dispatcher is single-threaded at the logical mutation level
intended state is derived from a single authoritative source (CUSE)
observed state from udev never overrides intended state
transitions are monotonic and idempotent
cleanup is serialized and non-destructive to future transitions

→ The system behaves like a deterministic state machine and cannot diverge.

3.8 Implementation notes on the CUSE front-end (open/write/ioctl/release)

Summary The CUSE front-end implements /dev/vuinput and maps guest operations to host actions. It is not a passive pipe; it holds per-handle state and must obey the dispatcher rules: short-running operations in CUSE callbacks (data-plane), heavy or state-mutating operations scheduled as dispatcher jobs (control-plane). The module therefore has two responsibilities:

Per-handle data plane: parse/wrap ioctls, translate writes (input_event → uinput), handle legacy vs compat event formats, produce SYN events.
Control-plane dispatching: schedule InjectInContainerJob, RemoveFromContainerJob (and other lifecycle jobs) and wait on their completion.

Blocking / IO in callbacks

CUSE callbacks must not perform long-blocking work on the FUSE thread. Long operations (mknod, writing /run/udev/data, sending netlink, waiting for container namespace exec) must be executed in jobs dispatched to the Dispatcher. If a callback must wait for job completion, it must use a small wait primitive (condvar) to block only the caller thread for as little as necessary and avoid locking dispatcher mutexes while waiting.

Why: reduce risk of deadlock and avoid starving other FUSE callbacks.

IOCTL handling policy

Variable-length ioctls are handled by using fuse_reply_ioctl_retry to request correct buffer sizing. The callback must validate sizes and respond with fuse_reply_ioctl_retry only when necessary, otherwise reply directly. Special-case ioctls (UI_DEV_CREATE, UI_DEV_DESTROY, UI_DEV_SETUP, UI_GET_SYSNAME, UI_GET_VERSION) must be handled on the data plane and schedule control jobs for side effects.

Compat/Alignment & memory-safety

All pointer-to-userdata handling must be done with read_unaligned or copying into properly aligned stack locals before creating slices or reinterpreting them. Do not retain pointers into ephemeral stack memory across writes; create owned buffers for any data that must survive until the next syscall.

Error reporting & log dedup

CUSE front-end deduplicates repeating errors (write failures) to reduce log spam but must still emit at least one full diagnostic with device identifiers (filehandle, devnode, major:minor, container). Deduplication should not hide critical first-occurrence context.

Why: helps debugging without overwhelming logs.

Response semantics

Use the correct FUSE reply: fuse_reply_open, fuse_reply_write, fuse_reply_ioctl for success; fuse_reply_err for error codes; fuse_reply_none for release where appropriate. Do not reply with error code 0 using fuse_reply_err — prefer fuse_reply_none or the matching success reply.

Why: avoid confusing FUSE/kernel semantics and accidental error returns.

Resource lifecycle & refcounts

Per-handle state is stored as Arc<Mutex<VuInputState>>. CUSE callbacks must hold only short-lived locks. When a handle is closed, the release callback must remove the state via a dispatcher job if needed and then release its Arc; any long-running cleanup must be scheduled.

Why: prevents deadlocks and reference cycles.

Blocking while awaiting job completion

If a callback synchronously waits for job completion (the current implementation uses a Condvar awaiter), it must not hold any global locks (Dispatcher lock, global state lock) while waiting. The wait must be limited and should log timeouts when exceeded.

Why: prevents deadlocks (dispatcher needs that same mutex to execute jobs).

Compatibility & architecture notes

When mapping 32-bit compat input_event formats into 64-bit representation, copy data into properly aligned locals and then write; do not create slices pointing at temporaries. Provide clear tests for compat conversion for each architecture supported.

Why: correctness across bitness.

Single-threaded CUSE in foreground mode

No high volume of events expected where we could benefit from multiple threads. But much of the code is already prepared for multithreading, if there is really demand.

Poll / event readiness handling

For operations that wait on host device readiness (e.g., force feedback, rumble, vibration, or reading back event state), the CUSE callback must never block.
A poll/wakeup watcher in a background thread monitors underlying /dev/uinput FDs and updates per-handle readiness (PollState) in VuInputState (see section 3.13).
FUSE poll callbacks may save the provided poll handle and immediately return; the background watcher later invokes fuse_notify_poll() to wake the kernel when data arrives.

Why: this separates the fast data-plane (CUSE callbacks) from the asynchronous event-plane (poll watcher) and prevents hanging the filesystem.

3.9 Overriding the type, vendor id, and product id

During the creation of the device, the type, vendor id, and product id will be

type: BUS_USB 0x3
vendor id: 0x1209,
product id: 0x5020.

BUS_VIRTUAL 0x6 is not used, because I couldn't find a place where I could register a vendor and product id. The now used combination is unique, as the product id is registered under pid.codes. So, there is no problem to use it in a system-wide hwdb-file for udev.

3.10 Namespace Switching After Exec

Decision

vuinputd and its helper actions are executed in the host mount namespace, and only after process startup do they switch into the target container’s namespaces using setns() (e.g. CLONE_NEWNS, CLONE_NEWNET).

Dynamic libraries (e.g. libc, libfuse3, libudev) are therefore resolved and mapped before entering the container’s mount namespace.
After the namespace switch, the process guarantees that it only performs filesystem operations intended for the container environment.

Rationale

This design intentionally separates code loading from runtime filesystem semantics:

ELF loading and dynamic linking are one-time operations performed at execve()
already-mapped libraries are unaffected by later namespace changes
mount namespaces only affect future path resolution, not existing mappings

By switching namespaces after startup, vuinputd avoids assumptions about:

the presence of shared libraries inside the container
libc / dynamic loader compatibility across distributions
static linking availability for complex dependencies like libfuse

At the same time, runtime behavior (device access, /dev, /sys, /proc) correctly reflects the container’s view once setns() has completed.

Why post-exec `setns()` is correct

Widely used by container runtimes and helpers (e.g. runc, crun, systemd-nspawn, nsenter)
Ensures maximum compatibility with heterogeneous container filesystems
Avoids brittle static builds and duplicated dependency trees
Preserves security boundaries: namespace changes are explicit and minimal

Constraints and Guarantees

To keep this model correct, vuinputd enforces:

namespace switching occurs before spawning threads
no unintended filesystem access occurs before setns()
all container-visible paths are accessed only after entering the target namespace
required kernel interfaces (/dev/fuse, /sys, /proc) are provided by the container

Under these constraints, post-exec namespace switching provides a robust and predictable execution model.

Alternatives Considered

Fully static binaries
Rejected due to complexity, limited library support, and reduced portability.
Executing entirely inside the container filesystem
Rejected due to dependency availability, loader ABI mismatch, and tighter coupling between host and container environments.
Executing the logic directly without an exec

This was the approach used in vuinputd releases 0.1 and 0.2:
the daemon would fork() and immediately execute the action logic in the child without performing an execve().

While this avoids process re-initialization overhead, it is fundamentally unsafe in a multi-threaded program.

In particular:
- fork() only duplicates the calling thread
- other threads may hold internal libc locks at the time of the fork
- common subsystems (notably malloc) are not async-signal-safe after fork()
- any allocation or lock acquisition in the child can deadlock permanently
This is not a theoretical concern: if another thread holds the malloc arena lock at the time of fork(), the child process may block forever on its first allocation, including implicit allocations inside libc or Rust runtime code. See also [https://github.com/rust-lang/rust/blob/c1e865c/src/libstd/sys/unix/process.rs#L202 and https://systemd.io/ARCHITECTURE/ .

The chosen approach offers the best balance between correctness, portability, and operational simplicity.

3.11 Fallback Graphical Session (`fallbackdm`)

Problem Statement

On systems without an active graphical session (X11 or Wayland), the kernel VT subsystem remains in text mode (KD_TEXT), and the VT keyboard handler is active. As a result:

getty receives keyboard input on the active VT
VT key handling (e.g. Ctrl+Alt+Fn) is enabled
input devices may interact with the VT layer in unintended ways

When a graphical session is active, these issues do not occur: the compositor, via systemd-logind, owns a VT, switches it to KD_GRAPHICS and K_OFF, and the VT keyboard handler is suppressed.

The missing piece is a well-defined fallback for the “no graphical session” case.

Effect of K_OFF in Linux VT subsystem

ioctl KDSKBMODE on /dev/ttyX leads to call of vt_do_kdskbmode
kb->kbdmode = VC_OFF

This suppresses the following chains in keyboard.c Lets take Console_1 via ALT+F2 as an example:

kbd_event (Entry Point)
kbd_keycode (Translation)
- Job: looks up the Keysym in the keymap based on the current modifier state (ALT+F2 is 0xf501 in defkeymap.c_shipped)
- type = KTYP(keysym) takes the first 16 bits (which is 0xf5).
- type -= 0xf0. (which is 0x05). Note hat the kernel uses the ** offset** 0xf0 to differentiate between characters and special handlers in the keymap. When type -= 0xf0 is called, it "normalizes" the keysym into an index for the k_handler array.
- index = KVAL(keysym) takes the last 16 bits (which is 0x01)
- return if ((raw_mode || kbd->kbdmode == VC_OFF) && type != KT_SPEC && type != KT_SHIFT), so suppression happens here.
- Note that K_HANDLERS[type] == K_HANDLERS[0x05] == k_cons.
- if no suppression: call (*k_handler[type])(vc, KVAL(keysym), !down), which is k_cons(vc,0x01)

Lets take Decr_Console via ALT+Left as second example:

kbd_event (Entry Point)
kbd_keycode (Translation)
- Job: looks up the Keysym in the keymap based on the current modifier state (ALT+Left is 0xf210 in defkeymap.c_shipped)
- type = KTYP(keysym) takes the first 16 bits (which is 0xf2).
- type -= 0xf0. (which is 0x02)
- index = KVAL(keysym) takes the last 16 bits (which is 0x10)
- return if ((raw_mode || kbd->kbdmode == VC_OFF) && type != KT_SPEC && type != KT_SHIFT), so suppression does not happen here.
- Note that K_HANDLERS[type] == K_HANDLERS[0x02] == k_spec.
- call (*k_handler[type])(vc, KVAL(keysym), !down), which is k_spec(vc,0x10)
k_spec (Handling)
- Condition if ((... || kbd->kbdmode == VC_OFF) && value != KVAL(K_SAK)) evaluates to false, so suppression happens here
  - if no suppression: call fn_handlervalue which is fn_dec_console(...)

In the KT_SHIFT-case of "return if ((raw_mode || kbd->kbdmode == VC_OFF) && type != KT_SPEC && type != KT_SHIFT", nothing interesting happens in our case: it might enable and disable caps lock. This means uinput can still enable disable caps, which is a bit odd, but nothing tragic in our use cases.

sysrq has an own handler. It is not affected by K_OFF. https://github.com/torvalds/linux/blob/master/drivers/tty/sysrq.c#L1048

Raw mode is not relevant.

The Risk of Non-Standard or User-Loaded Keymaps

While vuinputd relies on the default kernel keymap logic for its internal filtering, it is important to note that the host's active keymap can be modified at runtime (e.g., via loadkeys or systemd-vconsole-setup). Because the kernel's K_OFF logic (triggered by KDSKBMODE) explicitly whitelists the K_SAK (Secure Attention Key) keysym, any user-defined key combination mapped to SAK will bypass the kernel's own input suppression.

To further mitigate these risks, vuinputd could be extended to parse the host's active keymap during startup, allowing the sanitizer to dynamically identify and filter any physical keycode mapped to a sensitive keysym like K_SAK. This is currently not planned. Alternatively, on systems dedicated to containerization where full host TTY access is not required, administrators can load a "hardened" keymap stripped of all Console_N, Boot, and SAK assignments. By combining a minimized host keymap with vuinputd's CUSE-level filtering, the system achieves a robust "Defense in Depth" that protects against both accidental triggers and intentional container escapes.

Decision

fallbackdm is implemented as a logind-managed fallback graphical session. It is available at https://github.com/joleuger/fallbackdm.

It runs only when no other graphical session is active on the seat and exists solely to:

open a regular logind session **of class greeter**
occupy the assigned VT
let logind switch the VT to KD_GRAPHICS and K_OFF and mute the keyboard handler

fallbackdm itself does not manipulate VTs, perform KDSETMODE ioctls, or access /dev/tty directly. All VT and seat handling is delegated to systemd-logind via a standard PAM session.

Behavior and Lifecycle

fallbackdm starts as a normal session on the seat
while active:
- the VT is in KD_GRAPHICS
- the kernel VT keyboard handler is muted (equivalent to K_OFF or KDSKBMUTE)
- getty input is suppressed
when a real graphical session (greeter or compositor) starts:
- logind deactivates fallbackdm
- VT ownership is transferred automatically
when the graphical session ends:
- fallbackdm may be restarted to reclaim the fallback role

fallbackdm is non-interactive by design but may display minimal status information in the future. fallbackdm could also be extended to listen itself to the console switches from evdev devices that are actually connected to the seat and signal vuinputd to mute virtual devices accordingly.

Rationale

This design:

reuses existing, well-tested logind behavior
avoids duplicating VT and seat logic
guarantees compatibility with:
- Wayland compositors
- graphical login managers
- multi-seat setups
keeps the implementation minimal and robust

Conceptually, fallbackdm acts as a headless placeholder graphical session that ensures consistent system behavior even when no real graphical environment is running.

Alternatives Considered

Direct VT management (KDSETMODE, VT ioctls) Rejected due to complexity, fragility, and duplication of logind functionality.
Filtering input at the evdev / uinput layer Rejected as insufficient: it does not address VT keyboard handling or getty behavior.
Disabling gettys or VT switching globally Rejected to preserve emergency local access and standard Linux behavior. The logind-managed approach allows physical VT switching to remain functional for debugging.
Depending on a full display manager Rejected, as this may require dummy display devices in headless configurations and adds unnecessary complexity.

3.12 Device Policies & Input Sanitization

Exposing raw access to /dev/uinput inside a container introduces significant security risks. A malicious process could theoretically emulate a keyboard to execute "BadUSB"-style attacks, trigger kernel-level commands (Magic SysRq), or switch Virtual Terminals (VT) to escape the graphical session.

To mitigate this, vuinputd implements an Active Filtering Layer (CUSE middleware) that enforces strict device policies before requests reach the host kernel. This is controlled via the --device-policy flag.

The filtering operates on two levels (Defense in Depth):

Capability Filtering (ioctl): During device creation, vuinputd inspects UI_SET_KEYBIT, UI_SET_RELBIT, etc. If a container requests capabilities forbidden by the active policy (e.g., a gamepad trying to claim it has a SysRq key), the request is silently ignored or rejected. The resulting device on the host simply lacks those hardware capabilities. This hasn't been implemented, yet.
Event Filtering (write): At runtime, vuinputd inspects the stream of input events. It maintains internal state (tracking modifiers like Alt or Ctrl) to detect and drop dangerous sequences (e.g., Alt + F1-F12 for VT switching) that the capability filter alone cannot block.

Supported Policies:

strict-gamepad (Whitelist): Designed for console-like isolation. It strictly permits only Gamepad/Joystick events (EV_KEY buttons, EV_ABS axes). It proactively blocks EV_REL (mouse movement) and ABS_MT (multitouch), effectively "neutering" complex controllers (like DualSense or Wiimotes) so they cannot be used to hijack the host mouse cursor.
sanitized (Blacklist): Designed for desktop gaming. It allows standard Keyboard and Mouse input but strictly filters dangerous keys (KEY_SYSRQ, KEY_POWER) and host-management shortcuts (VT switching, CAD), providing a safe "sandboxed keyboard."

3.13 Polling & Readiness Watcher

Purpose

Provide non-blocking detection of device readiness on /dev/vuinput for operations like force feedback / rumble / vibration.
Ensure CUSE callbacks (poll, read) never block and the FUSE filesystem remains responsive.

Core design

Each VuInputState includes a PollState struct:

#[derive(Debug, Default)]
pub struct PollState {
    /// A FUSE poll request is currently waiting to be woken.
    pub waiting: bool,

    /// Sticky readiness latch: true once evdev became readable, false after read/drain.
    pub readable: bool,
}

A single background thread watches all active uinput device file descriptors using epoll (or poll).
When data arrives:
- The watcher locks the corresponding VuInputState via VUINPUT_STATE.
- Marks poll.readable = true.
- If poll.waiting = true, the watcher clears the flag and optionally calls fuse_notify_poll() to wake any waiting FUSE poll requests.

Adding / removing devices

When creating a new uinput device, the thread performs epoll_ctl(ADD) for the file descriptor.
On device close, it performs epoll_ctl(DEL) to remove the descriptor from monitoring.
No separate registry is maintained; the global VUINPUT_STATE HashMap is the single source of truth.

Poll callback behavior

FUSE poll callbacks do not block: they may store the poll handle and immediately return.
The background watcher ensures that any pending poll handles are notified asynchronously when data is ready.

Read handling

Reads from /dev/vuinput are non-blocking:
- poll() detects readiness.
- read() uses O_NONBLOCK and drains all available events.
- EAGAIN indicates the buffer is empty, at which point poll.readable is reset to false.

Threading / shutdown

The background watcher uses a 500ms epoll_wait timeout to allow clean shutdown.
It is safe to have a single watcher thread for all devices; epoll scales efficiently with multiple descriptors.

Benefits

Fully non-blocking CUSE front-end.
Lightweight: epoll manages file descriptors; no extra registry is needed.
Correctly wakes FUSE poll requests for force feedback / rumble operations.
Consistent with existing dispatcher rules: data-plane operations remain fast; control-plane updates scheduled as jobs if needed.

Poll/read race rule: PollState is the single source of truth for readiness and pending poll waiters. Both poll() and read() must update it only while holding the per-handle VuInputState mutex.

poll() must first check readable; only if false may it enqueue a poll handle.
the watcher sets readable = true and atomically drains pending waiters for notification.
read() may clear readable only after draining the proxied evdev fd until EAGAIN (or equivalent proof of emptiness).

This prevents lost wakeups and stale readiness in the proxy.

4. Security Considerations

vuinputd must currently run with root privileges to:

Access /dev/uinput and create CUSE devices.
Send and receive udev/netlink messages.
Manage per-container device nodes under /dev/input.

While this design is necessary for mediation, it introduces potential attack surfaces:

⚠️ Risks

Privilege escalation: a compromised container could exploit bugs in the proxy.
Input injection: if isolation fails, input devices may leak between containers.
Unsafe FUSE/unsafe code: any memory or pointer error could lead to denial-of-service or privilege abuse.

🛡️ Mitigations (planned / recommended)

Drop capabilities after startup (e.g. keep only CAP_SYS_ADMIN where needed).
Run under a dedicated system user (vuinputd) with limited filesystem access.
Enforce container identity using cgroup, namespace, or pidfd checks.
Use seccomp or systemd sandboxing (ProtectSystem, ProtectKernelTunables, RestrictNamespaces, etc.).
Eventually migrate to Rust-native FUSE/Netlink bindings to remove unsafe dependencies.

5. Background: How are input devices created by the kernel using uinput

We need to know in which order device nodes and netlink messages are sent by the linux infrastructure when uinput is used directly in order to correctly replicate the behavior. This is what this section is about. The externally visible state is determined by actions originating from the following locations:

userspace:
application uses ioctl UI_DEV_CREATE on an open file handle of /dev/uinput
the linux kernel (device creation): PART 1) Registering the uinput device. Entry point is uinput_ioctl_handler() in uinput.c 2.1) uinput_create_device() in uinput.c 2.1.1) input_register_device() in input.c 2.1.1.1) device_add() in core.c 2.1.1.1.1) device_create_file() in core.c: create the sysfs attribute file for the device. 2.1.1.1.2) create diverse symlinks and data for the sysfs entry 2.1.1.1.3) create node via devtmpfs is skipped, because there is no node for a uinput device 2.1.1.1.4) kobject_uevent() in kobject_uevent notifies user space via netlink with DEVPATH set to the sysfs entry under /sys, e.g., /devices/virtual/input/input155 2.1.1.2) input_attach_handler() in input.c PART 2) Registering the evdev device.
Note that evdev_handler is registered as an input_handler during the initialization of evdev in evdev.c 2.1.1.2.1) evdev_connect() in evdev.c 2.1.1.2.1.1) cdev_device_add() in char_dev.c 2.1.1.2.1.1.1) device_add() in core.c 2.1.1.2.1.1.1.1) device_create_file() in core.c: creates the sysfs attribute file for the device. 2.1.1.2.1.1.1.2) create diverse symlinks and data for the sysfs entry 2.1.1.2.1.1.1.3) create node via devtmpfs 2.1.1.2.1.1.1.3.1) devtmpfs_create_node() in devtmpfs.c creates /dev/input/eventY. name is determined by input_devnode() in input.c which was set via dev_set_name in evdev.c 2.1.1.2.1.1.1.4) kobject_uevent() in kobject_uevent.c notify user space via netlink with DEVPATH set to the sysfs entry under /sys, e.g., /devices/virtual/input/input155/event12
udev in userspace (TODO)

udev_event_execute_rules() in udev-event.c

The uinput device created by PART 1 is only exposed through sysfs, still represented by struct device, but does not correspond to a char device node. The evdev device created by PART 2 is exposed through sysfs and has also a major and a minor. Thus, the order is:

open uinput
create devices/virtual/input/inputX
netlink KERNEL DEVPATH=/devices/virtual/input/inputX
create /sys/devices/virtual/input/inputX/eventY
create /dev/input/eventY
netlink KERNEL DEVPATH=/devices/virtual/input/inputX/eventY

Example of kernel messages that were send via netlink

KERNEL[1674737.476205] add      /devices/virtual/input/input155 (input)
ACTION=add
DEVPATH=/devices/virtual/input/input155
SUBSYSTEM=input
PRODUCT=3/beef/dead/0
NAME="Example device"
PROP=0
EV=7
KEY=10000 0 0 0 0
REL=3
MODALIAS=input:b0003vBEEFpDEADe0000-e0,1,2,k110,r0,1,amlsfw
SEQNUM=18608

KERNEL[1674737.476328] add      /devices/virtual/input/input155/event12 (input)
ACTION=add
DEVPATH=/devices/virtual/input/input155/event12
SUBSYSTEM=input
DEVNAME=/dev/input/event12
SEQNUM=18610
MAJOR=13
MINOR=76

Example of udev (after adding the hwdb and rules entries)

UDEV  [1674737.478882] add      /devices/virtual/input/input155 (input)
ACTION=add
DEVPATH=/devices/virtual/input/input155
SUBSYSTEM=input
PRODUCT=3/beef/dead/0
NAME="Example device"
PROP=0
EV=7
KEY=10000 0 0 0 0
REL=3
MODALIAS=input:b0003vBEEFpDEADe0000-e0,1,2,k110,r0,1,amlsfw
SEQNUM=18608
USEC_INITIALIZED=1674737476194
ID_VUINPUT=1
ID_INPUT=1
.INPUT_CLASS=mouse
ID_SERIAL=noserial
ID_VUINPUT_MOUSE=1
ID_SEAT=seat_vuinput
TAGS=:seat:
CURRENT_TAGS=:seat:

UDEV  [1674737.498627] add      /devices/virtual/input/input155/event12 (input)
ACTION=add
DEVPATH=/devices/virtual/input/input155/event12
SUBSYSTEM=input
DEVNAME=/dev/input/event12
SEQNUM=18610
USEC_INITIALIZED=1674737477373
ID_VUINPUT=1
.HAVE_HWDB_PROPERTIES=1
ID_INPUT=1
.INPUT_CLASS=mouse
ID_SERIAL=noserial
ID_SEAT=seat_vuinput
ID_VUINPUT_MOUSE=1
MAJOR=13
MINOR=76
TAGS=:seat_vuinput:
CURRENT_TAGS=:seat_vuinput:

From my investigation, the ioctl thread immediately sees /dev/input/eventX, because:

UI_DEV_CREATE runs device_add()
device_add() calls devtmpfs_create_node()
devtmpfs_create_node() directly creates the node in VFS and waits via a wait_for_completion().
The kernel returns from those calls only after the node is fully present in the dcache and directory.

5.2 CUSE

5.3 uinput users

5.3.1 inputtino

https://github.com/games-on-whales/inputtino

Library used by wolf and sunshine

5.3.2 Steam

https://gitlab.steamos.cloud/steamrt/steam-runtime-tools/-/blob/main/steam-runtime-tools/input-device.c https://gitlab.steamos.cloud/steamrt/steam-runtime-tools/-/blob/main/docs/container-runtime.md https://gitlab.steamos.cloud/steamrt/steam-runtime-tools/-/blob/main/docs/ld-library-path-runtime.md https://github.com/ValveSoftware/steam-for-linux/issues/10175 https://github.com/ValveSoftware/steam-for-linux/issues/8042

5.3.3 Selkies Project

Absolutely untested. https://github.com/selkies-project/selkies/pull/173

5.4 Applications that use the created devices

wine control

https://github.com/flatpak/xdg-desktop-portal/issues/536

6. HIDAPI

https://github.com/libusb/hidapi

https://abeltra.me/blog/inputtino-uhid-1/

7. Alternative Approaches

7.1 trace accesses of /dev/uinput with eBPF

Idea (short): attach an eBPF program to the syscall tracepoint for ioctl (tracepoint/syscalls/sys_enter_ioctl), filter by container cgroup, and send small events (pid, tgid, fd, cmd, timestamp, short payload sample) to userspace using the BPF ring buffer. A privileged host agent consumes the ringbuf events, duplicates the target FD via pidfd_getfd() and proceeds with UI_GET_SYSNAME / sysfs resolution to retrieve the sys-path and the dev-path. Having the dev-path and the pid of the container, the solution could proceed as in the current solution.

1) Trace hook: `tracepoint/syscalls/sys_enter_ioctl`

Use the syscall tracepoint syscalls:sys_enter_ioctl. Tracepoints are stable, exported kernel probe points and the syscall tracepoint provides the syscall arguments (fd, cmd, arg) in a stable layout. This avoids fragile kprobe offsets on architecture-specific syscall wrappers. See the kernel tracepoint docs.

2) BPF map: ring buffer (kernel → userspace)

Use the BPF ring buffer (BPF_MAP_TYPE_RINGBUF) to cheaply publish fixed-size events to userspace. The ring buffer provides bpf_ringbuf_reserve() / bpf_ringbuf_submit() semantics from the kernel side and is the recommended modern replacement for perf-buf for high-rate kernel→user events. See the kernel documentation for the ring buffer API.

3) Useful eBPF helpers

Inside the trace program you will typically use:

bpf_get_current_pid_tgid() to record tgid/pid,
bpf_get_current_cgroup_id() to filter to the container cgroup you care about,
bpf_copy_from_user() to safely copy up to N bytes from the user pointer (arg) into the event buffer.

4) Use of `pidfd_getfd`

The pidfd_getfd() syscall (introduced in Linux 5.6, see man pidfd_getfd(2)) allows one process to duplicate a file descriptor from another process into its own FD table. It takes a pidfd (obtained via pidfd_open() or from CLONE_PIDFD), the target FD number in the remote process, and optional flags. The resulting descriptor refers to the same open file description—sharing offset, status flags, and driver state—exactly as if the target process had called dup(). Permission checks apply: the caller must either share credentials (same UID) or hold CAP_SYS_PTRACE or an equivalent capability over the target. This makes pidfd_getfd() the canonical and race-free way to inspect or reuse another process’s device handles (for example, to run UI_GET_SYSNAME on a client apps' fd on /dev/uinput ) without invasive ptrace tricks.

7.2 LD_PRELOAD

See src/fake-uinput/README.md on wolf

https://github.com/games-on-whales/wolf/issues/81

https://github.com/games-on-whales/wolf/pull/88

5b3282ceef (diff-2446d8f27f6ac4efff38510458548cea92179eddf38c187f5ad90d6bdd4b3d69)

7.3 Custom kernel modul

https://github.com/dkms-project/dkms

https://github.com/torvalds/linux/blob/master/drivers/input/misc/uinput.c

https://lore.kernel.org/linux-bluetooth/20191201145357.ybq5gfty4ulnfasq@pali/t/#u

43 KiB Raw Permalink Blame History Unescape Escape

Design Document

1. Introduction

2. Architecture

3. Design Decisions

3.1 Source of Truth: The CUSE Device (/dev/vuinput)

Decision

Rationale

3.2 Event Dispatcher and Job Engine

Decision

Why this is correct

Invariants ensured

Implementation Details on the job engine

3.3 Combined Queue: Creation, Updates, Cleanup

Decision

Why a single queue is correct

Why ordering issues cannot happen

3.4 Device State Model and Convergence

Decision

Meaning

Why this guarantees correctness

3.5 Integration with Container Runtimes

Decision

Rationale

Why tagging is correct

3.6 Cleanups and Race-Free Destruction

Decision

Why no race conditions

Result

3.7 Why This Implementation Is Correct

Correctness Criteria

Proof Sketch

3.8 Implementation notes on the CUSE front-end (open/write/ioctl/release)

3.9 Overriding the type, vendor id, and product id

3.10 Namespace Switching After Exec

Decision

Rationale

Why post-exec setns() is correct

Constraints and Guarantees

Alternatives Considered

3.11 Fallback Graphical Session (fallbackdm)

Problem Statement

Effect of K_OFF in Linux VT subsystem

The Risk of Non-Standard or User-Loaded Keymaps

Decision

Behavior and Lifecycle

Rationale

Alternatives Considered

3.12 Device Policies & Input Sanitization

3.13 Polling & Readiness Watcher

4. Security Considerations

⚠️ Risks

🛡️ Mitigations (planned / recommended)

5. Background: How are input devices created by the kernel using uinput

5.2 CUSE

5.3 uinput users

5.3.1 inputtino

5.3.2 Steam

5.3.3 Selkies Project

5.4 Applications that use the created devices

5.4.1 SDL

5.4.2 libudev and netlink

5.4.3 libinput, libevdev

5.4.4. Proton

6. HIDAPI

7. Alternative Approaches

7.1 trace accesses of /dev/uinput with eBPF

1) Trace hook: tracepoint/syscalls/sys_enter_ioctl

2) BPF map: ring buffer (kernel → userspace)

3) Useful eBPF helpers

4) Use of pidfd_getfd

7.2 LD_PRELOAD

7.3 Custom kernel modul

43 KiB

Raw Permalink Blame History

3.1 Source of Truth: The CUSE Device (`/dev/vuinput`)

Why post-exec `setns()` is correct

3.11 Fallback Graphical Session (`fallbackdm`)

1) Trace hook: `tracepoint/syscalls/sys_enter_ioctl`

4) Use of `pidfd_getfd`