Waypipe is a proxy for Wayland applications, which makes it possible to run an application on a different computer but interact with it locally, as if it were actually running on the local computer. (Wayland is the slowly-improving window system protocol for Linux, successor to X11; which most applications now support. The protocol sends plain data over a Unix socket, along with file descriptors to share less serializable things like window surface image data.)
It was written by me during the summer of 2019, and was implemented in C because libwayland used C, because most libraries provide a C interface, because other programming languages often aren’t available or are hard to install as a user on old, shared systems, and because no complicated data structures or libraries were used for which C++ would be necessary. The core operations (basic protocol parsing and shared memory buffer replication) did not take long to implement, and were done in a week. Most of Waypipe’s code is spent making this practical: making the buffer replication for displayed windows run fast and only when necessary; handling other Wayland “protocols” (read: Wayland object types and associated methods), supporting replication of DMABUFs (GPU-side memory buffers used to transfer image data between applications; typically used by OpenGL and Vulkan in place of CPU-side shared memory file descriptors.), and optionally video-encoding DMABUFs.
Making Waypipe reliable, secure, and efficient has been challenging. Waypipe receives and sends messages from Wayland applications and compositors, which it should not trust to use the various Wayland protocols properly. In addition to the (currently rather theoretical) risk of malicious applications, regular mistakes and complicated stacks of libraries can use the Wayland protocols in unexpected ways. There are several libraries implementing the base wire protocol, a number of compositors and toolkits that use it, libraries that extend or try to “share” a single Wayland connection with an existing program, and clients that people have written which directly use a wayland library instead of going through a toolkit, similarly to how many people directly used Xlib.
My approach was to try to write reliable code that handles all errors, in some form or another. (Ideally, by cleanly shutting down the connection and sending an error message to the application; this is what libwayland-server also does.) Of course, to make reliable code, I needed to test it. My main strategies were: trying many Wayland clients and subcomponents of Waypipe (worked, but tests take a while to write and still miss things), injecting errors (to check how broken memory allocation failure paths were), using addressanitizer and static analysis tools to detect issues, and fuzzing (to see what crashes when a fuzzer controls the Wayland message inputs and the internal protocol used to connect the local and remote Waypipe instances; like testing, this requires some framework code to let the fuzzer provide and manipulate file descriptors, which still doesn’t cover all cases).
Altogether, these testing approaches appear to have worked, but they
require a measure of active maintainance over time as the code is
updated. New Wayland protocols and protocol revisions continue to be
made and Waypipe has needed and will often need to adapt to them; the
wl_drm
protocol once used to share DMABUFs has now been
entirely replaced by zwp_linux_dmabuf_v1
, and new protocols
for explicit synchronization, presentation timing, screen capturing, and
color management are now done or being designed. There have also been
new feature requests and ideas for performance improvements.
Implementing all of these required or will require new code, which is
not as well tested as the older code and would require a lot of work to
bring to the same standard.
Rewriting Waypipe in Rust was expected to have multiple benefits.
First, to reduce the cost of making changes and adding new features
at the same level of security; Rust provides a framework with
which to encapsulate memory-unsafe code, and a safe and comprehensive
standard library, which together should significantly reduce the number
of places where memory-unsafe bugs could appear in Waypipe. Second, I
wanted to change Waypipe’s DMABUF handling backend library from
libgbm
to vulkan
to improve performance,
handle explicit synchronization, and more efficiently do RGB to YCbCr
conversion for the optional video encoding feature; in total I expected
that this would require changing or adding about half of Waypipe’s lines
of non-test code. Third: for me to better learn Rust; and fourth:
because I had been hearing about other C or C++ to rust rewrite
projects, and was curious whether a rewrite would be worth it. The best
way to determine that was to try it.
The rewrite went roughly as expected.
Instead of doing an incremental port of Waypipe, converting its
various logical parts piece by piece, I redeveloped the Rust version in
parallel, roughly following the same development path as the original
Waypipe. (Except this time I knew the end goal.) That is, I started with
a simplified form of the command line interface, and then developed a
basic main proxy loop, Wayland protocol parsing logic, and shared memory
buffer replication. The initial step was easier because I already had
written a different (local) Wayland proxy program in Rust
(windowtolayer
). Once that was ready, I iteratively added
back the various features of Waypipe, starting with damage tracking,
compression support, and multithreaded buffer diff calculation and
application; often testing the code by connecting it to the original
Waypipe implementation.
Much of my time in the middle of the port was spent implementing
DMABUF support, this time using Vulkan instead of libgbm. I started with
a simple, single-threaded implementation and once that worked,
progressively introduced multi-threading, buffer update calculations,
zwp_linux_dmabuf_v1
protocol handling, and stride
adjustments to match the weird way the original C implementation
adjusted nominal buffer strides when using libgbm. To implement
Waypipe’s optional video encoding feature, I started with the possibly
tricky case of hardware video encoding and decoding. As Vulkan hardware
video extensions had been released in the last few years, I just used
ffmpeg’s encoder/decoder based on them, which was recently added but
worked with few issues. Software video encoding and decoding were easy
to add afterwards.
The second 90% of the work has been spent on all the miscellaneous
tasks: bringing the Rust rewrite up to feature parity with the original
version, getting it to integrate with Waypipe’s existing build system
(using meson
.), and resolving the issues found after I
deemed the Rust port good enough and brought it into the main git
repository.
The rewritten code is slightly larger: tokei
reports
the C implementation had about 12000 lines of code, without comments and
tests, and 19000 with comments and tests, while the Rust implementation
has about 16000 lines of code without comments and tests, and 23000 when
comments and tests are included. (I am ignoring about 5000 lines of
auto-generated Wayland protocol handling data and code which are tracked
in git
for Rust, but auto-generated in C.) The largest
chunk of the difference comes from the DMABUF copying and video encoding
implementation using Vulkan and libavcodec, which together use about
4000 more lines than the C implementation (which had about 400 lines for
libgbm, and 1200 to libswscale, libavcodec, and vaapi interaction); most
of these lines would still have been needed, had the library change been
done in C.
Test code was generally more efficient to write for the Rust
implementation because higher-level constructs are available; for
example, making it possible to compare two vectors of bytes with
==
, or using closures to efficiently reuse the same
generated parse_<msg>
and
write_<msg>
functions in the main protocol
replication test framework as were used in the main proxy logic. The C
protocol replication tests skipped many checks because they would be
awkward and repetitive to write or would need more code generation.
Note: These are benefits that would be available had I used C++ or some
other language for Waypipe instead; I also had the advantage with the
rewrite of focusing on “end-to-end” tests running the Waypipe’s proxy
logic (as exposed through two Unix sockets) against various Wayland
protocol transcripts. I expect this approach will require less
maintenance with time than the more integrated tests used for the C
implementation.
Lifetimes and exclusive references: were annoying to work with in
some early code, but Waypipe generally either does nothing complicated,
or has multiple independent references to objects and needs to use
Rc<_>
or Arc<_>
. They have
prevented a few incorrect designs. One suboptimal thing remains: the
main proxy loop (loop_inner()
in
src/mainloop.rs
) uses nix::poll::poll()
which
takes nix::poll::PollFd
which contain references to
OwnedFd
objects that are owned by various structures for
Wayland protocol file descriptor replication, called
ShadowFd
s for historical reasons; the ShadowFd
objects are stored under Rc<RefCell<_>>
, and
making their OwnedFd
s available to poll()
currently requires acquiring and storing a Ref<_>
for
each one in a separate vector, and also extracting the return events
from each PollFd
into a separate vector because the
PollFd
s need to be dropped immediately to drop the
Ref<_>
s, so the following code can access the
ShadowFd
s. There are ways to avoid these extra vectors, and
building them wasn’t expensive to begin with; but ultimately the problem
is that I’ve spent too much time thinking about how to refactor this
thing.
When rewriting code, I sometimes noticed details that I’d missed
in the original; like incomplete ssh argument parsing, or a rare edge
case when clients construct a wp_presentation_feedback
object immediately after binding wp_presentation
. I
probably would have missed these if I had translated the code instead of
writing it from scratch, and comparing it with the C implementation
later. Other minor improvements (like not precisely replicating DMABUF
modifiers) were discovered through the use of Vulkan instead of
libgbm.
Much of the work in this rewrite was rather tedious, with little fundamentally new code. (I have used Vulkan before.) The only interesting bugs I have had to track down so far were memory safety+threading issues in libraries I was using, and an unfortunate typo when almost-but-not-quite copy-pasting code. The technically interesting parts (making more efficient buffer difference calculations) have been postponed until after remaining regressions have been discovered, the next release is done, and I start changing Waypipe’s internal protocol.
Rust’s error and string types are much better than C’s;
Result
, Option
, and first-class tuples make
detecting and unpacking errors require much less work than C; one no
longer needs to check which magic values identify failure, whether
errno
is set or where else error messages are stored, and
which arguments are returned by pointer in what circumstances. As the
Wayland wire protocol is binary I did not need to use much C string
handling in the original implementation.
Waypipe varies in how well checked its unsafe code is; I’ve tried
to document core operations on file descriptors and memory maps in
detail; on the other hand, most of the DMABUF and video code is unsafe
and FFI-heavy, and may leak memory when failures occur. (Fortunately,
most failures are fatal, so the leaks here are not critical.) I’ve been
using direct library bindings via bindgen
or unsafe crates
like ash
for external libraries because the current safe
bindings generally are missing required features, require statically
linking in libraries, or bring in too many other dependencies.
One of the original implementation’s design mistakes, perhaps,
was trying to cleanly handle memory allocation failures (and report an
error to the application) instead of just exiting when
malloc()
returns NULL
; this made the code more
complicated and added many failure paths to the code that are hard to
test. While may be possible to write a Wayland client that can make
Waypipe’s calls to malloc()
return NULL
,
normal clients will not do this. Because Waypipe uses one process per
Wayland connection it is safe for it to abort when malloc()
fails. On the other hand, Rust has good enough memory and error
handling to reliably and safely do a clean shutdown when
malloc()
fails, but standard library changes to enable this
are still unstable or in progress.
Error handling in longer stretches of unsafe code (ensuring
everything is freed on failure) can be more awkward than C, because the
standard goto cleanup;
trick is not available. Wrapping
things with a type that destroys them on Drop
generally
works instead. (Properly unwinding on panic for FFI wrappers generally
is not needed, because C libraries generally do not panic and the FFI
wrappers are usually straightforward leaf functions.)
enum
s are useful to make the possible states of a
structure clear, but doing this often requires that I define more
structs for each possible state, and name them all. Picking good names
for them all is a major unsolved problem in theory, so in practice I
just pick bad names.
Waypipe’s build system now somewhat of a mess: meson
runs cargo
through an intermediate script to control the
location of the output executable, and I still haven’t fully connected
meson
’s various build types to cargo
’s. (I am
continuing to use meson
because it is used by Waypipe’s
original C implementation, which I’ve moved into a subfolder of the
repository, and because Waypipe has a man
page that needs
to be installed to the right place.) This continues to evolve.
Rust uses much more build space (about 250MB) than C (14 MB) when building with debuginfo; this is mostly caused by a few big dependencies and 4MB compiled build scripts.
bindgen
is nice to have, but translates C’s
char
into i8
or u8
depending on
the platform, instead of translating it to
std::ffi::c_char
. As a result, I used
*const i8
a few times in my own code, until discovering the
build failures on platforms where c_char = u8
. After that,
I switched to c_char
and started checking the C headers
whenever I wanted to know whether a function’s argument actually was
* char
, * int8_t
, or
* uint8_t
.
cargo test
is OK, but could be better. There is no
convenient way to set per-test timeouts. (Some of Waypipe’s tests should
never take more than a millisecond of CPU time; others take a fraction
of a second if things go well.) Maybe I should switch to
nextest
; although I’d prefer configuring test properties in
the code instead of in a separate config file. Even with
nextest
, though, there is the limitation that tests appear
to be pass/fail and do not have a way to communicate that they are
inconclusive. As Waypipe needs to maintain copies of DMABUFs, some of
Waypipe’s tests are run performed for each render device available on
the system. These tests would ideally be considered
SKIPPED
. I have also observed the video encoding
implementation producing odd results (a constant color on a non-constant
image); ideally tests observing this could report UNCLEAR
as this is not clearly Waypipe’s fault. I’m certainly not the first
person to want either of these behaviors.
Rust’s integer support is much better than C’s, for
which implicit conversions are common and can hide mistakes, which in
turn are hard to enable warnings for because the conversions are common.
Rust also provides useful features like ilog2
,
isqrt
, leading_zeros
,
next_power_of_two
, saturated_add
that to do
well in C require intrinsics, carefully written bit manipulation, or
that you write a function for them yourself (which the compiler
hopefully identifies and replaces with the ideal
implementation.)
Because it was easy to do with bindgen
and
libloading
, the rewrite now dynamically loads
libavcodec
and libavutil
at runtime, when
necessary. This reduces the time to start the waypipe
executable (as measured by timing waypipe --help
) from 45
to 5 milliseconds.
I did not use any async
/await
code
under the assumption that it would be too complicated and not worth the
benefit. Many of Waypipe’s off-main thread tasks are compute heavy, and
these tasks often wait for a specific region of a shared resource
(mirror of a buffer) to become available or for the GPU to finish an
operation.
There currently does not appear to be a stabilized and
universally efficient way for Waypipe to safely interact with shared
memory regions, other than by using architecture-specific assembly.
Under the C++-like memory model that Rust uses, arbitrary shared memory
found through mmap
should be considered
volatile
, since any arbitrary process or DMA device could
modify or react to the memory in “ways unknown to the implementation”.
However, Waypipe in particular can assume it is connected to a
well-behaving Wayland process, and that there are no side effects to
memory access and that ordering of its writes does not matter, as long
as they all happen before the application reads the contents of
Waypipe’s next sendmsg()
. Similarly, Waypipe only needs to
see memory writes that happen before its last recvmsg()
returns, and only needs to be “safe” when reading from the shared memory
region: the compiler should never assume that two repeated or
overlapping read operations will return the same result.
Using &[u8]
would not provide this guarantee, so
Waypipe currently treats memory buffers shared with other processes as
essentially &[AtomicU8]
, using Relaxed
memory access ordering. This is probably fine in practice on current
architectures, as the relaxed atomic operations would be implemented
either with plain loads and stores, or with something stronger. There is
still the theoretical problem that, as far as I am aware, Atomic types
are only guaranteed to work when the memory is updated “within the
memory model”. (For example, one could imagine an architecture where the
compiler’s preferred atomic operations will crash the program if they
overlap with DMA operations, but it has volatile operations which are
OK.) As an alternative, Waypipe might be able to use
std::ptr::read_volatile
and
std::ptr::write_volatile
on entire 64-byte cache lines and
thereby give the compiler more freedom to optimize than if Waypipe were
to do volatile operations on a single u8
or
u64
at a time.
A cross-GPU-platform library for general data compression and decompression on GPU with Vulkan; ideally for lz4 or zstd, but some other CPU-friendly format would be OK.
For bindgen
to accept a list of functions so that,
if it does not generate bindings for all of them, it should return a
failing exit code. bindgen
currently can only filter which
of the functions (or variables, constants, etc.) it makes bindings
for.
A variation on the format!()
macro that produces an
iterator instead of a String
; this would make it possible
to (without restructuring the code very much) eliminate many
intermediate allocations from dynamically chosen trees of
format!
operations, like the following:
format!("{} is {}",
if a { format!("{:x}", b) } else { "C" },
if z { format!("{:x}", y) } else { "Z" })
I would not be surprised if this already exists.
Having learned more Rust recently, it is my irresponsibility to suggest things wiser programmers probably can explain are bad ideas.
I sometimes use key k1
to lookup an
&mut
value x
from a BTreeMap, read data
from x
to determine a key k2
distinct from
k1
, which I use to lookup &mut
value
y
, and then modify both x
and y
in some fashion. Doing this requires dropping x
and then
looking it up again in the map. Sometimes there is a third key whose
value I’d like to modify, but the total number is always small. The
extra lookups could be avoided with RefCell
, but that has
significant space overhead and is awkward to use when programming. I
think this problem could be solved with a sort of
split_at_mut()
-analogue; a method on BTreeMap that looks
something like
get_mut_and_remainder(&mut self, key: &K) -> Option<(&mut V, RemainingMap<K,V,1>)>
where RemainingMap<K,V,N>
is a type referring to
the BTreeMap which keeps a list of N
references
&K
and allows mutable lookups (but not insertions or
deletions) of keys with a
get_mut_and_remainder<N>(&mut self, key: &K) -> Option<(&mut V, RemainingMap<K,V,N + 1>)>
signature, failing when key
matches any of the
N
references stored so far. This would be an adaptive
version of the currently-unstable HashMap::get_many_mut
.
One can emulate something like this idea for slices using
split_at_mut()
, but I don’t see how to soundly and
efficiently build it on top of BTreeMap
’s current API.
Maybe there is a crate that already does this.
As far as I understand it, Rust has a notion of “uninitialized
memory”, where, quoting MaybeUninit
’s documentation, it is
“undefined behavior to have uninitialized data in a variable even if
that variable has an integer type”. I don’t think this is necessary, and
believe that Rust’s existing rules and mechanisms for making
unconditional promises to the compiler are sufficient to enable all
practical optimizations.
Currently, the memory provided to Rust by alloc
may be
uninitialized, and the memory region needs to be manipulated by pointer
or through MaybeUninit instead of by &mut [u8]
slice,
because std::slice::from_raw_parts_mut
requires that the
data region it operates on be properly initialized for the slice type
(in this case, u8
). As a result, one often requires two
variants of any FFI function that fills a region of memory: one
straightforwardly usable one which takes an &mut [u8]
,
and one which uses raw pointers (which is unsafe
to use) or
MaybeUninit<u8>
(safer but complicated). In practice,
these two variants would produce the same code, but if a crate provides
the &mut [u8]
version one cannot obtain the raw pointer
or MaybeUninit
version from it. (Without laundering the
pointer through FFI.) I ran into this issue when trying to use
nix::sys::uio::readv
on a fresh allocation, and when making
wrapper functions for lz4
and zstd
compression
and decompression.
Making alloc
provide an initialized [u8]
(albeit with arbitrary contents) would avoid the above code duplication
and the number of uses of unsafe
required when making data
structures or using external libraries. But I do not think it would
inhibit necessary compiler optimizations, because Rust has good
mechanisms for introducing undefined behavior (read: unconditional
promises to the compiler). If one wants to optimize bounds checks around
a partially initialized region of memory, then
std::hint::assert_unchecked
can be used to instruct the
compiler which addresses are actually being read from, or one can access
memory through an intermediate slice (with associated undefined behavior
if an unchecked access is out of bounds for that slice.) Similarly, when
allocating memory for a non-plain type T
, one does not need
an “uninitialized memory” concept to make accessing T
undefined behavior; the compiler should already assume that blindly
transmuting raw memory ([u8; _]
) into T
is
invalid, because it is not guaranteed that the memory has valid contents
for T
. Finally, the use of uninitialized variables (e.g.
let x: u8; x += 1;
) is already a language error in
Rust.
Also: I read a
document explaining adding undef
to LLVM; it gives mostly
C-specific or internal justifications: like discarding implicit function
return values, optimizing global variable initialization, or improving
compilation when a variable in an outer scope is not used when a given
condition holds: none of these should affect the Rust abstract
model.
Also: I read a relevant
post from 2019 mentioning an old set data structure which can work
when its memory region is arbitrarily initialized; the sparse set reads
from “uninitialized” but exclusively owned memory. I should note two
other examples which do not need initialization: first, there are catalytic algorithms
which use an (arbitrarily initialized) region of memory in their
calculation and later return it with the content reset to its initial
values; these only require exclusive access to the memory region.
Second, my favorite binary tree inversion algorithm, which uses only
O(log(tree depth
)) words of space (in exchange for awesome
and superlative runtime): it uses the algorithm of Savitch’s theorem to
identify which addresses in memory correspond to tree nodes, and then
swaps the children of each tree node. This algorithm will read from
memory that the algorithm does not own (and which may constantly be
changing); but only requires exclusive (&mut
) access to
the set of tree nodes; if one just wanted to count tree nodes, read-only
non-exclusive (&
) access would suffice.
Was the rewrite worth it? I suspect yes: improving the code does seem to be somewhat easier to do in Rust than with the original, where I could never be certain that I was not missing some edge case, and moving DMABUF handling to use Vulkan has significantly improved performance. I will know for certain in a few years when I see what types of bugs I run into. Rewriting the code did take time; I did not precisely measure it but would estimate a month of work so far (spread out over a longer period, since Waypipe was not my sole focus); this is similar to the time needed to develop the program to begin with. Could I have acheived the same effect with a month of work in C? Probably, but I would not have as much confidence that the project quality would remain stable in the future, when I will probably make many changes and spend less time testing them. (For example: I held off on parallelizing buffer diff message application with the C version, because I expected it to be a difficult task to do right.)
Overall, I think Waypipe was appropriate for a Rust rewrite: Waypipe is network facing code, needs to be efficient, does some parsing, and uses multiple threads; and was originally written in C. Interacting with existing libraries’ C APIs was, as expected, more tedious to do than in C, but I think the improvements to Waypipe’s core logic are worth it.
In general, I would pick Rust for new projects that do a lot of parsing or communication with other (untrusted or badly written) processes, are CPU limited and need to be fast or power-efficient, require fast startup, and do not deeply use large and irreplacable libraries from some other language. I would want to switch from C or C++ to Rust if the project is something that I use and make changes to often enough for the cost of making the change to be worth it; but this is rare. Switching from existing memory safe languages is probably only worth it when performance is at stake, and it is not practical to convert just the hot code.
I would not currently use Rust for glue scripts, basic file conversion, data analysis, game scripting, or exploratory programming; languages with a garbage collector and a more compact syntax (like Python, Scheme, Haskell, Clojure) tend to be better there.
Often the choice of language is controlled by which libraries are available: I’ve used C++ for many things because it was the easiest interface for a major library (Qt, OpenCASCADE, CGAL, or SDL/OpenGL). C is OK for small programs where most of the content is interaction with C APIs, but the language itself is the limiting factor beyond a certain scale, when proper number handling, string operations, or nontrivial data structures are required.
Finally, a reminder: Waypipe has been available for five years, and
using it exposes one’s local Wayland compositor to an application
running on a different computer. Even though Waypipe makes some sanity
checks on the messages it receives, it cannot guard against bugs in a
Wayland compositor. As before, do not assume that Waypipe itself
possibly being more secure makes it safe to waypipe ssh
into a compromised computer and run GUI programs; Wayland compositors
are in general not well tested against adversarial clients.