Network transparency with Wayland


2019-07-11: I've managed to get hardware video encoding and decoding using VAAPI working with waypipe, although of course the hardware codecs are less flexible and introduce additional restrictions on the image formats and dimensions. For example, buffers currently need to have an XRGB8888 pixel format (or a standard permutation thereof), as the Intel/AMD VAAPI implementations otherwise do not appear to support hardware conversions from the RGB color space to the YUV color space used by video formats, and in the other direction. It's also best if the buffers have 64-byte aligned strides, and 16-pixel aligned widths and heights. The result of this can run significantly faster than encoding with libx264, although to maintain the same level of visual quality the bitrate must be increased.

For games, using video compression with waypipe is probably worth the tradeoffs now. In some instances, it can even be faster. A 1024 by 768 SuperTuxKart window during a race, running with linear-format DMABUFs, losslessly replicated without compression via ssh on localhost, requires about 130MB/s of bandwidth and runs at about 40 FPS. (Using LZ4 or Zstd for compression would reduce bandwidth, but on localhost or a very fast network would take more time than would be saved by the bandwidth reduction.) Using the new --hwvideo flag to enable VAAPI encoding and decoding with a generation 8 Intel iGPU, the framerate goes up to 80 FPS, and bandwidth drops to 4 MB/second. The resulting images do have visible artifacts, as the following frame shows:

Of course, video encoding has its other downsides; at roughly the same framerate, a different trial with SuperTuxKart at 2048x1024 pixels used about 5W/25% more power when run with hardware encoding than when run using the standard buffer replication mechanism. (As the power usage tests were run on a laptop on battery, the confounders range from local thermal regulation to system power management; but the general trend that video encoding uses more power than streaming operations on memory probably still holds. It's possible that for newer hardware GPU encoding is more power efficient.) It's also important to note that, depending on the computers used, the software video pathway can be both faster and use less bandwidth than hardware video, especially for older hardware.


2019-07-05: For the last week, I've mostly been writing tests and fixing bugs. One or two bugs were new, caused by the switch to using the protocol unmarshaling code generator that I wrote a week ago, but most are older bugs that occur when waypipe is exposed to malformed protocol messages. For example, there used to be a use-after-free when a Wayland request to a given object was crafted to create a new object that would overwrite the existing object, but not replace all the references to the old object. Several such bugs were found by fuzzing, using AFL.

AFL is a file-format fuzzer, so it's not immediately obvious how to make it fuzz a pair of programs which use Unix socket connections as input. Furthermore, the Wayland protocol requires file descriptor transfers via sendmsg, and almost no fuzzers are set up to do that. The solution I chose for waypipe is to write an alternative frontend, which runs two linked copies of waypipe as distinct threads, and has a third thread which reads from a given input file and sends messages to the other two threads. The input files are structured with a simple header based format which the third thread can use to decide to which copy of waypipe it should send the next block of data. The headers also include a field indicating how large of a shared memory buffer, if any, to sendmsg to the next selected waypipe thread. To provide the initial test cases for AFL, I wrote a script which proxies a Wayland application and dumps its (formatted) protocol messages to a file.

While throughput for this fuzzing method was initially rather high, at a few thousand invocations per second with 1KB buffers, the executions weren't very repeatable. The initial fuzzing interface did not wait for a given waypipe thread to respond to its last input, instead writing data as quickly as possible to each thread. Depending on processing delay for each thread, and system background variability, the pair of waypipe main loop threads would each encounter a different interleaving of messages from the input file. The current frontend now waits a block of messages to pass entirely through the system before sending the next message; it's significantly slower, by about a factor of ten, than the old version, but is much more repeatable. As most of the slowdown is due to time spent waiting, one can recover some of the initial throughput by running multiple fuzzer instance per core.


2019-06-29: I've taken a short break from primarily working on waypipe development, instead focusing on more general experimentation. One of the results is a small test client for copy-paste operations, which has uncovered a rather interesting issue that applies to sway and a few other compositors. The core Wayland protocol uses a server-global serial counter to assign a number to each input event. When a client makes a copy selection request, it provides the serial number of the event which caused the request. When an application is delayed a few hundred milliseconds, the provided serial number can be used to reject its copy selection request if another application has made a request in response to a more recent input event. It's also possible that a badly written application (like my test client) sends a request to set the copy selection with a serial number that is far too large, making selection requests from other applications fail. Currently, wlroots and Weston do not check that the serial numbers in requests were ever sent to the client. Unfortunately, tracking the set of all, or even just the most recent, serial numbers given to a client would require a modification of libwayland-server, as the wl_display::sync request lets clients know what the current serial number is. In practice, just checking that the serial numbers in selection requests aren't any larger than the current server serial suffices to avoid any problems.

I also have written a small prototype for a code generator for Wayland protocol method calls. It precomputes data needed to quickly verify that messages are well formed, and generates small wrapper functions that can call an event handler directly from the wire format. For example,

void do_wl_display_evt_error(struct context *ctx, struct wp_object *object_id, uint32_t code, const char *message);
void call_wl_display_evt_error(struct context *ctx, const uint32_t *payload, const int *fds, struct message_tracker *mt) {
    int i = 0;
    struct wp_object *arg0 = get_object(mt, payload[i], NULL);
    i++;
    uint32_t arg1 = payload[i];
    i++;
    const char *arg2 = (const char *)&payload[i + 1];
    i += payload[i];
    do_wl_display_evt_error(ctx, arg0, arg1, arg2);
    (void)fds;
}

It's not stable enough to integrate into waypipe at the moment, but if protocol parsing ever becomes a bottleneck, this dispatch method should be significantly faster than the existing solution of using libffi to make function calls.


2019-06-20: The rate of additions to the code has started to slow, because waypipe already has most of the features it would ever need. The most significant change for the preceding week was making most build dependencies optional: only a core of libffi, libwayland, and wayland-protocols are required. (librt and pthreads are already dependencies of libwayland.) There has been a slight speedup to the damage merging algorithm, but on further reflection the whole "extended interval" construction may limit the performance of the buffer diff construction and application procedures. Instead, the horizontally banded damage tracking data structure underlying pixman may make it easier to ensure that buffers are scanned monotonically -- or it may prove to be unavoidably slow to construct, in the worst case. The next most notable change was the introduction of a headless test, checking that when run with "headless" weston that applications do not crash when run indirectly using waypipe. All the remaining changes are essentially bug fixes and small expansions of the previous multithreading and video encoding work.

Performance testing, on the other hand, has had a few interesting results. Using perf to trace both the USDT probes mentioned earlier, and scheduler context switches, makes it easy to find out why a given data transfer takes the time that it does. I've written a small script that generates timeline plots from the perf script output. The following image shows a (zoomed out) 120 second long plot, revealing a number of different interaction patterns in a sample program:

Zooming in a bit shows a brief 0.3 second chart of waypipe operations. The 5 rows, from bottom to top, are a waypipe server instance, followed by three associated worker threads, and with a waypipe client instance (connected via ssh) at the top. Overlapping time intervals are nested, so that they remain visibie. Gray intervals indicate the range of times during which the program is scheduled to run; orange, the time range needed for the waypipe client to read out the full transfer contents; green, the time range needed for the waypipe server to send the data transfer; and red, the time spent by worker threads compressing the buffer changed records needed by the following data transfer operation.

Most of the time is spent reading and writing the data transfers to the ssh connection. While these are ongoing, no other work is performed. A possible optimization would be to switch to a "streaming" data transfer model, in which the write operation is run in parallel with buffer diff compression, sending data as soon as it is compressed. The other side of the connection would perform streaming decompression, probably still with a single core. While this application only updates a single image buffer at a time, for applications which maintain multiple windows, transfer latency can be reduced by writing each window's buffer change transcript as soon as it is available, instead of bundling everything into a single large transfer.

The red intervals corresponding to the compress_buffer set of USDT probes often differ significantly in duration; dynamically adjusting the workload between threads to be more even may offer significant latency reduction.

Finally, looking at microsecond-scale details, we can see that during a transfer between matching waypipe instances, the receiving program makes a large number of context switches to periodically check for (and one time, handle) new events from either pipes or a connected Wayland compositor or application, and then resume the nonblocking read operation. Given the number of context switches, it may be more efficient -- even on single-core computers -- to create separate threads that only blocking read/write to the ssh connection, and notify the main thread on completion; but on the other hand, this may interfere with other optimizations, and would introduce a minimum of two context additional switches, and possibly cross-cache data migration as well. Sadly aio is oriented around file operations.


2019-06-13: The last week has been rather busy, introducing minimal support for:

More recently, waypipe has gained a more detailed damage tracking mechanism for shared-memory images (and potentially dmabufs with a linear layout.)

On the left, the above picture shows a collection of damage rectangles superimposed on a grid which contains the pixels (or bytes, or uint32_t's) in a shared memory buffer, ordered from left to right and then top to bottom. Because the diff construction routine used to determine which bytes have actually changed, and then copy out the changed bytes, is optimized to scan specific intervals, waypipe converts the set of damage rectangles to a set of disjoint intervals in the linear memory space. Because converting a 400 by 600 pixel rectangle into 600 distinct intervals would be a waste of memory, the conversion routine stores such rectangular units as "extended intervals", collections of regular intervals of the same width, with starting points separated by multiples of the image stride. Finally, to limit the total number of rectangles produced, and to avoid the overhead from tracking intervals of changed bytes from exceeding the size of the actual buffer contents, intervals are merged together if the minimum gap between the two -- in the linear memory space -- is less than or equal to a prespecified constant margin. In some cases, this heuristic can require merging rectangles which are on opposite sides of the image, but nevertheless close in memory. waypipe currently implements this transformation with a naive O(n^2) approach; much lower asymptotic runtimes are possible in theory, and may be mentioned here in the future.


2019-05-29: waypipe is mostly usable now - it works with reasonably low overhead for mostly-static GUI applications like kwrite and libreoffice, although it will probably crash and leak memory every now and then.

I've implemented a Wayland protocol parser for waypipe, and now use it to track the ownership and lifetime of wl_shm_pool buffers. An especially annoying detail of the Wayland protocol is that one cannot determine which passed file descriptor corresponds to which message without parsing the message. (This also makes proper handling of inert objects more complicated.) At the very least, the protocol makes it possible to identify message boundaries in the byte stream by including a 16-bit length field. Since protocol messages are effectively limited to <4096 bytes by libwayland fiat, and there are ≪1024 requests or events per interface in practice, a simple way to fix this issue would be to partition the second word of a wayland protocol header into a 12 bit byte-stream length field, 10 bit message id field, and 10 bit file-descriptor-stream length field. (12/16/4 also works, and can be made backwards-compatible.)

Another very useful change has been to make waypipe's connections to the compositor or application and to the matching waypipe instance nonblocking. As long as waypipe itself is not stuck with CPU-intensive computations, messages from the wayland client to the wayland server do not interfere/synchronize with messages moving in the other direction. This fixes a key repeat issue with waypipe from earlier, in which key repeat events were delayed by large screen updates. As networks can still be unreliable, it may still be useful to have waypipe modify messages to disable key repeat.

A useful trick when testing waypipe is to artificially adjust network parameters. This can be done with Linux traffic control tools, such as NetEm. For example, sudo tc qdisc add dev lo root netem rate 1000kbit will throttle bandwidth through the loopback interface, and sudo tc qdisc add dev lo root netem delay 100ms will add 100ms of latency.

When running games over waypipe, FPS often drops by a factor of 2 or more, as frames are delayed by the round trip time for the frame display callback. A workaround is to run the game nested inside weston; the compositor updates the game at e.g. 60 fps, and sends over a subset of the rendered frames. Depending on how the game loop is written, this trades a very slow game for one in which frames are dropped.


2019-05-23: It turns out that most toolkits and applications already work when restricted to protocols whose underlying file descriptors waypipe can successfully translate. I've modified waypipe to filter out the compositor messages advertising the availability of zwp_linux_dmabuf_v1 and wl_drm; see the protocol-aware branch.

I've tested the following programs (with ssh compression enabled, and the appropriate environment variables set to use wayland instead of X11):

Note that SuperTuxKart uses OpenGL 3.3, albeit via LLVMpipe, on the remote system. Due to current bandwidth and computational limitations, for a 1024x768 window, it only runs at about 2 FPS.

It is always useful to consider what the best possible performance would be. My test system, over WiFi, manages a 1MB/s transfer rate. A 2560x1440 screen, with 24 bits per pixel, and 60Hz refresh rate, displays 663MB/s of data. However, if the underlying data is for a 3d game, or some program with a scrolling viewport, then video bandwidth estimates should apply; one might only need 3 MB/s instead, which most modern networks, computers and hardware video decoders can handle. (For comparison, with a text-heavy website, losslessly compressing each frame as a PNG produces a transfer rate of about 25MB/s.) Of course, even if enough bandwidth is available, CPU usage and processing delay is still an issue. Furthermore, by default, the wl_surface interface uses a frame request to set up callbacks so that typical applications only provide a new frame sometime after the old frame has arrived at the compositor. This helps avoid computing unused frames, but introduces a round-trip latency between successive frames that is increased by any delays in waypipe itself.


2019-05-20: The minor issues mentioned last time have been fixed. The crash with weston-terminal was caused by a pty file descriptor from forkpty, which was both readable and writable, unlike pipe ends on Linux, which are only one of the two. A simple flag-setting signal handler for SIGINT now controls main loop termination for waypipe client and server processes, and permits cleanup of the Unix domain socket address files. A no-op handler for SIGCHLD is now set, and the flag SA_RESTART ensures that only the poll system calls are interrupted when child processes end. The ssh reverse tunnel closing issue was caused by the main client process not closeing the socket connection file descriptors.

At the moment, the performance of waypipe can be significantly improved by adding the -C option to enable compression to ssh. For example, try waypipe ssh -C user@host weston-stacking, and fullscreen the window. On one of my systems, over a slow connection, enabling compression reduces the time to draw the updated frame from about 10 seconds to 1.


2019-05-18: Still working on protocol-agnostic improvements, I have added translation of pipe operations to the prototype. These are required for wl_data_offer and primary_selection, which are used by copy/paste operations. An additional complication is that both protocols rely on the write end of the shared pipe being closed, to signify the completion of a data transfer. Fortunately, poll provides a relatively clean way to detect pipe closures, with the POLLHUP event flag. (At the moment, weston-terminal crashes on paste thanks to a non-S_ISFIFO fd; termite appears to work.)

There's not much more that can be done with a proxy that doesn't parse/modify protocol messages, so the first main work period (2019-05-27+) will probably be dedicated to that. Also, it looks like direct forwarding of e.g. wl_drm will not be possible, because the protocol sends a file descriptor for a DRM device, and it does not seem as though one cannot respond to the ioctl calls used to manipulate the device from userspace. It may also be possible (but very complicated) to translate protocol between wl_shm locally and e.g. linux_dmabuf+linux_explicit_synchronization remotely, since, to the extent that the compositor needs, the products of both are abstracted through wl_buffer.

I have also updated the command line interface for waypipe, because manually setting socket paths is inconvenient. If waypipe is correctly installed (visible in $PATH) on both a local and a remote system, then prefixing e.g. ssh -C me@there with waypipe to produce waypipe ssh -C me@there will automatically launch a waypipe client, add reverse tunneling flags, force pseudo-terminal allocation (-t), remotely run a waypipe server which runs either the default $SHELL or whatever program was requested, and close the client when ssh completes.

There are a few minor annoyances with the interface at the moment: ssh doesn't always exit cleanly, because it appears the reverse tunnel fails to close cleanly; waypipe litters socket bind paths in /tmp; and it usually takes a second to close a waypipe ssh session, because I've yet to set up signal handling and currently poll waitpid. (Unfortunately, signalfd is Linux-only, and pdfork is FreeBSD-only.)


2019-05-14: It lives, it moves!

I wrote a short proof of concept implementation of waypipe, which only supports shared memory buffers, and has about as unoptimized a change-detection and transfer mechanism as I can manage. Change-detection is performed by maintaining a twin copy; for transfer, waypipe sends over the entire buffer if any part changed. The program works, but lags quite horribly when I run weston-terminal over a WiFi network. For the code, see the protocol-agnostic branch, but do beware that it is research-quality (written with no particular care for clarity/maintenance).

Using wf-recorder, I recorded a short video of the proof of concept in action. Startup takes a while, and there is visible latency tracking mouse hover over even a small popup. When typing, with standard key repeat settings on the compositor side, one must be very careful to time keypresses so as to queue up the meaningful keystrokes when the underlying channel is hanging, lest non-idempotent actions be repeated.

In other news, by running buggy test clients, I've managed to crash sway often enough that I'll try to resolve the underlying issue sometime this week. On that note, I had been wondering why inert objects produce no errors, until reading this document.


2019-05-07: Task: implement a tool (henceforth called waypipe) which can be used to rely both messages and data between any Wayland client and compositor over a single transport channel. This should enable workflows similar to those using ssh -X.

The main difficulty in producing such a tool is that Wayland protocol messages primarily include control information, and the large data transfers of graphical applications are implemented through shared memory. waypipe must then identify and serialize changes to the shared memory buffers into messages to be transferred over a socket.

The above project will be funded via Google Summer of Code 2019.


Home RSS