GSoC 2021: Selection Editing and Window Selection

This summer I’m implementing a new screenshot UI for GNOME Shell. In this post I’ll show my progress over the past two weeks.

The new screenshot UI in the area selection mode

I spent the most time adding the four corner handles that allow you to adjust the selection. GNOME Shell’s drag-and-drop classes were mostly sufficient, save for a few minor things. In particular, I ended up extending the _Draggable class with a drag-motion signal emitted every time the dragged actor’s position changes. I used this signal to update the selection rectangle coordinates so it responds to dragging in real-time without any lag, just as one would expect. Some careful handling was also required to allow dragging the handle past selection edges, so for example it’s possible to grab the top-left handle and move it to the right and to the bottom, making it a bottom-right handle.

Editing the selection by dragging the corner handles

I’ve also implemented a nicer animation when opening the screenshot UI. Now the screen instantly freezes when you press the Print Screen button and the screenshot UI fades in, without the awkward screenshot blend. Here’s a side-by-side comparison to the previous behavior:

Comparison of the old and new opening animation, slowed down 2×

Additionally, I fixed X11 support for the new screenshot capturing. Whereas on Wayland the contents of the screen are readily available because GNOME Shell is responsible for all screen compositing, on X11 that’s not always the case: full-screen windows get unredirected, which means they bypass the compositing and go straight through the X server to the monitor. To capture a screenshot, then, GNOME Shell first needs to disable unredirection for one frame and paint the stage.

This X11 capturing works just as well as on Wayland, including the ability to capture transient windows such as tooltips—a long-requested feature. However, certain right-click menus on X11 grab the input and prevent the screenshot UI hotkey (and other hotkeys such as Super to enter the Overview) from working. This has been a long-standing limitation of the X11 session; unfortunately, these menus cannot be captured on X11. On Wayland this is not a problem as GNOME Shell handles all input itself, so windows cannot block its hotkeys.

Finally, over the past few days I’ve been working on window selection. Similarly to full-screen screenshots, every window’s contents are captured immediately as you open the screenshot UI, allowing you to pick the right window at your own pace. To capture the window contents I use Robert Mader’s implementation, which I invoke for all windows from the current workspace when the screenshot UI is opening. I arrange these window snapshots in a grid similar to the Overview and let the user pick the right window.

Window selection in action

As usual, the design is nowhere near finished or designer-approved. Consider it an instance of my “programmer art”. 😁

My goal was to re-use as much of the Overview window layout code as possible. I ended up making my own copy of the WorkspaceLayout class (I was able to strip it down considerably because the original class has to deal with windows disappearing, re-appearing and changing size, whereas the screenshot UI window snapshots never change) and directly re-using the rest of the machinery. I also made my own widget compatible with WindowPreview, which exports the few functions used by the layout code, once again considerably simplified thanks to not having to deal with the ever changing real windows.

The next step is to put more work into the window selection to make sure it handles all the different setups and edge cases right: the current implementation is essentially the first working draft that only supports the primary monitor. Then I’ll need to add the ability to pick the monitor in the screen selection mode and make sure it works fine with different setups too. I also want to figure out capturing screenshots with a visible cursor, which is currently notably missing from the screenshot UI. After that I’ll tackle the screen recording half.

Also, unrelated to the screenshot UI, I’m happy to announce that my merge request for reducing input latency in Mutter has finally been merged and should be included in Mutter 41.alpha.

That’s it for this post, see you in the next update!

GSoC 2021: GNOME Shell Screenshot UI

Hello! I’m Ivan Molodetskikh, a computer science student from Moscow, Russia.

I’ve been involved in GNOME starting from my GSoC 2018 project to port librsvg filters to Rust. Throughout the last year in GNOME I’ve been doing some work to reduce input latency in Mutter, the GNOME’s compositor (by implementing the presentation-time Wayland protocol and adding dynamic render time computation). I’ve also created two small apps, Video Trimmer and Identity.

As part of this year’s Google Summer of Code, I’m implementing a new screenshot UI in GNOME Shell.

Screenshot UI panel mock-up by the design team

The UI will make taking screenshots and recording screencasts more intuitive and discoverable. On a key press, GNOME Shell will capture a full screenshot, and you will be able to select the exact area you want. The screenshot is captured immediately, so it’s much easier to catch the right moment or capture open context menus.

Screencasts will get an upgrade too: you will be able to record areas of the screen or individual windows, just like you already can with screenshots.

Over the first few weeks I figured out how to add new UI elements to GNOME Shell: how to construct UI with GJS, how to style elements with CSS, the difference between Clutter actors and layouts and St (GNOME Shell’s toolkit) widgets, how to do transitions and handle input. I’ve been basing my work on the UI mock-up from the design team. Here’s a short demo of what I’ve implemented so far:

Demo of the parts of the mock-up that I’ve implemented thus far

Keep in mind this is very much a work-in-progress: I used stock icons instead of correct mock-up ones, I haven’t got any designer feedback yet, screen recording is not implemented and so on.

Using Robert Mader’s texture actor implementation, I added a Mutter function to snapshot the screen contents into a GPU texture that can be shown on a GNOME Shell widget. This way I can instantly display the screenshot preview in the UI without doing a slow PNG encoding round-trip. Then the UI allows you to select an area or a screen and record it into an image by pressing the capture button. Currently, the image is copied into the clipboard. I paste the screenshot into Obfuscate to display it.

When switching into the screencast mode, instead of the screen snapshot you can simply see your desktop normally because screen recording starts only upon pressing the capture button, not from an old screen snapshot.

The next step is to implement Window selection, which will arrange windows similarly to the Overview. Afterwards I’ll work on the screen recording part. I have also contacted the design team to get feedback and make sure the UI is the best it can be.

I’d like to thank my mentor, Jonas Dreßler (aka verdre), for keeping up with my questions. I’m excited to bring an awesome screenshot UI to GNOME, see you all in the next blog posts!

GSoC 2018: Overview

Introduction

Throughout the summer I was working on librsvg, a GNOME library for rendering SVG files to Cairo surfaces. This post is an overview of the work I did with relevant links.

My Results

For the project I was to port the SVG filter infrastructure of librsvg from C to Rust, adding all missing filter tests from the SVG test suite along the way. I was also expected to implement abstractions to make the filter implementation more convenient, including Rust iterators over the surface pixels.

Here’s a list of all merge requests accepted into librsvg as part of my GSoC project:

Here’s a convenient link to see all of these merge requests in GitLab: https://gitlab.gnome.org/GNOME/librsvg/merge_requests?scope=all&utf8=%E2%9C%93&state=all&author_username=YaLTeR&label_name[]=GSoC%202018

All of this code was accepted into the mainline and will appear in the next stable release of librsvg.

I also wrote the following blog posts detailing some interesting things I worked on as part of the GSoC project:

Further Work

There are a couple of fixes which still need to be done for filters to be feature-complete:

  • Fixing filters operating on off-screen nodes. Currently all intermediate surfaces are limited to the original SVG view area so anything off-screen is inaccessible to filters even when it should be. This is blocked on some considerable refactoring in the main librsvg node drawing code which is currently underway.
  • Implementing the filterRes property. This property allows to set the pixel resolution for filter operations and is one of the ways of achieving more resolution-independent rendering results. While it can be implemented with the current code as is, it will be much more convenient to account for it while refactoring the code to fix the previous issue.
  • Implementing the enable-background property. The BackgroundImage filter input should adhere to this property when picking which nodes to include in the background image, whereas it currently doesn’t.

GSoC 2018: Parallelizing Filters with Rayon

Introduction

I’m working on SVG filter effects in librsvg, a GNOME library for rendering SVG files to Cairo surfaces. After finishing porting all filters from C to Rust and adding tests, I started investigating the filter performance. With the codebase converted to Rust, I am able to confidently apply important optimizations such as parallelization. In this post I’ll show how I parallelized two computation-intensive filter primitives.

Rayon

Rayon is a Rust crate for introducing parallelism into existing code. It utilizes Rust’s type system to guarantee memory safety and data-race free execution. Rayon mimics the standard Rust iterators, so it’s possible to convert the existing iterator-using code with next to no changes. Compare the following single-threaded and parallel code:

// Single-threaded.
fn sum_of_squares(input: &[i32]) -> i32 {
    input.iter()
         .map(|i| i * i)
         .sum()
}

// Parallelized with rayon.
use rayon::prelude::*;
fn sum_of_squares(input: &[i32]) -> i32 {
    input.par_iter()
         .map(|i| i * i)
         .sum()
}

By merely using .par_iter() instead of .iter(), the computation becomes parallelized.

Parallelizing Lighting Filters

Going forward with analyzing and improving the performance of the infamous mobile phone case, the two biggest time sinks were the lighting and the Gaussian blur filter primitives. It can be easily seen on the callgrind graph from KCachegrind:

KCachegrind graph for filter rendering

Since I was working on optimizing the lighting filters, I decided to try out parallelization there first.

The lighting filter primitives in SVG (feDiffuseLighting and feSpecularLighting) can be used to cast light onto the canvas using a render of an existing SVG node as a bump map. The computation is quite involved, but it boils down to performing a number of arithmetic operations for each pixel of the input surface independently of the others—a perfect target for parallelization.

This is what the code initially looked like:

let mut output_surface = ImageSurface::create(
    cairo::Format::ARgb32,
    input_surface.width(),
    input_surface.height(),
)?;

let output_stride = output_surface.get_stride() as usize;
let mut output_data = output_surface.get_data().unwrap();

let mut compute_output_pixel = |x, y, normal: Normal| {
    let output_pixel = /* expensive computations */;

    output_data.set_pixel(output_stride, output_pixel, x, y);
};

// Compute the edge pixels
// <...>

// Compute the interior pixels
for y in bounds.y0 as u32 + 1..bounds.y1 as u32 - 1 {
    for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
        compute_output_pixel(
            x,
            y,
            interior_normal(&input_surface, bounds, x, y),
        );
    }
}

The edge pixel computation is separated out for optimization reasons and it’s not important. We want to focus on the main loop over the interior pixels: it takes up the most time.

What we’d like to do is to take the outer loop over the image rows and run it in parallel on a thread pool. Since each row (well, each pixel in this case) is computed independently of the others, we should be able to do it without much hassle. However, we cannot do it right away: the compute_output_pixel closure mutably borrows output_data, so sharing it over multiple threads would mean multiple threads get concurrent mutable access to all image pixels, which could result in a data race and so won’t pass the borrow checker of the Rust compiler.

Instead, we can split the output_data slice into row-sized non-overlapping chunks and feed each thread only those chunks that it needs to process. This way no thread can access the data of another thread.

Let’s change the closure to accept the target slice (as opposed to borrowing it from the enclosing scope). Since the slices will start from the beginning of each row rather than the beginning of the whole image, we’ll also add an additional base_y argument to correct the offsets.

let compute_output_pixel =
    |mut output_slice: &mut [u8],
     base_y,
     x,
     y,
     normal: Normal| {
        let output_pixel = /* expensive computations */;

        output_slice.set_pixel(
            output_stride,
            output_pixel,
            x,
            y - base_y,
        );
    };

// Compute the interior pixels
for y in bounds.y0 as u32 + 1..bounds.y1 as u32 - 1 {
    for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
        compute_output_pixel(
            output_data,
            0,
            x,
            y,
            interior_normal(&input_surface, bounds, x, y),
        );
    }
}

Now we can convert the outer loop to operate through iterators using the chunks_mut() method of a slice which does exactly what we want: returns the slice in evenly sized non-overlapping mutable chunks.

let compute_output_pixel =
    |mut output_slice: &mut [u8],
     base_y,
     x,
     y,
     normal: Normal| {
        let output_pixel = /* expensive computations */;

        output_slice.set_pixel(
            output_stride,
            output_pixel,
            x,
            y - base_y,
        );
    };

// Compute the interior pixels
let first_row = bounds.y0 as u32 + 1;
let one_past_last_row = bounds.y1 as u32 - 1;
let first_pixel = (first_row as usize) * output_stride;
let one_past_last_pixel =
    (one_past_last_row as usize) * output_stride;

output_data[first_pixel..one_past_last_pixel]
    .chunks_mut(output_stride)
    .zip(first_row..one_past_last_row)
    .for_each(|(slice, y)| {
        for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
            compute_output_pixel(
                slice,
                y,
                x,
                y,
                interior_normal(
                    &input_surface,
                    bounds,
                    x,
                    y,
                ),
            );
        }
    });

And finally, parallelize by simply changing chunks_mut() to par_chunks_mut():

use rayon::prelude::*;

output_data[first_pixel..one_past_last_pixel]
    .par_chunks_mut(output_stride)
    .zip(first_row..one_past_last_row)
    .for_each(|(slice, y)| {
        for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
            compute_output_pixel(
                slice,
                y,
                x,
                y,
                interior_normal(
                    &input_surface,
                    bounds,
                    x,
                    y,
                ),
            );
        }
    });

Let’s see if the parallelization worked! Here I’m using time to measure how long it takes to render the mobile phone SVG.

Before parallelization:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
6.95user 0.66system 0:07.62elapsed 99%CPU (0avgtext+0avgdata 270904maxresident)k
0inputs+2432outputs (0major+714373minor)pagefaults 0swaps

After parallelization:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
7.47user 0.63system 0:06.04elapsed 134%CPU (0avgtext+0avgdata 271328maxresident)k
0inputs+2432outputs (0major+714460minor)pagefaults 0swaps

Note that even though the user time went up, the elapsed time went down by 1.5 seconds, and the CPU utilization increased past 100%. Success!

Parallelizing Gaussian Blur

Next, I set out to parallelize Gaussian blur, the biggest timesink for the phone and arguably one of the most used SVG filters altogether.

The SVG specification hints that for all reasonable values of the standard deviation parameter the blur can be implemented as three box blurs (taking the average value of the pixels) instead of the much more costly Gaussian kernel convolution. Pretty much every SVG rendering agent implements it this way since it’s much faster and librsvg is no exception. This is why you’ll see functions called box_blur and not gaussian_blur.

Both box blur and Gaussian blur are separable convolutions, which means it’s possible to implement them as two passes, one of which is a loop blurring each row of the image  and another is a loop blurring each column of the image independently of the others. For box blur specifically it allows for a much more optimized convolution implementation.

In librsvg, the box blur function contains an outer loop over the rows or the columns of the input image, depending on the vertical argument and an inner loop over the columns or the rows, respectively. It uses i and j for the outer and inner loop indices and has some helper functions to convert those to the actual coordinates, depending on the direction.

// Helper functions for getting and setting the pixels.
let pixel = |i, j| {
    let (x, y) = if vertical { (i, j) } else { (j, i) };

    input_surface.get_pixel_or_transparent(bounds, x, y)
};

let mut set_pixel = |i, j, pixel| {
    let (x, y) = if vertical { (i, j) } else { (j, i) };

    output_data.set_pixel(output_stride, pixel, x, y);
};

// Main loop
for i in other_axis_min..other_axis_max {
    // Processing the first pixel
    // <...>

    // Inner loop
    for j in main_axis_min + 1..main_axis_max {
        // <...>
    }
}

Trying to convert this code to use chunks_mut() just like the lighting filters, we stumble on an issue: if the outer loop is iterating over columns, rather than rows, the output slices for all individual columns overlap (because the pixels are stored in row-major order). We need some abstraction, like a matrix slice, that can be split into non-overlapping mutable subslices by rows or by columns.

The first thing that comes to mind is to try using the Matrix type from the nalgebra crate which does have that functionality. However, it turns out that nalgebra doesn’t currently support rayon or even have by-row or by-column iterators. I tried implementing my own iterators but that required some very non-obvious unsafe code with odd trait bound restrictions which I really wasn’t sure were correct. Thus, I scrapped that code and made my own wrapper for the ImageSurface which only contains things needed for this particular use case.

To reiterate, we need a wrapper that:

  • provides write access to the image pixels,
  • can be split by row or by column into non-overlapping chunks,
  • is Send, i.e. can be safely sent between threads (for parallelizing).

Here’s what I came up with:

struct UnsafeSendPixelData<'a> {
    width: u32,
    height: u32,
    stride: isize,
    ptr: NonNull<u8>,
    _marker: PhantomData<&'a mut ()>,
}

unsafe impl<'a> Send for UnsafeSendPixelData<'a> {}

impl<'a> UnsafeSendPixelData<'a> {
    /// Creates a new `UnsafeSendPixelData`.
    ///
    /// # Safety
    /// You must call `cairo_surface_mark_dirty()` on the
    /// surface once all instances of `UnsafeSendPixelData`
    /// are dropped to make sure the pixel changes are
    /// committed to Cairo.
    #[inline]
    unsafe fn new(
        surface: &mut cairo::ImageSurface,
    ) -> Self {
        assert_eq!(
            surface.get_format(),
            cairo::Format::ARgb32
        );
        let ptr = surface.get_data().unwrap().as_mut_ptr();

        Self {
            width: surface.get_width() as u32,
            height: surface.get_height() as u32,
            stride: surface.get_stride() as isize,
            ptr: NonNull::new(ptr).unwrap(),
            _marker: PhantomData,
        }
    }

    /// Sets a pixel value at the given coordinates.
    #[inline]
    fn set_pixel(&mut self, pixel: Pixel, x: u32, y: u32) {
        assert!(x < self.width);
        assert!(y < self.height);

        let value = pixel.to_u32();

        unsafe {
            let ptr = self.ptr.as_ptr().offset(
                y as isize * self.stride + x as isize * 4,
            ) as *mut u32;
            *ptr = value;
        }
    }

    /// Splits this `UnsafeSendPixelData` into two at the
    /// given row.
    ///
    /// The first one contains rows `0..index` (index not
    /// included) and the second one contains rows
    /// `index..height`.
    #[inline]
    fn split_at_row(self, index: u32) -> (Self, Self) {
        assert!(index <= self.height);

        (
            UnsafeSendPixelData {
                width: self.width,
                height: index,
                stride: self.stride,
                ptr: self.ptr,
                _marker: PhantomData,
            },
            UnsafeSendPixelData {
                width: self.width,
                height: self.height - index,
                stride: self.stride,
                ptr: NonNull::new(unsafe {
                    self.ptr
                        .as_ptr()
                        .offset(index as isize * self.stride)
                }).unwrap(),
                _marker: PhantomData,
            },
        )
    }

    /// Splits this `UnsafeSendPixelData` into two at the
    /// given column.
    ///
    /// The first one contains columns `0..index` (index not
    /// included) and the second one contains columns
    /// `index..width`.
    #[inline]
    fn split_at_column(self, index: u32) -> (Self, Self) {
        assert!(index <= self.width);

        (
            UnsafeSendPixelData {
                width: index,
                height: self.height,
                stride: self.stride,
                ptr: self.ptr,
                _marker: PhantomData,
            },
            UnsafeSendPixelData {
                width: self.width - index,
                height: self.height,
                stride: self.stride,
                ptr: NonNull::new(unsafe {
                    self.ptr
                        .as_ptr()
                        .offset(index as isize * 4)
                }).unwrap(),
                _marker: PhantomData,
            },
        )
    }
}

The wrapper contains a pointer to the data rather than a mutable slice of the data, so the intermediate pixels (which cannot be accessed through set_pixel()) are not mutably aliased between different instances of UnsafeSendPixelData.

Now it’s possible to implement an iterator over the rows or the columns for this wrapper, however I went with a different, simpler approach: using rayon‘s scope functionality which allows spawning worker threads directly into rayon‘s thread pool.

First, let’s change the existing code to operate on individual rows or columns, just like we did with the lighting filters:

// The following loop assumes the first row or column of
// `output_data` is the first row or column inside `bounds`.
let mut output_data = if vertical {
    output_data.split_at_column(bounds.x0 as u32).1
} else {
    output_data.split_at_row(bounds.y0 as u32).1
};

for i in other_axis_min..other_axis_max {
    // Split off one row or column and launch its processing
    // on another thread. Thanks to the initial split before
    // the loop, there's no special case for the very first
    // split.
    let (mut current, remaining) = if vertical {
        output_data.split_at_column(1)
    } else {
        output_data.split_at_row(1)
    };

    output_data = remaining;

    // Helper function for setting the pixels.
    let mut set_pixel = |j, pixel| {
        // We're processing rows or columns one-by-one, so
        // the other coordinate is always 0.
        let (x, y) = if vertical { (0, j) } else { (j, 0) };
        current.set_pixel(pixel, x, y);
    };

    // Processing the first pixel
    // <...>

    // Inner loop
    for j in main_axis_min + 1..main_axis_max {
        // <...>
    }
}

I could avoid the current and the base_i arguments to the set_pixel closure because I can declare the closure from within the loop, whereas in the lighting filters code the compute_output_pixel closure had to be used and so declared outside of the main loop.

Now it’s a simple change to split the work across rayon‘s threads:

// The following loop assumes the first row or column of
// `output_data` is the first row or column inside `bounds`.
let mut output_data = if vertical {
    output_data.split_at_column(bounds.x0 as u32).1
} else {
    output_data.split_at_row(bounds.y0 as u32).1
};

// Establish a scope for the threads.
rayon::scope(|s| {
    for i in other_axis_min..other_axis_max {
        // Split off one row or column and launch its
        // processing on another thread. Thanks to the
        // initial split before the loop, there's no special
        // case for the very first split.
        let (mut current, remaining) = if vertical {
            output_data.split_at_column(1)
        } else {
            output_data.split_at_row(1)
        };

        output_data = remaining;

        // Spawn the thread for this row or column.
        s.spawn(move |_| {
            // Helper function for setting the pixels.
            let mut set_pixel = |j, pixel| {
                // We're processing rows or columns
                // one-by-one, so the other coordinate is
                // always 0.
                let (x, y) =
                    if vertical { (0, j) } else { (j, 0) };
                current.set_pixel(pixel, x, y);
            };

            // Processing the first pixel
            // <...>

            // Inner loop
            for j in main_axis_min + 1..main_axis_max {
                // <...>
            }
        });
    }
});

Let’s measure the performance. In the end of the previous section we had:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
7.47user 0.63system 0:06.04elapsed 134%CPU (0avgtext+0avgdata 271328maxresident)k
0inputs+2432outputs (0major+714460minor)pagefaults 0swaps

And now after the parallelization:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
10.32user 1.10system 0:04.57elapsed 250%CPU (0avgtext+0avgdata 272588maxresident)k
0inputs+2432outputs (0major+1009498minor)pagefaults 0swaps

We cut another 1.5 seconds and further increased the CPU utilization!

Conclusion

Rayon is an excellent crate which provides a multitude of ways for safely parallelizing Rust code. It builds on idiomatic concepts such as iterators, but also contains a number of other convenient ways for parallelizing code when standard iterators aren’t sufficient or wouldn’t be very convenient. It uses the Rust’s type system to statically guarantee the absence of data races and manages the low-level details on its own allowing the programmer to focus on the actual computation.

The parallelized filters are now included in the latest development release of librsvg for testing if using multiple threads leads to any issues downstream.

GSoC 2018: Safe Shared Access to Cairo Image Surfaces

Introduction

I’m working on librsvg, a GNOME SVG rendering library, to port the SVG filter effects and related infrastructure from C to Rust. Librsvg uses Cairo, a 2D graphics library, for most of its drawing operations. Cairo can draw to a number of different surfaces like XCB and Xlib windows and pixmaps, PDF documents and PostScript files.

Image Surfaces

Filter effects operate on rasterized bitmaps, so most of them need direct access to pixel data. There’s a special Cairo surface type for that: an image surface, backed by a memory buffer. Filters receive their input images in image surfaces, perform the necessary pixel operations and return new image surfaces with the results.

In the Rust bindings, image surfaces are represented by the ImageSurface struct. It has a get_data() method which returns an ImageSurfaceData, which in turn acts as a slice into the underlying memory.

Since ImageSurfaceData gives mutable access to the surface memory, it must ensure there are no other views into the same memory to comply with Rust’s ownership rules. This is achieved by checking that the ImageSurface reference count is equal to 1 in get_data(), which means that no other references to the ImageSurface exist. Furthermore, the ImageSurface is borrowed mutably for the lifetime of ImageSurfaceData. This is needed to prevent cloning the surface after obtaining ImageSurfaceData and subsequently drawing on it with a Cairo context while the view into the surface memory still exists.

While this scheme does work, it offers only mutable, unique access to the pixel data. In the filter code, it’s much more convenient to have multiple references to the input surfaces with read-only memory access. Simply adding a struct similar to ImageSurfaceData which provides only a read-only view into the memory does not allow to drop the unique reference constraint, because it’s always possible to use the other reference to concurrently mutate the pixel data.

Shared Image Surface

To work around the constraints, I ended up creating a special wrapper for ImageSurfaces:

#[derive(Debug, Clone)]
struct SharedImageSurface {
    surface: ImageSurface,

    data_ptr: NonNull<u8>, // *const.
    width: i32,
    height: i32,
    stride: isize,
}

The idea is to wrap a unique ImageSurface and provide just read-only access to it, while allowing cloning of SharedImageSurface itself. This way there can be multiple SharedImageSurfaces without compromising soundness because without direct access to the underlying ImageSurface it’s impossible to mutate it in any way from the outside.

Additionally, since we know the surface won’t be modified, we can cache some common properties to get rid of extra C calls which can’t be easily optimized away.

The SharedImageSurface constructor ensures the ImageSurface it receives is unique:

impl SharedImageSurface {
    pub fn new(
        surface: ImageSurface,
    ) -> Result<Self, cairo::Status> {
        // get_pixel() assumes ARgb32.
        assert_eq!(
            surface.get_format(),
            cairo::Format::ARgb32
        );

        // Ensure the access is unique.
        let reference_count = unsafe {
            cairo_sys::cairo_surface_get_reference_count(
                surface.to_raw_none(),
            )
        };
        assert_eq!(reference_count, 1);

        // Flush any pending drawing operations before
        // accessing the memory directly.
        surface.flush();
        if surface.status() != cairo::Status::Success {
            return Err(surface.status());
        }

        let data_ptr = NonNull::new(unsafe {
            cairo_sys::cairo_image_surface_get_data(
                surface.to_raw_none(),
            )
        }).unwrap();

        let width = surface.get_width();
        let height = surface.get_height();
        let stride = surface.get_stride() as isize;

        Ok(Self {
            surface,
            data_ptr,
            width,
            height,
            stride,
        })
    }
}

And other methods on SharedImageSurface provide access to various surface properties, as well as the pixel data:

impl SharedImageSurface {
    /// Returns the surface width.
    #[inline]
    pub fn width(&self) -> i32 {
        self.width
    }

    /// Returns the surface height.
    #[inline]
    pub fn height(&self) -> i32 {
        self.height
    }

    /// Retrieves the pixel value at the given coordinates.
    #[inline]
    pub fn get_pixel(&self, x: u32, y: u32) -> Pixel {
        assert!(x < self.width as u32);
        assert!(y < self.height as u32);

        let offset =
            y as isize * self.stride + x as isize * 4;
        let ptr = self.data_ptr.as_ptr().offset(offset);

        // According to Cairo documentation, the pixel values
        // for the ARgb32 format should be read using a
        // platform-native u32.
        let value = unsafe { *(ptr as *const u32) };

        Pixel {
            r: ((value >> 16) & 0xFF) as u8,
            g: ((value >> 8) & 0xFF) as u8,
            b: (value & 0xFF) as u8,
            a: ((value >> 24) & 0xFF) as u8,
        }
    }
}

Another nice property of wrapping ImageSurfaces is that it’s possible to provide extra utility methods for working with pixel data. Right now we have methods for extracting the alpha channel, resizing the surface, doing linear sRGB to sRGB and vice versa conversions and performing convolutions useful for primitives like Gaussian blur. Pixel iterators I showcased in the previous post were switched to take SharedImageSurfaces as input.

One downside is that any other Cairo APIs that take surfaces as a read-only input need to be wrapped too. Fortunately, so far we needed only the Context::set_source_surface() method which allows using the given surface as the pixel source for drawing operations:

impl SharedImageSurface {
    /// Calls `set_source_surface()` on the given Cairo
    /// context.
    #[inline]
    pub fn set_as_source_surface(
        &self,
        cr: &cairo::Context,
        x: f64,
        y: f64,
    ) {
        cr.set_source_surface(&self.surface, x, y);
    }
}

Converting a SharedImageSurface back into a regular ImageSurface to return it from the filter code does a reference count check too: if it isn’t equal to 1, there are other SharedImageSurfaces pointing at the same ImageSurface, which means we cannot simply return the surface as it can be used for modifying the pixel data thus breaking the SharedImageSurface invariants. In this case a copy of the surface is created and returned:

impl SharedImageSurface {
    /// Converts this `SharedImageSurface` back into a Cairo
    /// image surface.
    #[inline]
    pub fn into_image_surface(
        self,
    ) -> Result<ImageSurface, cairo::Status> {
        let reference_count = unsafe {
            cairo_sys::cairo_surface_get_reference_count(
                self.surface.to_raw_none(),
            )
        };

        if reference_count == 1 {
            Ok(self.surface)
        } else {
            // If there are any other references, copy the
            // underlying surface.
            let bounds = IRect {
                x0: 0,
                y0: 0,
                x1: self.width,
                y1: self.height,
            };

            self.copy_surface(bounds)
        }
    }
}

Alpha-Only Optimizations

The most heavy filter primitives are the ones doing image convolutions, mainly the Gaussian blur primitive. A convolution involves computing a weighted sum of pixels within a rectangle for every input pixel.

The Gaussian blur primitive is frequently used for creating shadows for other elements. In this case it takes the rasterized element’s alpha channel as an input and blurs it. Then the blurred surface is offset a little and drawn below the input surface to create a shadow.

A yellow circle with a shadow.

When a convolution is applied to an alpha-only image, there’s no need to compute the weighted sum for the other three color channels. However, going through the whole input image to check if it only contains meaningful data in the alpha channel is rather costly. Thankfully, we can avoid that step altogether by caching this property.

I added a field into SharedImageSurface to indicate whether the current surface is alpha-only, along with a special constructor:

struct SharedImageSurface {
    /* ... */

    /// Whether this surface contains meaningful data only
    /// in the alpha channel.
    ///
    /// This is used for optimizations, particularly in
    /// `convolve()` to skip processing other channels.
    alpha_only: bool,
}

impl SharedImageSurface {
    pub fn new(
        surface: ImageSurface,
    ) -> Result<Self, cairo::Status> {
        /* ... */

        Ok(Self {
            surface,
            data_ptr,
            width,
            height,
            stride,
            // Default to not alpha only.
            alpha_only: false,
        })
    }

    /// Creates a `SharedImageSurface` from a unique
    /// `ImageSurface` with meaningful data only in the alpha
    /// channel.
    #[inline]
    pub fn new_alpha_only(
        surface: ImageSurface,
    ) -> Result<Self, cairo::Status> {
        let mut rv = Self::new(surface)?;
        rv.alpha_only = true;
        Ok(rv)
    }
}

This constructor is used automatically in the extract_alpha() method used to separate the alpha channel of the input surface:

impl SharedImageSurface {
    /// Returns a surface with black background and alpha
    /// channel matching this surface.
    pub fn extract_alpha(
        &self,
        bounds: IRect,
    ) -> Result<SharedImageSurface, cairo::Status> {
        let mut output_surface = ImageSurface::create(
            cairo::Format::ARgb32,
            self.width,
            self.height,
        )?;

        let output_stride =
            output_surface.get_stride() as usize;
        {
            let mut output_data =
                output_surface.get_data().unwrap();

            for (x, y, Pixel { a, .. }) in
                Pixels::new(self, bounds)
            {
                let output_pixel = Pixel {
                    r: 0,
                    g: 0,
                    b: 0,
                    a,
                };

                output_data.set_pixel(
                    output_stride,
                    output_pixel,
                    x,
                    y,
                );
            }
        }

        // The returned surface is alpha-only!
        SharedImageSurface::new_alpha_only(output_surface)
    }
}

Finally, the pixel processing methods check the alpha-only flag to reduce the number of operations or skip unneeded processing altogether:

impl SharedImageSurface {
    /// Returns a surface with pre-multiplication of color
    /// values undone.
    pub fn unpremultiply(
        &self,
        bounds: IRect,
    ) -> Result<SharedImageSurface, cairo::Status> {
        // Unpremultiplication doesn't affect the alpha
        // channel.
        if self.alpha_only {
            return Ok(self.clone());
        }

        let mut output_surface = ImageSurface::create(
            cairo::Format::ARgb32,
            self.width,
            self.height,
        )?;

        let stride = output_surface.get_stride() as usize;
        {
            let mut data =
                output_surface.get_data().unwrap();

            for (x, y, pixel) in Pixels::new(self, bounds) {
                data.set_pixel(
                    stride,
                    pixel.unpremultiply(),
                    x,
                    y,
                );
            }
        }

        SharedImageSurface::new(output_surface)
    }

    /// Performs a convolution.
    pub fn convolve(
        &self,
        bounds: IRect,
        target: (i32, i32),
        kernel: &Matrix<f64>,
        edge_mode: EdgeMode,
    ) -> Result<SharedImageSurface, cairo::Status> {
        assert!(kernel.rows() >= 1);
        assert!(kernel.cols() >= 1);

        let mut output_surface = ImageSurface::create(
            cairo::Format::ARgb32,
            self.width,
            self.height,
        )?;

        let output_stride =
            output_surface.get_stride() as usize;
        {
            let mut output_data =
                output_surface.get_data().unwrap();

            if self.alpha_only {
                // Perform a convolution, taking a weighted
                // sum of only the alpha channel.
            } else {
                // Perform a convolution, taking a weighted
                // sum of all four color channels.
            }
        }

        if self.alpha_only {
            SharedImageSurface::new_alpha_only(
                output_surface,
            )
        } else {
            SharedImageSurface::new(output_surface)
        }
    }
}

Validating Performance

The main motivating example for the optimizations is this old mobile phone SVG file from a 9-year-old issue on performance which takes about 40 seconds to render on my desktop PC:

An old mobile phone.

To measure the performance impact, I built librsvg in release mode with debug info before doing alpha-only optimizations and used the perf tool:

perf record --call-graph=dwarf ./rsvg-convert -o /dev/null mobile_phone_01.svg

I then used the awesome FlameGraph tool to create a nice visualization of the data (click and open in a browser for an interactive SVG):

A flame graph of the rendering performance.
The large two-part column in the middle happens to be the Gaussian blur, taking up 50.63% of the total rendering time for the mobile phone. Turns out the blur operates on SourceAlpha, which contains just the alpha channel of the rasterized element the filter is applied to.

After adding the alpha-only optimizations to convolution and other parts of the code like linear sRGB ⇔ sRGB conversion, the rendering time dropped from 40 seconds to 29 seconds, and the performance graph now looks like this:

A flame graph of the rendering performance.

The percentage of time taken by Gaussian blur dropped from 50.63% to 36.98%. You can also see a slight drop in the narrow column to the left of the Gaussian blur part: that’s the sRGB linearizations which became no-ops on those input images that were alpha-only.

Conclusion

The project is coming along very well. I’ve just came back from GUADEC where I really pushed the filter rustification during the hacking days, porting all of the remaining filters from C to Rust. Now that all filters are in Rust and thoroughly tested, I’m looking at improving the performance, starting with the alpha-only optimization described in this post.

Also since all SVG nodes are now in Rust (the filter primitives were the last ones), I was able to clean up the remaining C interop code for nodes! It’s great to see the intermediate FFI code gradually disappear as the entire subsystems get completely ported to Rust.