Featured

GSoC 2018: Parallelizing Filters with Rayon

Introduction

I’m working on SVG filter effects in librsvg, a GNOME library for rendering SVG files to Cairo surfaces. After finishing porting all filters from C to Rust and adding tests, I started investigating the filter performance. With the codebase converted to Rust, I am able to confidently apply important optimizations such as parallelization. In this post I’ll show how I parallelized two computation-intensive filter primitives.

Rayon

Rayon is a Rust crate for introducing parallelism into existing code. It utilizes Rust’s type system to guarantee memory safety and data-race free execution. Rayon mimics the standard Rust iterators, so it’s possible to convert the existing iterator-using code with next to no changes. Compare the following single-threaded and parallel code:

// Single-threaded.
fn sum_of_squares(input: &[i32]) -> i32 {
    input.iter()
         .map(|i| i * i)
         .sum()
}

// Parallelized with rayon.
use rayon::prelude::*;
fn sum_of_squares(input: &[i32]) -> i32 {
    input.par_iter()
         .map(|i| i * i)
         .sum()
}

By merely using .par_iter() instead of .iter(), the computation becomes parallelized.

Parallelizing Lighting Filters

Going forward with analyzing and improving the performance of the infamous mobile phone case, the two biggest time sinks were the lighting and the Gaussian blur filter primitives. It can be easily seen on the callgrind graph from KCachegrind:

KCachegrind graph for filter rendering

Since I was working on optimizing the lighting filters, I decided to try out parallelization there first.

The lighting filter primitives in SVG (feDiffuseLighting and feSpecularLighting) can be used to cast light onto the canvas using a render of an existing SVG node as a bump map. The computation is quite involved, but it boils down to performing a number of arithmetic operations for each pixel of the input surface independently of the others—a perfect target for parallelization.

This is what the code initially looked like:

let mut output_surface = ImageSurface::create(
    cairo::Format::ARgb32,
    input_surface.width(),
    input_surface.height(),
)?;

let output_stride = output_surface.get_stride() as usize;
let mut output_data = output_surface.get_data().unwrap();

let mut compute_output_pixel = |x, y, normal: Normal| {
    let output_pixel = /* expensive computations */;

    output_data.set_pixel(output_stride, output_pixel, x, y);
};

// Compute the edge pixels
// <...>

// Compute the interior pixels
for y in bounds.y0 as u32 + 1..bounds.y1 as u32 - 1 {
    for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
        compute_output_pixel(
            x,
            y,
            interior_normal(&input_surface, bounds, x, y),
        );
    }
}

The edge pixel computation is separated out for optimization reasons and it’s not important. We want to focus on the main loop over the interior pixels: it takes up the most time.

What we’d like to do is to take the outer loop over the image rows and run it in parallel on a thread pool. Since each row (well, each pixel in this case) is computed independently of the others, we should be able to do it without much hassle. However, we cannot do it right away: the compute_output_pixel closure mutably borrows output_data, so sharing it over multiple threads would mean multiple threads get concurrent mutable access to all image pixels, which could result in a data race and so won’t pass the borrow checker of the Rust compiler.

Instead, we can split the output_data slice into row-sized non-overlapping chunks and feed each thread only those chunks that it needs to process. This way no thread can access the data of another thread.

Let’s change the closure to accept the target slice (as opposed to borrowing it from the enclosing scope). Since the slices will start from the beginning of each row rather than the beginning of the whole image, we’ll also add an additional base_y argument to correct the offsets.

let compute_output_pixel =
    |mut output_slice: &mut [u8],
     base_y,
     x,
     y,
     normal: Normal| {
        let output_pixel = /* expensive computations */;

        output_slice.set_pixel(
            output_stride,
            output_pixel,
            x,
            y - base_y,
        );
    };

// Compute the interior pixels
for y in bounds.y0 as u32 + 1..bounds.y1 as u32 - 1 {
    for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
        compute_output_pixel(
            output_data,
            0,
            x,
            y,
            interior_normal(&input_surface, bounds, x, y),
        );
    }
}

Now we can convert the outer loop to operate through iterators using the chunks_mut() method of a slice which does exactly what we want: returns the slice in evenly sized non-overlapping mutable chunks.

let compute_output_pixel =
    |mut output_slice: &mut [u8],
     base_y,
     x,
     y,
     normal: Normal| {
        let output_pixel = /* expensive computations */;

        output_slice.set_pixel(
            output_stride,
            output_pixel,
            x,
            y - base_y,
        );
    };

// Compute the interior pixels
let first_row = bounds.y0 as u32 + 1;
let one_past_last_row = bounds.y1 as u32 - 1;
let first_pixel = (first_row as usize) * output_stride;
let one_past_last_pixel =
    (one_past_last_row as usize) * output_stride;

output_data[first_pixel..one_past_last_pixel]
    .chunks_mut(output_stride)
    .zip(first_row..one_past_last_row)
    .for_each(|(slice, y)| {
        for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
            compute_output_pixel(
                slice,
                y,
                x,
                y,
                interior_normal(
                    &input_surface,
                    bounds,
                    x,
                    y,
                ),
            );
        }
    });

And finally, parallelize by simply changing chunks_mut() to par_chunks_mut():

use rayon::prelude::*;

output_data[first_pixel..one_past_last_pixel]
    .par_chunks_mut(output_stride)
    .zip(first_row..one_past_last_row)
    .for_each(|(slice, y)| {
        for x in bounds.x0 as u32 + 1..bounds.x1 as u32 - 1 {
            compute_output_pixel(
                slice,
                y,
                x,
                y,
                interior_normal(
                    &input_surface,
                    bounds,
                    x,
                    y,
                ),
            );
        }
    });

Let’s see if the parallelization worked! Here I’m using time to measure how long it takes to render the mobile phone SVG.

Before parallelization:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
6.95user 0.66system 0:07.62elapsed 99%CPU (0avgtext+0avgdata 270904maxresident)k
0inputs+2432outputs (0major+714373minor)pagefaults 0swaps

After parallelization:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
7.47user 0.63system 0:06.04elapsed 134%CPU (0avgtext+0avgdata 271328maxresident)k
0inputs+2432outputs (0major+714460minor)pagefaults 0swaps

Note that even though the user time went up, the elapsed time went down by 1.5 seconds, and the CPU utilization increased past 100%. Success!

Parallelizing Gaussian Blur

Next, I set out to parallelize Gaussian blur, the biggest timesink for the phone and arguably one of the most used SVG filters altogether.

The SVG specification hints that for all reasonable values of the standard deviation parameter the blur can be implemented as three box blurs (taking the average value of the pixels) instead of the much more costly Gaussian kernel convolution. Pretty much every SVG rendering agent implements it this way since it’s much faster and librsvg is no exception. This is why you’ll see functions called box_blur and not gaussian_blur.

Both box blur and Gaussian blur are separable convolutions, which means it’s possible to implement them as two passes, one of which is a loop blurring each row of the image  and another is a loop blurring each column of the image independently of the others. For box blur specifically it allows for a much more optimized convolution implementation.

In librsvg, the box blur function contains an outer loop over the rows or the columns of the input image, depending on the vertical argument and an inner loop over the columns or the rows, respectively. It uses i and j for the outer and inner loop indices and has some helper functions to convert those to the actual coordinates, depending on the direction.

// Helper functions for getting and setting the pixels.
let pixel = |i, j| {
    let (x, y) = if vertical { (i, j) } else { (j, i) };

    input_surface.get_pixel_or_transparent(bounds, x, y)
};

let mut set_pixel = |i, j, pixel| {
    let (x, y) = if vertical { (i, j) } else { (j, i) };

    output_data.set_pixel(output_stride, pixel, x, y);
};

// Main loop
for i in other_axis_min..other_axis_max {
    // Processing the first pixel
    // <...>

    // Inner loop
    for j in main_axis_min + 1..main_axis_max {
        // <...>
    }
}

Trying to convert this code to use chunks_mut() just like the lighting filters, we stumble on an issue: if the outer loop is iterating over columns, rather than rows, the output slices for all individual columns overlap (because the pixels are stored in row-major order). We need some abstraction, like a matrix slice, that can be split into non-overlapping mutable subslices by rows or by columns.

The first thing that comes to mind is to try using the Matrix type from the nalgebra crate which does have that functionality. However, it turns out that nalgebra doesn’t currently support rayon or even have by-row or by-column iterators. I tried implementing my own iterators but that required some very non-obvious unsafe code with odd trait bound restrictions which I really wasn’t sure were correct. Thus, I scrapped that code and made my own wrapper for the ImageSurface which only contains things needed for this particular use case.

To reiterate, we need a wrapper that:

  • provides write access to the image pixels,
  • can be split by row or by column into non-overlapping chunks,
  • is Send, i.e. can be safely sent between threads (for parallelizing).

Here’s what I came up with:

struct UnsafeSendPixelData<'a> {
    width: u32,
    height: u32,
    stride: isize,
    ptr: NonNull<u8>,
    _marker: PhantomData<&'a mut ()>,
}

unsafe impl<'a> Send for UnsafeSendPixelData<'a> {}

impl<'a> UnsafeSendPixelData<'a> {
    /// Creates a new `UnsafeSendPixelData`.
    ///
    /// # Safety
    /// You must call `cairo_surface_mark_dirty()` on the
    /// surface once all instances of `UnsafeSendPixelData`
    /// are dropped to make sure the pixel changes are
    /// committed to Cairo.
    #[inline]
    unsafe fn new(
        surface: &mut cairo::ImageSurface,
    ) -> Self {
        assert_eq!(
            surface.get_format(),
            cairo::Format::ARgb32
        );
        let ptr = surface.get_data().unwrap().as_mut_ptr();

        Self {
            width: surface.get_width() as u32,
            height: surface.get_height() as u32,
            stride: surface.get_stride() as isize,
            ptr: NonNull::new(ptr).unwrap(),
            _marker: PhantomData,
        }
    }

    /// Sets a pixel value at the given coordinates.
    #[inline]
    fn set_pixel(&mut self, pixel: Pixel, x: u32, y: u32) {
        assert!(x < self.width);
        assert!(y < self.height);

        let value = pixel.to_u32();

        unsafe {
            let ptr = self.ptr.as_ptr().offset(
                y as isize * self.stride + x as isize * 4,
            ) as *mut u32;
            *ptr = value;
        }
    }

    /// Splits this `UnsafeSendPixelData` into two at the
    /// given row.
    ///
    /// The first one contains rows `0..index` (index not
    /// included) and the second one contains rows
    /// `index..height`.
    #[inline]
    fn split_at_row(self, index: u32) -> (Self, Self) {
        assert!(index <= self.height);

        (
            UnsafeSendPixelData {
                width: self.width,
                height: index,
                stride: self.stride,
                ptr: self.ptr,
                _marker: PhantomData,
            },
            UnsafeSendPixelData {
                width: self.width,
                height: self.height - index,
                stride: self.stride,
                ptr: NonNull::new(unsafe {
                    self.ptr
                        .as_ptr()
                        .offset(index as isize * self.stride)
                }).unwrap(),
                _marker: PhantomData,
            },
        )
    }

    /// Splits this `UnsafeSendPixelData` into two at the
    /// given column.
    ///
    /// The first one contains columns `0..index` (index not
    /// included) and the second one contains columns
    /// `index..width`.
    #[inline]
    fn split_at_column(self, index: u32) -> (Self, Self) {
        assert!(index <= self.width);

        (
            UnsafeSendPixelData {
                width: index,
                height: self.height,
                stride: self.stride,
                ptr: self.ptr,
                _marker: PhantomData,
            },
            UnsafeSendPixelData {
                width: self.width - index,
                height: self.height,
                stride: self.stride,
                ptr: NonNull::new(unsafe {
                    self.ptr
                        .as_ptr()
                        .offset(index as isize * 4)
                }).unwrap(),
                _marker: PhantomData,
            },
        )
    }
}

The wrapper contains a pointer to the data rather than a mutable slice of the data, so the intermediate pixels (which cannot be accessed through set_pixel()) are not mutably aliased between different instances of UnsafeSendPixelData.

Now it’s possible to implement an iterator over the rows or the columns for this wrapper, however I went with a different, simpler approach: using rayon‘s scope functionality which allows spawning worker threads directly into rayon‘s thread pool.

First, let’s change the existing code to operate on individual rows or columns, just like we did with the lighting filters:

// The following loop assumes the first row or column of
// `output_data` is the first row or column inside `bounds`.
let mut output_data = if vertical {
    output_data.split_at_column(bounds.x0 as u32).1
} else {
    output_data.split_at_row(bounds.y0 as u32).1
};

for i in other_axis_min..other_axis_max {
    // Split off one row or column and launch its processing
    // on another thread. Thanks to the initial split before
    // the loop, there's no special case for the very first
    // split.
    let (mut current, remaining) = if vertical {
        output_data.split_at_column(1)
    } else {
        output_data.split_at_row(1)
    };

    output_data = remaining;

    // Helper function for setting the pixels.
    let mut set_pixel = |j, pixel| {
        // We're processing rows or columns one-by-one, so
        // the other coordinate is always 0.
        let (x, y) = if vertical { (0, j) } else { (j, 0) };
        current.set_pixel(pixel, x, y);
    };

    // Processing the first pixel
    // <...>

    // Inner loop
    for j in main_axis_min + 1..main_axis_max {
        // <...>
    }
}

I could avoid the current and the base_i arguments to the set_pixel closure because I can declare the closure from within the loop, whereas in the lighting filters code the compute_output_pixel closure had to be used and so declared outside of the main loop.

Now it’s a simple change to split the work across rayon‘s threads:

// The following loop assumes the first row or column of
// `output_data` is the first row or column inside `bounds`.
let mut output_data = if vertical {
    output_data.split_at_column(bounds.x0 as u32).1
} else {
    output_data.split_at_row(bounds.y0 as u32).1
};

// Establish a scope for the threads.
rayon::scope(|s| {
    for i in other_axis_min..other_axis_max {
        // Split off one row or column and launch its
        // processing on another thread. Thanks to the
        // initial split before the loop, there's no special
        // case for the very first split.
        let (mut current, remaining) = if vertical {
            output_data.split_at_column(1)
        } else {
            output_data.split_at_row(1)
        };

        output_data = remaining;

        // Spawn the thread for this row or column.
        s.spawn(move |_| {
            // Helper function for setting the pixels.
            let mut set_pixel = |j, pixel| {
                // We're processing rows or columns
                // one-by-one, so the other coordinate is
                // always 0.
                let (x, y) =
                    if vertical { (0, j) } else { (j, 0) };
                current.set_pixel(pixel, x, y);
            };

            // Processing the first pixel
            // <...>

            // Inner loop
            for j in main_axis_min + 1..main_axis_max {
                // <...>
            }
        });
    }
});

Let’s measure the performance. In the end of the previous section we had:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
7.47user 0.63system 0:06.04elapsed 134%CPU (0avgtext+0avgdata 271328maxresident)k
0inputs+2432outputs (0major+714460minor)pagefaults 0swaps

And now after the parallelization:

└─ time ./rsvg-convert -o temp.png mobile_phone_01.svg
10.32user 1.10system 0:04.57elapsed 250%CPU (0avgtext+0avgdata 272588maxresident)k
0inputs+2432outputs (0major+1009498minor)pagefaults 0swaps

We cut another 1.5 seconds and further increased the CPU utilization!

Conclusion

Rayon is an excellent crate which provides a multitude of ways for safely parallelizing Rust code. It builds on idiomatic concepts such as iterators, but also contains a number of other convenient ways for parallelizing code when standard iterators aren’t sufficient or wouldn’t be very convenient. It uses the Rust’s type system to statically guarantee the absence of data races and manages the low-level details on its own allowing the programmer to focus on the actual computation.

The parallelized filters are now included in the latest development release of librsvg for testing if using multiple threads leads to any issues downstream.

Advertisements

GSoC 2018: Overview

Introduction

Throughout the summer I was working on librsvg, a GNOME library for rendering SVG files to Cairo surfaces. This post is an overview of the work I did with relevant links.

My Results

For the project I was to port the SVG filter infrastructure of librsvg from C to Rust, adding all missing filter tests from the SVG test suite along the way. I was also expected to implement abstractions to make the filter implementation more convenient, including Rust iterators over the surface pixels.

Here’s a list of all merge requests accepted into librsvg as part of my GSoC project:

Here’s a convenient link to see all of these merge requests in GitLab: https://gitlab.gnome.org/GNOME/librsvg/merge_requests?scope=all&utf8=%E2%9C%93&state=all&author_username=YaLTeR&label_name[]=GSoC%202018

All of this code was accepted into the mainline and will appear in the next stable release of librsvg.

I also wrote the following blog posts detailing some interesting things I worked on as part of the GSoC project:

Further Work

There are a couple of fixes which still need to be done for filters to be feature-complete:

  • Fixing filters operating on off-screen nodes. Currently all intermediate surfaces are limited to the original SVG view area so anything off-screen is inaccessible to filters even when it should be. This is blocked on some considerable refactoring in the main librsvg node drawing code which is currently underway.
  • Implementing the filterRes property. This property allows to set the pixel resolution for filter operations and is one of the ways of achieving more resolution-independent rendering results. While it can be implemented with the current code as is, it will be much more convenient to account for it while refactoring the code to fix the previous issue.
  • Implementing the enable-background property. The BackgroundImage filter input should adhere to this property when picking which nodes to include in the background image, whereas it currently doesn’t.

GSoC 2018: Safe Shared Access to Cairo Image Surfaces

Introduction

I’m working on librsvg, a GNOME SVG rendering library, to port the SVG filter effects and related infrastructure from C to Rust. Librsvg uses Cairo, a 2D graphics library, for most of its drawing operations. Cairo can draw to a number of different surfaces like XCB and Xlib windows and pixmaps, PDF documents and PostScript files.

Image Surfaces

Filter effects operate on rasterized bitmaps, so most of them need direct access to pixel data. There’s a special Cairo surface type for that: an image surface, backed by a memory buffer. Filters receive their input images in image surfaces, perform the necessary pixel operations and return new image surfaces with the results.

In the Rust bindings, image surfaces are represented by the ImageSurface struct. It has a get_data() method which returns an ImageSurfaceData, which in turn acts as a slice into the underlying memory.

Since ImageSurfaceData gives mutable access to the surface memory, it must ensure there are no other views into the same memory to comply with Rust’s ownership rules. This is achieved by checking that the ImageSurface reference count is equal to 1 in get_data(), which means that no other references to the ImageSurface exist. Furthermore, the ImageSurface is borrowed mutably for the lifetime of ImageSurfaceData. This is needed to prevent cloning the surface after obtaining ImageSurfaceData and subsequently drawing on it with a Cairo context while the view into the surface memory still exists.

While this scheme does work, it offers only mutable, unique access to the pixel data. In the filter code, it’s much more convenient to have multiple references to the input surfaces with read-only memory access. Simply adding a struct similar to ImageSurfaceData which provides only a read-only view into the memory does not allow to drop the unique reference constraint, because it’s always possible to use the other reference to concurrently mutate the pixel data.

Shared Image Surface

To work around the constraints, I ended up creating a special wrapper for ImageSurfaces:

#[derive(Debug, Clone)]
struct SharedImageSurface {
    surface: ImageSurface,

    data_ptr: NonNull<u8>, // *const.
    width: i32,
    height: i32,
    stride: isize,
}

The idea is to wrap a unique ImageSurface and provide just read-only access to it, while allowing cloning of SharedImageSurface itself. This way there can be multiple SharedImageSurfaces without compromising soundness because without direct access to the underlying ImageSurface it’s impossible to mutate it in any way from the outside.

Additionally, since we know the surface won’t be modified, we can cache some common properties to get rid of extra C calls which can’t be easily optimized away.

The SharedImageSurface constructor ensures the ImageSurface it receives is unique:

impl SharedImageSurface {
    pub fn new(
        surface: ImageSurface,
    ) -> Result<Self, cairo::Status> {
        // get_pixel() assumes ARgb32.
        assert_eq!(
            surface.get_format(),
            cairo::Format::ARgb32
        );

        // Ensure the access is unique.
        let reference_count = unsafe {
            cairo_sys::cairo_surface_get_reference_count(
                surface.to_raw_none(),
            )
        };
        assert_eq!(reference_count, 1);

        // Flush any pending drawing operations before
        // accessing the memory directly.
        surface.flush();
        if surface.status() != cairo::Status::Success {
            return Err(surface.status());
        }

        let data_ptr = NonNull::new(unsafe {
            cairo_sys::cairo_image_surface_get_data(
                surface.to_raw_none(),
            )
        }).unwrap();

        let width = surface.get_width();
        let height = surface.get_height();
        let stride = surface.get_stride() as isize;

        Ok(Self {
            surface,
            data_ptr,
            width,
            height,
            stride,
        })
    }
}

And other methods on SharedImageSurface provide access to various surface properties, as well as the pixel data:

impl SharedImageSurface {
    /// Returns the surface width.
    #[inline]
    pub fn width(&self) -> i32 {
        self.width
    }

    /// Returns the surface height.
    #[inline]
    pub fn height(&self) -> i32 {
        self.height
    }

    /// Retrieves the pixel value at the given coordinates.
    #[inline]
    pub fn get_pixel(&self, x: u32, y: u32) -> Pixel {
        assert!(x < self.width as u32);
        assert!(y < self.height as u32);

        let offset =
            y as isize * self.stride + x as isize * 4;
        let ptr = self.data_ptr.as_ptr().offset(offset);

        // According to Cairo documentation, the pixel values
        // for the ARgb32 format should be read using a
        // platform-native u32.
        let value = unsafe { *(ptr as *const u32) };

        Pixel {
            r: ((value >> 16) & 0xFF) as u8,
            g: ((value >> 8) & 0xFF) as u8,
            b: (value & 0xFF) as u8,
            a: ((value >> 24) & 0xFF) as u8,
        }
    }
}

Another nice property of wrapping ImageSurfaces is that it’s possible to provide extra utility methods for working with pixel data. Right now we have methods for extracting the alpha channel, resizing the surface, doing linear sRGB to sRGB and vice versa conversions and performing convolutions useful for primitives like Gaussian blur. Pixel iterators I showcased in the previous post were switched to take SharedImageSurfaces as input.

One downside is that any other Cairo APIs that take surfaces as a read-only input need to be wrapped too. Fortunately, so far we needed only the Context::set_source_surface() method which allows using the given surface as the pixel source for drawing operations:

impl SharedImageSurface {
    /// Calls `set_source_surface()` on the given Cairo
    /// context.
    #[inline]
    pub fn set_as_source_surface(
        &self,
        cr: &cairo::Context,
        x: f64,
        y: f64,
    ) {
        cr.set_source_surface(&self.surface, x, y);
    }
}

Converting a SharedImageSurface back into a regular ImageSurface to return it from the filter code does a reference count check too: if it isn’t equal to 1, there are other SharedImageSurfaces pointing at the same ImageSurface, which means we cannot simply return the surface as it can be used for modifying the pixel data thus breaking the SharedImageSurface invariants. In this case a copy of the surface is created and returned:

impl SharedImageSurface {
    /// Converts this `SharedImageSurface` back into a Cairo
    /// image surface.
    #[inline]
    pub fn into_image_surface(
        self,
    ) -> Result<ImageSurface, cairo::Status> {
        let reference_count = unsafe {
            cairo_sys::cairo_surface_get_reference_count(
                self.surface.to_raw_none(),
            )
        };

        if reference_count == 1 {
            Ok(self.surface)
        } else {
            // If there are any other references, copy the
            // underlying surface.
            let bounds = IRect {
                x0: 0,
                y0: 0,
                x1: self.width,
                y1: self.height,
            };

            self.copy_surface(bounds)
        }
    }
}

Alpha-Only Optimizations

The most heavy filter primitives are the ones doing image convolutions, mainly the Gaussian blur primitive. A convolution involves computing a weighted sum of pixels within a rectangle for every input pixel.

The Gaussian blur primitive is frequently used for creating shadows for other elements. In this case it takes the rasterized element’s alpha channel as an input and blurs it. Then the blurred surface is offset a little and drawn below the input surface to create a shadow.

A yellow circle with a shadow.

When a convolution is applied to an alpha-only image, there’s no need to compute the weighted sum for the other three color channels. However, going through the whole input image to check if it only contains meaningful data in the alpha channel is rather costly. Thankfully, we can avoid that step altogether by caching this property.

I added a field into SharedImageSurface to indicate whether the current surface is alpha-only, along with a special constructor:

struct SharedImageSurface {
    /* ... */

    /// Whether this surface contains meaningful data only
    /// in the alpha channel.
    ///
    /// This is used for optimizations, particularly in
    /// `convolve()` to skip processing other channels.
    alpha_only: bool,
}

impl SharedImageSurface {
    pub fn new(
        surface: ImageSurface,
    ) -> Result<Self, cairo::Status> {
        /* ... */

        Ok(Self {
            surface,
            data_ptr,
            width,
            height,
            stride,
            // Default to not alpha only.
            alpha_only: false,
        })
    }

    /// Creates a `SharedImageSurface` from a unique
    /// `ImageSurface` with meaningful data only in the alpha
    /// channel.
    #[inline]
    pub fn new_alpha_only(
        surface: ImageSurface,
    ) -> Result<Self, cairo::Status> {
        let mut rv = Self::new(surface)?;
        rv.alpha_only = true;
        Ok(rv)
    }
}

This constructor is used automatically in the extract_alpha() method used to separate the alpha channel of the input surface:

impl SharedImageSurface {
    /// Returns a surface with black background and alpha
    /// channel matching this surface.
    pub fn extract_alpha(
        &self,
        bounds: IRect,
    ) -> Result<SharedImageSurface, cairo::Status> {
        let mut output_surface = ImageSurface::create(
            cairo::Format::ARgb32,
            self.width,
            self.height,
        )?;

        let output_stride =
            output_surface.get_stride() as usize;
        {
            let mut output_data =
                output_surface.get_data().unwrap();

            for (x, y, Pixel { a, .. }) in
                Pixels::new(self, bounds)
            {
                let output_pixel = Pixel {
                    r: 0,
                    g: 0,
                    b: 0,
                    a,
                };

                output_data.set_pixel(
                    output_stride,
                    output_pixel,
                    x,
                    y,
                );
            }
        }

        // The returned surface is alpha-only!
        SharedImageSurface::new_alpha_only(output_surface)
    }
}

Finally, the pixel processing methods check the alpha-only flag to reduce the number of operations or skip unneeded processing altogether:

impl SharedImageSurface {
    /// Returns a surface with pre-multiplication of color
    /// values undone.
    pub fn unpremultiply(
        &self,
        bounds: IRect,
    ) -> Result<SharedImageSurface, cairo::Status> {
        // Unpremultiplication doesn't affect the alpha
        // channel.
        if self.alpha_only {
            return Ok(self.clone());
        }

        let mut output_surface = ImageSurface::create(
            cairo::Format::ARgb32,
            self.width,
            self.height,
        )?;

        let stride = output_surface.get_stride() as usize;
        {
            let mut data =
                output_surface.get_data().unwrap();

            for (x, y, pixel) in Pixels::new(self, bounds) {
                data.set_pixel(
                    stride,
                    pixel.unpremultiply(),
                    x,
                    y,
                );
            }
        }

        SharedImageSurface::new(output_surface)
    }

    /// Performs a convolution.
    pub fn convolve(
        &self,
        bounds: IRect,
        target: (i32, i32),
        kernel: &Matrix<f64>,
        edge_mode: EdgeMode,
    ) -> Result<SharedImageSurface, cairo::Status> {
        assert!(kernel.rows() >= 1);
        assert!(kernel.cols() >= 1);

        let mut output_surface = ImageSurface::create(
            cairo::Format::ARgb32,
            self.width,
            self.height,
        )?;

        let output_stride =
            output_surface.get_stride() as usize;
        {
            let mut output_data =
                output_surface.get_data().unwrap();

            if self.alpha_only {
                // Perform a convolution, taking a weighted
                // sum of only the alpha channel.
            } else {
                // Perform a convolution, taking a weighted
                // sum of all four color channels.
            }
        }

        if self.alpha_only {
            SharedImageSurface::new_alpha_only(
                output_surface,
            )
        } else {
            SharedImageSurface::new(output_surface)
        }
    }
}

Validating Performance

The main motivating example for the optimizations is this old mobile phone SVG file from a 9-year-old issue on performance which takes about 40 seconds to render on my desktop PC:

An old mobile phone.

To measure the performance impact, I built librsvg in release mode with debug info before doing alpha-only optimizations and used the perf tool:

perf record --call-graph=dwarf ./rsvg-convert -o /dev/null mobile_phone_01.svg

I then used the awesome FlameGraph tool to create a nice visualization of the data (click and open in a browser for an interactive SVG):

A flame graph of the rendering performance.
The large two-part column in the middle happens to be the Gaussian blur, taking up 50.63% of the total rendering time for the mobile phone. Turns out the blur operates on SourceAlpha, which contains just the alpha channel of the rasterized element the filter is applied to.

After adding the alpha-only optimizations to convolution and other parts of the code like linear sRGB ⇔ sRGB conversion, the rendering time dropped from 40 seconds to 29 seconds, and the performance graph now looks like this:

A flame graph of the rendering performance.

The percentage of time taken by Gaussian blur dropped from 50.63% to 36.98%. You can also see a slight drop in the narrow column to the left of the Gaussian blur part: that’s the sRGB linearizations which became no-ops on those input images that were alpha-only.

Conclusion

The project is coming along very well. I’ve just came back from GUADEC where I really pushed the filter rustification during the hacking days, porting all of the remaining filters from C to Rust. Now that all filters are in Rust and thoroughly tested, I’m looking at improving the performance, starting with the alpha-only optimization described in this post.

Also since all SVG nodes are now in Rust (the filter primitives were the last ones), I was able to clean up the remaining C interop code for nodes! It’s great to see the intermediate FFI code gradually disappear as the entire subsystems get completely ported to Rust.

GSoC 2018: Filter Infrastructure

Introduction

This summer I’m working on librsvg, a GNOME library for rendering SVG files, particularly on porting the SVG filter effects from C to Rust. That involves separating the code for different filters from one huge C file into individual files for each filter, and then porting the filter rendering infrastructure and the individual filters.

Thankfully, in the large C file the code for different filters was divided by comment blocks, so several vim macros later I was done with the not so exciting splitting part.

Representing Filters in Rust

SVG filter effects are applied to an existing SVG element to produce a modified graphical result. Each filter consists of a number of filter primitives. The primitives take raster images (bitmaps) as an input (this can be, for example, the rasterized element where the filter was applied, the background snapshot of the canvas at the time the filter was invoked, or an output of another filter primitive), do something with it (like move the pixels to a different position, apply Gaussian blur, or blend two input images together) and produce raster images as an output.

Each filter primitive has a number of properties. The common properties include the bounds of the region where the filter primitive is doing its processing, the name assigned to the primitive’s result, and the input that the primitive operates on. I collected the common properties into the following types:

struct Primitive {
    x: Cell<Option<RsvgLength>>,
    y: Cell<Option<RsvgLength>>,
    width: Cell<Option<RsvgLength>>,
    height: Cell<Option<RsvgLength>>,
    result: RefCell<Option<String>>,
}

struct PrimitiveWithInput {
    base: Primitive,
    in_: RefCell<Option<Input>>,
}

Each filter primitive struct is meant to contain one of these two common types along with any extra properties as needed. The common types provide functions for parsing their respective properties so that code need not to be duplicated in each filter.

Note that these properties are just “descriptions” of the final values to be used during rendering. For example, an RsvgLength can be equal to 2 or 50%, and the actual length in pixels is evaluated during rendering and depends on various rendering state such as the coordinate system in use and the size of the enclosing element.

The filter primitive processing behavior is nicely described as a trait:

trait Filter {
    fn render(&self, ctx: &FilterContext)
        -> Result<FilterResult, FilterError>;
}

Here FilterContext contains various filter state such as the rasterized bitmap representation of the SVG element the filter is being applied to and results of previously rendered primitives, and allows retrieving the necessary input bitmaps. Successful rendering results in a FilterResult which has the name assigned to the primitive and the output image, and errors (like non-existent input filter primitive) end up in FilterError.

When a filter is invoked, it goes through its child nodes (filter primitives) in order, render()s them and stores the results in the FilterContext.

Pixel Iteration

Since many filter primitives operate on a per-pixel basis, it’s important to have a convenient way of transforming the pixel values.

Librsvg uses image surfaces from Cairo, a 2D graphics library, for storing bitmaps. An image surface stores its pixel values in RGBA format in a large contiguous array row by row with optional strides between the rows. The plain way of accessing the values is image[y * stride + x * 4 + ch] where ch is 0, 1, 2 and 3 for R, G, B and A respectively. However, writing this out is rather tedious and error-prone.

As the first step, I added a pixel value struct:

struct Pixel {
    pub r: u8,
    pub g: u8,
    pub b: u8,
    pub a: u8,
}

and extended cairo-rs‘s image surface data accessor with the following methods:

fn get_pixel(
    &self,
    stride: usize,
    x: usize,
    y: usize,
) -> Pixel;

fn set_pixel(
    &mut self,
    stride: usize,
    pixel: Pixel,
    x: usize,
    y: usize,
);

using the known trick of declaring a trait containing the new methods and implementing it for the target type. Unfortunately, stride has to be passed through manually because the (foreign) data accessor type doesn’t offer a public way of retrieving it. Adding methods to cairo-rs directly would allow to get rid of this extra argument.

Next, since the pattern of iterating over pixels of an image surface within the given bounds comes up rather frequently in filter primitives, I added a Pixels iterator inspired by the image crate. It allows writing code like this:

for (x, y, pixel) in Pixels::new(&image, bounds) {
    /* ... */
}

instead of the repetitive plain version:

for y in bounds.y0..bounds.y1 {
    for x in bounds.x0..bounds.x1 {
        let pixel = image.get_pixel(stride, x, y);
        /* ... */
    }
}

Filters with multiple input images can process pixels simultaneously in the following fashion using the standard Rust iterator combinators:

for (x, y, p, p2) in Pixels::new(&image, bounds)
    .map(|(x, y, p)| {
        (x, y, p, image2.get_pixel(stride, x, y))
    })
{
    let out_pixel = /* ... */;
    out_image.set_pixel(stride, out_pixel, x, y);
}

Benchmarking

Rust is known for its zero-cost abstractions, however it’s still important to keep track of performance because it’s very well possible to write code in such a way that’s hard to optimize away. Fortunately, a benchmarking facility is provided on nightly Rust out of the box: the test feature with the Bencher type.

Benchmark sources are usually placed in the benches/ subdirectory of the crate and look like this:

#![feature(test)]
extern crate rsvg_internals;

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn my_benchmark_1(b: &mut Bencher) {
        /* initialization */

        b.iter(|| {
            /* code to be benchmarked */
        });
    }

    #[bench]
    fn my_benchmark_2(b: &mut Bencher) {
        /* ... */
    }

    /* ... */
}

After ensuring the crate’s crate-type includes "lib", you can run benchmarks with cargo +nightly bench.

I created three benchmarks, one for the straightforward iteration:

b.iter(|| {
    let mut r = 0;
    let mut g = 0;
    let mut b = 0;
    let mut a = 0;

    for y in BOUNDS.y0..BOUNDS.y1 {
        for x in BOUNDS.x0..BOUNDS.x1 {
            let base = y * stride + x * 4;

            r += image[base + 0] as usize;
            g += image[base + 1] as usize;
            b += image[base + 2] as usize;
            a += image[base + 3] as usize;
        }
    }

    (r, g, b, a)
})

One for iteration using get_pixel():

b.iter(|| {
    let mut r = 0;
    let mut g = 0;
    let mut b = 0;
    let mut a = 0;

    for y in BOUNDS.y0..BOUNDS.y1 {
        for x in BOUNDS.x0..BOUNDS.x1 {
            let pixel = image.get_pixel(stride, x, y);

            r += pixel.r as usize;
            g += pixel.g as usize;
            b += pixel.b as usize;
            a += pixel.a as usize;
        }
    }

    (r, g, b, a)
})

And one for the Pixels iterator:

b.iter(|| {
    let mut r = 0;
    let mut g = 0;
    let mut b = 0;
    let mut a = 0;

    for (_x, _y, pixel) in Pixels::new(&image, BOUNDS) {
        r += pixel.r as usize;
        g += pixel.g as usize;
        b += pixel.b as usize;
        a += pixel.a as usize;
    }

    (r, g, b, a)
})

Here are the results I’ve got:

test tests::bench_pixels                   ... bench:     991,137 ns/iter (+/- 62,654)
test tests::bench_straightforward          ... bench:     992,124 ns/iter (+/- 7,119)
test tests::bench_straightforward_getpixel ... bench:   1,034,037 ns/iter (+/- 11,121)

Looks like the abstractions didn’t introduce any overhead indeed!

Implementing a Filter Primitive

Let’s look at how to write a simple filter primitive in Rust. As an example I’ll show the offset filter primitive which moves its input on the canvas by a specified number of pixels.

Offset has an input and two additional properties for the offset amounts:

struct Offset {
    base: PrimitiveWithInput,
    dx: Cell<RsvgLength>,
    dy: Cell<RsvgLength>,
}

Since each filter primitive is an SVG node, it needs to implement NodeTrait which contains a function for parsing the node’s properties:

impl NodeTrait for Offset {
    fn set_atts(
        &self,
        node: &RsvgNode,
        handle: *const RsvgHandle,
        pbag: &PropertyBag,
    ) -> NodeResult {
        // Parse the common properties.
        self.base.set_atts(node, handle, pbag)?;

        // Parse offset-specific properties.
        for (_key, attr, value) in pbag.iter() {
            match attr {
                Attribute::Dx => self.dx.set(parse(
                    "dx",
                    value,
                    LengthDir::Horizontal,
                    None,
                )?),
                Attribute::Dy => self.dy.set(parse(
                    "dy",
                    value,
                    LengthDir::Vertical,
                    None,
                )?),
                _ => (),
            }
        }

        Ok(())
    }
}

Finally, we need to implement the Filter trait. Note that render() accepts an additional &RsvgNode argument, which refers to the filter primitive node. It’s different from &self in that it contains various common SVG node state.

impl Filter for Offset {
    fn render(
        &self,
        node: &RsvgNode,
        ctx: &FilterContext,
    ) -> Result<FilterResult, FilterError> {
        // Compute the processing region bounds.
        let bounds = self.base.get_bounds(ctx);

        // Compute the final property values.
        let cascaded = node.get_cascaded_values();
        let values = cascaded.get();

        let dx = self
            .dx
            .get()
            .normalize(&values, ctx.drawing_context());
        let dy = self
            .dy
            .get()
            .normalize(&values, ctx.drawing_context());

        // The final offsets depend on the currently active
        // affine transformation.
        let paffine = ctx.paffine();
        let ox = (paffine.xx * dx + paffine.xy * dy) as i32;
        let oy = (paffine.yx * dx + paffine.yy * dy) as i32;

        // Retrieve the input surface.
        let input_surface =
            get_surface(self.base.get_input(ctx))?;

        // input_bounds contains all pixels within bounds,
        // for which (x + ox) and (y + oy) also lie
        // within bounds.
        let input_bounds = IRect {
            x0: clamp(bounds.x0 - ox, bounds.x0, bounds.x1),
            y0: clamp(bounds.y0 - oy, bounds.y0, bounds.y1),
            x1: clamp(bounds.x1 - ox, bounds.x0, bounds.x1),
            y1: clamp(bounds.y1 - oy, bounds.y0, bounds.y1),
        };

        // Create an output surface.
        let mut output_surface =
            ImageSurface::create(
                cairo::Format::ARgb32,
                input_surface.get_width(),
                input_surface.get_height(),
            ).map_err(FilterError::OutputSurfaceCreation)?;

        let output_stride =
            output_surface.get_stride() as usize;

        // An extra scope is needed because output_data
        // borrows output_surface, but we need to move
        // out of it to return it.
        {
            let mut output_data =
                output_surface.get_data().unwrap();

            for (x, y, pixel) in
                Pixels::new(&input_surface, input_bounds)
            {
                let output_x = (x as i32 + ox) as usize;
                let output_y = (y as i32 + oy) as usize;
                output_data.set_pixel(
                    output_stride,
                    pixel,
                    output_x,
                    output_y,
                );
            }
        }

        // Return the result of the processing.
        Ok(FilterResult {
            name: self.base.result.borrow().clone(),
            output: FilterOutput {
                surface: output_surface,
                bounds,
            },
        })
    }
}

Conclusion

The project is coming along very nicely with a few simple filters already working in Rust and a couple of filter tests getting output closer to the reference images.

I’ll be attending this year’s GUADEC, so I hope to see you there in July!

GSoC 2018: Introduction

Hello!

I’m Ivan Molodetskikh, a student of the Moscow State University, and I love Rust. I learned about Rust after a couple of years of mainly programming personal projects in C++ (with a little bit of C#, Java and others here and there). Making my way through writing the first Rust project (what’s a better way to get familiar with a language than trying to make something interesting in it?) and trying out various features of the language, I frequently had moments of “This is actually so much better than the C++ alternative!”. Multiple times after inspecting a borrow-related error I realized this has bit me in the past in C++ in a form of a hard to find mistake. Borrowing, lifetimes, iterators, enums, Option and Result in particular are great once you get a feel of how to use them. The Rust community is amazing too with people always being positive and happy to help newcomers.

Fast-forward two years, I have a couple of small Rust projects and some contributions and continuing to enjoy the language. So, it should be of no surprise that when I learned about GSoC I started looking for Rust-related projects. I applied to both Xi (a novel text editor with a fully async architecture) and librsvg (a GNOME library for rendering SVG files) and got accepted into librsvg on a project to help with the ongoing effort to port it to Rust, specifically the SVG filter effects.

The project

Librsvg is a small library for rendering SVG files in the GNOME ecosystem. Its goal is to be a low-footprint library with a minimal API suitable for rendering things like icons and other images that appear on the desktop.

Among other things SVG supports the so-called filter effects which generally (but not always) transform existing SVG elements in one way or another. For example, there’s a blur effect, an effect that moves its input by a fixed position, and an effect that loads an external raster image.

Currently all filter effects in librsvg are implemented entirely in C. Additionally, most of them aren’t covered by specification conformance tests.

For this project I’m going to be porting the filter infrastructure to Rust, adding all missing tests and making sure everything works correctly.

There’s a number of things to get done as part of the project. I’ll start by separating the existing C filter code from one huge file into multiple small ones, one for each filter. This will make it easier to port filters to Rust one by one later.

Next comes the most interesting part, experimenting with Rust abstractions over common filter actions, such as iterating over pixels in various ways, like one by one or using a square window. This has to be fast and ergonomic and support the different filter use cases.

Finally, I’ll be porting all filter code from C to Rust and adding the missing filter tests along the way to make sure everything works right.

Conclusion

I’ll be posting about my progress on this blog. If you want to contact me, I’m always in #rust on the GNOME IRC. And if you’re into video game speedruns, this summer I’ll be at the ESA hanging out and helping to commentate the Half-Life run. 🙂