« Lizard Engineer | Main | Heroes »

March 4, 2005

Visual Regression Tests

Over the last couple of weeks I've been working on automated visual regression testing for Gecko. The idea is to run Gecko through a battery of HTML test files and render each one to an image file. We can then compare the images produced by different Gecko versions; if a new version has different rendering then we've probably either fixed a bug or introduced a new bug (or both!). When layout engineers develop patches, we can see which testcases get changed and hopefully catch any unexpected regressions before checkin. (It's amazing we haven't had this years ago; we've survived until now because we have lots of great volunteers who download nightly builds, test them, and report bugs.)

We already have regression tests based on analyzing the frame tree geometry computed by the layout engine. Visual regression tests have some advantages over comparing those coordinates:

I initially tried to do this using Xvfb to create a virtual screen and taking screenshots of it (both using ImageMagick 'import' and reading the Xvfb virtual screen directly through its exported framebuffer). I had horrible problems with synchronization, trying to ensure that the Xvfb server had actually rendered all the commands Gecko sent to it before I took the screenshots, and eventually gave up on that approach late last week.

Instead I just added a real offscreen rendering API to the view manager and added code to nsDocumentViewer to call it when the environment variable MOZ_FORCE_PAINT_AFTER_ONLOAD is set. With this patch, if you set MOZ_FORCE_PAINT_AFTER_ONLOAD=/tmp/foo, then every time we load a document Gecko spits out a message "GECKO: PAINT FORCED AFTER ONLOAD [url] [file] ([status])", where [url] is the loaded URL, [file] is of the form /tmp/foo-NNN which names a file in PPM format, and [status] is OK if we wrote a file or otherwise some barely-descriptive error token. This code should be useful for various hacks, and hopefully we'll be able to make the view manager API visible to Javascript authors for use in the browser UI and extensions.

That solved a lot of problems but I had another set of problems to grapple with this week: nondeterminism. Basically the regression test approach depends on the fact that every time Gecko renders a page in a window of fixed width and height, we get exactly the same image, down to the details of each pixel. But with trunk code we don't and I just spent a few days figuring out exactly why. It boils down to the following issues:

Dealing with nondeterministic program behaviour is always troublesome and tracking these down was really hellish. I had to instrument the code involved with logging functions, run my 1300 test cases multiple times until I saw two runs with different behaviours, compare the logs to narrow down the cause of the variation, and repeat until I had located the problem. And 1300 testcases with a lot of logging enabled is very unwieldy to analyze.

Anyway after fixing all these issues I now have 100% reproducible rendering for the 1300 testcases in the Mozilla tree. Next week I'll try the 2500 testcases in various online test suites. Hopefully they won't uncover any new issues. Then I'll get this work submitted in Bugzilla and get back to fixing bugs with the help of my shiny new regression test engine.

Currently my script takes about 130 seconds to run 1300 testcases on my 3GHz Xeon dedicated test machine. I have included support for distributing tests across multiple processors and I'm looking forward to seeing how many tests per second I get on my big machine.

Posted by roc at March 4, 2005 5:05 PM

Comments

How did you fix that last bug, the Freetype2 issue?

Posted by: Ernst Persson at March 4, 2005 10:46 PM

Could you have something automatially compare 2 images? If they differ raise a red light and inform a developer/whoever (if they are identical then either show green light or there is no need for system to contact anyone). I just think that if possible it is best to reduce the amount of human time used.

Posted by: RichCorb at March 4, 2005 11:09 PM

Impressive work! Are these tests intended to be run on the tinderboxen?

Posted by: Christian at March 4, 2005 11:55 PM

Ernst: a small patch to Freetype2.

RichCorb: yeah, the comparison is automated. I'll probably blog more about how the regression tests work.

Christian: something like that

Posted by: Robert O'Callahan at March 5, 2005 7:17 AM

Will this code make it possible to make an extension that let the user take a screenshot of a page (the complete page and not just the bit that fits to the screen)?

Posted by: Emil Hesslow at March 5, 2005 7:27 AM

Emil: yes, once we've done some work to expose the API to script.

Posted by: Robert O'Callahan at March 5, 2005 10:50 AM

Robert: we should talk at some point. Would be cool if wec ould some way leverage this with the new reporter tool.

Posted by: Robert Accettura at March 5, 2005 12:06 PM

Robert, this is very cool stuff. I'm amazed the whole test suite can be run in only 2 minutes!

Posted by: Jafe at March 5, 2005 8:03 PM

Cool! When can I expect to see this feature with Firefox?

Posted by: Hugh at March 7, 2005 9:44 PM

How do you deal with platform specific issues (ie a bug only triggered on say Mac os X) ?

Posted by: Ludovic at March 8, 2005 12:11 AM

Robert: sure, catch me on IRC #developers

Jafe: I just added a bunch of tests from online test suites. We're up to 3840 tests which run in 7 minutes 40 seconds.

Hugh: I've already checked in most of the Mozilla changes. Trunk builds of Firefox should support the MOZ_FORCE_PAINT_AFTER_ONLOAD environment variable to dump page images. Supporting it with JS API requires some discussion of what exactly that API should be.

Ludovic: It's a good point that this only catches regressions on Linux. Someone could run this tool on OSX (once I've checked in all the support scripts), but OSX might have its own issues with nondeterministic rendering, and someone would have to track them down.

I've actually found some new nondeterministic rendering issues that only show up in a few testcases belonging to Hixie. I'll try to track those down.

Posted by: Robert O'Callahan at March 8, 2005 9:59 AM

Will the tests in layout/html/tests be run through this eventually (when we get people to run through them and verify that current rendering serves as a correct baseline)? Or are those tests used in a different way that I've never heard discussed before?

Posted by: Jeff Walden at March 8, 2005 11:58 AM

I'm running through them already. What I currently do is treat the current rendering as a baseline, apply a patch, rerun the tests, and see what changed.

Posted by: Robert O'Callahan at March 8, 2005 2:14 PM