Three Monkeys, Three Typewriters, Two Days

February 14, 2010

The pitfalls of comparing performance across browsers

From the source of a popular open-source rendering engine:

  // Flush out layout so it's up-to-date by the time onload is called.

From the source of another popular open-source rendering engine:

  // Make sure both the initial layout and reflow happen after the onload
  // fires. This will improve onload scores, and other browsers do it.
  // If they wanna cheat, we can too.

So if you're doing any sort of performance timing using onload and you're not flushing out layout yourself in a <script> at the very bottom of the page (because in some browsers loads of some subresources will actually start during layout) as well as right before taking your "load stop" timestamp (because there might be more pending layout at that point), and doing this in all subframes of your document, then you're comparing apples to oranges.

Too bad this is a common thing for people to do.

Posted by bzbarsky at February 14, 2010 10:29 PM | TrackBack

Rather than crawling through source comments (which aren't available for a few browsers), isn't the solution to add detection for such cheats in an "Acid4" test?

Considering some of the nit-picky things the previous acid tests checked for, firing onload before the layout is done seems like a much bigger issue.

I'm a complete javascript noob (OK, I've read about it but haven't used it), but it seems like the test's onload handler could simply note the time it gets called, then have it ask the dom for the position of some particular element towards the end of the page, note the time again, ask the dom for the position of some other element, note the time again, and if (time2-time1) >> (time3-time2), throw a red flag.

I'd also like to see an Acid3 revision which kept the same category of test for each specific test, but changed the actual test case to trip up browsers which have special cased functionality for acid3.

Posted by: Jim B on February 15, 2010 1:03 AM

Jim, last time I suggested to Ian that an ACID4 test focusing on issues with CSS and dynamic DOM changes would be a good idea he said that there was no point working on an ACID4 until all the common browsers were close to passing ACID3. Given that this would require a native SVG implementation in IE, among other things, it'll probably be several years at best before there's anything resembling ACID4.

Posted by: Boris on February 15, 2010 3:14 PM

Well, most of the common browsers *are* passing or close to passing Acid3. Internet Explorer is of course the exception, but then, the same was true when Acid3 was released - IE at the time certainly wasn't close to passing Acid2...

Posted by: Simon on February 15, 2010 5:28 PM

Simon, it might be good to get your dates straight: IE8 beta 1 was released in March 2008 and passed Acid2. It was well-known that the IE8 would aim to pass Acid2: it had been announced in December 2007. Acid3 was released in March 2008.

In contrast, there's been no indication from Microsoft that they plan to target Acid3 for any particular time (with good reason, imo; parts of Acid3 are really silly).

Posted by: Boris on February 15, 2010 5:35 PM

Although the original motivation for deferring layout in WebKit until after onload was benchmarks, it turns out to be good to do so. By deferring the initial layout, you avoid doing a double layout in the case where onload makes some further changes.

An even better topic to harp on is why people insist on using onload as a performance metric at all. When a page is displayed and able to be interacted with is far more important than when it and all the subresources have loaded and affects a user's perception of a browser's speed much more than onload time does.

Posted by: Dave Hyatt on February 15, 2010 6:07 PM

Dave, I agree that onload is a terrible performance metric. People insist on using it for the simple reason that it's easy. That's the story of benchmarks in general, actually: people measure what's easy to measure, not that useful to measure.

Posted by: Boris on February 15, 2010 7:04 PM

I like how people said "altho original idea was to...", its like original idea was "not significant".

It actually was, the silly speed war has drag the whole field into misguided direction.

the cheat should never been encouraged, like how webkit racing to pass ACID3 with targeted implementation of selected items.

the culture of cheating, will collectively waste resources, with everybody racing to find ways of getting high scores.

The RDF, trademark of apple, is collectively lowering the intelligence of general population, if you ask me.

Posted by: chris on February 15, 2010 8:08 PM

It's clear to me that the only ones who can write a really good performance test are browser developers themselves. Take the most successful current benchmark, Sunspider: it was created by the Apple JS team focusing on areas they wanted to improve on in their VM, and thus had the credibility of being unbiased. Unfortunately there are no other really good benchmarks and so everyone holds up Sunspider as being overall browser performance because they have no other good of measure.

That there is no really good DOM benchmark (Dromaeo has too many weak points, and unfortunately now contains Sunspider and V8 tests) and a number of other parts of the engine like layout and graphics are barely tested at all. (Although Peacekeeper has a graphics test now, I think).

The only way I see to stop people from using onload is to document the creation of a really good browser benchmark.

Posted by: Jordan on February 16, 2010 12:25 AM

Jordan, Sunspider isn't as good as you seem to think it is (starting with bugs in it that are fixed in the VCS version but not the public version and going through limited coverage of the commonly-used language features, non-deterministic test behavior, and somewhat unrealistic running conditions), though it's obviously better than most of the garbage that passes as benchmarks on the web. You're correct that Dromaeo is not all that much better.

The fundamental problem with creating benchmarks, which I keep meaning to blog about, is that any method of reducing "performance" in a very heterogeneous environment to a single number leads to situations where it's easy to game the setup as well as situations where the benchmark scores mean less and less as browsers work on optimizing to that benchmark....

What it comes down to is that creating a benchmark that's actually _good_ is a pretty significant time investment; probably several man years' worth of work at least for any category as broad as "DOM" or "Javascript". Assuming it can be done at all.

I welcome your thoughts on what you think a DOM or browser benchmark "should" measure, of course! Gathering such data is the first step to creating a reasonable benchmark.

Posted by: Boris on February 16, 2010 12:36 AM

Well, I think Sunspider is very good for its original intended purpose - the problem is that it's not at all a good measure of overall browser performance, or even of just Javascript performance. I certainly don't think it's perfect, but I hope you agree that's an impossible goal to reach in a subject as complex as evaluating performance of a complex system like a browser. The problem is that without anything better, everyone uses it for marketing (how else can a browser claim to be 'faster' than something else) and thus the press use it, and thus marketing uses it, etc. And so it gets all of the attention.

(Sidebar: a truly horrendous example of what the press will try on their own is Betanews' benchmark suite. Their numbers can have 10% changes between security releases; particularly sad was the story lamenting the performance decrease between the final RC of FF3.6 and final.)

The vacuum of good performance benchmarks in a browser war driven by performance means the decent ones out there will be overused. So I think Sunspider is a victim of its success, and every browser does so well that we really need to move on to something new; since there isn't anything new, we see regexp jits, instead of more attention paid to the DOM.

I completely agree that making a good performance test is incredibly hard. In fact, that is my main point - I'm trying to say that a good performance benchmark isn't going to just appear and take the focus off flawed benchmarks and those that have had too much attention paid to them. Basically, if you want the focus on performance to be judged more fairly, a browser vendor like Mozilla is going to have to do the work. My contention is that no one besides a browser developer is going to be able to make a really great browser performance test, and one of my points with bringing up the success of Sunspider is that it did come from a browser developer.

I'm certainly not qualified to give you advice on how to test performance. But since you asked, I suppose I'll take Hyatt's suggestion of "when a page is displayed and able to be interacted with" but getting a number out of that is tough, and that doesn't touch trying to evaluate subsystems like the DOM, etc.

Posted by: Jordan on February 16, 2010 4:07 AM
Post a comment