The Inside Track on Firefox Development.
« Firefox 2 Is Cool | Main | New Feed Handling Feature for Testing »
April 26, 2006
A Journey Through Feed Handling
One of the areas of focus for us for Firefox 2 is Feed Handling. With this release we are seeking to make feeds more useful to more users, and along the way to that goal improve on some of the shortcomings of Firefox 1.x. My plans are outlined in this newsgroup posting.
What this post is about however is not about UI design but about implementation. This has been a very interesting journey so far, and I’ve learned a lot about our networking APIs in the process. Thanks much to darin, bz and biesi for the help.
Here I’m going to focus on the first of two interesting aspects of the requirements described in my newsgroup posting: showing a display page when feed links are loaded.
Towards the end of Firefox 1.5, a prototype feed pretty printer was landed. It had very many problems, and was removed. The solution was a hack – it observed every page load and tried to guess if the content was a feed or not. It guessed wrong many times because of the various types feeds are (incorrectly) served as, was jarring to use (since it appeared only after the feed document had initially loaded and potentially displayed some content to the user) and had many issues.
For Firefox 2, I wanted to approach this from a different angle. I wanted to integrate this well, using the APIs exposed by our system, for clean code, but also to prove that it could be done.
Content Sniffing
The problem with detecting feeds is that very many feeds are served with incorrect or overly generic Content-Types. Some are served as text/html which is clearly wrong, but others are served as application/xml or text/xml which is not incorrect, just not specific enough. We can’t attempt to parse every candidate Content-Type as a feed just to see if it is, since that would significantly impact our performance. We also can’t restrict ourselves to Feed types, since that would leave us not detecting a lot of feeds, and still be incorrect.
So, what we needed to do was implement a content sniffer. Biesi added support to nsBaseChannel and nsHttpChannel for third party content sniffers to be consulted during document loads. By adding an entry under the net-content-sniffers category, a component implementing nsIContentSniffer can be asked during any load that includes the flag LOAD_CALL_TYPE_SNIFFERS (i.e. any load within a docshell or frame) for their take on what the Content-Type of the document is.
The content sniffer is given a chunk of the data. It was here that I discovered that data can be compressed with Content-Encoding: gzip, and this level is low level enough that it is not already decompressed. So I had to invoke another stream converter on the compressed data to get the real data out.
At this point, I could run some checks against the data, based on the heuristic defined in Microsoft’s Windows RSS Publisher’s Guide. (For web content interoperability, copying is good). If the data looked to be feed content, I informed the caller that the content type was really a special type, application/vnd.mozilla.maybe.feed.
I also take care not to coerce my special type when the URI is loaded by view-source, since this causes syntax highlighting not to work correctly otherwise.
Stream Converters
Gecko handles various types internally, but my maybe.feed type wasn’t one of them. I needed to find a way of saying that content with that type should be loaded in the browser window, but with some special modifications. I needed to implement a stream converter, and register it somehow. After some time spent messing around with the browser’s content listener implementation in browser.js, I found out that that code wasn’t really used at all and should be gotten rid of.
I had a conversation with Boris and he described the way the URILoader handles unknown types. It’s a six step process that involves a couple of different interfaces and begins in nsDocumentOpenInfo::DispatchContent. This function handles figuring out where content is loaded and what happens to it.
The first check is to see if the content must be handled by a helper application, as specified by a Content-Disposition response header. If this is set the entire process below is short-circuited and we skip directly to step 6, handle with the External Helper Application Service (hereforth refered to as “EHAS”).
- First, the DocShell’s
URIContentListeneris given a chance to handle it. This takes care of types internal to Gecko. If theDSURIContentListenerwants to handle it, thenDispatchContentreturns successfully. We are not an internal type, so we do not want to register our feed handler here. - Second, the list of
URIContentListeners held by the URILoader are enumerated, and each asked if they want to handle the content. These content listeners have no reference back to the DocShell that was handling the load, so this is not useful to us either as we cannot output anything that the user will be able to see.- Third, the set of
URIContentListeners registered using the “external-uricontentlisteners” category are tried. This step involves the same limitations of the second step – no access to the DocShell, and is thus useless to us for the feed case.- The fourth step is to see if there is a component implementing
nsIContentListenerregistered under the@mozilla.org/uriloader/content-handler;1?type=<type>contract id. If there is one, its handleContent method is invoked. This is not useful to us.- The fifth step is the one that works for us. If nothing so far has handled the content, the URILoader tries to find a stream converter from the unknown type to the “wildcard” type -
*/*. If one exists, it will construct it and use it to display data in the DocShell. Viola!- For the interested, the sixth step should the previous five had failed is to hand the process over to the External Helper App Service for handling by a desktop application or download to disk.
- Third, the set of
So, I wrote a component implementing nsIStreamConverter and registered it for the conversion from my maybe.feed type to */*. Sure enough, it was invoked. In my asyncConvert implementation, I cache a reference to the nsIStreamListener that is passed as an argument, this is the DocShell’s listener and is needed later, to display the XUL page.
Parsing Data
With bug 325080, Robert Sayre has been working on an all-purpose multi-format Feed parser for Mozilla applications. Part of the API exposed through nsIFeedProcessor is the ability to asynchronously funnel data through it. In my nsIStreamListener methods on my Stream Converter, I forward the data on to the Feed Processor, which does a lot of magic to abstract away the differences between the various feed formats.
Opening the Preview Page
Once the data is parsed, I open a channel to the XUL page using the cached DocShell listener above. This results in the page being loaded in the browser window. I set the originalURI property of the chrome channel to the URI of the feed, which makes sure that the location bar shows the URI of the feed, not the URI of the XUL page.
From the Preview Page, I needed to be able to get at the feed data so that I could show the title, etc. However the feed was parsed in the stream converter, and there was no conduit between the converter and the resulting XUL document. So I used a singleton service for registering parsed feed documents and used that to effectively pass the parsed feed over to the XUL document. One problem remained – if I register a parsed nsIFeedResult with my result service for a specific URI, how do I get that result again from the XUL document, if I don’t know the URI anymore?
The truth was amusing – I did know the URI. Since I set the originalURI property to be the URI of the feed document, window.location.href in the XUL document actually pointed to the feed URI. So I was able to pass that to my result service to get the parsed result.
That’s A Wrap
And that’s it. There are many details involved in the construction of the XUL page, and the various handling options available to feeds, but those are outside the scope of this post. I just wanted to share my experiences using the Gecko APIs for content handling. With a few tweaks to support content sniffing and some bug fixes elsewhere, this entire system was developed in the application layer, meaning that an extension could implement this kind of thing if it wanted. That’s impressive flexibility!
Later on, I’ll talk about how I implemented support for handling feeds with web services.
Posted by ben at April 26, 2006 6:29 PM
Comments
Please tell me you're not using exactly the same sniffing heuristic as Microsoft. They won't recognise RSS 1.0 feeds with a content-type of application/rdf+xml (the RECOMMENDED content-type for RDF Site Summary). They won't recognise Atom feeds using a namespace prefix. They won't recognise anything encoded as UTF-16 unless it uses application/atom+xml or application/rss+xml as the content-type. I'm sure Firefox can do better than that.
Posted by: James Holderness at April 27, 2006 1:57 AM
James,
Do you have some test cases where IE7 falls down?
Posted by: Ben at April 27, 2006 6:41 AM
Am I right in thinking that none of this stuff is used when something uses the right MIME type? That it's an RSS quirks mode?
Posted by: ant at April 27, 2006 7:21 AM
Correct.
Posted by: Ben at April 27, 2006 7:30 AM
Shouldn't the content type of "suspected" feeds be something like application/vnd.mozilla.maybe.feed+xml? After all, it's still XML.
Posted by: Daniel Schierbeck at April 27, 2006 9:08 AM
I have a number of test cases that should fail based on the heuristic Microsoft describes in their Publisher's Guide, however I haven't tested on the latest version. The version I have installed obviously uses a different algorithm as it's quite capable of sniffing feeds with a content-type of application/rdf+xml, but seems to fail on anything with a content-type of text/xml.
If you want I can email you some test feeds.
Posted by: James Holderness at April 27, 2006 12:48 PM
What should Firefox do if you go to http://www.mnot.net/ and click on the orange feed icon in the top right?
It is a valid RSS 1.0 feed. Served using the MIME type recommended by the RSS 1.0 spec. And contains all the tell-tale marker strings in the first 512 bytes.
IE7 beta 2 does not recognize it as a feed, and asks you where you want to download it.
Posted by: Sam Ruby at April 28, 2006 3:54 AM
Sam: Cool! I just tested that in my build, and it detected it as a feed... not sure what IE7's doing differently.
Posted by: Ben at April 28, 2006 9:06 AM
How about this:
http://www.kanban.ru/rss/new365.xml
It's a valid RSS 2.0 feed. Served using the MIME type recommended by Dave Winer (the spec itself doesn't recommend anything).
Will Firefox recognise that as a feed?
Posted by: James Holderness at April 29, 2006 11:14 PM
i am sorry, i did not know how else to contact!
suggestion:
allow firefox to hide the File/edit/view (etc) bar of the browser, such as the new version of IE allows.
It allows for a nice increase of space in the browser window. I do not think many people spend too much time using those options up there- so it would be nice to toggle them off during times when we do not use them, as to allow more browser space!
Posted by: mike at April 30, 2006 1:05 AM
©1997-2006 Ben Goodger. All Rights Reserved.
Opinions expressed here are my own, and not those of any organization that I may be affiliated with.
Reload icon is © Stephen Horlander;
Firefox logo is by
Jon Hicks, and is a
trademark of The Mozilla Foundation.
GetFirefox buttons are from rakaz
