I've discovered an interesting HTML parsing foible. HTML entities, such as &, are used to encode special and extended characters in HTML. But, as web page authors are sloppy, both IE and Firefox attempt to "do the right thing" when faced with unterminated (no semicolon) HTML entities like:
"I used Sun's java compiler"
However, if HTML sanitisers don't do exactly the same thing when decoding, you may be able to slip some script past them. So, for example:
<img src="javascript:alert('Oops');">
could be the name of a valid image on the webserver. However, IE and Firefox will decode it to:
<img src="javascript:alert('Oops');">
IE executes the JavaScript immediately; you need to do a View Image for Firefox to (not sure why). Perhaps this isn't a big deal, but I have no idea whether any commonly-used HTML sanitisers would fall for this one...
Posted by gerv at February 22, 2005 01:51 PMOne of the first bugs I found had a very unique way of interpreting something similar:
https://bugzilla.mozilla.org/show_bug.cgi?id=188278
A crash.
Posted by: Robert Accettura at February 22, 2005 02:13 PMOddly enough, Firefox executes it immediately if you use it in a data: url, but only one time per tab. Also, if I try to put an if statement in to restrict it only to internet explorer, I get an uncaught exception in the JS console...
I think this will turn up a lot of very obscure bugs, if anyone bothers to check.
Posted by: dolphinling at February 22, 2005 03:38 PMNitpicking: Those are not entities but (numeric) character references.
SGML/HTML allows the omission of the semicolon in certain situations. I don't have the Handbook here, but this might be one of the situations where it is permissible to omit the semicolon.
Posted by: Henri Sivonen at February 22, 2005 05:11 PMHenri,
There's something about that in the HTML 4.01 specification:
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.Posted by: Evan Nemerson at February 23, 2005 06:24 AM
I frankly don’t see the point; whether or not the refc delimiter (reference close, usually ‘;’) is used, the attribute value passed to the script engine by the markup parser should be the same. If you wanted an ampersand in a path component you would have to escape it just as you would have to in a query component.
As to parsing rules, SGML delimiters are context sensitive, which seems to cause a lot of confusion. In your example, the context is unambiguous[0] and the character references (I wouldn’t file the destinction from entity references under nitpicking, so &deity help me) can be expanded.
[0] The quoted W3C prose is by no means unambiguous, I’m afraid, especially in gaining the general character reference potpourri noise – e.g. for a character reference the “middle of a word” would have to be a function name while this particular part of that note is quite likely twaddling about entity references (who would want to use a function character “in the middle of a word” ;-). Just in the idle hope to make someone’s head explode, the mentioned scenario “at a line break” is a special case in itself and does not mean that the refc delimiter is even omitted. Look out:
In the more common case
Hello world
the character reference ends at the space character and should be parsed as
Hello world
while this case
Hello
world
should be parsed as
Helloworld
because in this context the line break is a refc delimiter too (RE, ‘record end’) and as such absorbed.
Posted by: Eric Bednarz at February 23, 2005 11:47 PM