February 3, 2006

GuessDTD

I notice that the wmlbrowser extension for Firefox has a problem; some WML sites render better if wmlbrowser has access to the WML DTD, but wmlbrowser can't ship it for licensing reasons.

That got me thinking: surely it's possible, particularly for XML-based languages where conformance to the schema is a requirement, to reverse engineer the contents of the schema if you have enough documents which conform to it? Or, at least, you could make a good guess.

For example, if the root element is always <wml>, you could guess that as the root. And if it only contained elements from a given list, and if a particular element only ever appeared once, etc. etc. Is this feasible? If so, has anyone already written "guess-schema"?

Posted by gerv at February 3, 2006 12:42 PM
Comments

I know Trang can infer a RELAX-NG schema given a set of conforming documents. I think it can output DTDs as well.

Posted by: Ted Mielczarek at February 3, 2006 2:05 PM

Why couldn't the extension just load the DTD off the web? It is on the web somewhere, isn't it?

Posted by: Benjamin Smedberg at February 3, 2006 2:26 PM

The XML editor in Eclipse's WTP project has an option to infer the schema from the current document to provide content assist. I have only briefly used it but it seems to work.

Posted by: Mossop at February 3, 2006 2:32 PM

bsmedberg: I believe it's the other side of a click-to-accept licence agreement. So wmlbrowser has an option to take you to the site to agree to the terms. But it's all a bit obnoxious.

You'd need to see the site for exact details.

Posted by: Gerv at February 3, 2006 2:48 PM

The extension can load the DTD off the web. I made it so that you have to tick a box saying that you accept the terms and conditions, which is a pain (you have to open the options window first), but I think that covers the bases legally.

Technically I only need the DTDs for the entity declarations (  etc.), it's not the schema I care about at all. So perhaps I should just ship with a "fake" DTD containing only the entities.

The other obstacle is that DTDs have to be stored in browser chrome, not in user profiles. I guess I should raise a bug on this (and maybe even try to fix it).

Matthew (wmlbrowser author)

Posted by: Matthew Wilson at February 3, 2006 4:04 PM

Trang is the only tool I have found that will do it all, unfortunately (or fortunately) it's in Java:
http://thaiopensource.com/relaxng/trang.html


Posted by: Shane Caraveo at February 3, 2006 7:05 PM

Ah, look at that, if only I paid closer attention to the first post :)

Posted by: Shane Caraveo at February 3, 2006 7:06 PM

"The extension can load the DTD off the web. I made it so that you have to tick a box saying that you accept the terms and conditions, which is a pain (you have to open the options window first), but I think that covers the bases legally."

That is strange, if a DTD reference is provided any validating XML parser will automatically retrieve it! So how do they suppose that would work??


~Grauw

Posted by: Laurens Holst at February 5, 2006 1:48 AM

Why does the WML browser need the DTD? As far as I can tell, the only things in the DTD that could make the lack of the DTD a problem are the entity definitions for nbsp and shy. Creating a pseudo-DTD for two entity definitions is not difficult. (However, I consider DTD-based entities harmful in the Web context. When Mozilla gave into XHTML entities, other browsers had to follow, too, which runs against the idea that interactive browsers that have non-validating parsers were supposed to be relieved from processing DTDs. If Mozilla had firmly rejected the entities, perhaps the XML DTD-based character entities could have been eradicated on the Web.)

Posted by: Henri Sivonen at February 5, 2006 9:49 AM