Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: XML for linguists?

From:David G. Durand <david@...>
Date:Friday, November 19, 1999, 0:44
>On Wed, 10 Nov 1999, Don Blaheta wrote: >> and this would in fact resolve one or two infelicities in the system >> having to do with null constituents (traces and the like). The problem >> is, even though this is much better for all the reasons XML usually is, >> it wouldn't be accepted because it would triple or quadruple the size of >> the corpus, for no "obvious" gain. > >Well, it needn't - I wouldn't store the texts in XML form, but in a >relational database. XML texts can be readily mapped to a normalized >database, and then take far less space, and they can be extracted and >put into a DOM form just as easily as if if they would be read from >a text file in xml format.
This is true, but if you have highly recursive structures in your XML (like nesting clauses) relational representations are provably inferior, since they cannot easily express notions like "ancestor" or "descendant." In particular, the various ways that these notions can be expressed are either difficult or impossible to edit efficiently, or are difficult to query. Some academic projects have extended relational theory with appropriate fixed-point operators in their query languages, but this is definitely still research, not common practice. My experience is that it's not too hard to represent the surface structure of an XML document in a database, or of a database in an XML document, but that they offer rather different data models and processing possiblities. Of course, hybrid approaches are often the best, as relational engines are often good at flexibly representing linking structures, while XML is better at representing the static structure of a single document or piece of information. When XLink finally issues, I expect people to start building interesting systems based on structured text in XML using XML tools, combined with relational engines to handle much (but not all) of the heavy lifting of link management.
>> Also, there is some question of what >> level of information to put into the tag name and how much to leave in >> the arguments. That is, >> <constituent type="SINV" function="ADV"> >> or >> <constituent type="S" subtype="INV" function="ADV"> >> or >> <S subtype="INV" function="ADV"> >> or >> <SINV function="ADV"> >> ? In any case, I'll ask my advisor (and some other people around here) >> to see if any work in this direction has been done. > >Yes, that's one of my quandaries (if that's the word I want), too. If I >normalize everything then I don't keep anything between the opening and >closing tags...
This is an old argument in the SGML community, the best thing would be to read some good books on DTD design. Of course, I'm not sure what they are, having learned the hard way. I find Maler and El-Andaloussi to be the most useful (sorry the title is missing from my teeny neural buffer, but El-Andaloussi is unusual enough that it can't be hard to find). The public parts of the W3C's XML archives have some good discussion of this point, around the issue of whether attributes should have been included in XML at all. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com http://www.cs.bu.edu/students/grads/dgd/ \ Director of Development Graduate Student no more! \ Dynamic Diagrams --------------------------------------------\ http://www.dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________