Re: XML for linguists?
From: | David G. Durand <david@...> |
Date: | Friday, November 19, 1999, 0:44 |
>On Wed, 10 Nov 1999, Don Blaheta wrote:
>> and this would in fact resolve one or two infelicities in the system
>> having to do with null constituents (traces and the like). The problem
>> is, even though this is much better for all the reasons XML usually is,
>> it wouldn't be accepted because it would triple or quadruple the size of
>> the corpus, for no "obvious" gain.
>
>Well, it needn't - I wouldn't store the texts in XML form, but in a
>relational database. XML texts can be readily mapped to a normalized
>database, and then take far less space, and they can be extracted and
>put into a DOM form just as easily as if if they would be read from
>a text file in xml format.
This is true, but if you have highly recursive structures in your XML (like
nesting clauses) relational representations are provably inferior, since
they cannot easily express notions like "ancestor" or "descendant." In
particular, the various ways that these notions can be expressed are either
difficult or impossible to edit efficiently, or are difficult to query.
Some academic projects have extended relational theory with appropriate
fixed-point operators in their query languages, but this is definitely
still research, not common practice.
My experience is that it's not too hard to represent the surface structure
of an XML document in a database, or of a database in an XML document, but
that they offer rather different data models and processing possiblities.
Of course, hybrid approaches are often the best, as relational engines are
often good at flexibly representing linking structures, while XML is better
at representing the static structure of a single document or piece of
information.
When XLink finally issues, I expect people to start building interesting
systems based on structured text in XML using XML tools, combined with
relational engines to handle much (but not all) of the heavy lifting of
link management.
>> Also, there is some question of what
>> level of information to put into the tag name and how much to leave in
>> the arguments. That is,
>> <constituent type="SINV" function="ADV">
>> or
>> <constituent type="S" subtype="INV" function="ADV">
>> or
>> <S subtype="INV" function="ADV">
>> or
>> <SINV function="ADV">
>> ? In any case, I'll ask my advisor (and some other people around here)
>> to see if any work in this direction has been done.
>
>Yes, that's one of my quandaries (if that's the word I want), too. If I
>normalize everything then I don't keep anything between the opening and
>closing tags...
This is an old argument in the SGML community, the best thing would be to
read some good books on DTD design. Of course, I'm not sure what they are,
having learned the hard way. I find Maler and El-Andaloussi to be the most
useful (sorry the title is missing from my teeny neural buffer, but
El-Andaloussi is unusual enough that it can't be hard to find).
The public parts of the W3C's XML archives have some good discussion of
this point, around the issue of whether attributes should have been
included in XML at all.
-- David
_________________________________________
David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com
http://www.cs.bu.edu/students/grads/dgd/ \ Director of Development
Graduate Student no more! \ Dynamic Diagrams
--------------------------------------------\ http://www.dynamicDiagrams.com/
MAPA: mapping for the WWW \__________________________