Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: XML for linguists?

From:Charles <catty@...>
Date:Wednesday, November 10, 1999, 19:36
Don Blaheta wrote:

> I'll ask my advisor, but I don't think that there's an XML standard in > the field yet. There is a parsing format standard, though, as initiated > by the people at UPenn, generally known as the "Penn treebank format". > It's a very lisp-y sort of format... here's a sample sentence: > > ( (S (NP-SBJ (DT The) (ADJP (RBS most) (JJ troublesome)) (NN report)) > (VP (MD may) > (VP (AUX be) > (NP-PRD (NP (DT the) > (NNP August) > (NN merchandise) > (NN trade) > (NN deficit)) > (ADJP (JJ due) (ADVP (IN out)) (NP-TMP (NN tomorrow)))))) > (. .))) > > Most of that should be pretty self-explanatory. Now, it's well nigh > trivial to map this to something like > > <sentence> > <constituent type="S"> > <constituent type="NP" function="SBJ"> > <word type="DT">The</word> > <constituent type="ADJP"> > ... > > and this would in fact resolve one or two infelicities in the system > having to do with null constituents (traces and the like).
Cool! and thanks.
> The problem > is, even though this is much better for all the reasons XML usually is, > it wouldn't be accepted because it would triple or quadruple the size of > the corpus, for no "obvious" gain.
Size doesn't matter though; quoting from ... http://www.w3.org/XML/1999/XML-in-10-points : 5. XML is verbose, but that is not a problem : : Since XML is a text format, and it uses tags to delimit the data, : XML files are nearly always larger than comparable binary formats. : That was a conscious decision by the XML developers. : The advantages of a text format are evident (see 3 above), : and the disadvantages can easily be solved at a different level. : Disk spaces isn't as expensive anymore as it used to be, : and programs like zip and gzip can compress files very well and very fast. : Those programs are available for nearly all platforms (and are usually free). : In addition, communication protocols such as modem protocols and HTTP/1.1 : (the core protocol of the Web) can compress data on the fly, : thus saving bandwith as effectively as a binary format.
> Also, there is some question of what > level of information to put into the tag name and how much to leave in > the arguments. That is, > <constituent type="SINV" function="ADV"> > or > <constituent type="S" subtype="INV" function="ADV"> > or > <S subtype="INV" function="ADV"> > or > <SINV function="ADV">
Yes, I tripped fell and splattered on that one already.
> ? In any case, I'll ask my advisor (and some other people around here) > to see if any work in this direction has been done.
Results eagerly awaited here.