Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: XML for linguists?

From:Don Blaheta <blahedo@...>
Date:Friday, November 12, 1999, 0:52
Quoth John Cowan:
> Don Blaheta wrote: > [Lisp data] > > ( (S (NP-SBJ (DT The) (ADJP (RBS most) (JJ troublesome)) (NN report)) > > (VP (MD may) > > (VP (AUX be) > > [...] > > [proposed XML:] > > <sentence> > > <constituent type="S"> > > <constituent type="NP" function="SBJ"> > > <word type="DT">The</word> > > <constituent type="ADJP"> > > Methinks a better version is: > > <top><S><NP-SBJ><DT>The</DT><ADJP><RBS>most</RBS><JJ>troublesome</JJ></ADJP> > <NN>report</NN></NP-SUBJ> > <VP><MD>may</MD><VP><AUX>be</AUX> > [...]
It surely shouldn't include top-level tags like "NP-SBJ"; at the least that should be "<NP type="SBJ">" or something. But even so, you run into a very large problem: what are the acceptable tag names? Within the treebank, there are 38 part-of-speech tags and 40 internal constituent tags. But in the Brown corpus, there were a total of 179 tags, and the CLAWS set used in the British National Corpus and others had as many as 166 at its height. Moreover, these are all English corpora; other languages have entirely different sets of parts of speech and of grammatical constituents.... -- -=-Don Blaheta-=-=-dpb@cs.brown.edu-=-=-<http://www.cs.brown.edu/~dpb/>-=- Do you have lysdexia?