Re: XML for linguists?
From: | Don Blaheta <blahedo@...> |
Date: | Friday, November 12, 1999, 0:52 |
Quoth John Cowan:
> Don Blaheta wrote:
> [Lisp data]
> > ( (S (NP-SBJ (DT The) (ADJP (RBS most) (JJ troublesome)) (NN report))
> > (VP (MD may)
> > (VP (AUX be)
> > [...]
>
> [proposed XML:]
> > <sentence>
> > <constituent type="S">
> > <constituent type="NP" function="SBJ">
> > <word type="DT">The</word>
> > <constituent type="ADJP">
>
> Methinks a better version is:
>
> <top><S><NP-SBJ><DT>The</DT><ADJP><RBS>most</RBS><JJ>troublesome</JJ></ADJP>
> <NN>report</NN></NP-SUBJ>
> <VP><MD>may</MD><VP><AUX>be</AUX>
> [...]
It surely shouldn't include top-level tags like "NP-SBJ"; at the least
that should be "<NP type="SBJ">" or something. But even so, you run
into a very large problem: what are the acceptable tag names? Within
the treebank, there are 38 part-of-speech tags and 40 internal
constituent tags. But in the Brown corpus, there were a total of 179
tags, and the CLAWS set used in the British National Corpus and others
had as many as 166 at its height. Moreover, these are all English
corpora; other languages have entirely different sets of parts of speech
and of grammatical constituents....
--
-=-Don Blaheta-=-=-dpb@cs.brown.edu-=-=-<http://www.cs.brown.edu/~dpb/>-=-
Do you have lysdexia?