Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: XML for linguists?

From:Don Blaheta <blahedo@...>
Date:Wednesday, November 10, 1999, 10:48
Quoth Charles:
> I'm wondering if there is or should be some kind of > XML definition for language parsing.
Probably, yes.
> Anybody know anything about this? > What I vaguely have in mind is something like: > > <sentence> > <np case=subject> > ... > </np> > <vp voice=antiantiantipassive> > ... > </vp> > </sentence>
I'll ask my advisor, but I don't think that there's an XML standard in the field yet. There is a parsing format standard, though, as initiated by the people at UPenn, generally known as the "Penn treebank format". It's a very lisp-y sort of format... here's a sample sentence: ( (S (NP-SBJ (DT The) (ADJP (RBS most) (JJ troublesome)) (NN report)) (VP (MD may) (VP (AUX be) (NP-PRD (NP (DT the) (NNP August) (NN merchandise) (NN trade) (NN deficit)) (ADJP (JJ due) (ADVP (IN out)) (NP-TMP (NN tomorrow)))))) (. .))) Most of that should be pretty self-explanatory. Now, it's well nigh trivial to map this to something like <sentence> <constituent type="S"> <constituent type="NP" function="SBJ"> <word type="DT">The</word> <constituent type="ADJP"> ... and this would in fact resolve one or two infelicities in the system having to do with null constituents (traces and the like). The problem is, even though this is much better for all the reasons XML usually is, it wouldn't be accepted because it would triple or quadruple the size of the corpus, for no "obvious" gain. Also, there is some question of what level of information to put into the tag name and how much to leave in the arguments. That is, <constituent type="SINV" function="ADV"> or <constituent type="S" subtype="INV" function="ADV"> or <S subtype="INV" function="ADV"> or <SINV function="ADV"> ? In any case, I'll ask my advisor (and some other people around here) to see if any work in this direction has been done. -- -=-Don Blaheta-=-=-dpb@cs.brown.edu-=-=-<http://www.cs.brown.edu/~dpb/>-=- When two airplanes almost collide why do they call it a near miss? It sounds like a near hit to me!