Re: XML for linguists?
From: | Don Blaheta <blahedo@...> |
Date: | Wednesday, November 10, 1999, 10:48 |
Quoth Charles:
> I'm wondering if there is or should be some kind of
> XML definition for language parsing.
Probably, yes.
> Anybody know anything about this?
> What I vaguely have in mind is something like:
>
> <sentence>
> <np case=subject>
> ...
> </np>
> <vp voice=antiantiantipassive>
> ...
> </vp>
> </sentence>
I'll ask my advisor, but I don't think that there's an XML standard in
the field yet. There is a parsing format standard, though, as initiated
by the people at UPenn, generally known as the "Penn treebank format".
It's a very lisp-y sort of format... here's a sample sentence:
( (S (NP-SBJ (DT The) (ADJP (RBS most) (JJ troublesome)) (NN report))
(VP (MD may)
(VP (AUX be)
(NP-PRD (NP (DT the)
(NNP August)
(NN merchandise)
(NN trade)
(NN deficit))
(ADJP (JJ due) (ADVP (IN out)) (NP-TMP (NN tomorrow))))))
(. .)))
Most of that should be pretty self-explanatory. Now, it's well nigh
trivial to map this to something like
<sentence>
<constituent type="S">
<constituent type="NP" function="SBJ">
<word type="DT">The</word>
<constituent type="ADJP">
...
and this would in fact resolve one or two infelicities in the system
having to do with null constituents (traces and the like). The problem
is, even though this is much better for all the reasons XML usually is,
it wouldn't be accepted because it would triple or quadruple the size of
the corpus, for no "obvious" gain. Also, there is some question of what
level of information to put into the tag name and how much to leave in
the arguments. That is,
<constituent type="SINV" function="ADV">
or
<constituent type="S" subtype="INV" function="ADV">
or
<S subtype="INV" function="ADV">
or
<SINV function="ADV">
? In any case, I'll ask my advisor (and some other people around here)
to see if any work in this direction has been done.
--
-=-Don Blaheta-=-=-dpb@cs.brown.edu-=-=-<http://www.cs.brown.edu/~dpb/>-=-
When two airplanes almost collide why do they call it a near miss? It
sounds like a near hit to me!