Re: XML for linguists?
From: | Charles <catty@...> |
Date: | Wednesday, November 10, 1999, 19:36 |
Don Blaheta wrote:
> I'll ask my advisor, but I don't think that there's an XML standard in
> the field yet. There is a parsing format standard, though, as initiated
> by the people at UPenn, generally known as the "Penn treebank format".
> It's a very lisp-y sort of format... here's a sample sentence:
>
> ( (S (NP-SBJ (DT The) (ADJP (RBS most) (JJ troublesome)) (NN report))
> (VP (MD may)
> (VP (AUX be)
> (NP-PRD (NP (DT the)
> (NNP August)
> (NN merchandise)
> (NN trade)
> (NN deficit))
> (ADJP (JJ due) (ADVP (IN out)) (NP-TMP (NN tomorrow))))))
> (. .)))
>
> Most of that should be pretty self-explanatory. Now, it's well nigh
> trivial to map this to something like
>
> <sentence>
> <constituent type="S">
> <constituent type="NP" function="SBJ">
> <word type="DT">The</word>
> <constituent type="ADJP">
> ...
>
> and this would in fact resolve one or two infelicities in the system
> having to do with null constituents (traces and the like).
Cool! and thanks.
> The problem
> is, even though this is much better for all the reasons XML usually is,
> it wouldn't be accepted because it would triple or quadruple the size of
> the corpus, for no "obvious" gain.
Size doesn't matter though; quoting from ...
http://www.w3.org/XML/1999/XML-in-10-points
: 5. XML is verbose, but that is not a problem
:
: Since XML is a text format, and it uses tags to delimit the data,
: XML files are nearly always larger than comparable binary formats.
: That was a conscious decision by the XML developers.
: The advantages of a text format are evident (see 3 above),
: and the disadvantages can easily be solved at a different level.
: Disk spaces isn't as expensive anymore as it used to be,
: and programs like zip and gzip can compress files very well and very fast.
: Those programs are available for nearly all platforms (and are usually free).
: In addition, communication protocols such as modem protocols and HTTP/1.1
: (the core protocol of the Web) can compress data on the fly,
: thus saving bandwith as effectively as a binary format.
> Also, there is some question of what
> level of information to put into the tag name and how much to leave in
> the arguments. That is,
> <constituent type="SINV" function="ADV">
> or
> <constituent type="S" subtype="INV" function="ADV">
> or
> <S subtype="INV" function="ADV">
> or
> <SINV function="ADV">
Yes, I tripped fell and splattered on that one already.
> ? In any case, I'll ask my advisor (and some other people around here)
> to see if any work in this direction has been done.
Results eagerly awaited here.