Re: XML for linguists?
From: | Boudewijn Rempt <bsarempt@...> |
Date: | Friday, November 19, 1999, 18:40 |
On Thu, 18 Nov 1999, David G. Durand wrote:
>
> The Text Encoding Initiative faced this problem (electronic version of the
> guidelines is at
http://etext.virginia.edu/TEI.html). The result was that
> they created a fairly complex facility that allows one to declare a
> linguistic representation, and then to mark up text in conjunction with
> that declaration.
>
I did look at TEI, but I decided to be practical for once and not to try
to implement it - indeed, I decided not to try to understand it... I
really want to get something working, and the TEI stuff looks seriously
complicated! I don't need too much flexibility, since I can easily
rewrite everything when I find out I do need - one ends up rewriting
things anyway ;-).
> For simple projects, devising your own DTD might be simpler than using a
> the fulle TEI mechanisms ("feature structures"). There are also simpler
> tags that can be used to attach basic grammatical information like glosses
> to a text (confusingly, grammatical items are called "tags" in the corpus
> linguistics community).
>
I've been experimenting this week, and I've succeeded in writing a nice,
compliant xml file from the data in the database, and, more importantly,
reading in some minimally edited interlinear text, and producing xml
from it. Next step will be inserting/updating the contents of an xml
document into the database. (After that, I will be integrating my
interlinear viewer into the main interface, making a nice gui to edit
lexemes, devise a way of linking lexemes to words in a text, find a
way of splitting words into morphemes that works well, work towards a
bit of code that makes it possible to browse the grammar from a
webserver and so on... You see why I wouldn't want to spend a month or
more looking at the TEI documents.)
> There are various multilingual corpus tools available from the University
> of Edinburgh, deriving from the MULTEXT project. Those tools deal with the
> general problem of representing texts plus part of speech information plus
> segmentations (sentence and clause level word groups) and alignments
> (correlated segments in different language variants of a text). The tools
> are generic XML tools, but the tags used in the project are variations of
> the TEI tagset.
>
I'll go to Edinburgh this weekend and look at their things - I could use
some input on working with segments of speech smaller than a complete
example but larger than a word.
Boudewijn Rempt | http://denden.conlang.org