Conlang: Re: XML for linguists? (Boudewijn Rempt, Nov 19 '99, 18:40)

Re: XML for linguists?

From:	Boudewijn Rempt <bsarempt@...>
Date:	Friday, November 19, 1999, 18:40

From:

Boudewijn Rempt <bsarempt@...>

Date:

Friday, November 19, 1999, 18:40

On Thu, 18 Nov 1999, David G. Durand wrote:

> > The Text Encoding Initiative faced this problem (electronic version of the > guidelines is at http://etext.virginia.edu/TEI.html). The result was that > they created a fairly complex facility that allows one to declare a > linguistic representation, and then to mark up text in conjunction with > that declaration. >

I did look at TEI, but I decided to be practical for once and not to try to implement it - indeed, I decided not to try to understand it... I really want to get something working, and the TEI stuff looks seriously complicated! I don't need too much flexibility, since I can easily rewrite everything when I find out I do need - one ends up rewriting things anyway ;-).

> For simple projects, devising your own DTD might be simpler than using a > the fulle TEI mechanisms ("feature structures"). There are also simpler > tags that can be used to attach basic grammatical information like glosses > to a text (confusingly, grammatical items are called "tags" in the corpus > linguistics community). >

I've been experimenting this week, and I've succeeded in writing a nice, compliant xml file from the data in the database, and, more importantly, reading in some minimally edited interlinear text, and producing xml from it. Next step will be inserting/updating the contents of an xml document into the database. (After that, I will be integrating my interlinear viewer into the main interface, making a nice gui to edit lexemes, devise a way of linking lexemes to words in a text, find a way of splitting words into morphemes that works well, work towards a bit of code that makes it possible to browse the grammar from a webserver and so on... You see why I wouldn't want to spend a month or more looking at the TEI documents.)

> There are various multilingual corpus tools available from the University > of Edinburgh, deriving from the MULTEXT project. Those tools deal with the > general problem of representing texts plus part of speech information plus > segmentations (sentence and clause level word groups) and alignments > (correlated segments in different language variants of a text). The tools > are generic XML tools, but the tags used in the project are variations of > the TEI tagset. >

I'll go to Edinburgh this weekend and look at their things - I could use some input on working with segments of speech smaller than a complete example but larger than a word. Boudewijn Rempt | http://denden.conlang.org