Re: XML for linguists?
From: | Brook Conner <nellardo@...> |
Date: | Tuesday, November 9, 1999, 19:06 |
Boudewijn Rempt writes:
> It's not the same, but I'm currently experimenting with the use of
> xml for importing texts into, handling texts inside and exporting
> texts out of Kura, based on my current datamodel.
Excellent application of XML, I would think.
> Something like:
>=20
> <?xml version=3D"1.0" encoding=3D"UTF-8" standalone=3D'yes' ?>
> <!DOCTYPE kura_interlinear_text [
> <!ELEMENT kura_interlinear_text (#PCDATA)>
> ]>
For something that is a little more structured than the below (which
should make nifty processing easier), consider:
<kura_text>
<kura_source title=3D"Contentious" language=3D"Quenya">
Nai lambelya maruva sinome
<kura_xlation language=3D"English">
May it be that your language will dwell here.
<kura_interlinear>
<kura_il_pair>
<kura_src_line>
Nai
<kura_xl_line>
=09 May it be that
<kura_il_pair>
<kura_src_line>
lambelya
<kura_xl_line>
language:your
etc....
In other words, a kura_text contains a source, a translation, and an
interlinear. Nothing else. The interlinear contains a list of pairs,
again, and nothing else. Each interlinear (il) pair contains a source=20=
line and a translated line. And that's it.
Extend the DTD as you add features to Kura, e.g., alternative
translations. But I'd go for a fairly rigorous DTD - makes the map
from the DTD to internal data structures much clearer. E.g., (in
Haskell, but it should be clear):
data Kura_text =3D ( Kura_source, Kura_xlation, Kura_interlinear )
data Kura_string =3D ( String, Kura_language, Maybe Kura_title )
data Kura_source =3D Kura_string -- both source and xlation have same =
format
data Kura_xlation =3D Kura_string
data Kura_interlinear =3D [ Kura_pair ] -- [] is list syntax
etc.
> <kura_interlinear_text>
> <text title=3D"Lamay Neranmen"=20
> description=3D"Wander Song"=20
> language=3D"denden">
> <stream text=3D"Edo qoiqoi s=FCmzi nerananmen" language=3D"denden">=
> <e>edo
> <tag name=3D"TR">my</tag>
> <e>e
> <tag name=3D"GL">poss</tag></e>
> <e>do
> <tag name=3D"GL">1sMGH</tag></e>
> </e>
> <e text=3D"qoiqoi">...</e>
> <e text=3D"s=FCmzi">...</e>
> <e text=3D"nerananmen">...</e>
> </stream>
> <stream text=3D"S=FCs=FC-=FCmen edi hod-atahl par" language=3D"den=
den">
> </stream>
> </text>
> <text title=3D"Lama Hosame">
> </text> =20
> </kura_interlinear_text>
> However, that makes for long documents, and it has little to do with=
> natural language parsing. Besides, I lack a good reference guide to
> XML since O'Reilly has only a 100-page booklet, and I am loth to buy=
> from another publisher - so this xml text isn't actually valid :-(.
Goldfarb should be good, even if it isn't O'Reilly - he literally
wrote the book on SGML.
> Taliessin pointed me to an interesting paper at www.sil.org about
> interlinear glossing:
>=20
> http://www.sil.org/silewp/1997/003/SILEWP1997-003.html
Have to check this out.....
> And they take a more line-oriented approach, tagging per line instea=
d
> of per element. However, parsing XML is really easy, as is using DOM=
> structures. Going from XML to HTML might be a bit more difficult - I=
> need to translate all the <e> elements with their <tag> sub-elements=
> into something parallel. My problem remains that there is a linear
Nested lists? Tables? Just depends on what you want it to look like,
or use style sheets and spit out something more structural.
> flow of complex elements where each sub-element has its own place in=
> the flow.
That's why I'd advocate a more rigorous DTD. It should make it easier=20=
to keep track of those sorts of things.
Brook
---------
% got a light?
No match.
---------
Fancy. Myth. Magic.
http://www.concentric.net/~nellardo/