Re: XML for linguists?

From:	Brook Conner <nellardo@...>
Date:	Tuesday, November 9, 1999, 19:06
|< < Post > >| << List/Tree >> November 1999 Index
Boudewijn Rempt writes:
 > It's not the same, but I'm currently experimenting with the use of
 > xml for importing texts into, handling texts inside and exporting
 > texts out of Kura, based on my current datamodel.

Excellent application of XML, I would think.

 > Something like:
 >=20
 > <?xml version=3D"1.0" encoding=3D"UTF-8" standalone=3D'yes' ?>
 > <!DOCTYPE kura_interlinear_text [
 >   <!ELEMENT kura_interlinear_text (#PCDATA)>
 > ]>

For something that is a little more structured than the below (which
should make nifty processing easier), consider:

<kura_text>
  <kura_source title=3D"Contentious" language=3D"Quenya">
    Nai lambelya maruva sinome
  <kura_xlation language=3D"English">
    May it be that your language will dwell here.
  <kura_interlinear>
    <kura_il_pair>
      <kura_src_line>
         Nai
      <kura_xl_line>
=09    May it be that
    <kura_il_pair>
      <kura_src_line>
         lambelya
       <kura_xl_line>
         language:your
etc....

In other words, a kura_text contains a source, a translation, and an
interlinear. Nothing else. The interlinear contains a list of pairs,
again, and nothing else.  Each interlinear (il) pair contains a source=20=

line and a translated line. And that's it.

Extend the DTD as you add features to Kura, e.g., alternative
translations.  But I'd go for a fairly rigorous DTD - makes the map
from the DTD to internal data structures much clearer.  E.g., (in
Haskell, but it should be clear):

data Kura_text =3D ( Kura_source, Kura_xlation, Kura_interlinear )

data Kura_string =3D ( String, Kura_language, Maybe Kura_title )

data Kura_source  =3D Kura_string -- both source and xlation have same =
format

data Kura_xlation =3D Kura_string

data Kura_interlinear =3D [ Kura_pair ] -- [] is list syntax

etc.


 > <kura_interlinear_text>
 > <text title=3D"Lamay Neranmen"=20
 >   description=3D"Wander Song"=20
 >   language=3D"denden">
 >  <stream text=3D"Edo qoiqoi s=FCmzi nerananmen" language=3D"denden">=

 >    <e>edo
 >      <tag name=3D"TR">my</tag>
 >      <e>e
 >      <tag name=3D"GL">poss</tag></e>
 >      <e>do
 >      <tag name=3D"GL">1sMGH</tag></e>
 >    </e>
 >    <e text=3D"qoiqoi">...</e>
 >    <e text=3D"s=FCmzi">...</e>
 >    <e text=3D"nerananmen">...</e>
 >  </stream>
 >  <stream  text=3D"S=FCs=FC-=FCmen edi hod-atahl par" language=3D"den=
den">
 >  </stream>
 > </text>
 > <text title=3D"Lama Hosame">
 > </text>         =20
 > </kura_interlinear_text>

 > However, that makes for long documents, and it has little to do with=

 > natural language parsing. Besides, I lack a good reference guide to
 > XML since O'Reilly has only a 100-page booklet, and I am loth to buy=

 > from another publisher - so this xml text isn't actually valid :-(.

Goldfarb should be good, even if it isn't O'Reilly - he literally
wrote the book on SGML.

 > Taliessin pointed me to an interesting paper at www.sil.org about
 > interlinear glossing:
 >=20
 >   http://www.sil.org/silewp/1997/003/SILEWP1997-003.html

Have to check this out.....

 > And they take a more line-oriented approach, tagging per line instea=
d
 > of per element. However, parsing XML is really easy, as is using DOM=

 > structures. Going from XML to HTML might be a bit more difficult - I=

 > need to translate all the <e> elements with their <tag> sub-elements=

 > into something parallel. My problem remains that there is a linear

Nested lists? Tables? Just depends on what you want it to look like,
or use style sheets and spit out something more structural.

 > flow of complex elements where each sub-element has its own place in=

 > the flow.

That's why I'd advocate a more rigorous DTD.  It should make it easier=20=

to keep track of those sorts of things.


Brook

---------
% got a light?
No match.

---------
Fancy. Myth. Magic.
http://www.concentric.net/~nellardo/
|< < Post > >| << List/Tree >> November 1999 Index