Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: XML for linguists?

From:Boudewijn Rempt <bsarempt@...>
Date:Tuesday, November 9, 1999, 21:10
On Tue, 9 Nov 1999, Charles wrote:

> I'm wondering if there is or should be some kind of > XML definition for language parsing. >=20
It's not the same, but I'm currently experimenting with the use of xml for importing texts into, handling texts inside and exporting texts out of Kura, based on my current datamodel. Something like: <?xml version=3D"1.0" encoding=3D"UTF-8" standalone=3D'yes' ?> <!DOCTYPE kura_interlinear_text [ <!ELEMENT kura_interlinear_text (#PCDATA)> ]> <kura_interlinear_text> <text title=3D"Lamay Neranmen"=20 description=3D"Wander Song"=20 language=3D"denden"> <stream text=3D"Edo qoiqoi s=FCmzi nerananmen" language=3D"denden"> <e>edo <tag name=3D"TR">my</tag> <e>e <tag name=3D"GL">poss</tag></e> <e>do <tag name=3D"GL">1sMGH</tag></e> </e> <e text=3D"qoiqoi">...</e> <e text=3D"s=FCmzi">...</e> <e text=3D"nerananmen">...</e> </stream> <stream text=3D"S=FCs=FC-=FCmen edi hod-atahl par" language=3D"denden"> </stream> </text> <text title=3D"Lama Hosame"> </text> =20 </kura_interlinear_text> However, that makes for long documents, and it has little to do with natural language parsing. Besides, I lack a good reference guide to XML since O'Reilly has only a 100-page booklet, and I am loth to buy from another publisher - so this xml text isn't actually valid :-(. Taliessin pointed me to an interesting paper at www.sil.org about interlinear glossing: http://www.sil.org/silewp/1997/003/SILEWP1997-003.html And they take a more line-oriented approach, tagging per line instead of per element. However, parsing XML is really easy, as is using DOM structures. Going from XML to HTML might be a bit more difficult - I need to translate all the <e> elements with their <tag> sub-elements into something parallel. My problem remains that there is a linear flow of complex elements where each sub-element has its own place in the flow. Boudewijn Rempt | http://denden.conlang.org/~bsarempt