Conlang: Re: Hyperlinking a dictionary to a corpus (Jim Henry, Dec 3 '05, 18:58)

Re: Hyperlinking a dictionary to a corpus

From:	Jim Henry <jimhenry1973@...>
Date:	Saturday, December 3, 2005, 18:58

From:

Jim Henry <jimhenry1973@...>

Date:

Saturday, December 3, 2005, 18:58

On 12/2/05, Carsten Becker <naranoieati@...> wrote:

> On Thu, 01 Dec 2005, 12:43 CET, Gary Shannon wrote:

> database and PHP? It sounds like a nifty idea. How does the > program know, though, how to assemble example sentences? > I.e. AFAIU your Elomi is isolating and rather simplistic > (at least it seems so at first sight), but what about > more complex agglutinating or even inflecting languages?

> To link words in example sentences, you'd need a script that > would split at *morpheme* boundaries and when the respective > morpheme already exists in the dictionary, a link to this > one is provided. Well, but I quite don't know how to tell a > program how to split on morpheme boundaries the way the > sentence is meant ... for example, when there's a prefix > _a-_ and there are words beginning with _a_, all first a's > of such words would be linked to the entry for that prefix

Unless the conlang has a self-segregating morphology, you probably need to go through the example sentences file and manually mark them up with hyphens or some other divider characters between morphemes. Then your conversion script would produce links around the morphemes and delete the hyphen characters. My sample sentences in gzb already have the hyphens between morphemes; the tricky bit is that the base version of the lexicon is in gzb's ASCII orthography and the sample sentences are in Unicode. So the conversion script needs to convert the sample sentences from Unicode back to ASCII in memory to match them to the dictionary's link anchors, and vice versa. Today I modified the script that formats glossed sentences so it will automatically link each word to the dictionary entry. I did that for the new "Danti and the Donkey" text, but I don't think I'll be redoing all the other texts that way because most of them have had hand-edited corrections to the HTML version that didn't go back into the ASCII sources. I need to make it also generate output to another stream of some format like <gzb word> <target doc>.htm#<anchor for sample sentence> then the output of the other stream can be appended to a table of cross-references that can be merged with the lexicon table and used by the lexicon HTMLization script. I've started working on that but it's not finished, and it will only work on new glosses; I'll need another script to generate anchor lists for existing sentences. -- Jim Henry http://www.pobox.com/~jimhenry/gzb/gzb.htm ...Mind the gmail Reply-to: field