Re: Hyperlinking a dictionary to a corpus
From: | Jim Henry <jimhenry1973@...> |
Date: | Saturday, December 3, 2005, 18:58 |
On 12/2/05, Carsten Becker <naranoieati@...> wrote:
> On Thu, 01 Dec 2005, 12:43 CET, Gary Shannon wrote:
> database and PHP? It sounds like a nifty idea. How does the
> program know, though, how to assemble example sentences?
> I.e. AFAIU your Elomi is isolating and rather simplistic
> (at least it seems so at first sight), but what about
> more complex agglutinating or even inflecting languages?
> To link words in example sentences, you'd need a script that
> would split at *morpheme* boundaries and when the respective
> morpheme already exists in the dictionary, a link to this
> one is provided. Well, but I quite don't know how to tell a
> program how to split on morpheme boundaries the way the
> sentence is meant ... for example, when there's a prefix
> _a-_ and there are words beginning with _a_, all first a's
> of such words would be linked to the entry for that prefix
Unless the conlang has a self-segregating morphology,
you probably need to go through the example sentences
file and manually mark them up with hyphens or some other
divider characters between morphemes. Then your
conversion script would produce links around
the morphemes and delete the hyphen characters.
My sample sentences in gzb already have the hyphens
between morphemes; the tricky bit is that the
base version of the lexicon is in gzb's ASCII orthography
and the sample sentences are in Unicode. So
the conversion script needs to convert the sample
sentences from Unicode back to ASCII in memory
to match them to the dictionary's link anchors,
and vice versa.
Today I modified the script that formats glossed
sentences so it will automatically link each word
to the dictionary entry. I did that for the new
"Danti and the Donkey" text, but I don't think
I'll be redoing all the other texts that way
because most of them have had hand-edited
corrections to the HTML version that didn't
go back into the ASCII sources.
I need to make it also generate output to another
stream of some format like
<gzb word> <target doc>.htm#<anchor for sample sentence>
then the output of the other stream can be appended
to a table of cross-references that can be merged
with the lexicon table and used by the lexicon
HTMLization script. I've started working
on that but it's not finished, and it will only
work on new glosses; I'll need another script
to generate anchor lists for existing sentences.
--
Jim Henry
http://www.pobox.com/~jimhenry/gzb/gzb.htm
...Mind the gmail Reply-to: field