Ideas for a conlang-friendly Unicode text editor (long)

From:	Herman Miller <hmiller@...>
Date:	Thursday, May 11, 2000, 0:09
|< < Post > >| << List/Tree >> May 2000 Index
I've been thinking about writing a Unicode-based text editor with special
features for conlangs. But it's already looking like a huge task, and I
don't know if I can find the time for it. Two areas of this project might
be of interest to readers of this list: customizable keyboard input and
support for scripts with special requirements (for vowel mark placement,
ligatures, and so on). It won't be more than a very simple text editor to
start out with, but I might add more features such as dictionary lookup if
I have the time.

I'd like to make it as easy as possible for conlangers to customize the
editor to add their own scripts and languages. So before I even start
writing any code, I'd like to figure out the best way to do this.

First there's the issue of finding a place for your new script in the
character set. A number of scripts already have a space reserved in the
ConScript Unicode Registry (http://www.ccil.org/~cowan/csur/). But if
you're like me, you probably have a number of scripts in various stages of
development that would be convenient to have available, even if they're
incomplete or no longer in use. Fortunately, there's plenty of room in the
Surrogates range from F0000-10FFFF. The CSUR has F0000-F16AF already
reserved, so probably the best thing for temporary and personal usage would
be to start at 10FFFF and work downwards. Alternatively, characters outside
the Unicode range could be used. The range from 110000-1FFFFF (which are
illegal characters in Unicode) fits conveniently into the UTF-8 format.

After you assign character values to each character in the script, you need
to figure out a way to type it on the keyboard. Keyboards are associated
with particular languages, since more than one language may use the same
script and have different typing conventions. Many languages have a simple
one-to-one correspondence between a key on the keyboard and a character in
the script, for instance Chispa:

a (MIZARIAN LETTER A)
e (MIZARIAN LETTER E)
i (MIZARIAN LETTER I)
u (MIZARIAN LETTER U)
r (MIZARIAN LETTER R)
z (MIZARIAN LETTER Z)
s (MIZARIAN LETTER S)
p (MIZARIAN LETTER P)
t (MIZARIAN LETTER T)
c (MIZARIAN LETTER C)
k (MIZARIAN LETTER K)
' (MIZARIAN LETTER GLOTTAL STOP)

Sequences of keystrokes can be associated with sequences of characters, for
instance Zirinka:

'a (ZIRINKA LETTER A HIGH RISING)
`a (ZIRINKA LETTER A HIGH RISING)(ZIRINKA LETTER A LOW RISING)
/a (ZIRINKA LETTER A LOW RISING)
\a (ZIRINKA LETTER A LOW FALLING)
'aa (ZIRINKA LETTER A LOW RISING)(ZIRINKA LETTER A HIGH RISING)
`aa (ZIRINKA LETTER A HIGH RISING)(ZIRINKA LETTER A LOW FALLING)
/aa (ZIRINKA LETTER A LOW FALLING)(ZIRINKA LETTER A LOW RISING)
\aa (ZIRINKA LETTER A LOW FALLING)(ZIRINKA LETTER A LOW FALLING)

The actual keyboard definition files for these languages would have a
similar format: each line contains a sequence of keys followed by a space,
and in the simplest cases it would be followed by the sequence of Unicode
characters corresponding to the keystroke sequence, delimited by < ... >.
An example using the Latin alphabet, emulating a portion of the US
International keyboard:

a <a>
'a <á>
`a <à>
^a <â>
~a <ã>
' <'>
` <`>
^ <^>
~ <~>

Converting keystrokes to characters isn't always so simple, though. The
Tsolyáni script has distinct characters for word-initial vowels and vowel
marks written above or below consonants (like Devanagari, although in other
respects the Tsolyáni script is more like Arabic). The Tengwar script has
different writing conventions for Quenya and Sindarin. In Sindarin, vowels
are attached to the *following* consonant. This means that the keyboards
for Tsolyáni and Sindarin need to be sensitive to the context of the
surrounding characters.

Clearly it would be convenient to be able to represent sets of characters,
such as the Tsolyáni consonants, and conditionally generate one character
or another based on the preceding character. For example:

{C} = <bcdfghjklmnpqrstvwxyz> // define a set of consonants

a [{C} (VOWEL MARK A)] (LETTER A) // the keypress 'a' generates a vowel
mark if immediately following a consonant, otherwise an independent letter

In the case of Devanagari, pressing a vowel key in the context of a
consonant with a virama attached to it should actually replace the virama
with a vowel mark. I can't think of any conlangs that require this, but
there must be some out there somewhere. So I might as well provide a way to
do this.

a [(VIRAMA)* (VOWEL MARK A)] (LETTER A) // asterisk means that the virama
is replaced

Here's an example of the Sindarin case (illustration of the consonant 's'):

{V} = <aeiouy>
{I} = (SHORT CARRIER)(LONG CARRIER)

s [{I}*{V}** (TENGWAR LETTER SILME)**] (TENGWAR LETTER SILME) // vowel
carrier is removed, vowel is removed from its position before the consonant
and attached after the consonant

(Obviously the notation needs a lot of work. This is just to illustrate the
general idea of what's necessary.)

Then there's the question of what to do with languages like Korean that
have large sets of regularly-defined syllables. (Kinya is a conlang in that
category). It would be nice to take advantage of that regularity and encode
some simple formulas into the keyboard input program.

Okay, now some thoughts on converting characters to glyphs. The editor will
have font definition files in a similar format to the keyboard definition
files. These files specify the conversion of Unicode characters to glyphs
in a particular font or family of fonts that share the same encoding.

Most of these will be fairly straightforward: stuff like

(OLAETYAN LETTER A) <a>
(OLAETYAN LETTER NI)(OLAETYAN LETTER GA) <N> // NG ligature glyph

A common problem is positioning of vowel marks. This is something that
(like ligatures) ideally should be handled by the font technology, but we
have to live with what's available. Example:

(ACUTE) (ACUTE) [[<w> x-50, <t> y+50]]

In the Niskloz script, ligatures of vowel marks occur and need to be
specifically positioned to attach to the consonant or vowel holder in the
correct place!

<ai> (ai-ligature) [[(vowel-holder) x+25 y-50, <n> x-25, <w> y-30]] ...

Clearly this kind of notation can get tedious really fast.

A number of languages have different forms of letters that can attach to
other letters on each end, like the initial, medial, final, and isolated
forms of Arabic letters. Intervening vowel marks are ignored. Niskloz
ligatures have similar problems. The Káshtri style of the Tsolyáni script
has special variant forms of certain consonants as initial or final
elements of a cluster. Kazvarad has variant forms of consonants with
shorter ascenders to avoid running into nearby consonants.

These and similar technicalities are reasons that I haven't done any
complex scripts like Niskloz and Kazvarad in recent years since I've been
doing most of my conlanging on the computer (since around 1990 or so). More
recent scripts like Mizarian and Zirinka are much easier to deal with!

--
languages of Azir------> ----<http://www.io.com/~hmiller/languages.html>---
    h i l r i . o         "If all Printers were determin'd not to print any
     m l e @ o c m       thing till they were sure it would offend no body,
   (Herman Miller)       there would be very little printed." -Ben Franklin
|< < Post > >| << List/Tree >> May 2000 Index