(LONG) on conlang software (was Re: Lojban program and conlang software ideas)
|From:||Brook Conner <nellardo@...>|
|Date:||Monday, May 8, 2000, 16:21|
Peter Clark wrote:
[... details from freshmeat snipped ....]
I'd settle for "open source," but putting general tools for conlanging
under the GPL would probably be a good thing - these are fundamentally
our own ideas we're creating here, and providing someone else the
ability to profit off of our own ideas (esp. when we know the commercial
possibilities of the languages themselves are somewhat limited) just
doesn't sit well with me. But then, I'm something of an anarchist (at
least, when I feel optimistic about people :-)
> Anywho, if this interests anyone, check out:
> Next order of business: has anyone considered starting up a Open
> Source/Free Software (take your pick of terms)
> line of conlang software
> and tools? How many programmers do we have on the list? I really wish that
I suspect there's a fair amount - for me, at least, conlangs and
proglangs are simply different points in a space.
> a.) I had more time
Hear hear! :-)
> and b.) knew a decent programming language (I am still
> teaching myself C in my officially non-existant free time),
Ugh - I can't think of a PL still in widespread use that could possibly
be much worse for conlang work than C. If you want elegance and power,
try Haskell or (if you can stand the parens) Scheme. Both are widely
cross-platform, with free compilers and interpreters for *everything* -
win, mac, unix, even palmtops (e.g., PocketScheme). Haskell is
especially beautiful as a proglang, whose facility with lists and
strings is really quite nice and *readable* (if you name your functions
something reasonable). The list comprehensions in Haskell are
particularly nice. If you really *must* have something with a complex
lexical structure that explores the farthest reaches of the printable
parts of ASCII, then at least try Perl (which, IMNSHO, is a blight upon
the world of proglang design, which doesn't change the fact that it is a
useful tool). Perl's extensive support for string processing is
well-suited to conlanging.
Someone else recommended Python, which is also a good choice, as is Java
(though it has some of the lexical complexity of the whole ALGOL family
(C, C++, Perl, and various "practical" scripting languages).
> o Random word generator - I have found several on the web, but aside from
> LangMake, these are primitive at best. Of course, I do my word generation[...]
> transformation feature), but why not take a good thing and make it better?
Just to plug my favorite PL :-), here's something simple to generate
gismu = ccvcvGismu , cvccvGismu
ccvcvGismu = [ [ c1 , c2 , v1 , c3 , v2 ] |
c1 <- lojbanConsonants, c2 <- lojbanConsonants, c3 <- lojbanConsonants,
v1 <-lojbanVowels, v2<-lojbanVowels ]
cvccvGismu = [ [ c1 , v1 , c2 , c3 , v2 ] |
c1 <- lojbanConsonants, c2 <- lojbanConsonants, c3 <-
v1 <-lojbanVowels, v2<-lojbanVowels ]
"gismu" is a list of ccvcv gismu, followed by cvccv gismu. Each of the
sublists of gismu are defined as "the list of all lists of five letters
such that c1,c2, and c3 are lojban consonants and v1 and v2 are lojban
vowels", with the letters in appropriate orders.
One of the nice things about Haskell is what I didn't say - lojbanVowels
would be a list of characters, e.g., "aeiou", but if it were a list of
something else, the code wouldn't matter. You can do the same kind of
stuff with more abstract stuff like phonemes, tones, what have you.
Want something more general? Here's one that simply needs a function
that returns true if the word is part of the language and false if it isn't:
wordlist characters isAWord = [a | a <- perms characters, isAWord a ]
For lojban, isAWord would be something like this:
isAWord [a, b, c, d, e] = consonant a && ((consonant b && vowel c) ||
(vowel b && consonant c))
&& consonant d && vowel e
"perms" is the function that returns a list of all possible permutations
of a list. Obviously, this is somewhat brute force, but not bad for a
> o Dictionary program - something where the user could type in the word and
> the translation, and the program would insert generate a Conlang<->Natlang
> dictionary. It would definitely have to handle multiple meanings; grabbing
> an example from Russian, if I type in "jazyk" for the Russian word and
> "tongue" and "language" for the English definitions, I should be able to
> find "jazyk" under both "tongue" and "language" in the English section.
This part is relatively simple, if the data is the right format....
Let's say that dictionary entries are a pair of lists - possible
meanings in lang a on one side, possible meanings in lang b on the
other. A list of such pairs is the raw data for a dictionary. So
(["jazyk"], ["tongue", "language"]) for the example above....
langAtoLangB :: [ ([A],[B]) ] -> [ (A, [B]) ]
langAtoLangB  =  -- empty dictionary
langAtoLangB ( a , b ) :: rest = [ (x , b) | x<-a ]
And similarly for the other direction. Sorting is a standard library
function, though we need a little function for ordering:
order (a, _) (b, _) = compare x y
AtoBDictionary = sortBy order langAtoLangB data
> should also work the other way as well; if I type in "probovat'" and later
> "starat'sja", I should find both under "try." It should also be able to
> indicate special forms, like "djen'gi" becoming "djenjeg" in the genitive
The simple routine above wouldn't note that "djenjeg" was genitive
plural, but you could certainly include it as a possible translation of
"money" (djen'gi, unless I've forgotten more Russian than I thought).
> Plus, it should have an Export To HTML feature, for that web page
> that I keep meaning to create... :)
Pretty printing is an exercise for the reader :-)
More seriously, the problem in a dictionary generator is more one of
specifying the data format than anything else - too much specification,
and you might as well write the dictionary by hand. Too little, and it
isn't so useful.
So a more general dictionary generator would include:
* many-to-many word mappings
* automatic generation of declensions and/or conjugations (which may
require tagging words by part of speech, etc, or may require rules for
determining such), with provisions for exceptions for irregular forms.
* a template-based generation of output (e.g., XML with random style
sheets - it just occurred to me that a suitably robust style sheet
processor might be able to do the same as the Haskell code above if
given the right style sheet).
> o Transformer (I can't think of a better name--it's getting late) - this
> would apply regular sound changes across the board.
A series of rules that replace sounds? And of course, the ability to
have those rules be conditional on context, e.g., this vowel only
changes if preceded by this kind of consonant.....
I presume this is for automatic generation of the kinds of evolutionary
threads as Tolkien's development of Quenya and Sindarin from "earlier" forms.
> o Grammar generator - This would be incredibly cool if someone could
> actually manage to pull it off. The program would run through a list of
> different grammar options (nominative/ergative/active/mixed; SVO, SOV,
> VSO, etc.; isolating/agglutinating/fusional/polysynthetic; and so on) and
> spit out a grammar. Of course, listing all the millions of different
> variables would be a nightmare...
Now this one sounds rather interesting, and programatically somewhat
more complex, especially if you want it to generate *parsers* from the
particular permutations. Hmmm. This is a neat one. I want to chew on it
for a while.
Anyone want to suggest more variables/options here?
Okay, having thought some more (but not enough), it seems that you first
want to divide the options up into orthogonal "dimensions" (SVO etc.
being one). Each point on any given dimension corresponds to a
"mini-parser" - a combinator of some sort. Functional composition of the
"mini-parsers" produces a full parser. Getting the types right in this
would probably be a real pain, especially since they're so abstract.
E.g., what does SVO expect? A simple list of words makes it too hard for
SVO to decide whether the sentence parses (it would have to check parts
of speech etc.) No, SVO needs to be passed "words" that have already
been tagged by part of speech, normalized to have all words in base
form, with differences such as affixes, prepositions, et al factored
out. Let's see if this makes sense:
parseConlang = findWords =>
"findWords" takes a string and returns a list of words and punctuation.
- "I kissed the boy." becomes ["I", "kissed", "the", "boy", "."]
"breakIntoSVO" groups that list so that all words that are part of the
subject are together, the verb are together, etc. ["I", "kissed", "the",
"boy", "."] becomes [["I"], ["kissed"], ["the", "boy"], ["."]]
normalizeWords turns modified forms into base forms - [["I"],
["kissed"], ["the", "boy"], ["."]] becomes [["I"], ["kiss", "-ed"],
["the", "boy"], ["."]]
identifyPartOfSpeech labels normalized words by part of speech: [["I"],
["kiss", "-ed"], ["the", "boy"], ["."]] becomes [[pronoun, "I"], [verb,
"kiss", "-ed"], [noun, [article, "the"], [noun, "boy"]], [punctuation, "."]]
checkWordOrder then sees if the things are in the right order - e.g.,
for SVO, noun verb noun punct.
And so on......
> o Simulator - Since I am now officially dreaming, imagine a simulator
> where the computer takes two or more languages and builds a simulation of
> how they would change and interact with each other. How close would the
> computer come to Brethenig? What would have happened if Alexander the
> Great had conquered Japan and left a significant speakers of Greek (or
> Macedonian--is there a difference?) in Kyoto?
If you had the generator mentioned previously, then you'd have the basis
for a genetic algorithm for conlangs. The different variables become
"genes" in the "genome". Mix and match according to some sort of
objective (e.g., the "dominant" language is more likely to be selected).
You'd have to do something similar for words, borrowings, and sound changes.
> Ok, well these last two are probably unreasonable, but since the
> first three already have a precedent in one form or another, they should
> not prove too difficult.
I think the last two are possible - I think no one's bothered to do it
before because, well, there just aren't that many people partaking of
the Secret Vice.
> I think it would be nice if the core could be in
> ANSI C or some other standardized language that is available across the
Uck. It would be a maintenance nightmare to build this kind of stuff in
C. Compilers are written in C only because compiler writers often care
about speed - when they don't care so much, they don't write them in C.
C is like assembly language - only the syntax is more complicated and it
is an ANSI standard (a line I first heard from Gregor Kiczales, of CLOS fame).
> whole width of Win/Mac/Un*x platforms. The GUI could then be a separate
> program that calls the core functions; that way, instead of having to
> write a seperate version for each operating system, only the GUI would
> need to be re-written. (Plus, this would allow both a QT/KDE and GTK/GNOME
> GUI for Linux--hey, you could write a GUI in Tk...)
Write the GUI using your favorite CGI script equivalent instead. Write
it once, let everyone use it.
> Mmm...just think about piping a list of syllable structures and
> phonemes into a word generator, which pipes its output to a dictionary
> program which randomly assigns meanings to words, then proceeds to pipe
> the resulting dictionary to a transformer which creates half a dozen
> daughter languages.
Yep - just imagine - every sci-fi novel on the planet could have a
different language for the alien race(s) within it :-)