Re: Using word generators (was Re: Semitic root word list?)
From: | H. S. Teoh <hsteoh@...> |
Date: | Tuesday, January 9, 2007, 21:28 |
On Tue, Jan 09, 2007 at 01:03:16PM -0800, David J. Peterson wrote:
[...]
> H. S. Teoh:
> <<
> I agree. English itself shows this: J, Z, and X occur very rarely
> compared to, say, E. This does not seem to be a problem in practice.
> :-)
> >>
>
> Right, but this is a slightly different matter. Would you contend
> that the sounds /dZ/, /z/ and /ks/ rarely occur in English? /z/, at
> the very least, occurs in...what, almost 50% of plural nouns?
I think /ks/ is still relatively rare, although you're right, there is a
discrepancy between the orthography and the actual phonemes. Mea culpa.
> The problem is that my alphabet is pretty much phonemic. Unless I use
> the letter for the bilabial click, there is no bilabial click.
True that.
> Jörg wrote:
> <<
> Second, you can easily avoid and correct imbalances by looking at what
> you have already invented, and use the underrepresented phonemes more
> frequently and the overrepresented ones less frequently as you
> progress.
> >>
>
> I don't know about the "easily" part... I wonder: is there a simple
> way to calculate letter frequency in one's vocabulary? I bet
> there probably is, but not for folks like me that use a word processing
> document for a dictionary... I'd switch to a spreadsheet, but it's
> just so ugly... And too practical! ;)
[...]
Text files forever! ;-) I keep the Ebisédian lexicon as a set of LaTeX
source files, and have written a utility for parsing and building a
lookup table out of it. I believe I've actually written a frequency
analysis function for it, too. :-) For Tatari Faran, since the
orthography is not terribly ugly, I keep it as a formatted plaintext
file, with a Perl script for doing lexicon searches and various such
things. It already has a way to output a list of words (bare word, no
IPA, no definition, etc.), which should be easy to filter through
another Perl script that does frequency analysis on it.
The problem with using a full-fledged word-processor format like .doc is
that (1) it's binary, and therefore very difficult to write scripts to
process things automatically, (2) it contains formatting codes in
addition to text, making extraction of words rather tedious, and (3) in
the case of MS Word, the format is proprietary and you have to
reverse-engineer it in order to get any information out of it. The only
recourse is to write a VB script or some such, that does what needs to
be done within Word itself. I suspect it's still doable, but I much
rather prefer Perl's ready-to-use arsenal of text-parsing features than
to implement a lexical analyser in VB. :-P
Unfortunately, in either case, programming expertise seems to be a
requirement, unless you use a common-employed format like Shoebox for
which others have written such utilities. (I seem to remember Gary
proposing some sort of conlanging system recently---that would work,
too. But then porting everything over is always a tedious job.)
T
--
Let's not fight disease by killing the patient. -- Sean 'Shaleh' Perry