Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Word length as a function of word frequency

From:Sally Caves <scaves@...>
Date:Saturday, May 31, 2003, 16:18
Dear Dirk,

Can you explain your beautiful chart below in layman's terms?  I'm unsure
what a "segment" is as distinguished from a syllable, or how we are to read
the chart.  Also. what linguistic rule has observed that the more frequent
word tends to be the shorter one?   More Anglophones will say or write the
words advertisement, conversation, and peculiarly than they will say or
write fop, doff, and dap.  Granted, the examples I gave of the "long" common
words are "borrowed"-- are borrowings excluded?  The short words are English
in origin, but have an air of archaism to them or jargon ("dap" is a fishing
term).  Are archaisms excluded?  What are the parameters of the study?  And
granted, English is given to the efficient monosyllable.

I was once told by a non-linguist, who thought he knew something on the
subject, that I should make sure that my "function" words in Teonaht were
monosyllabic; that it was a linguistic rule that function words and pronouns
were short, across the board.  I was tremendously skeptical, even though I
had already made sure of that years ago--for Teonaht.  It was the "across
the board" remark that cooled me.

An "As For My Conlang" remark:
The "short" words in Teonaht do tend to be function words, or base-words on
which I've built longer words.  Some of the "long" words, though, in Teonaht
are the oldest, and cannot be broken down into units. (tatilynakose,
Erahenahil).
"Intuitively" aware of this rule, I've tried shortening some of these, or
extending their meanings (haromal, for instance, which I discussed in
another thread, was reduced to mal--"now"; haromal has come to mean
'contemporarily," "these days")..  But many commonly used Welsh words are on
the long side.  Adeiladwyd being a famous example.  Esgusodwch fi (a
borrowing, granted). Gwybodaeth, "knowledge."

Most Teonaht words are two syllables, a few words are five syllables, very
few longer than that.  Tatilynakose, "disgusting," is one of my very early
words, and I can't for the life of me get rid of it.  It's very frequently
used. :)  Erahenahil means "paradise," and that one, too, is pretty common.

Sally Caves
scaves@frontiernet.net
Eskkoat ol ai sendran, rohsan nuehra celyil takrem bomai nakuo.
"My shadow follows me, putting strange, new roses into the world."



----- Original Message -----
From: "Dirk Elzinga" <dirk_elzinga@...>
To: <CONLANG@...>
Sent: Friday, May 30, 2003 10:57 AM
Subject: Re: Word length as a function of word frequency


> Jeffrey: > > I have a 20,000 word electronic dictionary which includes: i) > transcription (in a sensible but non-standard ASCII scheme), ii) stress > pattern, iii) syllable count, iv) spelling, v) frequency (I think > that's what the number is), and vi) part of speech. I can send it along > to you and other interested parties, but its about 600K; I don't know > what that will do to email accounts. > > I did do a little project to see how segment count and syllable count > are related; it was inspired by a similar graph I saw for German in an > old article in _Language_. Here is my graph for the 20,000 word > dictionary (use a monowidth font to view the graph): > > Distribution of lexical items according to syllable (x-axis) and > segment (y-axis) count > > 1 2 3 4 5 6 7 8 total > > 17 1 2 1 4 > 16 3 3 6 > 15 2 7 4 13 > 14 3 13 20 5 41 > 13 12 57 40 8 117 > 12 3 69 172 47 291 > 11 29 242 256 33 560 > 10 8 165 621 287 9 1090 > 9 47 532 922 138 2 1641 > 8 271 1248 721 24 2264 > 7 1 936 1525 273 2735 > 6 39 1922 937 40 2938 > 5 431 2343 293 3067 > 4 1480 1406 14 2900 > 3 1514 127 1641 > 2 215 215 > 1 5 5 > > total 3685 7060 4746 2903 949 162 22 1 19528 > > I'm not sure what it means, but it's a pretty picture. > > Dirk > > On Thursday, May 29, 2003, at 09:29 PM, Jeffrey Henning wrote: > > > I thought I had read a web page addressing word length as a function of > > word frequency before, but after a half-hour of searching Google I gave > > up and did a quick analysis of this English corpus in Excel: > > http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/2_3_writtenspoken.txt > > > > Length of word - Average frequency of words with this length > > 1 - 1835.5 > > 2 - 1790.7 > > 3 - 900.2 > > 4 - 211.3 > > 5 - 110.7 > > 6 - 78.6 > > 7 - 71.9 > > 8 - 63.1 > > 9 - 59.5 > > 10 - 53.6 > > 11 - 49.9 > > 12 - 47.1 > > 13 - 48.7 > > 14 - 36.4 > > 15 - 33.0 > > 16 - 30.0 > > > > I haven't scrubbed the corpus (and it looks like it could use it), but > > this quick and dirty analysis was all I needed for my conlanging > > activities of the moment, and proved my hypothesis correct. The more > > frequent words in my conlang should be shorter than less frequent > > words, > > but frequency declines more gradually than I anticipated for words of 7 > > or more letters. > > > > Has anyone seen a more rigorous analysis? > > > > I had toyed with converting the words to phonetic representations but > > decided it wasn't worth my time. Obviously, the number of phonemes in > > a > > word is a stronger function of word frequency than the length of the > > English spelling of the word, but I didn't feel like using SOUNDEX or > > Zompist.com's English spelling algorithm (56 rules! -- > > http://www.zompist.com/spell.html) to come up with approximations of > > the > > phonetic length. > > > > Anyone inspired to do a more statistically thorough analysis? > > > > Best regards, > > > > Jeffrey > > > > > -- > Dirk Elzinga > Dirk_Elzinga@byu.edu > > "I believe that phonology is superior to music. It is more variable and > its pecuniary possibilities are far greater." - Erik Satie >

Replies

And Rosta <a.rosta@...>
Dirk Elzinga <dirk_elzinga@...>