Theiling Online    Sitemap    Conlang Mailing List HQ   

Word length as a function of word frequency

From:Jeffrey Henning <jeffrey@...>
Date:Friday, May 30, 2003, 3:29
I thought I had read a web page addressing word length as a function of
word frequency before, but after a half-hour of searching Google I gave
up and did a quick analysis of this English corpus in Excel:
http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/2_3_writtenspoken.txt

Length of word - Average frequency of words with this length
 1 - 1835.5
 2 - 1790.7
 3 - 900.2
 4 - 211.3
 5 - 110.7
 6 - 78.6
 7 - 71.9
 8 - 63.1
 9 - 59.5
10 - 53.6
11 - 49.9
12 - 47.1
13 - 48.7
14 - 36.4
15 - 33.0
16 - 30.0

I haven't scrubbed the corpus (and it looks like it could use it), but
this quick and dirty analysis was all I needed for my conlanging
activities of the moment, and proved my hypothesis correct.  The more
frequent words in my conlang should be shorter than less frequent words,
but frequency declines more gradually than I anticipated for words of 7
or more letters.

Has anyone seen a more rigorous analysis?

I had toyed with converting the words to phonetic representations but
decided it wasn't worth my time.  Obviously, the number of phonemes in a
word is a stronger function of word frequency than the length of the
English spelling of the word, but I didn't feel like using SOUNDEX or
Zompist.com's English spelling algorithm (56 rules! --
http://www.zompist.com/spell.html) to come up with approximations of the
phonetic length.

Anyone inspired to do a more statistically thorough analysis?

Best regards,

Jeffrey

Replies

JS Bangs <jaspax@...>
Dirk Elzinga <dirk_elzinga@...>
And Rosta <a.rosta@...>