Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Word length as a function of word frequency

From:Dirk Elzinga <dirk_elzinga@...>
Date:Friday, May 30, 2003, 14:59
Jeffrey:

I have a 20,000 word electronic dictionary which includes: i)
transcription (in a sensible but non-standard ASCII scheme), ii) stress
pattern, iii) syllable count, iv) spelling, v) frequency (I think
that's what the number is), and vi) part of speech. I can send it along
to you and other interested parties, but its about 600K; I don't know
what that will do to email accounts.

I did do a little project to see how segment count and syllable count
are related; it was inspired by a similar graph I saw for German in an
old article in _Language_. Here is my graph for the 20,000 word
dictionary (use a monowidth font to view the graph):

Distribution of lexical items according to syllable (x-axis) and
segment (y-axis) count

             1     2     3     4     5     6     7     8     total

17                                        1     2     1         4
16                                        3     3               6
15                                  2     7     4              13
14                            3    13    20     5              41
13                           12    57    40     8             117
12                      3    69   172    47                   291
11                     29   242   256    33                   560
10                8   165   621   287     9                  1090
9                47   532   922   138     2                  1641
8               271  1248   721    24                        2264
7           1   936  1525   273                              2735
6          39  1922   937    40                              2938
5         431  2343   293                                    3067
4        1480  1406    14                                    2900
3        1514   127                                          1641
2         215                                                 215
1           5                                                   5

total    3685  7060  4746  2903   949   162    22    1    19528

I'm not sure what it means, but it's a pretty picture.

Dirk

On Thursday, May 29, 2003, at 09:29  PM, Jeffrey Henning wrote:

> I thought I had read a web page addressing word length as a function of > word frequency before, but after a half-hour of searching Google I gave > up and did a quick analysis of this English corpus in Excel: > http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/2_3_writtenspoken.txt > > Length of word - Average frequency of words with this length > 1 - 1835.5 > 2 - 1790.7 > 3 - 900.2 > 4 - 211.3 > 5 - 110.7 > 6 - 78.6 > 7 - 71.9 > 8 - 63.1 > 9 - 59.5 > 10 - 53.6 > 11 - 49.9 > 12 - 47.1 > 13 - 48.7 > 14 - 36.4 > 15 - 33.0 > 16 - 30.0 > > I haven't scrubbed the corpus (and it looks like it could use it), but > this quick and dirty analysis was all I needed for my conlanging > activities of the moment, and proved my hypothesis correct. The more > frequent words in my conlang should be shorter than less frequent > words, > but frequency declines more gradually than I anticipated for words of 7 > or more letters. > > Has anyone seen a more rigorous analysis? > > I had toyed with converting the words to phonetic representations but > decided it wasn't worth my time. Obviously, the number of phonemes in > a > word is a stronger function of word frequency than the length of the > English spelling of the word, but I didn't feel like using SOUNDEX or > Zompist.com's English spelling algorithm (56 rules! -- > http://www.zompist.com/spell.html) to come up with approximations of > the > phonetic length. > > Anyone inspired to do a more statistically thorough analysis? > > Best regards, > > Jeffrey > >
-- Dirk Elzinga Dirk_Elzinga@byu.edu "I believe that phonology is superior to music. It is more variable and its pecuniary possibilities are far greater." - Erik Satie

Reply

Sally Caves <scaves@...>