Re: Word length as a function of word frequency
From: | Dirk Elzinga <dirk_elzinga@...> |
Date: | Friday, May 30, 2003, 14:59 |
Jeffrey:
I have a 20,000 word electronic dictionary which includes: i)
transcription (in a sensible but non-standard ASCII scheme), ii) stress
pattern, iii) syllable count, iv) spelling, v) frequency (I think
that's what the number is), and vi) part of speech. I can send it along
to you and other interested parties, but its about 600K; I don't know
what that will do to email accounts.
I did do a little project to see how segment count and syllable count
are related; it was inspired by a similar graph I saw for German in an
old article in _Language_. Here is my graph for the 20,000 word
dictionary (use a monowidth font to view the graph):
Distribution of lexical items according to syllable (x-axis) and
segment (y-axis) count
1 2 3 4 5 6 7 8 total
17 1 2 1 4
16 3 3 6
15 2 7 4 13
14 3 13 20 5 41
13 12 57 40 8 117
12 3 69 172 47 291
11 29 242 256 33 560
10 8 165 621 287 9 1090
9 47 532 922 138 2 1641
8 271 1248 721 24 2264
7 1 936 1525 273 2735
6 39 1922 937 40 2938
5 431 2343 293 3067
4 1480 1406 14 2900
3 1514 127 1641
2 215 215
1 5 5
total 3685 7060 4746 2903 949 162 22 1 19528
I'm not sure what it means, but it's a pretty picture.
Dirk
On Thursday, May 29, 2003, at 09:29 PM, Jeffrey Henning wrote:
> I thought I had read a web page addressing word length as a function of
> word frequency before, but after a half-hour of searching Google I gave
> up and did a quick analysis of this English corpus in Excel:
>
http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/2_3_writtenspoken.txt
>
> Length of word - Average frequency of words with this length
> 1 - 1835.5
> 2 - 1790.7
> 3 - 900.2
> 4 - 211.3
> 5 - 110.7
> 6 - 78.6
> 7 - 71.9
> 8 - 63.1
> 9 - 59.5
> 10 - 53.6
> 11 - 49.9
> 12 - 47.1
> 13 - 48.7
> 14 - 36.4
> 15 - 33.0
> 16 - 30.0
>
> I haven't scrubbed the corpus (and it looks like it could use it), but
> this quick and dirty analysis was all I needed for my conlanging
> activities of the moment, and proved my hypothesis correct. The more
> frequent words in my conlang should be shorter than less frequent
> words,
> but frequency declines more gradually than I anticipated for words of 7
> or more letters.
>
> Has anyone seen a more rigorous analysis?
>
> I had toyed with converting the words to phonetic representations but
> decided it wasn't worth my time. Obviously, the number of phonemes in
> a
> word is a stronger function of word frequency than the length of the
> English spelling of the word, but I didn't feel like using SOUNDEX or
> Zompist.com's English spelling algorithm (56 rules! --
>
http://www.zompist.com/spell.html) to come up with approximations of
> the
> phonetic length.
>
> Anyone inspired to do a more statistically thorough analysis?
>
> Best regards,
>
> Jeffrey
>
>
--
Dirk Elzinga
Dirk_Elzinga@byu.edu
"I believe that phonology is superior to music. It is more variable and
its pecuniary possibilities are far greater." - Erik Satie
Reply