Re: Language comparison
From: | Gary Shannon <fiziwig@...> |
Date: | Sunday, January 9, 2005, 17:14 |
--- Andreas Johansson <andjo@...> wrote:
<snip>
> I don't really see how anyone could arrive at "~1
> bit per character", tho,
> assuming that 'character' corresponds to 'phoneme';
> even in an ultra-simple
> phonology with rigid CV syllable structure, eight
> consonants and four vowels
> (making for 32 distinct syllables; less than
> Rotokas's inventory), you'd need a
> minimum of 2.5 bits per 'character' just to specify
> the sequence of segmental
> phonemes.
If each character were equally likely in a given
context then that would be true. However, give the
context "esta_lishment" the missing character has such
a high probability of being 'b' that the presence of
"b" in that context carries virtually no information
whatsoever and can be encoded with zero bits.
17 bits would encode 131,072 unique words if each were
equally likely in a given context. Given that the
average word is five characters long that implies 17/5
= 3.4 bits per character on average. However, in a
given context there are relatively few words that are
possible. For example "I read an intersting ______
yesterday." Clearly the vast majority of the words in
the dictionary cannot be meaningfully placed in the
blank location. "I read an interesting favorable
yesterday." "I read an interesting pumice yesterday."
So the information content of whatever word shows up
there is less than 17 bits, and less than 3.4 bits per
character.
English has so much redundancy, in fact, that
something in the neighborhood of 1.3 bits per
character, on average, it all it takes to encode
typical English text.
--gary
Replies