Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Language comparison

From:Gary Shannon <fiziwig@...>
Date:Sunday, January 9, 2005, 17:14
--- Andreas Johansson <andjo@...> wrote:

<snip>
> I don't really see how anyone could arrive at "~1 > bit per character", tho, > assuming that 'character' corresponds to 'phoneme'; > even in an ultra-simple > phonology with rigid CV syllable structure, eight > consonants and four vowels > (making for 32 distinct syllables; less than > Rotokas's inventory), you'd need a > minimum of 2.5 bits per 'character' just to specify > the sequence of segmental > phonemes.
If each character were equally likely in a given context then that would be true. However, give the context "esta_lishment" the missing character has such a high probability of being 'b' that the presence of "b" in that context carries virtually no information whatsoever and can be encoded with zero bits. 17 bits would encode 131,072 unique words if each were equally likely in a given context. Given that the average word is five characters long that implies 17/5 = 3.4 bits per character on average. However, in a given context there are relatively few words that are possible. For example "I read an intersting ______ yesterday." Clearly the vast majority of the words in the dictionary cannot be meaningfully placed in the blank location. "I read an interesting favorable yesterday." "I read an interesting pumice yesterday." So the information content of whatever word shows up there is less than 17 bits, and less than 3.4 bits per character. English has so much redundancy, in fact, that something in the neighborhood of 1.3 bits per character, on average, it all it takes to encode typical English text. --gary

Replies

Chris Bates <chris.maths_student@...>
Andreas Johansson <andjo@...>