Data compression based on sentence structure analysis
|From:||Jim Henry <jimhenry1973@...>|
|Date:||Thursday, January 19, 2006, 19:52|
I'm forwarding this with permission.
---------- Forwarded message ----------
From: Robert L. Read <read@...>
Date: Jan 19, 2006 10:54 AM
Subject: About conlangish stuff..
To: Jim Henry <jimhenry1973@...>
Dankon pro via anonco pri la planlingvo kongreso.
I doubt that I will have time to develop it in time for this conference, but I
wanted to mention an idea that really fascinates me and would make an
excellent academic paper.
I think we could build a model-based codec for Esperanto that would utilize
sentence structure as part of the model. I conjecture that one could obtain
extraordinarily high compression rates of Esperanto text in this way.
As you know, Huffman encoding works at the "letter" level, and other
dictionary-based algorithms work at the pattern level, but basically
English is to complicated to utilize grammatical information in the model.
Not so Esperanto. For example, one can imagine a 2 or 3 bit pattern
that specifies SVO or SOV or so forth, and then the subject, verb, and
direct object do not need the letters that mark those positions. If you
combine this with a good dictionary of radicals, and use a Huffman code
on the radicals, I conjecture that one would get a surprisingly tight
compression of Esperanto text.
So: "Mi vin Amas" might be:
01 (SOV) (00 - most common pronoun) (01 - another common pronoun)
(1010 - "am" as root) (01 - present tense marker.)
So the whole 11-letter, 3 word sentence is maybe 12 bits. (Of course,
one would have to
fully develop the encoding model before any assertion could really be made.)
This would be scientifically interesting in several ways (if it succeeds):
1) The best compression of any language text is of interest, because
even if we cannot construct such a compressor for English, we are exploring
a bound on the problem.
2) Depending on the degree of success, it could motivate attempts to
find structural compressors for other languages.
3) It has some slight practical significance.