Lexicon storage methodology (was: Lexicon counting)

From:	Iain E. Davis <feaelin@...>
Date:	Thursday, September 7, 2006, 3:13

|< < Post > >| << List/Tree >> Reference September 2006 Index

Arthaey:
> Since MySQL version 4.1:
>
>     http://dev.mysql.com/doc/refman/4.1/en/charset-unicode.html
I thought they had, or had it planned. This must have been a long time ago
now that I was tinkering with this.

Carsten:
> Unfortunately, MySQL 5 addresses this issue only
> half-heartedly as it seems to me, I don't know which version
> I have exactly, but in my version at least Unicode does not
> really seem to work. The system somehow can only handle
> iso-8859-1 and changes my UTF-8 characters to _Â°_ and such.
> When giving out the word list in HTML, I can specify that
> this should be reencoded to UTF-8 with HTML's meta-tags. On
> the other hand, it may be just because I have WinXP and not Linux.
Hmmm. Maybe it is time I re-tinkered, just to see what the limitations are.

Actually, looking at what I had written, I seem to have made it work....

Ah. I got around it by running the input from the form through htmlentities
which results in any "special" characters being converted to html entities.
Thus, no actual unicode to store in the database.

For example, the IPA symbol represented by SAMPA /6/ becomes &#592;

A search becomes a little more complex, since you have to run the search
word through htmlentities before querying the database.

This is not ideal, of course, since I'm effectively contaminating my data
with html-isms. But it got me around the limitations of MySQL at the time.
:)

I think I got tired of fiddling with PHP not too long after that, so it
never really went anywhere. :)

Iain

|< < Post > >| << List/Tree >> Reference September 2006 Index