Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Um...help with unicode?

From:Herman Miller <hmiller@...>
Date:Monday, November 4, 2002, 4:07
On Mon, 4 Nov 2002 02:16:55 +0000, Mat McVeagh <matmcv@...> wrote:

>Encoding systems are originally based around character sets (i.e. new >encoding systems were devised to cover character sets that previous ones >didn't). But with Unicode the idea is to have one single encoding system >that covers all character sets - laudable. > >But it seems a lot of things can't or don't use Unicode. So if we use >Unicode will it always work? I.e., will we be able to read what other people >are writing, and vice versa? If not, how do you get around it?
If you're talking about email, you're pretty much limited to ASCII, or if you're lucky, ISO 8859-1. People do use other character sets (I regularly get spam in Korean, for instance), but not every email program will be able to see them, so it's best to avoid them. I managed to figure out how to get Agent to read Esperanto and Turkish, but most people probably won't bother trying. As far as the web goes, it's not hard to find Unicode-compatible browsers these days (my preferred browser is Mozilla). See the "Unicode Support in Your Browser" page at http://home.att.net/~jameskass/ (also the home of the Code2000 and Code2001 fonts) for some info on using Unicode on the web. For other uses, it really depends on the program. Some programs can only handle 8-bit text regardless of what operating system they're running on. For other programs, like Notepad, it depends on the operating system; Notepad does handle Unicode in some versions of Windows, but not others. Even the programs that do support Unicode often only support the "easy" character sets (like Latin or Chinese), and not the more complicated ones (the right-to-left scripts, or ones with complicated ligatures). And even with the Latin script, software has a long way to go before we can put arbitrary combinations of diacritics on any character; pretty much the only characters that are safe to use are the precomposed Latin-1 and Latin Extended-A characters, plus the extra characters needed for Pinyin and Vietnamese. (This does have some relevance to conlanging: the spelling of Lindiga, as well as the romanization of all my other recent langs, was designed with the limitations of typical Windows fonts in mind. So I use a dot under the vowel to distinguish [e] from [E], [o] from [O], and a circumflex for vowel length, because the combination of e or o with a dot and a circumflex is found in Vietnamese. I also use cedillas or commas instead of dots for retroflex consonants on my web pages, since typical Windows fonts don't include the characters with dots under them, but in my own documentation I use the dots under the characters.)
>Secondly, suppose you plump for Unicode as I now am doing. (I am planning to >be writing in languages with lots of different accents, IPA, and it would be >nice to do e.g. Greek. I don't want to have to switch between encoding >systems or character sets. I don't really know how to.) That doesn't mean >you can just type or read everything. Oh no. You have to have special fonts >installed. All the old fonts are useless.
Some programs make an attempt to fill the gaps in old fonts by substituting characters from other fonts, with varying degrees of success. I've seen programs substitute characters even if the character does exist in the current font, with predictably ugly results (and apparently no way to prevent it from happening).
>And, seemingly, there are not fonts yet for all areas of Unicode.
Code2000 covers quite a few of them (an amazing achievement!)
>Next... you get Unicode up, you've got the fonts installed... now how do you >type the characters? You need a special 'keyboard'. I.e. a protocol for >interpreting keystrokes on what physical keyboard you have as characters. >(Of course you could use Character Map or an equivalent but let's face it >that is hopelessly laborious and fiddly.) So... you need to download special >keyboard drivers that link in to particular characters. If you are using >Unicode of course these must be Unicode keyboards; no other keyboards will >do. And it seems you can only type in some fonts if you have the appropriate >keyboard for them, etc. etc.
And there's also the issue that some software won't work at all with Unicode keyboards. These are typically the same programs that don't display Unicode in the first place, but Unitype Global Writer 98 refuses to work with Keyman. (It has its own keyboards, but these aren't programmable.) One recent version of Word tries to guess what language you're typing in, and it frequently guesses wrong (switching to Hebrew every time I tried to type a thorn, and changing the font at the same time). I gave up and went back to Word 2000, which seems to be more reliable.
>OK. You have Unicode, relevant fonts to display your chosen character sets >with, relevant keyboards to type the characters with nice and easy. Now... >where do you type them? Any old where? NO! You cannot do Unicode at all with >Notepad. Someone suggested you might be able to with Wordpad, but I have yet >to see how.
What version of Windows are you running?
>Let's suppose you have found a way to compose neatly in Unicode, and can do >textfiles, word-processed documents, webpages, typing in different fonts and >character sets in the same piece, and hence can mix ordinary text with your >accented conlang and phonetic transcriptions. Now... will your browser show >it properly? Will it handle Unicode properly? Or all the relevant fonts?
Mozilla does a pretty good job of handling Unicode.
>And will your readers, to whom you have sent your masterpiece, or who are >browsing your site?
There will still be people with outdated browsers, but Unicode-enabled browsers have been out for long enough now that there probably aren't many people who still use the old ones.
>I find it all hopelessly confusing. I want to get this sorted out before I >start entering lots of data re my conlangs because I don't want to have to >retype a lot of stuff because of some incompatibility. But the trouble is, I >don't even really know what I don't know. I don't know what questions to >ask.
As bad as the situation is with Unicode, there really isn't any better choice. The good news is that things seem to have been gradually getting better over the years. Unfortunately, the rate of improvement is still very slow. TrueType Open support for complex scripts has been available in theory for years now, but it's only been very recently that _some_ software is starting to use a subset of the features required for _some_ scripts. -- languages of Azir------> ---<http://www.io.com/~hmiller/lang/index.html>--- hmiller (Herman Miller) "If all Printers were determin'd not to print any @io.com email password: thing till they were sure it would offend no body, \ "Subject: teamouse" / there would be very little printed." -Ben Franklin