Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: OT: Question: Unicode

From:Mark J. Reed <markjreed@...>
Date:Sunday, May 18, 2003, 20:39
On Sun, May 18, 2003 at 01:15:11PM -0400, Roger Mills wrote:
> Another question-- where can I find a listing of the Unicode nos. for the > chars. from (hex) 0100 thru approx 02B8 (LatinA, B and IPA)? If I start > making my own html I'll need that, as it would be cumbersome to keep jumping > going from Notepad to Word just to look them up.
www.unicode.org has code charts.
> On first reading of Carlos' reply, I concluded that I should specify UTF8. > On second reading, I'm not so sure....
If you include any characters whose numbers are above 255, you technically need to declare the file to be Unicode of some variety. If you don't, the browser isn't required to honor higher-numbered entities. However, if you declare it to be UTF-8, then you also need to use HTML entities for the characters whose numbers are between 128 and 255 such as á, ¡, etc. Traditional character sets like ASCII, Latin-1, and Windows-1252 use a single byte to represent each character. Since there are only 256 distinct bytes, such character sets can represent only 256 different characters. (And ASCII only uses half of these.) Unicode has over a million characters, so it needs more bytes per character. The most straightforward way to represent it is to just use four bytes for every character (actually, three would be enough, but most computers are designed to access memory in 4-byte chunks and would waste so much time trying to deal with 3-byte chunks that the space savings wouldn't be worth it). This is called UCS-4 (4 bytes) or UTF-32 (4 bytes x 8 bits/byte = 32 bits). But that means a given web page suddenly takes up quadruple the space to represent the same text. You can actually represent over 99% of the currently-assigned Unicode characters with only two bytes per character, since there was originally only going to be 65,536 characters. So part of the under-65536 range is reserved to be used in pairs to represent the higher-numbered characters. This is called UTF-16 and is the internal representation used by most Unicode software. But it still doubles the size of text files, and if the text is all or mostly in the under-256 range that's a lot of wasted space. So many applications use UTF-8 for external representations - in files, when transmitting web pages over the net, etc. UTF-8 is a variable-length encoding that uses a single byte for characters in the under-128 range, so all-ASCII files are legal UTF-8 files already. But since bytes in the 128-255 range are used in combinations to represent the rest of the Unicode range, they're not available to represent the Latin-1/Windows-1252 characters. The characters in the 128-255 range are encoded differently, so normal one-byte text which uses them is not compatible with UTF-8. That's why you have to use HTML entities for them in UTF-8 text. -Mark Unicode was originally going to have only 65,536 characters before the Consortium realized that wouldn't be enough. 99% of the characters are in this range, which only requires two bytes - and part of the range is reserved in pairs that you put together to represent those that are outside the range. So you can If they all were in tt so only take two

Reply

John Cowan <cowan@...>