Re: OT: Question: Unicode
From: | Mark J. Reed <markjreed@...> |
Date: | Sunday, May 18, 2003, 20:39 |
On Sun, May 18, 2003 at 01:15:11PM -0400, Roger Mills wrote:
> Another question-- where can I find a listing of the Unicode nos. for the
> chars. from (hex) 0100 thru approx 02B8 (LatinA, B and IPA)? If I start
> making my own html I'll need that, as it would be cumbersome to keep jumping
> going from Notepad to Word just to look them up.
www.unicode.org has code charts.
> On first reading of Carlos' reply, I concluded that I should specify UTF8.
> On second reading, I'm not so sure....
If you include any characters whose numbers are above 255, you
technically need to declare the file to be Unicode of some variety.
If you don't, the browser isn't required to honor higher-numbered
entities. However, if you declare it to be UTF-8, then you also
need to use HTML entities for the characters whose numbers are
between 128 and 255 such as á, ¡, etc.
Traditional character sets like ASCII, Latin-1, and Windows-1252
use a single byte to represent each character. Since there are
only 256 distinct bytes, such character sets can represent only
256 different characters. (And ASCII only uses half of these.)
Unicode has over a million characters, so it needs more bytes per
character. The most straightforward way to represent it is to just
use four bytes for every character (actually, three would be enough,
but most computers are designed to access memory in 4-byte chunks
and would waste so much time trying to deal with 3-byte chunks that
the space savings wouldn't be worth it). This is called UCS-4
(4 bytes) or UTF-32 (4 bytes x 8 bits/byte = 32 bits). But that
means a given web page suddenly takes up quadruple the space to
represent the same text.
You can actually represent over 99% of the currently-assigned
Unicode characters with only two bytes per character, since there
was originally only going to be 65,536 characters. So part of
the under-65536 range is reserved to be used in pairs to represent
the higher-numbered characters. This is called UTF-16 and is the
internal representation used by most Unicode software.
But it still doubles the size of text files, and if the text is all
or mostly in the under-256 range that's a lot of wasted space. So
many applications use UTF-8 for external representations - in
files, when transmitting web pages over the net, etc. UTF-8 is a
variable-length encoding that uses a single byte for characters
in the under-128 range, so all-ASCII files are legal UTF-8
files already. But since bytes in the 128-255 range are used in
combinations to represent the rest of the Unicode range, they're
not available to represent the Latin-1/Windows-1252 characters.
The characters in the 128-255 range are encoded differently, so
normal one-byte text which uses them is not compatible with UTF-8.
That's why you have to use HTML entities for them in UTF-8 text.
-Mark
Unicode was originally going to have only 65,536 characters
before the Consortium realized that wouldn't be enough. 99% of
the characters are in this range, which only requires two bytes -
and part of the range is reserved in pairs that you put together
to represent those that are outside the range. So you can
If they all were in tt
so only take two
Reply