Re: ASCIIifying
From: | Mark J. Reed <markjreed@...> |
Date: | Tuesday, May 6, 2003, 21:37 |
On Tue, May 06, 2003 at 04:20:55PM -0400, Robert B Wilson wrote:
> hmm... it's called latin-1 (or western european) everywhere in windows...
> there is a windows-1252, but that seems to have the same character values
> as the "ascii" in the qbasic help file...
> i guess i shouldn't trust microsoft at all (instead of trusting them
> about as much as i trust bill clinton...)
Technically, ASCII only defines 128 characters, numbered 0
through 127, of which only 95 are "printable" - letters, numbers,
puncutation, and such. (Though some folks include tab, newline,
and carriage return in the "printable" group since they have
visible effects within text files). The rest are called "control
characters", and mostly have metatextual functions originally geared
toward issuing mechanical controls to teletypewriter terminals,
separating records on magnetic tape, etc.
Looking at your Windows Character Map accessory, printable ASCII
starts with the space character (position 32), followed by !, ",
#, $, etc., all the way up to ~ (position 126). The last position,
127, is another control character, DELETE.
"Latin-1" is a nickname for the ISO-8859-1 character set
(International Standards Organization publication number 8859,
part 1). Each of the character sets within ISO-8859 defines 256
characters, exactly twice as many as ASCII, and have in common that
the first 128 are the same as ASCII. They have another feature in
common, which is that the first 32 characters of the second half -
that is, positions 128 through 159 - are designated as more control
characters. The printable characters start with 160. In Latin-1,
position 160 is a "non breaking space" - looks like a space, but
also conveys "don't break across lines here". I think that's true
of all the ISO 8859 characters ets, actually. Latin-1 continues
with the inverted exclamation point (¡) in position 161, the cent
sign (¢ - a terrible oversight that it was left out of ASCII)
in position 162, etc.
The character set used by Windows 9x is neither ASCII nor Latin-1.
It's a nonstandard variant of Latin-1 called Windows-1252. The
ASCII half is the same, as are all of the characters from 160 up.
But in between, instead of more control characters, it puts extra
printable characters such as the O-E ligature, which do not appear
in Latin-1. If you use those characters, then your email is only
readable on another Windows system, or, as long as your mail software
takes care to mark your message as Windows-1252, on a system which
knows how to convert Windows-1252 to its own character set, assuming
it's something like Unicode which includes all of the characters in
Windows-1252 at all.
Unicode is the granddaddy of all character sets; it is what
Windows NT, 2000, and XP use internally. Support on other systems
varies, but I have my Linux boxes set up so that I have complete Unicode
support not only in my browser but in all my xterm windows.
Instead of a mere 128 or 256 characters, Unicode has room for over
a million (of which only around 100,000 are currently assigned),
and the goal is to represent every writing system on the planet.
It's not there yet, but it's close enough for just about every
practical use. The first 256 characters are the same as Latin-1,
including the two control areas; the extra characters from Windows-1252
are included, but in different positions outside of the 0-255 range.
There's so much more in Unicode, though. You want alphabets?
How about Cyrillic, Greek, Arabic, Hebrew, Korean, Tamil, Devanagari,
Gujarati, Thai, and more? How about all of the above in the
same text file, without having to include any sort of meta-textual
instructions for switching fonts? How about the real, full IPA? How about
a bunch of mathematical symbols, just about every imaginable
diacritical mark, the most common Chinese ideographs used in Chinese,
Japanese, and Korean writing, every symbol in Zapf Dingbats (again,
without having to switch fonts), and so much more?
You want constructed scripts? There's a private use area where folks
can agree to put things that aren't part of Unicode proper, and there's
a separate "standard" set of assignments within that range called
ConScript which includes Cirth, Tengwar, and KLI pIqaD, among others.
As I type this mail I have the ability to include any Unicode
character within it, but of course that doesn't do any good if
the environment in which you read email doesn't support Unicode.
Email is still largely an ASCII medium, although most folks have
support for at least Latin-1 and probably a few other character sets.
Sorry, got a little excited there. The point is, there's no such thing
as 8-bit ASCII - just various and sundry 8-bit character sets based on
ASCII, two of which are Latin-1 and Windows-1252, which are not quite the
same as each other. :)
-Mark
Replies