Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: ASCIIifying

From:Mark J. Reed <markjreed@...>
Date:Tuesday, May 6, 2003, 21:37
On Tue, May 06, 2003 at 04:20:55PM -0400, Robert B Wilson wrote:
> hmm... it's called latin-1 (or western european) everywhere in windows... > there is a windows-1252, but that seems to have the same character values > as the "ascii" in the qbasic help file... > i guess i shouldn't trust microsoft at all (instead of trusting them > about as much as i trust bill clinton...)
Technically, ASCII only defines 128 characters, numbered 0 through 127, of which only 95 are "printable" - letters, numbers, puncutation, and such. (Though some folks include tab, newline, and carriage return in the "printable" group since they have visible effects within text files). The rest are called "control characters", and mostly have metatextual functions originally geared toward issuing mechanical controls to teletypewriter terminals, separating records on magnetic tape, etc. Looking at your Windows Character Map accessory, printable ASCII starts with the space character (position 32), followed by !, ", #, $, etc., all the way up to ~ (position 126). The last position, 127, is another control character, DELETE. "Latin-1" is a nickname for the ISO-8859-1 character set (International Standards Organization publication number 8859, part 1). Each of the character sets within ISO-8859 defines 256 characters, exactly twice as many as ASCII, and have in common that the first 128 are the same as ASCII. They have another feature in common, which is that the first 32 characters of the second half - that is, positions 128 through 159 - are designated as more control characters. The printable characters start with 160. In Latin-1, position 160 is a "non breaking space" - looks like a space, but also conveys "don't break across lines here". I think that's true of all the ISO 8859 characters ets, actually. Latin-1 continues with the inverted exclamation point (¡) in position 161, the cent sign (¢ - a terrible oversight that it was left out of ASCII) in position 162, etc. The character set used by Windows 9x is neither ASCII nor Latin-1. It's a nonstandard variant of Latin-1 called Windows-1252. The ASCII half is the same, as are all of the characters from 160 up. But in between, instead of more control characters, it puts extra printable characters such as the O-E ligature, which do not appear in Latin-1. If you use those characters, then your email is only readable on another Windows system, or, as long as your mail software takes care to mark your message as Windows-1252, on a system which knows how to convert Windows-1252 to its own character set, assuming it's something like Unicode which includes all of the characters in Windows-1252 at all. Unicode is the granddaddy of all character sets; it is what Windows NT, 2000, and XP use internally. Support on other systems varies, but I have my Linux boxes set up so that I have complete Unicode support not only in my browser but in all my xterm windows. Instead of a mere 128 or 256 characters, Unicode has room for over a million (of which only around 100,000 are currently assigned), and the goal is to represent every writing system on the planet. It's not there yet, but it's close enough for just about every practical use. The first 256 characters are the same as Latin-1, including the two control areas; the extra characters from Windows-1252 are included, but in different positions outside of the 0-255 range. There's so much more in Unicode, though. You want alphabets? How about Cyrillic, Greek, Arabic, Hebrew, Korean, Tamil, Devanagari, Gujarati, Thai, and more? How about all of the above in the same text file, without having to include any sort of meta-textual instructions for switching fonts? How about the real, full IPA? How about a bunch of mathematical symbols, just about every imaginable diacritical mark, the most common Chinese ideographs used in Chinese, Japanese, and Korean writing, every symbol in Zapf Dingbats (again, without having to switch fonts), and so much more? You want constructed scripts? There's a private use area where folks can agree to put things that aren't part of Unicode proper, and there's a separate "standard" set of assignments within that range called ConScript which includes Cirth, Tengwar, and KLI pIqaD, among others. As I type this mail I have the ability to include any Unicode character within it, but of course that doesn't do any good if the environment in which you read email doesn't support Unicode. Email is still largely an ASCII medium, although most folks have support for at least Latin-1 and probably a few other character sets. Sorry, got a little excited there. The point is, there's no such thing as 8-bit ASCII - just various and sundry 8-bit character sets based on ASCII, two of which are Latin-1 and Windows-1252, which are not quite the same as each other. :) -Mark

Replies

Tristan McLeay <kesuari@...>
taliesin the storyteller <taliesin@...>