Re: TECH: Unicode Private Use Area
From: | Philip Newton <philip.newton@...> |
Date: | Saturday, February 2, 2008, 10:14 |
On Feb 2, 2008 9:54 AM, David J. Peterson <dedalvs@...> wrote:
> This is how I understand the private use area:
I'll reply with my understanding, which may or may not be correct.
> (1) Unicode has designated that #'s X through X+n are for private
> use, and won't be used for natural language scripts by Unicode
> at any time in the future.
Yep. (The ranges are U+E000-U+F8FF, U+F0000-U+FFFFD, and U+100000-U+10FFFD.)
(ISO 10646 had even more such private use codepoints, including entire
groups of planes, but dropped them when it synchronised with Unicode,
which meant that the highest possible codepoint was 0x10FFFF.)
"Characters in these areas will never be defined by the Unicode
Standard." (This and other quotations will be from "The Unicode
Standard 4.0", hereinafter TUS4, which is the newest version I have in
book form. This is from section 2.4, page 26.)
"All code points in the blocks of private-use characters in the
Unicode Standard are permanently designated for private use--no
assignment to a particular, standard set of characters will ever be
endorsed or documented by the Unicode Consortium for any of these code
points." (TUS4, s15.7 p398)
> (2) If you're a conlanger and have created a brand new script,
> you can go to Unicode, pick out some private use unicode #'s,
> and then map your script characters to those numbers.
Pretty much.
Though private use codepoints only work for data transmission if both
the sender and the recipient agree on the meaning of the code points.
"These code points can be freely used for characters of any purpose,
but successful interchange requires an agreement between sender and
receiver on their interpretation." (TUS4 s2.4 p26)
"[Their] use may be determined by private agreement among cooperating
users." (TUS4 s15.7 p398)
Interesting, the Unicode Standard does provide for a rough partition
of the Private Use Area U+E000--U+F8FF:
"By convention, the primary Private Use Area is divided into a
corporate use subarea for platform writers, starting at U+F8FF and
extending downard in values, and an end user subarea, starting at
U+E000 and extending upward." (TUS4 s15.7 p398)
(For example, some Apple fonts encode the Apple logo--which was
historically often included in Mac-specific charsets--in the corporate
use subarea.)
It's only a convention, though, and anybody can assign any meaning to
any private-use codepoint.
> If this is accurate, then:
>
> (a) What happens if two conlangers with two separate scripts
> choose the same unicode numbers, in ignorance of one another's
> scripts?
Nothing, as long as nobody ever tries to interchange data that could
be one script or another.
It's all about cooperating partners -- if conlanger A sends data to
user X, he'll have to tell him that U+E123 is WONDERFUL-AUXLANG LETTER
ABC, and if conlanger B sends data to user X, he'll have to tell him
that U+E123 is GREAT-ENGELANG SYLLABLE SKLUMPTH.
If anyone tries to compose a document containing some text in
WonderfulAuxlang and some in GreatEngelang, he'll have a problem,
though.
What they could try is to get their script registered in the ConScript
Unicode Registry ( http://www.evertype.com/standards/csur/ ), run by
Michael Everson and, formerly, also by John Cowan; it's a collection
of script ranges for conscripts, intended to avoid subrange clashing.
However, it's definitively non-normative, at least to the extent that
it's not endorsed by the Unicode Consortium, which means that anyone
can use ranges which clash with CSUR allocations -- it's just a
suggestion for people who want to use it to cooperate.
Also, CSUR is (obviously to me) not a good choice for every little
conscript someone comes up with, since it'd fill up pretty quickly if
every neography out there were registered. Only scripts with some use
-- preferably for interchange, rather than only for their creator's
private use -- should IMO start an attempt to be registered there.
(Again, though, you can ignore CSUR and simply say that
WonderfulAuxlang will use, for example, the block U+E120-U+E13F.)
> (b) Does this mean that if one creates a unicode-compliant font
> with one's script in the unicode #'s one has designed, then one
> can use the code to display one's script on a website?
Yes. (Provided your web browser and/or operating system text output
support supports the code points -- which may be a problem especially
for Planes 14 and 15, which need more than 16 bits to encode them.)
> If (b) is the case, then...
>
> (c) Wouldn't a viewer of your site have to download your new
> unicode compliant font to view the page correctly?
Yes.
(Or make their own, or download someone else's Unicode-compliant font
that maps "your" codepoints to the appropriate glyphs.)
> And if that's
> the case, then what's the difference between that and just creating
> a regular old font with regular old mappings (e.g., the a keystroke
> = your glyph for the vowel [a:], or whatever), and making viewers
> of your website download *that* font to view your page?
Technically, none.
Semantically, though, U+0061 is defined by Unicode to be LATIN SMALL
LETTER A, with specific attributes, such as belonging to the Latin
alphabet, being a letter (and not punctuation or a digit or a dingbat
or ...), having an upper-case mapping to U+0041, etc.
If you map U+0061 to your conlang glyph, you're essentially claiming
that it's a Latin small letter a, which isn't the case.
It's -- how shall I compare it? Perhaps like using the HTML tag "ul"
(unordered list) merely to indent a paragraph, or "em" (emphasis)
merely to italicise something. Or like using the capital letter O
instead of digit 0 (perhaps because you grew up on typewriters where
this was necessary since they had no zero key). The visual effect may
be what you desire but the semantics are wrong.
Whereas if you map U+E123 to your conlang glyph, you're explicitly
saying, "This is a character with a meaning assigned by me, not by the
Unicode Consortium".
> Are we
> hoping that one day some future version of unicode is going to
> support our personal scripts, or something?
No, I think it's just a matter of saying that we support Unicode and
want U+0061 to mean LATIN SMALL LETTER A, and if we need a separate
glyph which doesn't have the same semantics as LATIN SMALL LETTER A,
we don't re-use that code point. It's more a philosophical thing.
"Any prior use of a character as a private-use character has no
direct bearing on any eventual encoding decisions regardng whether and
how to encode that character." (TUS4, s15.7 p398) Which I read as
meaning that encoding your character as a private-use character is
neither beneficial nor detrimental to a later effort to get it encoded
officially in Unicode. If you want to get it into Unicode officially,
you'll still have to "follow the normal process for encoding of new
characters or scripts." (ibid.)
It may be interesting to note that a couple of scripts (Shavian and
Deseret) eventually moved from CSUR to being an official Unicode
script.
It may or may not be significant that both were used to write a
natlang (English). Though Tolkien's Tengwar and Cirth scripts have
made it into the Unicode standardisation pipeline, and it's not
completely impossible they'll eventually be standardised, particularly
if the Unicode Consortium can be convinced that there are people who
have a legitimate use for those characters, i.e. they actively
interchange data using those characters. (Which is one thing that
killed Klingon, for example: it was proposed for addition, but IIRC
the consortium said that nearly everyone writing in Klingon used the
Latin alphabet to do so, not pIqaD. A bit of a chicken-and-egg
problem, to be sure, since it's difficult to write in it without good
font, display, and keyboard-input support, and if you want to create a
font and a keyboard mapping, you have to decide which code points to
use for it.)
Cheers,
--
Philip Newton <philip.newton@...>
Replies