Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: TECH: Unicode Private Use Area

From:Philip Newton <philip.newton@...>
Date:Saturday, February 2, 2008, 10:14
On Feb 2, 2008 9:54 AM, David J. Peterson <dedalvs@...> wrote:
> This is how I understand the private use area:
I'll reply with my understanding, which may or may not be correct.
> (1) Unicode has designated that #'s X through X+n are for private > use, and won't be used for natural language scripts by Unicode > at any time in the future.
Yep. (The ranges are U+E000-U+F8FF, U+F0000-U+FFFFD, and U+100000-U+10FFFD.) (ISO 10646 had even more such private use codepoints, including entire groups of planes, but dropped them when it synchronised with Unicode, which meant that the highest possible codepoint was 0x10FFFF.) "Characters in these areas will never be defined by the Unicode Standard." (This and other quotations will be from "The Unicode Standard 4.0", hereinafter TUS4, which is the newest version I have in book form. This is from section 2.4, page 26.) "All code points in the blocks of private-use characters in the Unicode Standard are permanently designated for private use--no assignment to a particular, standard set of characters will ever be endorsed or documented by the Unicode Consortium for any of these code points." (TUS4, s15.7 p398)
> (2) If you're a conlanger and have created a brand new script, > you can go to Unicode, pick out some private use unicode #'s, > and then map your script characters to those numbers.
Pretty much. Though private use codepoints only work for data transmission if both the sender and the recipient agree on the meaning of the code points. "These code points can be freely used for characters of any purpose, but successful interchange requires an agreement between sender and receiver on their interpretation." (TUS4 s2.4 p26) "[Their] use may be determined by private agreement among cooperating users." (TUS4 s15.7 p398) Interesting, the Unicode Standard does provide for a rough partition of the Private Use Area U+E000--U+F8FF: "By convention, the primary Private Use Area is divided into a corporate use subarea for platform writers, starting at U+F8FF and extending downard in values, and an end user subarea, starting at U+E000 and extending upward." (TUS4 s15.7 p398) (For example, some Apple fonts encode the Apple logo--which was historically often included in Mac-specific charsets--in the corporate use subarea.) It's only a convention, though, and anybody can assign any meaning to any private-use codepoint.
> If this is accurate, then: > > (a) What happens if two conlangers with two separate scripts > choose the same unicode numbers, in ignorance of one another's > scripts?
Nothing, as long as nobody ever tries to interchange data that could be one script or another. It's all about cooperating partners -- if conlanger A sends data to user X, he'll have to tell him that U+E123 is WONDERFUL-AUXLANG LETTER ABC, and if conlanger B sends data to user X, he'll have to tell him that U+E123 is GREAT-ENGELANG SYLLABLE SKLUMPTH. If anyone tries to compose a document containing some text in WonderfulAuxlang and some in GreatEngelang, he'll have a problem, though. What they could try is to get their script registered in the ConScript Unicode Registry ( ), run by Michael Everson and, formerly, also by John Cowan; it's a collection of script ranges for conscripts, intended to avoid subrange clashing. However, it's definitively non-normative, at least to the extent that it's not endorsed by the Unicode Consortium, which means that anyone can use ranges which clash with CSUR allocations -- it's just a suggestion for people who want to use it to cooperate. Also, CSUR is (obviously to me) not a good choice for every little conscript someone comes up with, since it'd fill up pretty quickly if every neography out there were registered. Only scripts with some use -- preferably for interchange, rather than only for their creator's private use -- should IMO start an attempt to be registered there. (Again, though, you can ignore CSUR and simply say that WonderfulAuxlang will use, for example, the block U+E120-U+E13F.)
> (b) Does this mean that if one creates a unicode-compliant font > with one's script in the unicode #'s one has designed, then one > can use the code to display one's script on a website?
Yes. (Provided your web browser and/or operating system text output support supports the code points -- which may be a problem especially for Planes 14 and 15, which need more than 16 bits to encode them.)
> If (b) is the case, then... > > (c) Wouldn't a viewer of your site have to download your new > unicode compliant font to view the page correctly?
Yes. (Or make their own, or download someone else's Unicode-compliant font that maps "your" codepoints to the appropriate glyphs.)
> And if that's > the case, then what's the difference between that and just creating > a regular old font with regular old mappings (e.g., the a keystroke > = your glyph for the vowel [a:], or whatever), and making viewers > of your website download *that* font to view your page?
Technically, none. Semantically, though, U+0061 is defined by Unicode to be LATIN SMALL LETTER A, with specific attributes, such as belonging to the Latin alphabet, being a letter (and not punctuation or a digit or a dingbat or ...), having an upper-case mapping to U+0041, etc. If you map U+0061 to your conlang glyph, you're essentially claiming that it's a Latin small letter a, which isn't the case. It's -- how shall I compare it? Perhaps like using the HTML tag "ul" (unordered list) merely to indent a paragraph, or "em" (emphasis) merely to italicise something. Or like using the capital letter O instead of digit 0 (perhaps because you grew up on typewriters where this was necessary since they had no zero key). The visual effect may be what you desire but the semantics are wrong. Whereas if you map U+E123 to your conlang glyph, you're explicitly saying, "This is a character with a meaning assigned by me, not by the Unicode Consortium".
> Are we > hoping that one day some future version of unicode is going to > support our personal scripts, or something?
No, I think it's just a matter of saying that we support Unicode and want U+0061 to mean LATIN SMALL LETTER A, and if we need a separate glyph which doesn't have the same semantics as LATIN SMALL LETTER A, we don't re-use that code point. It's more a philosophical thing. "Any prior use of a character as a private-use character has no direct bearing on any eventual encoding decisions regardng whether and how to encode that character." (TUS4, s15.7 p398) Which I read as meaning that encoding your character as a private-use character is neither beneficial nor detrimental to a later effort to get it encoded officially in Unicode. If you want to get it into Unicode officially, you'll still have to "follow the normal process for encoding of new characters or scripts." (ibid.) It may be interesting to note that a couple of scripts (Shavian and Deseret) eventually moved from CSUR to being an official Unicode script. It may or may not be significant that both were used to write a natlang (English). Though Tolkien's Tengwar and Cirth scripts have made it into the Unicode standardisation pipeline, and it's not completely impossible they'll eventually be standardised, particularly if the Unicode Consortium can be convinced that there are people who have a legitimate use for those characters, i.e. they actively interchange data using those characters. (Which is one thing that killed Klingon, for example: it was proposed for addition, but IIRC the consortium said that nearly everyone writing in Klingon used the Latin alphabet to do so, not pIqaD. A bit of a chicken-and-egg problem, to be sure, since it's difficult to write in it without good font, display, and keyboard-input support, and if you want to create a font and a keyboard mapping, you have to decide which code points to use for it.) Cheers, -- Philip Newton <philip.newton@...>


<li_sasxsek@...>What happened to the CSUR? (< TECH: Unicode Private Use Area)
David J. Peterson <dedalvs@...>