|From:||Mark J. Reed <markjreed@...>|
|Date:||Wednesday, March 28, 2007, 13:17|
On 3/28/07, John Vertical <johnvertical@...> wrote:
> I think we're talking about different levels of abstraction. Certainly
> Unicode does not define any caracters in the terms of how *exactly* they
> look like - but they do have an underlying geometrical structure.
Unicode specifically defines abstract characters, not glyphs; it's
data, not presentation. Roman, Cyrillic, and Greek uppercase A may all
look alike, but they are in fact different pieces of data. Their
actual appearance is technically irrelevant (although of course the
appearance of glyphs was involved at some level in the decisions made
about whether and how to include characters).
> Or would you say that the star symbol you get when pressing the, say, "r" key while
> using a dingbats font is still, on some abstract level, a letter "r"?
Absolutely not. Neither how you enter the character on a keyboard, or
even if you can, nor what it looks like on the screen have anything to
do with its Unicode identity. U+0101 LATIN SMALL LETTER A WITH MACRON
is that letter whether you type it via option-_ a or
alt-0257 or control-k a - or whatever, and irrespective of what it
looks like on screen, even if it just shows up as a ? or "unknown
character" glyph because your font doesn't have it.
A character is not the same as a code point, however. Even though the
sequence U+0061 LATIN SMALL LETTER A followed by U+0304 COMBINING
MACRON is distinct codewise from the above, compliant Unicode software
is required to treat them as representing the same "character". But
that semantics is defined in the standard - U+0101 only exists for
round-trip compatibility with other character sets where that sequence
exists as a single character. It's not based on appearance. If
anything, the cause/effect relationship works the other way: the
appearance should be the same because the underlying abstract
character is the same (although many implementations fail to handle
combining characters appropriately, so the appearance is not the
> It completely depends on how do you define "alloglyph". "Same Unicode
> entity" would be circularish logic, and dependant on the font anyway.
No, it's not. Unicode has nothing to do with fonts!
Consider these code points:
(1) U+0063 LATIN SMALL LETTER C
(2) U+00E7 LATIN SMALL LETTER C WITH CEDILLA
(3) U+0326 COMBINING COMMA BELOW
(4) U+0327 COMBINING CEDILLA
By definition, the sequence (1) + (4) represents the same character as
(2) by itself. But (1) + (3) is a completely different character.
Even if it happens to look the same on your screen.
The Turkish characters representing /S/ and /ts/ are a little more
problematic; the Turkish Academy considers use of the cedilla
incorrect, but the original Unicode standard did not include
precomposed characters for S and T with comma below, only with
cedilla. But tat was remedied in 1999, so now we have all of the
(5) U+0053 LATIN CAPITAL LETTER S
(6) U+0054 LATIN CAPITAL LETTER T
(7) U+0063 LATIN SMALL LETTER S
(8) U+0064 LATIN SMALL LETTER T
(9) U+015E LATIN CAPITAL LETTER S WITH CEDILLA
(10) U+015F LATIN SMALL LETTER S WITH CEDILLA
(11) U+0162 LATIN CAPITAL LETTER T WITH CEDILLA
(12) U+0163 LATIN SMALL LETTER T WITH CEDILLA
(13) U+0218 LATIN CAPITAL LETTER S WITH COMMA BELOW
(14) U+0219 LATIN SMALL LETTER S WITH COMMA BELOW
(15) U+021A LATIN CAPITAL LETTER T WITH COMMA BELOW
(16) U+021B LATIN SMALL LETTER T WITH COMMA BELOW
(3) U+326 COMBINING COMMA BELOW
(4) U+0327 COMBINING CEDILLA
(5) + (4) is the same character as (9). (Hey, math works. :))
(5) + (3) is the same character as (13). (Sometimes.)
But (5)+(4) is not the same as either (5)+(3) or (13), nor is (5)+(3)
the same as (9).
So I think it's pretty clear that UNICODE most definitely
distinguishes between cedilla and comma below. I thought the question
before us concerned the historical validity of making that
distinction, not the fact that it is made (enforced politically,
Mark J. Reed <markjreed@...>
> usage", yes, but likewise c with caron, tee-esh digraph, etc; and I think
> adding "same origin" to that would giv an answer of "no", tho the diareses /
> umlaut precedent suggests that this combo is not the way to go. I see it
> coming down to "variations on a basic shape", and at that point my answer is
> "no". You can certainly use them as allo*graphs* (representations of the
> same sound), but they are no more allo*glyphs* than o-with-umlaut and "oe"
> are (synchronically.)
> >As for the three A's, they come from three separate alphabets. It
> >would be quite odd to mix alphabets within a single word.
> Not too long ago, letters like o-with-umlaut vs. n-with-tilde were also
> considered to come from separate alphabets (eg German vs. Spanish), and
> arguably they still are. Yet nobody says you cannot use both when devising a
> writing system for a language. Also, Japanese, IPA, etc.
> >More importantly, you would lose the ability to map each letter to its
> >lowercase equivalent.
> Which is probably the main reason the alphabets are considered distinct, tho
> I wonder how offen it would come handy. The distinction between different
> upper case <> lower case mappings, especially.
> John Vertical
> Windows Live Messenger - kivuttoman viestinnän puolestapuhuja.