|From:||Mark J. Reed <markjreed@...>|
|Date:||Monday, April 16, 2007, 19:46|
On 4/14/07, John Vertical <johnvertical@...> wrote:
> I think I have implied over the discussion that I'm not familiar with the
> details of the Unicode terminology.
Which is, I think, the only thing I've been talking about. ):
> IMO all of those are still the same "glyphs". Okay, let's say "letter
> shapes" insted. For example, "A" is the letter shape that consists of two
> lines that meet at the top and a horizontal line between them. But notice
> that this is still, despite allowing for different particular written forms,
> still a definition that's grounded in the actual appearence of the letter.
> (So it is certainly abstract in the sense that it's not a physical entity,
> but it's not abstract in the sense of not being even a possible property of
> physical entities.)
The term "abstract" in the Unicode use of "abstract characters" is not
meant to be interpreted in the latter sense.
> "LATIN SMALL LETTER A WITH MACRON" is not
> the same string as "LATIN SMALL LETTER A followed by COMBINING MACRON"
Right. Different sequences of code points.
> I'm still not sure whether there exists some level of coding, in the
> font renderer or thereabout, where the latter two will be necessarily
> treated equally but the A's still kept distinct.
The abstract character equivalence means that a Unicode-compliant
string comparison operation is required to treat them as equal. So if
you're in a text editor and you do a regular expression search that
matches one of those two sequences, it must also match the other.
Whether or not there is any point at which they are actually stored
the same is implementation-dependent; there are a variety of
canonically normalized or denormalized forms to choose from.
> * C with cedilla and C with comma below are not the same letter shape.
Hard to argue one way or the other without knowing the evidence on
which that statement is based.
> * They may or may not have the same encoding (be it the binary data, the
> Unicode-point, or some hi'er level of encoding.)
No. From a Unicode perspective, if they're different "letter shapes"
(abstract characters), they *must* have different encodings. You can
have multiple ways of encoding a given abstract character, but you
cannot ever have more than one abstract character interpretation of a
given encoding sequence.
> * There is no _universal_ connection between letter shapes and encodings
There is a connection. It is not necessarily a straightforward or
symmetric one, but there is nevertheless a definite connection.
> * The word "character" or "letter" is an umbrella term that can mean either
> the letter shape, the encoding, or its meaning. Or a combination of some of
> these. As I demonstrated before, it cannot be used in a sense that's not
> based on any of these, and equating two different senses makes as much sense
> as equating two different senses of any homophone.
But in the case of the Unicode "abstract character", the basis on the
shape of the letter is only historical. At this point, not only is an
abstract character is a technical term with a technical definition,
but there is a finite list of them, along with exactly which sets of
code point sequences may be used to represent each one.
> >Actually, Cherokee throws in a different problem. The Cherokee alphabet
> >includes all but 4 of the characters of the Roman alphabet, then adds a
> >bunch of characters unique to Cherokee.
There's a different kink there. The Cherokee syllabary was created by
someone with no knowledge of which aspects of the appearance of the
Roman letters was significant, and has generally been treated as if
the exact typeface were important. Besides, really the connection
between, say, Latin R and Tsalagi /e/ is purely coincidental; I would
not say that they're even as close as the Latin/Cyrillic/Greek A's are
to each other. (And of course each member of the syllabary has its
own Unicode value; /e/ is U+1341).
> > One could describe Icelandic (for
> >example) in the same way, using most of the Latin characters, then adding
> >some unique to that language, pronouncing nearly all of the characters
> >different than say, English. What objective criteria should we use to call
> >Cherokee a separate alphabet, while keeping Icelandic under the Roman
> >alphabet? I think they're clearly separate, I'm just wondering what other
> >people think should be the distinguishing criteria.
Well, if you're wondering what the Unicode Consortium's criteria are,
you might start by reading http://www.unicode.org/notes/tn26/. That
specifically addresses the question of why they adopted Han
unification (same code points for corresponding Chinese-descended
characters in the writing systems of modern Chinese, Korean, and
Japanese) but kept Greek and Cyrillic separate from Latin...
Mark J. Reed <markjreed@...>