Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Sorting PIE with IPA

From:Benct Philip Jonsson <conlang@...>
Date:Tuesday, May 22, 2007, 9:04
On 22.5.2007 Paul Bennett wrote:
 > Should I list the PIE symbols as if they were variants of
 > the IPA glyphs they're similar to, and order them in the
 > same fashion as the other "variant" glyphs in the
 > document? (I call this the "visual way").
 >

I think you should, for several reasons:

- People expect sorted word lists to be sorted by symbol
   rather than by sound.
- There is still quite some uncertainty about phonetic
   values of PIE phonmes, so a reference work should IMHO
   play it safe.
- Most importantly if you start from a certain daughter
   language and try to guess at the PIE form of a word you
   may not be able to guess which PIE guttural it contains,
   so it makes sense to have all of *k *ky *kw sorted as *k.
- In fact I think that it would be best to use bi-level
   lexicographic sorting, so that all symbols derived
   from/similar to/equivalent to a certain Latin letter
   should sort as that Latin letter, since that would
   free readers from remembering a modified/extended
   sort order most of the time. See
<http://cpan.uwinnipeg.ca/dist/Sort-
   ArbBiLex> / <http://cpan.uwinnipeg.ca/htdocs/Sort-
   ArbBiLex/Sort/ArbBiLex.html> for both the tool and
   the concept!

|> CONCEPTS
|>
|> Writing systems for different languages usually have
|> specific sort orders for the glyphs (characters, or
|> clusters of characters) that each writing system uses.
|> For well-known national languages, these different sort
|> orders (or someone's idea of them) are formalized in the
|> locale for each such language, on operating system
|> flavors that support locales. However, there are problems
|> with locales; cf. perllocale. Chief among the problems
|> relevant here are:
|>
|> * The basic concept of ``locale'' conflates
|>   language/dialect, writing system, and character set --
|>   and country/region, to a certain extent. This may be
|>   inappropriate for the text you want to sort. Notably,
|>   this assumes standardization where none may exist
|>   (what's THE sort order for a language that has five
|>   different Roman-letter-based writing systems in use?).
|>
|> * On many OS flavors, there is no locale support.
|>
|> * Even on many OS flavors that do suport locales, the
|>   user cannot create his own locales as needed.
|>
|> * The ``scope'' of a locale may not be what the user
|>   wants -- if you want, in a single program, to sort the
|>   array @foo by one locale, and an array @bar by another
|>   locale, this may prove difficult or impossible.
|>
|> In other words, locales (even if available) may not sort
|> the way you want, and are not portable in any case.
|>
|> This module is meant to provide an alternative to locale-
|> based sorting.
|>
|> This module makes functions for you that implement bi-
|> level lexicographic sorting according to a sort order you
|> specify. ``Lexicographic sorting'' means comparing the
|> letters (or properly, ``glyphs'', as I'll call them here,
|> when a single glyph can encompass several letters, as
|> with digraphs) in strings, starting from the start of the
|> string (so that ``apple'' comes after ``apoplexy'', say)
|> -- as opposed to, say, sorting by numeric value.
|> ``Lexicographic sorting'' is sometimes used to mean just
|> ``ASCIIbetical sorting'', but I use it to mean the sort
|> order used by lexicographers, in dictionaries (at least
|> for alphabetic languages).
|>
|> Consider the words ``resume'' and ``résumé'' (the
|> latter should display on your POD viewer with acute
|> accents on the e's). If you declare a sort order such
|> that e-acute (``é'') is a letter after e (no accent),
|> then ``résumé'' (with accents) would sort after every
|> word starting with ``re'' (no accent) -- so ``résumé''
|> (with accents) would come after ``reward''.
|>
|> If, however, you treated e (no accent) and e-acute as the
|> same letter, the ordering of ``resume'' and ``résumé''
|> (with accents) would be unpredictable, since they would
|> count as the same thing -- whereas ``resume'' should
|> always come before ``résumé'' (with accents) in English
|> dictionaries.
|>
|> What bi-level lexicographic sorting means is that you can
|> stipulate that two letters like e (no accent) and e-acute
|> (``é'') generally count as the same letter (so that they
|> both sort before ``reward''), but that when there's a tie
|> based on comparison that way (like the tie between
|> ``resume'' and ``résumé'' (with accents)), the tie is
|> broken by a stipulation that at a second level, e (no
|> accent) does come before e-acute (``é'').
|>
|> (Some systems of sort order description allow for any
|> number of levels in sort orders -- but I can't imagine a
|> case where this gets you anything over a two-level sort.)
|>
|> Moreover, the units of sorting for a writing system may
|> not be characters exactly. In some forms of Spanish, ch,
|> while two characters, counts as one glyph -- a ``letter''
|> after c (at the first level, not just the second, like
|> the e in the paragraph above). So ``cuerno'' comes before
|> ``chile''. A character-based sort would not be able to
|> see that ``ch'' should count as anything but ``c'' and
|> ``h''. So this library doesn't assume that the units of
|> comparison are necessarily individual characters.


--

/BP 8^)
--
   B.Philip Jonsson mailto:melrochX@melroch.se (delete X)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Truth, Sir, is a cow which will give [skeptics] no more milk,
and so they are gone to milk the bull."
                                     -- Sam. Johnson (no rel. ;)