NYCPHP Meetup

NYPHP.org

[nycphp-talk] utf-8, iso-8859-1...

Chris Snyder chsnyder at gmail.com
Thu May 6 13:42:25 EDT 2010


On Thu, May 6, 2010 at 1:26 PM, Paul A Houle <paul at devonianfarm.com> wrote:

>   There's also the issue that there really is no "Unicode Sort Order" that
> entirely makes sense.  For instance,  languages such as German and Swedish
> sort the same characters in a different order.  I'm currently working on a
> system that is predominantly English but contains many named entity names
> with latinoid characters:  the sort order for "English" might well sort the
> Polish "Dark L" (the l with a line through it) after Z,  but poles sort
> "Dark L" after "Clear L" and most en-speakers will expect that too,  since
> we commonly squash Dark L -> Clear L in words like "Stanislaw."  Japanese
> people sort named entities phonetically,  which means you need to keep a
> furigana (phonetic) representation side by side with the conventional kanji
> representation...  In this age of statistical language translation,  I think
> kanji -> furigana translation could be largely automated,  but there are
> always 'words' that can't be read phonetically out of context

Oh great, locale-specific sorting. To what extent are there unix tools
to help you deal with that?

When it comes to internationalization, it's always something. At least
with utf-8 the encoding can be consistent if you control the inputs
(ie, you're not scraping or accepting mishmash from other systems).



More information about the talk mailing list
Automatic Email Organization without missing anything!