NYCPHP Meetup

NYPHP.org

[nycphp-talk] utf-8, iso-8859-1...

Chris Snyder chsnyder at gmail.com
Thu May 6 13:42:25 EDT 2010


On Thu, May 6, 2010 at 1:26 PM, Paul A Houle <paul at devonianfarm.com> wrote:

>   There's also the issue that there really is no "Unicode Sort Order" that
> entirely makes sense.  For instance,  languages such as German and Swedish
> sort the same characters in a different order.  I'm currently working on a
> system that is predominantly English but contains many named entity names
> with latinoid characters:  the sort order for "English" might well sort the
> Polish "Dark L" (the l with a line through it) after Z,  but poles sort
> "Dark L" after "Clear L" and most en-speakers will expect that too,  since
> we commonly squash Dark L -> Clear L in words like "Stanislaw."  Japanese
> people sort named entities phonetically,  which means you need to keep a
> furigana (phonetic) representation side by side with the conventional kanji
> representation...  In this age of statistical language translation,  I think
> kanji -> furigana translation could be largely automated,  but there are
> always 'words' that can't be read phonetically out of context

Oh great, locale-specific sorting. To what extent are there unix tools
to help you deal with that?

When it comes to internationalization, it's always something. At least
with utf-8 the encoding can be consistent if you control the inputs
(ie, you're not scraping or accepting mishmash from other systems).



More information about the talk mailing list