[nycphp-talk] utf-8, iso-8859-1...
chsnyder at gmail.com
Thu May 6 13:42:25 EDT 2010
On Thu, May 6, 2010 at 1:26 PM, Paul A Houle <paul at devonianfarm.com> wrote:
> There's also the issue that there really is no "Unicode Sort Order" that
> entirely makes sense. For instance, languages such as German and Swedish
> sort the same characters in a different order. I'm currently working on a
> system that is predominantly English but contains many named entity names
> with latinoid characters: the sort order for "English" might well sort the
> Polish "Dark L" (the l with a line through it) after Z, but poles sort
> "Dark L" after "Clear L" and most en-speakers will expect that too, since
> we commonly squash Dark L -> Clear L in words like "Stanislaw." Japanese
> people sort named entities phonetically, which means you need to keep a
> furigana (phonetic) representation side by side with the conventional kanji
> representation... In this age of statistical language translation, I think
> kanji -> furigana translation could be largely automated, but there are
> always 'words' that can't be read phonetically out of context
Oh great, locale-specific sorting. To what extent are there unix tools
to help you deal with that?
When it comes to internationalization, it's always something. At least
with utf-8 the encoding can be consistent if you control the inputs
(ie, you're not scraping or accepting mishmash from other systems).
More information about the talk