NYCPHP Meetup

NYPHP.org

[nycphp-talk] utf-8, iso-8859-1...

Paul A Houle paul at devonianfarm.com
Thu May 6 13:26:02 EDT 2010


Chris Snyder wrote:
> Dirty secret - MySQL latin-1 tables will happily store and retrieve
> utf-8 data. They won't sort it correctly, though I believe they will
> sort it consistently.
>
> So even if your MySQL was compiled without unicode support, you can
> put utf-8 in and get utf-8 out.
>
> Of course, if you're going to take the trouble to convert, you should
> do it right.
>   
    In fact,  this is a dirty secret about PHP and the "Unix Way";  to a 
large extent,  systems that are 8-bit clean will process UTF-8 data 
correctly without modifications...  Except when they don't.

    Unfortunately,  that's also the case with Perl,  Java,  .NET and 
other systems that have complex "Unicode Support";  Unicode support is 
such a complicated thing that it's inevitably implemented with errors in 
those systems,  and often,  you're really screwed in those systems 
because you're not seeing the raw bytestream.

    I remember a system where a choice of language and database were 
made because the systems "supported unicode" according to the 
documentation,  but practically all kinds of strange transformations 
were going on behind our backs...  One day I actually looked at the 
database in the SQL monitor and found the whole thing was double-encoded.

    There's also the issue that there really is no "Unicode Sort Order" 
that entirely makes sense.  For instance,  languages such as German and 
Swedish sort the same characters in a different order.  I'm currently 
working on a system that is predominantly English but contains many 
named entity names with latinoid characters:  the sort order for 
"English" might well sort the Polish "Dark L" (the l with a line through 
it) after Z,  but poles sort "Dark L" after "Clear L" and most 
en-speakers will expect that too,  since we commonly squash Dark L -> 
Clear L in words like "Stanislaw."  Japanese people sort named entities 
phonetically,  which means you need to keep a furigana (phonetic) 
representation side by side with the conventional kanji 
representation...  In this age of statistical language translation,  I 
think kanji -> furigana translation could be largely automated,  but 
there are always 'words' that can't be read phonetically out of 
context,  like

"read"

    The "real" character encoding that you find web documents in is most 
closely described as a random mix of ISO-latin-1 and UTF-8 characters 
interspersed at random,  no matter what the charset of a document 
officially is.  There are just too many cases where characters come in 
through form fields and other sources that aren't well controlled.  
Yes,  ~you~ should publish good clean UTF-8,  but if you're scraping on 
a large scale you'll find lots of crazy stuff that doesn't quite match 
what's in the books...  And it's helpful to look at the byte stream in 
those cases.



More information about the talk mailing list