NYCPHP Meetup

NYPHP.org

[nycphp-talk] MySQL 4 vs. 5 and character encoding (?)

Mikko Rantalainen mikko.rantalainen at peda.net
Thu May 11 04:26:54 EDT 2006


Rahmin Pavlovic wrote:
> On 5/10/06 11:53 AM, "Daniel Convissor" <danielc at analysisandsolutions.com>
> wrote:
> 
>> On Wed, May 10, 2006 at 10:57:53AM -0400, Rahmin Pavlovic wrote:
>>> I already tried setting the charset / collation to utf-8 -- both on tables /
>>> fields -- but resaving data produces the same mangled results.
>> How did you do that?  ALTER TABLE?  If so, that's too late.
>>
>> Open up the SQL script created by mysqldump.  What's the default charset
>> defined in the CREATE TABLE statements?
>>
> 
> latin1_swedish_ci.  When I ALTER TABLE, it successfully updates to
> utf8_general_ci.  Why is it too late?  Do I need to re-dump?

Since you say that you changed it to utf8, I guess the original data 
was supposed to be with UTF-8 character set, but because MySQL5 
dumped it with latin1_swedish_ci you never did tell MySQL5 about the 
fact (this happens pretty much automatically if you upgrade from 
MySQL4 which pretty much didn't have a clue about UTF-8).

However, MySQL5 in its infinite wishdom replaces invalid characters 
in the mysqldump output with a literal "?" (U+003F) instead of 
leaving those bytes as is. Latin1 doesn't define byte range 128-159 
so MySQL5 replaces those bytes with literal "?" if you use mysqldump 
without correct settings (UTF-8 encoded strings may include bytes in 
range 128-159 and if you replace those bytes with a literal "?" 
you'll end up with invalid UTF-8 encoding). You will not notice this 
problem if you just do INSERT, UPDATE and SELECT from some other 
application because this modification doesn't happen in that case.

In short, yes, you have to re-dump from MySQL5. Hopefully you have 
the binary table data still available.

-- 
Mikko




More information about the talk mailing list