NYCPHP Meetup

NYPHP.org

[nycphp-talk] Character set issues revisited

John Campbell jcampbell1 at gmail.com
Sun Oct 21 19:21:44 EDT 2007


> > And by '?' do you mean '?' or '�' ... there is a big difference.
> What is the difference? I have seen both?

'?' is typically caused by storing the data in Mysql as UTF-8, and
then a magical undocumented feature in the php mysql drivers auto
converts it to 8859-1.  Thus if you have a character that is not in
8859-1 (e.g. a korean character) mysql will convert it to a '?'.  A
'�' is generated by the browser when the browser is trying to render
as UTF-8 and it comes across an invalid byte sequence.

The first thing to understand about character encoding is the overlap
between UTF-8 and 8859-1.  Below is a sample
a - lower case a (Same in 8859-1 & UTF-8)
à - a acute (Available in 8859-1 & UTF8 but different values..)
賜 - Chinese character (Not in 8859-1, in UTF-8)

These days, you should really do everything in UTF-8.  There was a lot
of talk about PHP not being UTF-8 safe, but it is largely nonsense and
primarily because developers don't think about other languages.  I
personally never see the need for functions like substr, and I don't
use regular expressions like [a-zA-Z0-9].

One other piece of advice, is don't ever use that stupid meta tag to
specify the content encoding.  It makes no sense to specify the
encoding of the content in the content itself.  The content encoding
should be specified in the header and only in the header.

Regards,
John Campbell


More information about the talk mailing list