NYCPHP Meetup

NYPHP.org

[nycphp-talk] utf-8, iso-8859-1...

David Mintz david at davidmintz.org
Thu May 6 11:46:03 EDT 2010


I don't really have a good understanding of issues around character sets,
encoding, what have you, though I am starting to work on it.

My problem involves a MySQL database and accented characters such as those
you find in Spanish and French. My web server sends a "content-type:
text/html; charset=iso-8859-1" header and my docs have an equivalent meta
tag. My mysql's config says

default-character-set = latin1
character_set_server = latin1
collation_server     = latin1_general_ci

and my data tables "SHOW CREATE" typically look like

CREATE TABLE `people` (
  `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `lastname` varchar(40) COLLATE latin1_general_ci NOT NULL,
  `firstname` varchar(40) COLLATE latin1_general_ci NOT NULL,
  /* etc */
) ENGINE=MyISAM AUTO_INCREMENT=546 DEFAULT CHARSET=latin1
COLLATE=latin1_general_ci

So what's the problem? Generally there is none. Characters like ó and ñ
render correctly. The snag I am hitting now is writing a regular expression
to whitelist the characters I can accept in proper names. I would think that
the regex

      /^[-a-zA-Z\xC0-\xFF ']+$/

would test for anything that isn't a "letter" in most western european
languages, or a space, or an apostrophe. But it is returning true (meaning
yes there is an illegal character) in the name Barceló, where false is what
I would like to hear.

Would this regex work if the data were utf-8? Should I consider converting
everything and working in utf-8, and if so, how painful is it to convert a
MySQL database? My initial research suggests that it isn't painless.

-- 
Support real health care reform:
http://phimg.org/

--
David Mintz
http://davidmintz.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nyphp.org/pipermail/talk/attachments/20100506/0b1e6c80/attachment.html>


More information about the talk mailing list