NYCPHP Meetup

NYPHP.org

[nycphp-talk] Character set issues revisited

Cliff Hirsch cliff at pinestream.com
Fri Oct 19 15:44:05 EDT 2007


On 10/19/07 2:29 PM, "Michael B Allen" <ioplex at gmail.com> wrote:

> On 10/19/07, Cliff Hirsch <cliff at pinestream.com> wrote:
>> 
>>  There was recently a thread about some character set problem. I just found
>> a similar issue. I just transferred a site from a Windows XP dev. platform
>> to rhel. Everything looks fine except for a few special characters.
>> 
>>  Windows   -> rhel
>>  it's           -> it?s
>>  ‹            -> ? (should be the long dash, an em I think)
>>  'blahblah' -> ?blahblah?
>>  "                  -> ?
> 
> Hey Cliff,
> 
> That's actually not a character encoding issue. The '?' or an empty
> box is commonly displayed whenever a glyph associated with a character
> value is not available. Meaning the client doesn't have the necessary
> font. Also meaning, whatever editor was used to input those single
> quotes didn't input the more common ASCII single quote character value
> of 0x27. If you hexdump that content you'll see it's something else
> (it will probably be a multibyte UTF-8 secquence which when decoded
> will give you a Unicode value that you can lookup in Adobe's glyph
> tables).
> 
> This is the sort of thing that happends when you create some content
> with a word processor and then copy and paste it into the web page.
> 
> The way to fix this problem is to just seek and destory all of those
> characters and replace them with their more common equivalent values
> (e.g. the single quote 0x27 ASCII value).
> 
> Or you could install whatever wacked out font that has that character
> on every client that will ever visit the page but that's probably not
> the more desirable solution.
> 
>>  In phpMyAdmin I see: can't
>>  In my app, I see: can?t
>>  So phpMyAdmin is displaying things correctly on either platform.
> 
> That's odd. Maybe phpMyAdmin is doing some transliteration.
> 
>>  Where should I start looking? What is the best charset to use anyway?
>> Iso-8859-1 or utf-8?
> 
> Look at the page with hexdump to see verify what the encoding is and
> what the unicode value of one of the errant characters really is. Then
> you can start to figure out where things went wrong.
> 
> Mike

Mike:

Thanks. This is helpful. Here's another interesting puzzle. Why does the
page info in FireFox say encoding: UTF-8 while the Content-Type is
charset=iso-8859-1.

Ah, I think I see it. The encoding is how the page was saved. And as usual,
Microsoft butchers everything.

But this is php -- the page is dynamically generated. So is the encoding
picked up from my php script, index.php, or the template file index.tpl?





More information about the talk mailing list
Automatic Email Organization without missing anything!