NYCPHP Meetup

NYPHP.org

[nycphp-talk] Character set issues revisited

Cliff Hirsch cliff at pinestream.com
Fri Oct 19 15:44:05 EDT 2007


On 10/19/07 2:29 PM, "Michael B Allen" <ioplex at gmail.com> wrote:

> On 10/19/07, Cliff Hirsch <cliff at pinestream.com> wrote:
>> 
>>  There was recently a thread about some character set problem. I just found
>> a similar issue. I just transferred a site from a Windows XP dev. platform
>> to rhel. Everything looks fine except for a few special characters.
>> 
>>  Windows   -> rhel
>>  it's           -> it?s
>>  ‹            -> ? (should be the long dash, an em I think)
>>  'blahblah' -> ?blahblah?
>>  "                  -> ?
> 
> Hey Cliff,
> 
> That's actually not a character encoding issue. The '?' or an empty
> box is commonly displayed whenever a glyph associated with a character
> value is not available. Meaning the client doesn't have the necessary
> font. Also meaning, whatever editor was used to input those single
> quotes didn't input the more common ASCII single quote character value
> of 0x27. If you hexdump that content you'll see it's something else
> (it will probably be a multibyte UTF-8 secquence which when decoded
> will give you a Unicode value that you can lookup in Adobe's glyph
> tables).
> 
> This is the sort of thing that happends when you create some content
> with a word processor and then copy and paste it into the web page.
> 
> The way to fix this problem is to just seek and destory all of those
> characters and replace them with their more common equivalent values
> (e.g. the single quote 0x27 ASCII value).
> 
> Or you could install whatever wacked out font that has that character
> on every client that will ever visit the page but that's probably not
> the more desirable solution.
> 
>>  In phpMyAdmin I see: can't
>>  In my app, I see: can?t
>>  So phpMyAdmin is displaying things correctly on either platform.
> 
> That's odd. Maybe phpMyAdmin is doing some transliteration.
> 
>>  Where should I start looking? What is the best charset to use anyway?
>> Iso-8859-1 or utf-8?
> 
> Look at the page with hexdump to see verify what the encoding is and
> what the unicode value of one of the errant characters really is. Then
> you can start to figure out where things went wrong.
> 
> Mike

Mike:

Thanks. This is helpful. Here's another interesting puzzle. Why does the
page info in FireFox say encoding: UTF-8 while the Content-Type is
charset=iso-8859-1.

Ah, I think I see it. The encoding is how the page was saved. And as usual,
Microsoft butchers everything.

But this is php -- the page is dynamically generated. So is the encoding
picked up from my php script, index.php, or the template file index.tpl?





More information about the talk mailing list