NYCPHP Meetup

NYPHP.org

[nycphp-talk] [OT]Forcing charset on a shared webhost?

Greg Rundlett greg.rundlett at gmail.com
Thu Oct 20 23:09:26 EDT 2005


On 10/20/05, leam at reuel.net <leam at reuel.net> wrote:
> I've set up a static site for my client while I work on the shopping cart. The html files include double quotes which come across on his browser and mine as odd characters. I figured out how to set the encoding on my Mozilla to universal but the client and customers are not likely as computer literate.
>
> Is there a way to force the browser to see the double quotes properly? As always, if there's something you can point me to that i missed, lemme know.
>

You're on the right track.

The character(s) in question are probably inserted by a Microsoft
application that is configured (by default) to insert 'smart quotes',
em dash, trademark symbol etc in place of the actual ASCII characters
you type at the keyboard.  These bad characters (like the 'smart
quote') that are part of the Windows-1250 character set
(http://en.wikipedia.org/wiki/Windows-1250) that is incompatible with
simple ASCII or UTF-8.  There are many reasons why this is a 'dumb'
idea.  The main one is that not all user agents will be able to render
such characters, because the character is NOT present in the font
available on the client system - even IF the appropriate character set
is specified in the document header.  Even a user on a Microsoft
platform (using a non-Microsoft font) could see incorrect or missing
characters.

The best solution would be to avoid these non-standard characters.

The second best solution would be to replace these troublesome
characters with their unicode equivalent and then specify that
character set in BOTH the HTTP header like this
    header("Content-type: text/html; charset=utf-8");
(assuming you're using PHP -- there are many variations depending on
the http server and scripting language in use)
and your document using a tag like this for HTML:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
or this for XML
    <?xml version="1.0" encoding="UTF-8"?>
Text is said to be in a Unicode encoding form if it is encoded in
UTF-8, UTF-16 or UTF-32  If you want to use a character that is not in
UTF-8, then you can always use a numeric or character entity
reference.  There is a great FAQ at the unicode website here
http://www.unicode.org/faq/unicode_web.html

There is also a 'demoroniser' script
(http://www.fourmilab.ch/webtools/demoroniser/) that will clean your
documents if you have lots of cleanup to do.

A great tutorial can be found at
http://www.cs.tut.fi/~jkorpela/chars.html, with additional information
at http://www.cs.tut.fi/~jkorpela/html/chars.html, which might lead
you to think "this guy knows his stuff" (Korpela, not me)  Where can I
read more? http://www.cs.tut.fi/~jkorpela/www.html

Then there is the wikipedia set of articles on the topic:
http://en.wikipedia.org/wiki/Character_encoding

Setting the HTTP header http://www.w3.org/International/O-HTTP-charset.html

No list of references would be complete without a link to the W3C spec
http://www.w3.org/TR/charmod/

- Greg



More information about the talk mailing list