NYCPHP Meetup

NYPHP.org

[nycphp-talk] htmlentities charset bug

Michael B Allen ioplex at gmail.com
Wed Jan 23 14:20:31 EST 2008


On 1/23/08, Cliff Hirsch <cliff at pinestream.com> wrote:
> On 1/23/08 12:58 PM, "Michael B Allen" <ioplex at gmail.com> wrote:>>  Reason:
> > if the browser was really sophisticated about it
> > it could pop-up a dialog that warns you and asks you if you would like
> > to transliterate those characters to ISO-8859-1 equivalent glyphs.
> I wonder if there is any way to detect this on the server side. Htmlentities
> certainly catches the problem, but returns an empty string. Some sort of
> friendlier filter that strips characters that are the wrong charset would be
> very cool.

You could do that. You just have to run it through the iconv function
with '//TRANSLIT' appened to the output charset.

Although one thing to watch out for is to make sure that the form
input is really UTF-8 and not some strange Microsoft codepage like
CP1250 or CP1252. If it's not really UTF-8 then that's very annoying.
You would have to run the input through iconv with various possible
input encodings and some unicode encoding just to detect what the
encoding really is. Then, you can call iconv:

The following example uses iconv with //TRANSLIT to convert CP1252 to
ISO-8859-1 and convert the curly quotes to regular quotes:

#!/usr/bin/php
<?php
$curly_quotes = "\x91\x92\x93\x94";
$output = iconv('CP1252', 'UTF-8', $curly_quotes);
echo "[$output]\n";
$output = iconv('CP1252', 'ISO-8859-1//TRANSLIT', $curly_quotes);
echo "[$output]\n";
--8<--
$ ./iconvtest.php
[''""] (this should display as curly quotes)
[''""] (this should display as regular quotes)

But now if you want your application to support Cyrillic or some
non-latin encoding the transliteration thing doesn't work.

You could just simply replace the offending characters with str_replace like:

#!/usr/bin/php
<?php
$curly_quotes = "\x91\x92\x93\x94";
$output = iconv('CP1252', 'UTF-8', $curly_quotes);
echo "[$output]\n";
$output = str_replace(
    array(
        "\xe2\x80\x98",
        "\xe2\x80\x99",
        "\xe2\x80\x9c",
        "\xe2\x80\x9d"
    ),
    array(
        '\'',
        '\'',
        '"',
        '"',
    ),
    $output
);
echo "[$output]\n";
--8<--
$ ./reptest.php
[''""] (this should display as curly quotes)
[''""] (this should display as regular quotes)

Mike

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/



More information about the talk mailing list