NYCPHP Meetup

Tue Nov 29 16:16:34 EST 2005

Hi List,

With much thanks to Mikko for his help, I've finally figured this out 
enough come up with a way to test for valid Latin-1 input.  If I am 
right about this, then the following modification to Mikko's code will 
report whether or not a particular string, assumed to be UTF-8 encoded, 
is within the Latin-1 character set:

-----------8<-----------
function isValidLatin1String($Str) {
     $latinHex = array ('20', //
     '21', // !
     '22', // "
     // snip for brevity...
     'c3bf' // ÿ
     );

     # While checking for valid UTF-8 stream, compile each character
     # as hex codes and match with latinHex array;
     # correct UTF-8 stream has every character starting with zero bit
     # or first byte has <length of encoding> high bits set and all
     # following bytes have highest bits set to 10.
     for ($i=0; $i<strlen($Str); $i++)
     {
         if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
         else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
         else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
         else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
         else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
         else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
         else return false; # invalid byte
         # verify that n bytes matching bit sequence 10bbbbbb
         # follow where bbbbbb is not 000000
         # failing this test means that input is "overlong UTF-8
         # encoding", which is not allowed.
         $char = bin2hex($Str[$i]);
         for ($j=0; $j<$n; $j++) {
             $chara .= bin2hex($Str[++$i]);
             if (($i == strlen($Str))
		|| ((ord($Str[$i]) & 0xC0) != 0x80)) {
                 return false;
             }
         }
         if (!in_array($char, $latinHex)) {
             return false;
         }
     }
     # couldn't find errors, it's probably valid Latin-1 data.
     return true;
}
-----------8<-----------

Initial testing seems to confirm confirm what I think I've figured out 
already.  Thanks to Mikko for lots of clues and advice from his own 
experience.

Mikko Rantalainen wrote:
> But the problem is that unless you're using UTF-8, you cannot always 
> identify between iso-8859-1 and say windows-1255. The safest way I 
> can think about is to require UTF-8 encoding and then check that the 
>   real data I'm getting only uses characters that can be represented 
> with iso-8859-1. 

-- 
Allen Shaw
Polymer (http://polymerdb.org)

NYCPHP Meetup

NYPHP.org

[nycphp-talk] enforcing Latin-1 input (follow-up)