NYCPHP Meetup

Tue Nov 22 13:57:25 EST 2005

Mikko Rantalainen wrote:
> The problem is that you cannot accurately identify different 8 bit 
> encodings from each other. Latin-1 (iso-8859-1) and Latin-9 
> (iso-8859-15) text may contain identical byte sequences and still 
> different content so you have no way to know which one user intended 
> to use.

> Some 8 bit encodings have different *probabilities* for different 
> byte sequences and you could make an educated guess which encoding 
> the user agent really used. That would still be just a guess.
> 
> The way I do it is that I send the html with UTF-8 encoding (I also 
> have <form accept-charset="UTF-8" ...> in case some user agent 
> supports that, most user agents just use the same encoding the page 
> with the form used) and I check that the user input is valid UTF-8 
> byte sequence. [snip...]

I'm very curious how you test this.

Also, I'm continuing to read more on all of this (and cripes, there's a 
lot to read...), but just so I don't lose momentum here, I want to ask 
what you think of this half-baked idea:

A form on a document with iso-8859-1 encoding will apparently (according 
to a few quick tests) encode its user input into Latin-1 also.  If I put 
something else in there, say that Japanese string I gave you, it gets 
encoded into 
"&#22823;&#38442;&#24066;&#28010;&#36895;&#21306;&#12398;&#12510;
&#12531;&#12471;&#12519;&#12531;"

So, if I can find user input matching a regex pattern like '&#\d+;', I 
know the user is either intentionally typing HTML numeric entities into 
my form, or trying to enter some non-Latin-1 characters.  Of course, 
what to do with that information is an app-level question, but the 
question here is: is this really a valid test, or will I be getting 
false positives and false negatives?

Any thoughts?

-- 
Allen Shaw
Polymer (http://polymerdb.org)

NYCPHP Meetup

NYPHP.org

[nycphp-talk] enforcing Latin-1 input