NYCPHP Meetup

NYPHP.org

[nycphp-talk] somewhat OT Re: validating proper name capitalization

Tedd Sperling tedd.sperling at gmail.com
Mon Oct 3 17:17:20 EDT 2011


On Oct 3, 2011, at 1:53 PM, Jerry B. Altzman wrote:

> Returning to the my original obPHP (so that Hans doesn't get upset at me):  how do punycoded URLs and their Unicoded (or other-encoded) counterparts get dealt with in real life PHP?  Who is dealing with them, and how well does PHP+your underlying OS manage it?  Do you need to do environment-wrangling to make encoding issues go away?  Tedd's original response "by wishing Microsoft never existed" is glib but unhelpful. The homographic problem is also huge, no doubt, but computers by and large aren't fooled by the difference between А and A. (BTW the former is U0410 'Cyrillic Capital Letter A', the latter is U0041 'Latin Capital Letter A'.  Depending on the font you use in your reader, you may or may not see a difference between the two.)

PUNYCODE is the ULR for IDN (Internationalized Domain Names). PHP doesn't have to deal with it any more/less than any other URL.

For most browsers, entering a PUNYCODE string is the only way to provide Unicode characters (code-points). It is only in the Safari Browser where a user can enter a string that can be composed of something other than ASCII AND the browser will accept that string as a real URL and direct the user to the proper URL. Whereas, other browsers convulse.

Nothing o the above has anything to do with PHP.

The following is from memory and may be flawed, but should be close:

Now, in PHP string management (string functions) that's a different story. Using the standard built-in PHP functions to deal with strings, such as strstr(), please realise that these routines deal with standard ASCII strings.

If you are dealing Unicode strings, then they are handled differently, such as using the routine mb_strstr() (the "mb_" mean muitibyte) of which Unicode strings are composed.

In other words, the extended charset taken from standard ASCII and expanded to include all Unicode (actually ASCII is a subset of Unicode) required more information to properly address each code point. In doing so, special functions had to be created to deal with the extended set of characters (code-points) that the Unicode database provides.

HTH's

tedd

_____________________
tedd at sperling.com
http://sperling.com


More information about the talk mailing list