NYCPHP Meetup

NYPHP.org

[nycphp-talk] somewhat OT Re: validating proper name capitalization

Tedd Sperling tedd.sperling at gmail.com
Tue Oct 4 14:37:05 EDT 2011


On Oct 4, 2011, at 1:13 PM, John Campbell wrote:

>> I understand that, but I'm asking something like: if you type in •.com into
>> your browser, what's getting passed to the server behind the scenes?
> 
> The non encoded string (xn--...).  It must be this way because the
> HTTP protocol requires the header to be completely US-ASCII.
> 
> It is best to think of punycode as just a browser adress bar display hack.
> 
> -jc

It's not so much a hack but rather the way one can send information that exceeds the capability of the medium.

The Internet is/was based upon a seven-bit character set and not 8 (or greater). As such, simple ASCII was used from the beginning. Domain names composed of ASCII characters posed no problems -- after all they're all English characters. However, when additional characters were needed (non-English), there was no way to address them. After all, if you are held to only 127 possible characters (seven-bit), how can you address over 65,000 characters as found in the UTF-8?

So, circa 2000 the IDNS WG was established to create a method to use seven-bit addressing to accomplish more than what HTTP was originally designed to do. One of the first algorithms was AMC, followed by RACE, and finally followed by PUNYCODE. These were simply algorithms that used a prefix (such as "xn--" as found in PUNYCODE) to identify that the characters transmitted were to be transposed to code-points. For example, the string "xn--19g" meant that the "xn--" string identified the string as a IDNS domain and the "19g" was transposed by the PUNYCODE algorithms to produce a square-root symbol.

I believe that Browsers like Safari have the PUNYCODE algorithm already built-in and thus can make the translation "on-the-fly" between characters entered via the keyboard and what's transmitted via HTTP. Keep in mind that PUNYCODE was never meant to be seen by the end-user. All domain names were supposed to be seen in their native language.

Now, my understanding of the specific process may be flawed, but my description should be generally correct. Please understand that I lurked on the IDNS WG at the time (circa 2000), but did not fully understand everything that was discussed. There were some very smart people on that WG.

Cheers,

tedd

_____________________
tedd at sperling.com
http://sperling.com



More information about the talk mailing list