NYCPHP Meetup

NYPHP.org

[nycphp-talk] Character set issues revisited

tedd tedd at sperling.com
Thu Oct 25 10:49:21 EDT 2007


At 11:12 AM -0400 10/23/07, Michael B Allen wrote:
>On 10/23/07, tedd <tedd at sperling.com> wrote:
>>  At 7:21 PM -0400 10/21/07, John Campbell wrote:
>>  >The first thing to understand about character encoding is the overlap
>>  >between UTF-8 and 8859-1.  Below is a sample
>>  >a - lower case a (Same in 8859-1 & UTF-8)
>>  >à - a acute (Available in 8859-1 & UTF8 but different values..)
>>  >éí - Chinese character (Not in 8859-1, in UTF-8)
>>
>>  A small clarification -- it's not really overlap,
>>  but rather UTF-8 is a super-set containing 8859-1
>>  like both contain ASCII.
>
>Well if you want to be pedantic about it, "overlap" is more accurate.
>UTF-8 is a multibyte encoding of the Unicode charset. ISO-8859-1 is a
>single byte encoding of the ISO-8859-1 charset. So yes, Unicode is a
>superset of ISO-8859-1 but the UTF-8 encoding of values above 0x7f are
>not the same.
>
>Mike


You are free to call it what you want.

True, the code-points for the ISO-8859-1 charset 
above 0x7F (the M$ spin) are not the same as 
UTF-* et al, but the glyphs are still included in 
UFT-8 regardless of encoding differences -- is 
that not true?

If this is true, then the term "overlap" would be 
less correct than "super-set" because the two 
sets do not overlap with respect to all 
code-points -- but the larger one still contain 
all the glyphs that the smaller one does (for the 
exception of Apple's spin on that set, which 
included adding their logo).

That's the reason I'm free to call one a super-set of the the other.

I believe it's easier to explain char-sets and 
code-points in terms of current Unicode standards 
than it is to point out historical differences 
that are diminishing in importance as more people 
convert.

Cheers,

tedd
-- 
-------
http://sperling.com  http://ancientstones.com  http://earthstones.com



More information about the talk mailing list