NYCPHP Meetup

[nycphp-talk] Charsets are still driving me nuts

Michael B Allen ioplex at gmail.com
Wed Mar 5 22:19:57 EST 2008


On 3/5/08, John Campbell <jcampbell1 at gmail.com> wrote:
>  Frankly, I rarely find the need to use the multibyte functions (or any
>  string functions for that matter) on user data.

Agreed. It is rare that the multibyte functions are necessary. The
problem area is when you want to iterate over each character and
evaluate them independently. But in 99% of those scenarios you are
just searching for an ASCII character and ASCII characters cannot
appear within a UTF-8 sequence [1] and therefore the standard bytewise
iteration is ok (e.g. strchr is ok if the needle is a single ASCII
character).

Some example issues with multibyte encodings are chopping a string to
a fixed length or doing a case insensitive string comparison. If you
want to display a summary of some search results (e.g. where the
string is chopped off with an ellipsis at the end) you cannot just use
substr because the last character may be incomplete. In that case you
would need to use mb_substr($str, $start, $length, 'UTF-8').

Mike

[1] ASCII characters can appear within multibyte sequences of
encodings other than UTF-8. For example the backslash (\) can appear
within a certain Japanese encoding (I don't recall which, I think it
was SHIFT-JIS).

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/



More information about the talk mailing list