NYCPHP Meetup

NYPHP.org

[nycphp-talk] PHP + UTF-8 + mb_string issue.

Mark Armendariz lists at enobrev.com
Wed Mar 21 06:35:37 EDT 2007


> -----Original Message-----
> [mailto:talk-bounces at lists.nyphp.org] On Behalf Of Anirudh Zala
> 
> Question is why PHP is not able to count length of given 
> string in practical way. I am aware that current PHP versions 
> are not aware of string, instead they just deal with bytes. 
> In that case output is correct but this is not practical 
> solution as length of word in Gujarati language is only "2" 
> (In Indic languages, we have primary characters like "?" and 
> secondary characters like "?", but there is not value of 
> secondary characters without primary
> characters) and not "4" even if it requires 4 bytes to store data.


It's my understanding that the mbstring extension doesn't actually replace
php functions.  If you're using the extension, you'll have to use the
mb_string functions, (mb_strlen in this case).

On another note, something to use if you don't / can't use the extensions:
http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php

I grabbed this while doing research for a project I haven't started yet - so
I haven't had the chance to try it out, but it comes well recommended.

Specific to your cause (from the link):

/**
 * Unicode aware replacement for strlen()
 *
 * utf8_decode() converts characters that are not in ISO-8859-1
 * to '?', which, for the purpose of counting, is alright - It's
 * even faster than mb_strlen.
 *
 * @author <chernyshevsky at hotmail dot com>
 * @see    strlen()
 * @see    utf8_decode()
 */
function utf8_strlen($string){
  return strlen(utf8_decode($string));
}



I hope that works for you.

Mark Armendariz




More information about the talk mailing list