NYCPHP Meetup

NYPHP.org

[nycphp-talk] PHP + UTF-8 + mb_string issue.

Anirudh Zala arzala at gmail.com
Wed Mar 21 09:19:20 EDT 2007


On Wednesday 21 March 2007 16:05, Mark Armendariz wrote:
> > -----Original Message-----
> > [mailto:talk-bounces at lists.nyphp.org] On Behalf Of Anirudh Zala
> >
> > Question is why PHP is not able to count length of given
> > string in practical way. I am aware that current PHP versions
> > are not aware of string, instead they just deal with bytes.
> > In that case output is correct but this is not practical
> > solution as length of word in Gujarati language is only "2"
> > (In Indic languages, we have primary characters like "?" and
> > secondary characters like "?", but there is not value of
> > secondary characters without primary
> > characters) and not "4" even if it requires 4 bytes to store data.
>
> It's my understanding that the mbstring extension doesn't actually replace
> php functions.  If you're using the extension, you'll have to use the
> mb_string functions, (mb_strlen in this case).

It does if you set directive "mbstring.func_overload=7" in php.ini or in 
server configuration file.

>
> On another note, something to use if you don't / can't use the extensions:
> http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php
>
> I grabbed this while doing research for a project I haven't started yet -
> so I haven't had the chance to try it out, but it comes well recommended.
>
> Specific to your cause (from the link):
>
> /**
>  * Unicode aware replacement for strlen()
>  *
>  * utf8_decode() converts characters that are not in ISO-8859-1
>  * to '?', which, for the purpose of counting, is alright - It's
>  * even faster than mb_strlen.
>  *
>  * @author <chernyshevsky at hotmail dot com>
>  * @see    strlen()
>  * @see    utf8_decode()
>  */
> function utf8_strlen($string){
>   return strlen(utf8_decode($string));
> }
>

I throughout use UTF-8 encoding in input, processing, storing, output data 
hence data is already in UTF-8 format.

I have found that it is UTF-8 encoding that it is not properly supporting 
Indic languages as it has been designed so. My tested word requires 4 bytes 
to store in UTF-8 encoding hence in that way php + mbstring works properly. 
So I have concluded that it is not anymore php and/or mbstring issue as it 
happens same when I try to calculate length using SQL or using any other 
scripting language.

In Indic languages, there are separate alphabats for vowels and consonants. 
There is grammatical rule that when any vowel is used alone then count it's 
length as 1 but when it is used along with any consonant then count only 
consonant. That is how languages are used in communication, writing, reading 
etc. But while accommodating characters in UTF-8 encoding, developers either 
thought in different way or they have been told to do so.

>
>
> I hope that works for you.
>
> Mark Armendariz
>
> _______________________________________________
> New York PHP Community Talk Mailing List
> http://lists.nyphp.org/mailman/listinfo/talk
>
> NYPHPCon 2006 Presentations Online
> http://www.nyphpcon.com
>
> Show Your Participation in New York PHP
> http://www.nyphp.org/show_participation.php

Thanks for suggestions.

Anirudh Zala

(30% of Internet resources, 
used to deliver web-pages, 
are wasted by unnecessary 
tabs and spaces.)



More information about the talk mailing list