NYCPHP Meetup

NYPHP.org

[nycphp-talk] PHP + UTF-8 + mb_string issue.

Anirudh Zala arzala at gmail.com
Wed Mar 21 02:40:24 EDT 2007


Subject: Re: [nycphp-talk] PHP + UTF-8 + mb_string issue.
Date: Wednesday 21 March 2007 12:07
From: Anirudh Zala <arzala at gmail.com>
To: Michael B Allen <mba2000 at ioplex.com>

On Wednesday 21 March 2007 11:49, you wrote:
> On Wed, 21 Mar 2007 11:48:20 +0530
>
> Anirudh Zala <arzala at gmail.com> wrote:
> > On Wednesday 21 March 2007 11:36, you wrote:
> > > On Wed, 21 Mar 2007 10:50:26 +0530
> > >
> > > Anirudh Zala <arzala at gmail.com> wrote:
> > > > Hello Everybody,
> > > >
> > > > While building a truly multilingual project, I am running into an
> > > > interesting problem with php5 + utf-8 + mb_string.
> > >
> > > <snip>
> > >
> > > > ____________  = 1 word; 4 bytes; 2 characters (______, ______); 4
> > > > key-strokes (___, ___, ___, ___); "strlen" should be 2 but is 4.
> > >
> > > Generally the libc-like functions exhibit libc behavior so 4 is the
> > > correct answer.
> > >
> > > Is mb_strlen not suitable for some reason? You have to use mb_*
> > > functions whenever you perform character-wise operations as opposed to
> > > byte-wise (and that assumes you're running in the UTF-8 locale).
> > >
> > > Mike
> >
> > I am using mb_* functions and UTF-8 as locale. Everything is
> > transparently processed in UTF-8 format only. I have tested same thing
> > using "iconv" extension but same results. Looks like it is the behavior
> > of php + mb_*.
>
> I don't understand. You used mb_strlen and got 4? If so, what are the
> 4 bytes that make up the 2 characters exactly?
>
> Mike

It is because length of string should "2" in actual way when it is used in
communication, writing, speaking etc. But PHP needs 4 bytes to store it hence
giving length as "4". As I told in original mail that Indic languages have
primary (like ઝ, લ) and secondary characters (like ા) to create different
meaning but secondary characters should not be counted while calculating
length of word (even if it requires additional byte to store). This is the
issue.

--

Anirudh Zala

(30% of Internet resources,
used to deliver web-pages,
are wasted by unnecessary
tabs and spaces.)



More information about the talk mailing list