NYCPHP Meetup

NYPHP.org

[nycphp-talk] iterating through a multibyte string

John Campbell jcampbell1 at gmail.com
Wed Jan 13 12:44:57 EST 2010


You forgot
mb_internal_encoding("UTF-8");

without that, mb_substr is just an alias for substr

my results look like:

normal iteration took 0.64724087715149
mb_substr method took 16.471849918365
mb_substr method with shortening the string took 21.613878965378
preg_split method took 1.927277803421

Dan is the winner.  preg_split always runs in linear time.  Both of
the mb_substr are O(N^2), because the first step in mb_substr is
splitting the string into array.  It is not as intelligent as I
initially assumed.

Regards,
John Campbell

On Wed, Jan 13, 2010 at 11:37 AM, Rob Marscher
<rmarscher at beaffinitive.com> wrote:
> OK.  Here are the results of my rough benchmark.  Every time I ran it, the results were within about .025 seconds of each other so it seems accurate.  Surprisingly, my original mb_substr method won, with preg_split taking just a little bit longer.  John's method of grabbing the first character and then removing it from the string actually seems take almost exponentially more time based on how long the string is.  I set $strSize to 1000 and had to kill it because I didn't want to wait so long.  There must be something pretty inefficient going on in mb_substr to make that the case.  I suppose we could look at the source to get to the bottom of it... but I think I've already spent as much time on this as I'm willing to.  Thanks again to you guys.
>
> $ php mbtest.php
> normal iteration took 0.8041729927063
> mb_substr method took 1.7228858470917
> mb_substr method with shortening the string took 7.9840841293335
> preg_split method took 2.1547298431396
>
> $ cat mbtest.php
> <?php
>
> $strSize = 100;
> $repeats = 1000;
>
> // make the string somewhat large
> $str = '';
> for ($i = 0; $i < $strSize; $i++) {
>        $str .= "string with utf-8 chars\n   åèö";
> }
>
> // non-multibyte iteration
> $start = microtime(true);
> for ($i = 0; $i < $repeats; $i++) {
>        $length = strlen($str);
>        $newStr = '';
>        for ($j = 0; $j < $length; $j++) {
>                $newStr .= $str{$j};
>        }
> }
> $end = microtime(true);
> echo "normal iteration took " . ($end - $start) . "\n";
>
> // mb_substr method
> $start = microtime(true);
> for ($i = 0; $i < $repeats; $i++) {
>        $length = mb_strlen($str);
>        $newStr = '';
>        $rest = $str;
>        for ($j = 0; $j < $length; $j++) {
>                $newStr .= mb_substr($rest, $j, 1);
>        }
> }
> $end = microtime(true);
> echo "mb_substr method took " . ($end - $start) . "\n";
>
> // mb_substr method, shortening string
> $start = microtime(true);
> for ($i = 0; $i < $repeats; $i++) {
>        $length = mb_strlen($str);
>        $newStr = '';
>        $rest = $str;
>        while ($rest) {
>                $newStr .= mb_substr($rest, 0, 1);
>                $rest = mb_substr($rest, 1);
>        }
> }
> $end = microtime(true);
> echo "mb_substr method with shortening the string took " . ($end - $start) . "\n";
>
> // preg_split method
> $start = microtime(true);
> for ($i = 0; $i < $repeats; $i++) {
>        $chars = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
>        $length = count($chars);
>        $newStr = '';
>        for ($j = 0; $j < $length; $j++) {
>                $newStr += $chars[$j];
>        }
> }
> $end = microtime(true);
> echo "preg_split method took " . ($end - $start) . "\n";
>
>
>
> _______________________________________________
> New York PHP Users Group Community Talk Mailing List
> http://lists.nyphp.org/mailman/listinfo/talk
>
> http://www.nyphp.org/Show-Participation
>



More information about the talk mailing list