NYCPHP Meetup

NYPHP.org

[nycphp-talk] PCRE expression for tokenizing?

Dan Cech dcech at phpwerx.net
Mon Jul 21 18:45:42 EDT 2008


Michael B Allen wrote:
> On Mon, Jul 21, 2008 at 6:08 PM, Dan Cech <dcech at phpwerx.net> wrote:
>> Michael B Allen wrote:
> So is there any way to say "capture anything that didn't match" (aside
> from created a sub-expression that explicitly excludes all of the
> tokens)?

Afaik no, you could probably do something like:

preg_match('@^(.*?)(~|\*\*|//|=====|====|===|==|=|$)@',$string,$m);

Which would give you anything before the first token (or end if there 
are no more tokens) in $m[1] and the first token (or nothing if there 
are no more tokens) in $m[2].

>>> Can someone recommend a good PCRE expression for tokenizing like this?
>> If you want to end up with everything in an array, you might want to look at
>> preg_split with the PREG_SPLIT_DELIM_CAPTURE argument.
>>
>> Something like:
>>
>> $tokens =
>> preg_split('@(~|\*\*|//|=====|====|===|==|=)@',$string,PREG_SPLIT_DELIM_CAPTURE);
> 
> No, I need the tokens (or rather I need to know which token matched)
> for the state-machine that follows so preg_split probably isn't going
> to do the trick.

That's what the PREG_SPLIT_DELIM_CAPTURE flag does, it returns the 
delimiters.  You can iterate over the returned array and you'll get 
either a token or text in each element.

An iterative preg_match setup will most likely be more memory efficient 
but also slower.

Dan



More information about the talk mailing list