NYCPHP Meetup

Mon Jul 21 18:08:08 EDT 2008

Michael B Allen wrote:
> I trying to write a Wiki syntax tokenizer using preg_match. Meaning I
> want to match any token like '~', '**', '//', '=====', ... etc but if
> none of those tokens match I want to match any valid printable string.
> 
> The expression I have so far is the following:
> 
>   @(~)|(\*\*)|(//)|(=====)|(====)|(===)|(==)|(=)|([[:print:]]*)@
> 
> The problem with this is that the [[:print:]] class matches the entire
> input. Strangely if I use [a-zA-Z0-9 ]* instead it works (but of
> course I want to support more than ASCII and a space).

The reason for this is that your token characters are included in 
[[:print:]] but not in [a-zA-Z0-9 ].

> Meaning given the input:
> 
>   [The **fox** jumped //over// the fence]
> 
> I want each call to preg_match to return tokens (while advancing the
> offset accordingly of course):
> 
>   [The ]
>   [**]
>   [fox]
>   [**]
>   [ jumped ]
>   [//]
>   [over]
>   [//]
>   [ the fence]
> 
> Can someone recommend a good PCRE expression for tokenizing like this?

If you want to end up with everything in an array, you might want to 
look at preg_split with the PREG_SPLIT_DELIM_CAPTURE argument.

Something like:

$tokens = 
preg_split('@(~|\*\*|//|=====|====|===|==|=)@',$string,PREG_SPLIT_DELIM_CAPTURE);

May do what you're after.

Dan

NYCPHP Meetup

NYPHP.org

[nycphp-talk] PCRE expression for tokenizing?