NYCPHP Meetup

NYPHP.org

[nycphp-talk] PCRE expression for tokenizing?

Michael B Allen ioplex at gmail.com
Mon Jul 21 18:32:02 EDT 2008


On Mon, Jul 21, 2008 at 6:08 PM, Dan Cech <dcech at phpwerx.net> wrote:
> Michael B Allen wrote:
>>
>> I trying to write a Wiki syntax tokenizer using preg_match. Meaning I
>> want to match any token like '~', '**', '//', '=====', ... etc but if
>> none of those tokens match I want to match any valid printable string.
>>
>> The expression I have so far is the following:
>>
>>  @(~)|(\*\*)|(//)|(=====)|(====)|(===)|(==)|(=)|([[:print:]]*)@
>>
>> The problem with this is that the [[:print:]] class matches the entire
>> input. Strangely if I use [a-zA-Z0-9 ]* instead it works (but of
>> course I want to support more than ASCII and a space).
>
> The reason for this is that your token characters are included in
> [[:print:]] but not in [a-zA-Z0-9 ].

Oh, yeah. Duh.

So is there any way to say "capture anything that didn't match" (aside
from created a sub-expression that explicitly excludes all of the
tokens)?

>> Meaning given the input:
>>
>>  [The **fox** jumped //over// the fence]
>>
>> I want each call to preg_match to return tokens (while advancing the
>> offset accordingly of course):
>>
>>  [The ]
>>  [**]
>>  [fox]
>>  [**]
>>  [ jumped ]
>>  [//]
>>  [over]
>>  [//]
>>  [ the fence]
>>
>> Can someone recommend a good PCRE expression for tokenizing like this?
>
> If you want to end up with everything in an array, you might want to look at
> preg_split with the PREG_SPLIT_DELIM_CAPTURE argument.
>
> Something like:
>
> $tokens =
> preg_split('@(~|\*\*|//|=====|====|===|==|=)@',$string,PREG_SPLIT_DELIM_CAPTURE);

No, I need the tokens (or rather I need to know which token matched)
for the state-machine that follows so preg_split probably isn't going
to do the trick.

Mike

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/



More information about the talk mailing list