NYCPHP Meetup

NYPHP.org

[nycphp-talk] Regex for P Elements

justin justin at justinhileman.info
Wed Jan 12 10:00:49 EST 2011


On Wed, Jan 12, 2011 at 8:24 AM, Randal Rust <randalrust at gmail.com> wrote:
> I am admittedly not very good with regular expressions. I am trying to
> pull all of the paragraphs out of an article, so that I can create
> inline links. Here is my script:
>
> $blockpattern='/<p*[^>]*>.*?<\/p>/';
> $blocks=preg_match_all($blockpattern, $txt, $blockmatches);
>


You really don't want the * after that first p, because this:

    /<p*[^>]*>/

Means, essentially, "Match a `<` character, then any number of `p`
(including 0), then a bunch of things that aren't `>`". This regex
will match any pair of `<...>` -- i.e. any opening and closing html
tag in your document.

Dropping the first * will get you closer:

    /<p[^>]*>/

But that's still not right, as it'll get false positives on `<pre>`
and `<param>` tags. Instead use this:

    /<p(\s+[^>]*)?>/

Which only matches that "a bunch of things that aren't `>`" if there's
a space between the `p` and whatever comes next.

The second half of your regex is right, but it does have the newline
problem you mentioned. To get `.` to match newline characters, use the
`dotall` flag by adding `s` after the final slash:

    /<p(\s+[^>]*)?>.*?<\/p>/s

So that leaves us with:

    $blockpattern = '/<p(\s+[^>]*)?>.*?<\/p>/s';

-- 
http://justinhileman.com



More information about the talk mailing list