PHP Recursive Patterns

tl;dr; This post deals with writing very simple parsers for nested data using recursive regular expressions.

There is a very good reason why regular expressions are called that – they are regular. You can get a lot done with them, but if your data has a complex structure, you’d often be advised to use a parser. A good example is processing data where parentheses can be nested. A simple regular expression:

/\([^\)]+/

fails quickly when a nested ( ) pair is encountered.

<?php

$string = '(Hello (World!))';
preg_match_all('/\([^\)]+/', $string, $groups);
var_export($groups);

# array (
#   0 => array (
#     0 => '(Hello (World!',
#   ),
# )

PCRE (the regular expressions engine behind preg_* functions) has support for dealing with those cases where you need to recurse and repeat the pattern. For example, a very simple CSS parser would need to balance opening { and closing }, i.e., taking into account @media queries also enclosed by a {..} pair. Let’s assume this document:

body { color: #888; }

@media print { body { color: #333; } }

code { color: blue; }

A non-greedy regular expression like /{.*?}/ would fail as it exits as soon as a closing } is encountered resulting in the following captured groups:

array (
  0 => '{ color: #888; }',
  1 => '{ body { color: #333; }',  # NOTE: the first opening { is not balanced.
  2 => '{ color: blue; }',
)

To deal with balanced pairs, we need a way to descend into a pair and repeat the pattern. In pseudo Basic-like regular expressions code this would mean:

10: expect an opening '{'
20:   read until
21:     '{' or '}' is encountered
22:     OR end of data, goto 50.
30:   if '{', start over; goto 10.
40:   if '}', goto 50.
50: expect a balanced closing '}'

PCRE supports (?R) which does exactly what is illustrated above: it repeats the whole pattern recursively. Let’s go back to the non-greedy pattern (and the sample CSS document):

/{.*?}/

and modify it so it starts a new group for each nested pair:

/
  {           # find the first opening '{'.
    (?:       # start a new group, this is so '|' below does not apply/affect the opening '{'.
      [^{}]+  # skip ahead happily if no '{' or '}'.
      |       #   ...otherwise...
      (?R)    # we may be at the start of a new group, repeat whole pattern.
    )
    *         # nesting can be many levels deep.
  }           # finally, expect a balanced closing '}'
/

Let’s convert this to an inline pattern and run it against our sample CSS document:

<?php

$string = <<<CSS
body { color: #888; }

@media print { body { color: #333; } }

code { color: blue; }
CSS;

$pattern = '/{(?:[^{}]+|(?R))*}/';

preg_match_all($pattern, $string, $groups);
var_export($groups);

# array (
#   0 =>
#   array (
#     0 => '{ color: #888; }',
#     1 => '{ body { color: #333; } }',
#     2 => '{ color: blue; }',
#   ),
# )

This is a great start to a simple CSS parser. You can now iterate over the results and run the expression again until you get a flattened list of all the properties.

Note again (?R) repeats the whole pattern. If you want to match all @media queries for example, you’d need to make sure the group is optional:

<?php

# [...]

$pattern = '/(?:@media[^{]+)?'     # @media is optional, e.g., when we have descended into it.
         . '{(?:[^{}]+|(?R))*}/s';

Why not a parser instead?

Parsers can be much more complex than a one-line regular expression. You’d most likely also need to include a dependency in your project. If all you need is a simple solution then I say try and use recursive regular expressions first. I have been hacking on a tool to merge @media queries produced by Sass and I got the job done with no complex parsers or dependencies involved.