PHP Recursive Patterns
tl;dr; This post deals with writing very simple parsers for nested data using recursive regular expressions.
There is a very good reason why regular expressions are called that – they are regular. You can get a lot done with them, but if your data has a complex structure, you’d often be advised to use a parser. A good example is processing data where parentheses can be nested. A simple regular expression:
fails quickly when a nested (
)
pair is encountered.
PCRE (the regular expressions engine behind preg_*
functions) has support for dealing with those cases where you need to recurse and repeat the pattern.
For example, a very simple CSS parser would need to balance opening {
and closing }
, i.e., taking into account @media
queries also enclosed by a {..}
pair. Let’s assume this document:
A non-greedy regular expression like /{.*?}/
would fail as it exits as soon as a closing }
is encountered resulting in the following captured groups:
To deal with balanced pairs, we need a way to descend into a pair and repeat the pattern. In pseudo Basic-like regular expressions code this would mean:
10: expect an opening '{'
20: read until
21: '{' or '}' is encountered
22: OR end of data, goto 50.
30: if '{', start over; goto 10.
40: if '}', goto 50.
50: expect a balanced closing '}'
PCRE supports (?R)
which does exactly what is illustrated above: it repeats the whole pattern recursively.
Let’s go back to the non-greedy pattern (and the sample CSS document):
and modify it so it starts a new group for each nested pair:
Let’s convert this to an inline pattern and run it against our sample CSS document:
This is a great start to a simple CSS parser. You can now iterate over the results and run the expression again until you get a flattened list of all the properties.
Note again (?R)
repeats the whole pattern. If you want to match all @media
queries for example, you’d need to make sure the group is optional:
Why not a parser instead?
Parsers can be much more complex than a one-line regular expression. You’d most likely also need to include a dependency in your project.
If all you need is a simple solution then I say try and use recursive regular expressions first. I have been hacking on a tool to merge @media
queries produced by Sass and I got the job done with no complex parsers or dependencies involved.