Finer points of PHP regular expressions
Even experienced programmers sometimes do not know everything about PHP regular expressions. And there are some things which are not really stressed in the official PHP manual, but very useful anyway.This article is intended for beginner PHP programmers. It is for people who already know what regular expressions are but did not use them really often.
1. How to make PHP regular expressions ungreedy ?
PHP regular expressions are "greedy" by default. It means the quantifiers like *, +, ? would consume as many characters as possible.The quantifiers *, + and ? mean:
*means repetition of 0 or more characters, same as {0, }
+means repetition of 1 or more characters, same as {1, }
?means 0 or 1 character, same as {0,1}
The "greediness" of the quantifiers *, +, ? could be illustrated by the example:
<?php
$string = "aaaaabbbbb";
preg_match("/^(.*)(b+)$/",$string,$matches);
echo "{$matches[1]}<br>{$matches[2]}";
?>
which would produce:
aaaaabbbb
b
You see the first capturing pattern (.*) has consumed 5 letters a and 4 letters b. I.e. it consumed as many characters as possible. It is because it is "greedy".
To make the quantifiers *, +, ? "ungreedy", it is enough to put ? right after them. It means you have to use
*? instead of * ,
+? instead of + ,
?? instead of ?
So let us slightly modify the previous example:
<?php
$string = "aaaaabbbbb";
preg_match("/^(.*?)(b+)$/",$string,$matches);
echo "{$matches[1]}<br>{$matches[2]}";
?>
We only added ? after the * but now the example would produce:
aaaaa
bbbbb
This is because the first capturing pattern (.*?) is not "greedy" any more. Now it consumes as few characters as possible.
You could also make ALL the quantifiers in a regular expression "ungreedy" by using the U modifier. I.e. the following example
<?php
$string = "aaaaabbbbb";
preg_match("/^(.*)(b+)$/U",$string,$matches);
echo "{$matches[1]}<br>{$matches[2]}";
?>
would produce again:
aaaaa
bbbbb
Please be careful! The question mark ? changes the behavior of the quantifiers *, +, ? from "greedy" to "ungreedy" in a "greedy" regular expression. But the same question mark ? changes the behavior of the quantifiers *, +, ? from "ungreedy" to "greedy" in an "ungreedy" regular expression!
Let us illustrate this idea by the example:
<?php
$string = "aaaaabbbbb";
preg_match("/^(.*?)(b+)$/U",$string,$matches);
echo "{$matches[1]}<br>{$matches[2]}";
?>
It would produce:
aaaaabbbb
b
as the very first example of the article. Despite we made the whole regular expression "ungreedy" by using the U modifier, the first capturing pattern (.*?) is still "greedy". It is "greedy" because we changed the behaviour of the quantifier * from "ungreedy" to "greedy" by adding a question mark ? to it.
I.e. a question mark ? turns "greediness" of the quantifiers *, +, ? to opposite in "ungreedy" and "greedy" regular expressions.
2. How to denote a backslash in a regular expression.
It is a very common case when you have to replace backslashes "\" in some string with common slashes "/".Of course if this is all which has to be done, using a regular expression would be an overkill. You could do it with the function str_replace():
<?php
$string = "c:\somepath\somefile.php";
$string = str_replace('\\', '/', $string); // please notice that we use '\\', not '\'
echo $string;
?>
It would produce:
c:/somepath/somefile.php
Please notice that we used '\\', not '\' in str_replace(). '\' would produce a parse error. This is because the PHP parser would consider the second single quote in the string '\' escaped by the backslash "\". This is why we had to escape "\" with another "\".
Generally in single quoted strings we have only 2 types of characters which should be escaped to denote themselves. They are the single quote "'" and backslash "\". Other characters in single quoted strings are not parsed by the PHP parser. E.g. the single quoted string '\\n' would mean 2 characters: \ and letter n, but not the line break, like it would be in a double quoted string.
Still it could be useful to know how to replace backslashes "\" in strings with common slashes "/" with a regular expression.
The following code:
<?php
$string = "c:\somepath\somefile.php";
$string = preg_replace("/\/","/",$string);
echo $string;
?>
would produce a parser error.
To make the replacement of "\" with "/" correctly, we would have to use the following code:
<?php
$string = "c:\somepath\somefile.php";
$string = preg_replace("/\\\\/","/",$string);
echo $string;
?>
It would produce:
c:/somepath/somefile.php
In this example we had to use 4 backslashes "\\\\" in the regular expression to denote just 1 backslash. This is because every backslash in
a C-like string must be escaped by one more backslash. So we get 2 backslashes instead of 1. But each backslash in a regular expression must be escaped by another backslash too. So we get 4 backslashes.
3. How to match a variable name in a regular expression.
Sometimes it is necessary to match a variable name in a regular expression. Not the variable value, but the variable name. E.g. it could be necessary to match a string like this:
$string = "\$a";
or (the same):
$string = '$a';
Of course we could match it with a regular expression like this:
<?php
$string = '$a';
if (preg_match('/\$a/', $string)) {
echo "matched";
} else {
echo "not matched";
}
?>
This would produce "matched". We used a single quoted string in a regular expression '/\$a/'. So we had to place only 1 slash before "$". The slash is necessary because "$" has special meaning in regular expressions (it denotes the end of a string).
But sometimes it could be necessary for us to use double quoted strings in regular expressions. In this case the code would look like this:
<?php
$string = '$a';
if (preg_match("/\\\$a/", $string)) {
echo "matched";
} else {
echo "not matched";
}
?>
Here we have 3 slashes before a. When the string is parsed by the parser, "\\\$a" will become "\$a". So we will still have /\$a/ as the regular expression pattern, like in the previous case.
4. How to use a binary zero in regular expressions.
Sometimes we have to process binary data files with regular expressions. It is not unheard of to meet a binary zero in such data. In a C-like string the binary zero "\\x00" would be considered as the end of line character. So to use a binary zero in a regular expression, we have to write it like that: "\\\\x00". I.e. we have to escape with a backslash.5. How to use recursion in regular expressions.
Recursion could be used to match against a recursively repeated pattern in a string. The most common example of recursive patterns is solving nested parentheses problem.To understand recursion in PHP, let us consider a simple example.
Imagine you have a pattern enclosed in correctly nested brackets somewhere inside the text. Number of nested brackets could be unlimited. You'd like to capture this bracketed pattern.
You could do it like this:
<?php
$string = "some text (a(b(c)d)e) more text";
if (preg_match("/\(([^()]+|(?R))*\)/", $string, $matches)) {
echo '<pre>' . print_r($matches, true) . '</pre>';
}
?>
This example produces:
Array ( [0] => (a(b(c)d)e) [1] => e )
As you see the the array element $matches[0] captures the peace of text we have been looking for (the peace of text enclosed in correctly nested brackets).
Let's consider how it works.
We made the regular expression recursive by adding (?R) to it. (?R) means recursive substitution of the entire regular expression. PHP parser substitutes the entire regexp "\(([^()]+|(?R))*\)" instead of (?R) at each iteration of recursion.
So in our particular case almost the same result could be obtained by using the regular expression:
"/\(([^()]+|\(([^()]+|\(([^()]+)*\))*\))*\)/"
But it works only for not more than 3 nested brackets. So if we do not know the nesting depth in advance, we have to use
"/\(([^()]+|(?R))*\)/"
instead which allows us unlimited nesting depth and simplifies the regular expression syntax.
Let us check manually how the pattern "/\(([^()]+|(?R))*\)/" could match the string "(a(b(c)d)e)":
- "(c)" is matched by "\(([^()]+)*\)" that is by the whole pattern.
This means the parser can use "(?R)" for matching "(c)" at the next iterations of recursion.
- "(b(c)d)" is matched like:
- "(" is matched by "\("
- "b" is matched by "[^()]+"
- "(c)" is matched by (?R) (see above)
- "d" is matched by "[^()]+"
- ")" is matched by "\)"
So the string "(b(c)d)" is matched by the whole pattern.
This means the parser can use (?R) instead of "(b(c)d)" at the next iterations of recursion.
- "(a(b(c)d)e)" is matched like:
- "(" is matched by "\("
- "a" is matched by "[^()]+"
- "(b(c)d)" is matched by (?R) (see above)
- "e" is matched by "[^()]+"
- ")" is matched by "\)"
If this is how it works, it is clear why the second array element $matches[1] is equal to "e". The substring "e" is matched at the last iteration of recursion. Only the value captured at the last iteration is saved in the array.
If we want to capture only $matches[0], we could do it like:
<?php
$string = "some text (a(b(c)d)e) more text";
if (preg_match("/\((?:[^()]+|(?R))*\)/", $string, $matches)) {
echo '<pre>' . print_r($matches, true) . '</pre>';
}
?>
which produces:
Array ( [0] => (a(b(c)d)e) )
Here we changed capturing brackets "()" to not capturing "(?:)".
Or we could do it even better:
<?php
$string = "some text (a(b(c)d)e) more text";
if (preg_match("/\((?>[^()]+|(?R))*\)/", $string, $matches)) {
echo '<pre>' . print_r($matches, true) . '</pre>';
}
?>
which produces the same result:
Array ( [0] => (a(b(c)d)e) )
Here we used so called once-only pattern "(?>)" (which is not capturing) instead of capturing brackets "()". Using once-only patterns (where possible) is recommended by the PHP manual. Using them should make the regular expression faster.
Once-only patterns are quite simple so I do not give information about them here. They are explained in detail in the official PHP manual here.
I did not use once-only patterns at once in the examples for the sake of simplicity.
If you'd like to learn more about Perl Compatible Regular Expressions (PCRE) you could do it here.