Regular Expression Metacharacters

This table is from the website

Metacharacter Description
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
^ Matches the position at the beginning of the input string.
$ Matches the position at the end of the input string.
* Matches the preceding subexpression zero or more times.
+ Matches the preceding subexpression one or more times.
? Matches the preceding subexpression zero or one time.
{n} Matches exactly n times, where n is a non-negative integer.
{n,} Matches at least n times, n is a non-negative integer.
{n,m} Matches at least n and at most m times, where m and n are non-negative integers and n <= m.
. Matches any single character except “n”.
[xyz] A character set. Matches any one of the enclosed characters.
x|y Matches either x or y.
[^xyz] A negative character set. Matches any character not enclosed.
[a-z] A range of characters. Matches any character in the specified range.
[^a-z] A negative range characters. Matches any character not in the specified range.
b Matches a word boundary, that is, the position between a word and a space.
B Matches a nonword boundary. ‘erB’ matches the ‘er’ in “verb” but not the ‘er’ in “never”.
d Matches a digit character.
D Matches a non-digit character.
f Matches a form-feed character.
n Matches a newline character.
r Matches a carriage return character.
s Matches any whitespace character including space, tab, form-feed, etc.
S Matches any non-whitespace character.
t Matches a tab character.
v Matches a vertical tab character.
w Matches any word character including underscore.
W Matches any non-word character.
un Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, u00A9 matches the copyright symbol

The 12 punctuation characters that make regular expressions work their magic are called metacharacters. If you want your regex to match them literally, you need to escape them by placing a backslash in front of them. Thus, the regex: ‹\$\(\)\*\+\.\?\[\\\^\{\|› matches the text $()*+.?[\^{|.

Notably absent from the list are the closing square bracket ], the hyphen -, and the closing curly bracket }. The first two become metacharacters only after an unescaped [, and the } only after an unescaped {. There’s no need to ever escape }.

Block Escape

Perl, PCRE and Java support the regex tokens ‹\Q› and ‹\E›. ‹\Q› suppresses the meaning of all metacharacters, including the backslash, until ‹\E›. If you omit ‹\E›, all characters after the ‹\Q› until the end of the regex are treated as literals. Though Java 4 and 5 support this feature, you should not use it. Bugs in the implementation cause regular expressions with ‹\Q⋯\E› to match different things from what you intended, and from what PCRE, Perl, or Java 6 would match. These bugs were fixed in Java 6, making it behave the same way as PCRE and Perl.

Leave a comment

Your email address will not be published. Required fields are marked *