perlrecharclass - Perl Regular Expression Character Classes
The top level documentation about Perl regular expressions is found in perlre.
This manual page discusses the syntax and use of character classes in Perl regular expressions.
A character class is a way of denoting a set of characters in such a way that one character of the set is matched. It's important to remember that: matching a character class consumes exactly one character in the source string. (The source string is the string the regular expression is matched against.)
There are three types of character classes in Perl regular expressions: the dot, backslash sequences, and the form enclosed in square brackets. Keep in mind, though, that often the term "character class" is used to mean just the bracketed form. Certainly, most Perl documentation does that.
The dot (or period), .
is probably the most used, and certainly the most well-known character class. By default, a dot matches any character, except for the newline. That default can be changed to add matching the newline by using the single line modifier: for the entire regular expression with the /s
modifier, or locally with (?s)
(and even globally within the scope of use re '/s'
). (The "\N"
backslash sequence, described below, matches any character except newline without regard to the single line modifier.)
Here are some examples:
"a" =~ /./ # Match
"." =~ /./ # Match
"" =~ /./ # No match (dot has to match a character)
"\n" =~ /./ # No match (dot does not match a newline)
"\n" =~ /./s # Match (global 'single line' modifier)
"\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
"ab" =~ /^.$/ # No match (dot matches one character)
A backslash sequence is a sequence of characters, the first one of which is a backslash. Perl ascribes special meaning to many such sequences, and some of these are character classes. That is, they match a single character each, provided that the character belongs to the specific set of characters defined by the sequence.
Here's a list of the backslash sequences that are character classes. They are discussed in more detail below. (For the backslash sequences that aren't character classes, see perlrebackslash.)
\d Match a decimal digit character.
\D Match a non-decimal-digit character.
\w Match a "word" character.
\W Match a non-"word" character.
\s Match a whitespace character.
\S Match a non-whitespace character.
\h Match a horizontal whitespace character.
\H Match a character that isn't horizontal whitespace.
\v Match a vertical whitespace character.
\V Match a character that isn't vertical whitespace.
\N Match a character that isn't a newline.
\pP, \p{Prop} Match a character that has the given Unicode property.
\PP, \P{Prop} Match a character that doesn't have the Unicode property
\N
, available starting in v5.12, like the dot, matches any character that is not a newline. The difference is that \N
is not influenced by the single line regular expression modifier (see "The dot" above). Note that the form \N{...}
may mean something completely different. When the {...}
is a quantifier, it means to match a non-newline character that many times. For example, \N{3}
means to match 3 non-newlines; \N{5,}
means to match 5 or more non-newlines. But if {...}
is not a legal quantifier, it is presumed to be a named character. See charnames for those. For example, none of \N{COLON}
, \N{4F}
, and \N{F4}
contain legal quantifiers, so Perl will try to find characters whose names are respectively COLON
, 4F
, and F4
.
\d
matches a single character considered to be a decimal digit. If the /a
regular expression modifier is in effect, it matches [0-9]. Otherwise, it matches anything that is matched by \p{Digit}
, which includes [0-9]. (An unlikely possible exception is that under locale matching rules, the current locale might not have [0-9]
matched by \d
, and/or might match other characters whose code point is less than 256. The only such locale definitions that are legal would be to match [0-9]
plus another set of 10 consecutive digit characters; anything else would be in violation of the C language standard, but Perl doesn't currently assume anything in regard to this.)
What this means is that unless the /a
modifier is in effect \d
not only matches the digits '0' - '9', but also Arabic, Devanagari, and digits from other languages. This may cause some confusion, and some security issues.
Some digits that \d
matches look like some of the [0-9] ones, but have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks very much like an ASCII DIGIT EIGHT (U+0038), and LEPCHA DIGIT SIX (U+1C46) looks very much like an ASCII DIGIT FIVE (U+0035). An application that is expecting only the ASCII digits might be misled, or if the match is \d+
, the matched string might contain a mixture of digits from different writing systems that look like they signify a number different than they actually do. "num()" in Unicode::UCD can be used to safely calculate the value, returning undef
if the input string contains such a mixture. Otherwise, for example, a displayed price might be deliberately different than it appears.
What \p{Digit}
means (and hence \d
except under the /a
modifier) is \p{General_Category=Decimal_Number}
, or synonymously, \p{General_Category=Digit}
. Starting with Unicode version 4.1, this is the same set of characters matched by \p{Numeric_Type=Decimal}
. But Unicode also has a different property with a similar name, \p{Numeric_Type=Digit}
, which matches a completely different set of characters. These characters are things such as CIRCLED DIGIT ONE
or subscripts, or are from writing systems that lack all ten digits.
The design intent is for \d
to exactly match the set of characters that can safely be used with "normal" big-endian positional decimal syntax, where, for example 123 means one 'hundred', plus two 'tens', plus three 'ones'. This positional notation does not necessarily apply to characters that match the other type of "digit", \p{Numeric_Type=Digit}
, and so \d
doesn't match them.
The Tamil digits (U+0BE6 - U+0BEF) can also legally be used in old-style Tamil numbers in which they would appear no more than one in a row, separated by characters that mean "times 10", "times 100", etc. (See https://www.unicode.org/notes/tn21.)
Any character not matched by \d
is matched by \D
.
A \w
matches a single alphanumeric character (an alphabetic character, or a decimal digit); or a connecting punctuation character, such as an underscore ("_"); or a "mark" character (like some sort of accent) that attaches to one of those. It does not match a whole word. To match a whole word, use \w+
. This isn't the same thing as matching an English word, but in the ASCII range it is the same as a string of Perl-identifier characters.
/a
modifier is in effect ...\w
matches the 63 characters [a-zA-Z0-9_].
\w
matches the same as \p{Word}
matches in this range. That is, it matches Thai letters, Greek letters, etc. This includes connector punctuation (like the underscore) which connect two words together, or diacritics, such as a COMBINING TILDE
and the modifier letters, which are generally used to add auxiliary markings to letters.