There are three separate approaches to pattern matching provided by PostgreSQL: the traditional SQL LIKE
operator, the more recent SIMILAR TO
operator (added in SQL:1999), and POSIX-style regular expressions. Aside from the basic “does this string match this pattern?” operators, functions are available to extract or replace matching substrings and to split a string at matching locations.
If you have pattern matching needs that go beyond this, consider writing a user-defined function in Perl or Tcl.
While most regular-expression searches can be executed very quickly, regular expressions can be contrived that take arbitrary amounts of time and memory to process. Be wary of accepting regular-expression search patterns from hostile sources. If you must do so, it is advisable to impose a statement timeout.
Searches using SIMILAR TO
patterns have the same security hazards, since SIMILAR TO
provides many of the same capabilities as POSIX-style regular expressions.
LIKE
searches, being much simpler than the other two options, are safer to use with possibly-hostile pattern sources.
The pattern matching operators of all three kinds do not support nondeterministic collations. If required, apply a different collation to the expression to work around this limitation.
LIKE
string
LIKEpattern
[ESCAPEescape-character
]string
NOT LIKEpattern
[ESCAPEescape-character
]
The LIKE
expression returns true if the string
matches the supplied pattern
. (As expected, the NOT LIKE
expression returns false if LIKE
returns true, and vice versa. An equivalent expression is NOT (
.)string
LIKE pattern
)
If pattern
does not contain percent signs or underscores, then the pattern only represents the string itself; in that case LIKE
acts like the equals operator. An underscore (_
) in pattern
stands for (matches) any single character; a percent sign (%
) matches any sequence of zero or more characters.
Some examples:
'abc' LIKE 'abc' true 'abc' LIKE 'a%' true 'abc' LIKE '_b_' true 'abc' LIKE 'c' false
LIKE
pattern matching always covers the entire string. Therefore, if it's desired to match a sequence anywhere within a string, the pattern must start and end with a percent sign.
To match a literal underscore or percent sign without matching other characters, the respective character in pattern
must be preceded by the escape character. The default escape character is the backslash but a different one can be selected by using the ESCAPE
clause. To match the escape character itself, write two escape characters.
If you have standard_conforming_strings turned off, any backslashes you write in literal string constants will need to be doubled. See Section 4.1.2.1 for more information.
It's also possible to select no escape character by writing ESCAPE ''
. This effectively disables the escape mechanism, which makes it impossible to turn off the special meaning of underscore and percent signs in the pattern.
The key word ILIKE
can be used instead of LIKE
to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.
The operator ~~
is equivalent to LIKE
, and ~~*
corresponds to ILIKE
. There are also !~~
and !~~*
operators that represent NOT LIKE
and NOT ILIKE
, respectively. All of these operators are PostgreSQL-specific. You may see these operator names in EXPLAIN
output and similar places, since the parser actually translates LIKE
et al. to these operators.
The phrases LIKE
, ILIKE
, NOT LIKE
, and NOT ILIKE
are generally treated as operators in PostgreSQL syntax; for example they can be used in expression
operator
ANY (subquery
) constructs, although an ESCAPE
clause cannot be included there. In some obscure cases it may be necessary to use the underlying operator names instead.
There is also the prefix operator ^@
and corresponding starts_with
function which covers cases when only searching by beginning of the string is needed.
SIMILAR TO
Regular Expressionsstring
SIMILAR TOpattern
[ESCAPEescape-character
]string
NOT SIMILAR TOpattern
[ESCAPEescape-character
]
The SIMILAR TO
operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE
, except that it interprets the pattern using the SQL standard's definition of a regular expression. SQL regular expressions are a curious cross between LIKE
notation and common regular expression notation.
Like LIKE
, the SIMILAR TO
operator succeeds only if its pattern matches the entire string; this is unlike common regular expression behavior where the pattern can match any part of the string. Also like LIKE
, SIMILAR TO
uses _
and %
as wildcard characters denoting any single character and any string, respectively (these are comparable to .
and .*
in POSIX regular expressions).
In addition to these facilities borrowed from LIKE
, SIMILAR TO
supports these pattern-matching metacharacters borrowed from POSIX regular expressions:
|
denotes alternation (either of two alternatives).
*
denotes repetition of the previous item zero or more times.
+
denotes repetition of the previous item one or more times.
?
denotes repetition of the previous item zero or one time.
{
m
}
denotes repetition of the previous item exactly m
times.
{
m
,}
denotes repetition of the previous item m
or more times.
{
m
,
n
}
denotes repetition of the previous item at least m
and not more than n
times.
Parentheses ()
can be used to group items into a single logical item.
A bracket expression [...]
specifies a character class, just as in POSIX regular expressions.
Notice that the period (.
) is not a metacharacter for SIMILAR TO
.
As with LIKE
, a backslash disables the special meaning of any of these metacharacters; or a different escape character can be specified with ESCAPE
.
Some examples:
'abc' SIMILAR TO 'abc' true 'abc' SIMILAR TO 'a' false 'abc' SIMILAR TO '%(b|d)%' true 'abc' SIMILAR TO '(b|c)%' false
The substring
function with three parameters provides extraction of a substring that matches an SQL regular expression pattern. The function can be written according to SQL99 syntax:
substring(string
frompattern
forescape-character
)
or as a plain three-argument function:
substring(string
,pattern
,escape-character
)
As with SIMILAR TO
, the specified pattern must match the entire data string, or else the function fails and returns null. To indicate the part of the pattern for which the matching data sub-string is of interest, the pattern should contain two occurrences of the escape character followed by a double quote ("
). The text matching the portion of the pattern between these separators is returned when the match is successful.
The escape-double-quote separators actually divide substring
's pattern into three independent regular expressions; for example, a vertical bar (|
) in any of the three sections affects only that section. Also, the first and third of these regular expressions are defined to match the smallest possible amount of text, not the largest, when there is any ambiguity about how much of the data string matches which pattern. (In POSIX parlance, the first and third regular expressions are forced to be non-greedy.)
As an extension to the SQL standard, PostgreSQL allows there to be just one escape-double-quote separator, in which case the third regular expression is taken as empty; or no separators, in which case the first and third regular expressions are taken as empty.
Some examples, with #"
delimiting the return string:
substring('foobar' from '%#"o_b#"%' for '#') oob substring('foobar' from '#"o_b#"%' for '#') NULL
Table 9.15 lists the available operators for pattern matching using POSIX regular expressions.
Table 9.15. Regular Expression Match Operators
Operator | Description | Example |
---|---|---|
~ |
Matches regular expression, case sensitive | 'thomas' ~ '.*thomas.*' |
~* |
Matches regular expression, case insensitive | 'thomas' ~* '.*Thomas.*' |
!~ |
Does not match regular expression, case sensitive | 'thomas' !~ '.*Thomas.*' |
!~* |
Does not match regular expression, case insensitive | 'thomas' !~* '.*vadim.*' |
POSIX regular expressions provide a more powerful means for pattern matching than the LIKE
and SIMILAR TO
operators. Many Unix tools such as egrep
, sed
, or awk
use a pattern matching language that is similar to the one described here.
A regular expression is a character sequence that is an abbreviated definition of a set of strings (a regular set). A string is said to match a regular expression if it is a member of the regular set described by the regular expression. As with LIKE
, pattern characters match string characters exactly unless they are special characters in the regular expression language — but regular expressions use different special characters than LIKE
does. Unlike LIKE
patterns, a regular expression is allowed to match anywhere within a string, unless the regular expression is explicitly anchored to the beginning or end of the string.
Some examples:
'abc' ~ 'abc' true 'abc' ~ '^a' true 'abc' ~ '(b|d)' true 'abc' ~ '^(b|c)' false
The POSIX pattern language is described in much greater detail below.
The substring
function with two parameters, substring(
, provides extraction of a substring that matches a POSIX regular expression pattern. It returns null if there is no match, otherwise the portion of the text that matched the pattern. But if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned. You can put parentheses around the whole expression if you want to use parentheses within it without triggering this exception. If you need parentheses in the pattern before the subexpression you want to extract, see the non-capturing parentheses described below.string
from pattern
)
Some examples:
substring('foobar' from 'o.b') oob substring('foobar' from 'o(.)b') o
The regexp_replace
function provides substitution of new text for substrings that match POSIX regular expression patterns. It has the syntax regexp_replace
(source
, pattern
, replacement
[, flags
]). The source
string is returned unchanged if there is no match to the pattern
. If there is a match, the source
string is returned with the replacement
string substituted for the matching substring. The replacement
string can contain \
n
, where n
is 1 through 9, to indicate that the source substring matching the n
'th parenthesized subexpression of the pattern should be inserted, and it can contain \&
to indicate that the substring matching the entire pattern should be inserted. Write \\
if you need to put a literal backslash in the replacement text. The flags
parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. Flag i
specifies case-insensitive matching, while flag g
specifies replacement of each matching substring rather than only the first one. Supported flags (though not g
) are described in Table 9.23.
Some examples:
regexp_replace('foobarbaz', 'b..', 'X') fooXbaz regexp_replace('foobarbaz', 'b..', 'X', 'g') fooXX regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g') fooXarYXazY
The regexp_match
function returns a text array of captured substring(s) resulting from the first match of a POSIX regular expression pattern to a string. It has the syntax regexp_match
(string
, pattern
[, flags
]). If there is no match, the result is NULL
. If a match is found, and the pattern
contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern. If a match is found, and the pattern
contains parenthesized subexpressions, then the result is a text array whose n
'th element is the substring matching the n
'th parenthesized subexpression of the pattern
(not counting “non-capturing” parentheses; see below for details). The flags
parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. Supported flags are described in Table 9.23.
Some examples:
SELECT regexp_match('foobarbequebaz', 'bar.*que'); regexp_match -------------- {barbeque} (1 row) SELECT regexp_match('foobarbequebaz', '(bar)(beque)'); regexp_match -------------- {bar,beque} (1 row)
In the common case where you just want the whole matching substring or NULL
for no match, write something like
SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1]; regexp_match -------------- barbeque (1 row)
The regexp_matches
function returns a set of text arrays of captured substring(s) resulting from matching a POSIX regular expression pattern to a string. It has the same syntax as regexp_match
. This function returns no rows if there is no match, one row if there is a match and the g
flag is not given, or N
rows if there are N
matches and the g
flag is given. Each returned row is a text array containing the whole matched substring or the substrings matching parenthesized subexpressions of the pattern
, just as described above for regexp_match
. regexp_matches
accepts all the flags shown in Table 9.23, plus the g
flag which commands it to return all matches, not just the first one.
Some examples:
SELECT regexp_matches('foo', 'not there'); regexp_matches ---------------- (0 rows) SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g'); regexp_matches ---------------- {bar,beque} {bazil,barf} (2 rows)
In most cases regexp_matches()
should be used with the g
flag, since if you only want the first match, it's easier and more efficient to use regexp_match()
. However, regexp_match()
only exists in PostgreSQL version 10 and up. When working in older versions, a common trick is to place a regexp_matches()
call in a sub-select, for example:
SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
This produces a text array if there's a match, or NULL
if not, the same as regexp_match()
would do. Without the sub-select, this query would produce no output at all for table rows without a match, which is typically not the desired behavior.
The regexp_split_to_table
function splits a string using a POSIX regular expression pattern as a delimiter. It has the syntax regexp_split_to_table
(string
, pattern
[, flags
]). If there is no match to the pattern
, the function returns the string
. If there is at least one match, for each match it returns the text from the end of the last match (or the beginning of the string) to the beginning of the match. When there are no more matches, it returns the text from the end of the last match to the end of the string. The flags
parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. regexp_split_to_table
supports the flags described in Table 9.23.
The regexp_split_to_array
function behaves the same as regexp_split_to_table
, except that regexp_split_to_array
returns its result as an array of text
. It has the syntax regexp_split_to_array
(string
, pattern
[, flags
]). The parameters are the same as for regexp_split_to_table
.
Some examples:
SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '\s+') AS foo; foo ------- the quick brown fox jumps over the lazy dog (9 rows) SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+'); regexp_split_to_array ----------------------------------------------- {the,quick,brown,fox,jumps,over,the,lazy,dog} (1 row) SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; foo ----- t h e q u i c k b r o w n f o x (16 rows)
As the last example demonstrates, the regexp split functions ignore zero-length matches that occur at the start or end of the string or immediately after a previous match. This is contrary to the strict definition of regexp matching that is implemented by regexp_match
and regexp_matches
, but is usually the most convenient behavior in practice. Other software systems such as Perl use similar definitions.
PostgreSQL's regular expressions are implemented using a software package written by Henry Spencer. Much of the description of regular expressions below is copied verbatim from his manual.
Regular expressions (REs), as defined in POSIX 1003.2, come in two forms: extended REs or EREs (roughly those of egrep
), and basic REs or BREs (roughly those of ed
). PostgreSQL supports both forms, and also implements some extensions that are not in the POSIX standard, but have become widely used due to their availability in programming languages such as Perl and Tcl. REs using these non-POSIX extensions are called advanced REs or AREs in this documentation. AREs are almost an exact superset of EREs, but BREs have several notational incompatibilities (as well as being much more limited). We first describe the ARE and ERE forms, noting features that apply only to AREs, and then describe how BREs differ.
PostgreSQL always initially presumes that a regular expression follows the ARE rules. However, the more limited ERE or BRE rules can be chosen by prepending an embedded option to the RE pattern, as described in Section 9.7.3.4. This can be useful for compatibility with applications that expect exactly the POSIX 1003.2 rules.
A regular expression is defined as one or more branches, separated by |
. It matches anything that matches one of the branches.
A branch is zero or more quantified atoms or constraints, concatenated. It matches a match for the first, followed by a match for the second, etc; an empty branch matches the empty string.
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a match for the atom. With a quantifier, it can match some number of matches of the atom. An atom can be any of the possibilities shown in Table 9.16. The possible quantifiers and their meanings are shown in Table 9.17.
A constraint matches an empty string, but matches only when specific conditions are met. A constraint can be used where an atom could be used, except it cannot be followed by a quantifier. The simple constraints are shown in Table 9.18; some more constraints are described later.
Table 9.16. Regular Expression Atoms
Atom | Description |
---|---|
( re ) |
(where re is any regular expression) matches a match for re , with the match noted for possible reporting |
(?: re ) |
as above, but the match is not noted for reporting (a “non-capturing” set of parentheses) (AREs only) |
. |
matches any single character |
[ chars ] |
a bracket expression, matching any one of the chars (see Section 9.7.3.2 for more detail) |
\ k |
(where k is a non-alphanumeric character) matches that character taken as an ordinary character, e.g., \\ matches a backslash character |
\ c |
where c is alphanumeric (possibly followed by other characters) is an escape, see Section 9.7.3.3 (AREs only; in EREs and BREs, this matches c ) |
{ |
when followed by a character other than a digit, matches the left-brace character { ; when followed by a digit, it is the beginning of a bound (see below) |
x |
where x is a single character with no other significance, matches that character |
An RE cannot end with a backslash (\
).
If you have standard_conforming_strings turned off, any backslashes you write in literal string constants will need to be doubled. See Section 4.1.2.1 for more information.
Table 9.17. Regular Expression Quantifiers
Quantifier | Matches |
---|---|
* |
a sequence of 0 or more matches of the atom |
+ |
a sequence of 1 or more matches of the atom |
? |
a sequence of 0 or 1 matches of the atom |
{ m } |
a sequence of exactly m matches of the atom |
{ m ,} |
a sequence of m or more matches of the atom |
{ m , n } |
a sequence of m through n (inclusive) matches of the atom; m cannot exceed n |
*? |
non-greedy version of * |
+? |
non-greedy version of + |
?? |
non-greedy version of ? |
{ m }? |
non-greedy version of { m } |
{ m ,}? |
non-greedy version of { m ,} |
{ m , n }? |
non-greedy version of { m , n } |
The forms using {
...
}
are known as bounds. The numbers m
and n
within a bound are unsigned decimal integers with permissible values from 0 to 255 inclusive.
Non-greedy quantifiers (available in AREs only) match the same possibilities as their corresponding normal (greedy) counterparts, but prefer the smallest number rather than the largest number of matches. See