Regular expressions

Applies to: Databricks SQL Databricks Runtime

A regular expression (regex) is a pattern that describes a set of strings. Several Databricks SQL functions and operators take a regular expression to match, locate, count, extract, replace, or split text:

rlike operator and regexp operator: test whether a string matches a pattern.
regexp_like function: test whether a string matches a pattern.
regexp_count function: count how many times a pattern matches.
regexp_instr function: return the position of the first match.
regexp_substr function: return the first matching substring.
regexp_extract and regexp_extract_all functions: return a capture group from the first match or from all matches.
regexp_replace function: replace matches with a replacement string.
split function: split a string around matches of a pattern.

The like and ilike operators do not use regular expressions. They use a simpler pattern language based on _ and %.

Regex engine

Databricks evaluates regular expressions using the Java java.util.regex engine. Patterns must follow Java regular expression syntax, which is broadly compatible with Perl-style regular expressions but differs in some details. For the complete and authoritative reference, see the Java Pattern class documentation.

The following sections describe the most commonly used constructs.

Character classes

Construct	Matches
`.`	Any single character except a line terminator (unless the `s` flag is set).
`[abc]`	Any one of the listed characters: `a`, `b`, or `c`.
`[^abc]`	Any single character that is not `a`, `b`, or `c`.
`[a-z]`	Any character in the range `a` through `z`.
`\d`, `\D`	A digit (`[0-9]`); `\D` matches any non-digit.
`\w`, `\W`	A word character (`[a-zA-Z_0-9]`); `\W` matches any non-word character.
`\s`, `\S`	A whitespace character; `\S` matches any non-whitespace character.

Anchors and boundaries

Construct	Matches
`^`	The start of the input (or of a line with the `m` flag).
`$`	The end of the input (or of a line with the `m` flag).
`\b`, `\B`	A word boundary; `\B` matches a non-word boundary.

Quantifiers

Quantifiers specify how many times the preceding element must match. By default, quantifiers are greedy and match as much as possible. Append ? to make a quantifier reluctant (match as little as possible), for example .*?.

Construct	Matches the preceding element…
`*`	Zero or more times.
`+`	One or more times.
`?`	Zero or one time.
`{n}`	Exactly `n` times.
`{n,}`	At least `n` times.
`{n,m}`	Between `n` and `m` times.

Groups and alternation

Construct	Description
`(...)`	A capturing group. Use the group index with regexp_extract.
`(?:...)`	A non-capturing group. Groups without capturing the match.
`a\|b`	Alternation. Matches either `a` or `b`.

Inline flags

Place a flag at the start of the pattern to change how the whole pattern is matched.

Flag	Effect
`(?i)`	Case-insensitive matching.
`(?s)`	Dotall mode. `.` matches line terminators.
`(?m)`	Multiline mode. `^` and `$` match at line breaks.
`(?x)`	Comments mode. Ignores unescaped whitespace in the pattern.

For example, (?i)ste(v\|ph)en matches Steven, steven, and STEPHEN regardless of case.

Escaping and string literals

To match a character that is also a regex metacharacter (such as ., *, (, [, or \), precede it with a backslash. For example, \. matches a literal period, and \\ matches a literal backslash.

Because a backslash is also the escape character in ordinary SQL string literals, you must double each backslash that you intend to pass to the regex engine. For example, to pass the pattern \d+ to a function, write it as '\\d+' in a regular string literal.

To avoid this double escaping, use a raw-literal (r prefix), which disables escape-character preprocessing. In a raw literal, write the pattern exactly as the regex engine expects it:

SQL
-- Regular string literal: backslashes must be doubled.
> SELECT regexp_substr('item 42 in stock', '\\d+');
 42

-- Raw literal: write the pattern as-is.
> SELECT regexp_substr('item 42 in stock', r'\d+');
 42

Common patterns

The following patterns cover frequent tasks. They are written as raw literals so that backslashes pass through to the regex engine unchanged. The email, URL, and IP patterns are intentionally simple illustrations, not strict validators.

Goal	Pattern
Integer, optionally signed	`r'-?\d+'`
Decimal number	`r'-?\d+(\.\d+)?'`
A word	`r'\w+'`
The whole word `cat` only	`r'\bcat\b'`
Email address (simple)	`r'[\w.%+-]+@[\w.-]+\.\w{2,}'`
Host in a URL	`r'https?://([^/]+)'`
IPv4 address (does not check 0–255)	`r'\d{1,3}(\.\d{1,3}){3}'`
ISO date (`YYYY-MM-DD`)	`r'\d{4}-\d{2}-\d{2}'`
Leading or trailing whitespace	`r'^\s+\|\s+$'`
US phone number (simple)	`r'\d{3}-\d{3}-\d{4}'`

Word boundaries (\b) match a whole word rather than a substring:

SQL
-- 'cat' as a standalone word
> SELECT 'the cat sat on the mat' rlike r'\bcat\b';
 true

-- 'cat' only as part of a larger word
> SELECT 'category' rlike r'\bcat\b';
 false

Combine a literal prefix with a quantifier to extract a structured token, such as an order ID:

SQL
> SELECT regexp_substr('Ref: ORD-12345 shipped on 2024-03-15', r'ORD-\d+');
 ORD-12345

Choose the right function

Use this table to pick the function that matches your goal.

You want to…	Use
Test whether a string matches.	rlike / regexp / regexp_like
Return the first matching substring.	regexp_substr
Return a capture group.	regexp_extract
Return all matches.	regexp_extract_all
Return the position of a match.	regexp_instr
Count the number of matches.	regexp_count
Replace matches with other text.	regexp_replace
Split a string around matches.	split
Match a simple `_` / `%` pattern.	like / ilike

Regex engine​

Character classes​

Anchors and boundaries​

Quantifiers​

Groups and alternation​

Inline flags​

Escaping and string literals​

Common patterns​

Choose the right function​

Related articles​