Regular expressions
Applies to: Databricks SQL
Databricks Runtime
A regular expression (regex) is a pattern that describes a set of strings. Several Databricks SQL functions and operators take a regular expression to match, locate, count, extract, replace, or split text:
- rlike operator and regexp operator: test whether a string matches a pattern.
- regexp_like function: test whether a string matches a pattern.
- regexp_count function: count how many times a pattern matches.
- regexp_instr function: return the position of the first match.
- regexp_substr function: return the first matching substring.
- regexp_extract and regexp_extract_all functions: return a capture group from the first match or from all matches.
- regexp_replace function: replace matches with a replacement string.
- split function: split a string around matches of a pattern.
The like and ilike operators do not use regular expressions. They use a simpler pattern language based on _ and %.
Regex engine
Databricks evaluates regular expressions using the Java java.util.regex engine.
Patterns must follow Java regular expression syntax, which is broadly compatible with Perl-style regular expressions but differs in some details.
For the complete and authoritative reference, see the Java Pattern class documentation.
The following sections describe the most commonly used constructs.
Character classes
Construct | Matches |
|---|---|
| Any single character except a line terminator (unless the |
| Any one of the listed characters: |
| Any single character that is not |
| Any character in the range |
| A digit ( |
| A word character ( |
| A whitespace character; |
Anchors and boundaries
Construct | Matches |
|---|---|
| The start of the input (or of a line with the |
| The end of the input (or of a line with the |
| A word boundary; |
Quantifiers
Quantifiers specify how many times the preceding element must match.
By default, quantifiers are greedy and match as much as possible. Append ? to make a quantifier reluctant (match as little as possible), for example .*?.
Construct | Matches the preceding element… |
|---|---|
| Zero or more times. |
| One or more times. |
| Zero or one time. |
| Exactly |
| At least |
| Between |
Groups and alternation
Construct | Description |
|---|---|
| A capturing group. Use the group index with regexp_extract. |
| A non-capturing group. Groups without capturing the match. |
| Alternation. Matches either |
Inline flags
Place a flag at the start of the pattern to change how the whole pattern is matched.
Flag | Effect |
|---|---|
| Case-insensitive matching. |
| Dotall mode. |
| Multiline mode. |
| Comments mode. Ignores unescaped whitespace in the pattern. |
For example, (?i)ste(v\|ph)en matches Steven, steven, and STEPHEN regardless of case.
Escaping and string literals
To match a character that is also a regex metacharacter (such as ., *, (, [, or \), precede it with a backslash.
For example, \. matches a literal period, and \\ matches a literal backslash.
Because a backslash is also the escape character in ordinary SQL string literals, you must double each backslash that you intend to pass to the regex engine.
For example, to pass the pattern \d+ to a function, write it as '\\d+' in a regular string literal.
To avoid this double escaping, use a raw-literal (r prefix), which disables escape-character preprocessing.
In a raw literal, write the pattern exactly as the regex engine expects it:
-- Regular string literal: backslashes must be doubled.
> SELECT regexp_substr('item 42 in stock', '\\d+');
42
-- Raw literal: write the pattern as-is.
> SELECT regexp_substr('item 42 in stock', r'\d+');
42
Common patterns
The following patterns cover frequent tasks. They are written as raw literals so that backslashes pass through to the regex engine unchanged. The email, URL, and IP patterns are intentionally simple illustrations, not strict validators.
Goal | Pattern |
|---|---|
Integer, optionally signed |
|
Decimal number |
|
A word |
|
The whole word |
|
Email address (simple) |
|
Host in a URL |
|
IPv4 address (does not check 0–255) |
|
ISO date ( |
|
Leading or trailing whitespace |
|
US phone number (simple) |
|
Word boundaries (\b) match a whole word rather than a substring:
-- 'cat' as a standalone word
> SELECT 'the cat sat on the mat' rlike r'\bcat\b';
true
-- 'cat' only as part of a larger word
> SELECT 'category' rlike r'\bcat\b';
false
Combine a literal prefix with a quantifier to extract a structured token, such as an order ID:
> SELECT regexp_substr('Ref: ORD-12345 shipped on 2024-03-15', r'ORD-\d+');
ORD-12345
Choose the right function
Use this table to pick the function that matches your goal.
You want to… | Use |
|---|---|
Test whether a string matches. | rlike / regexp / regexp_like |
Return the first matching substring. | |
Return a capture group. | |
Return all matches. | |
Return the position of a match. | |
Count the number of matches. | |
Replace matches with other text. | |
Split a string around matches. | |
Match a simple |