You are here: Functions > Regular Expressions

Regular Expressions

Analyzer provides two functions REGEXFIND() and REGEXREPLACE() that allow you to utilize regular expressions to find or replace data in a character string.

The phrase “regular expressions” refers to a specific core common syntax used for representing patterns that are being searched for within a character string. Regular expressions are used by many common programming languages.

Each character in a regular expression (that is, each character used in describing a pattern) is understood to be either:

a meta-character (with a special meaning), or
a regular character (with a literal meaning)

Together, they can be used to identify a given pattern, or process a number of instances of a pattern. Pattern-matches can vary from a precise equality to a very general similarity (controlled by the meta-characters). The meta-character syntax is designed specifically to easily represent patterns in a concise and flexible way.

A very simple case of a regular expression in this syntax would be to locate the same word spelled two different ways. For example, the regular expression “familiari[sz]e” matches both familiarise" and "familiarize".

Regular expressions consist of constants and operator symbols that denote sets of strings and operations over these sets, respectively.

A full explanation of all possible regular expression meta-character syntax is beyond the scope of this section. Numerous resources explaining regular expressions are available on the Internet. Analyzer uses the ECMA Script implementation of regular expressions. Most implementations of regular expressions use a core common syntax.

The table below lists the most common regular expression meta-characters and describes the operation that each one performs.

Regular Expressions - Lists and Description of Meta-characters

Meta-character

Description

.

Matches any character (except a new line character)

?

Matches 0 or 1 occurrences of the immediately preceding literal, meta-character, or element

*

Matches 0 or more occurrences of the immediately preceding literal, meta-character, or element

+

Matches 1 or more occurrences of the immediately preceding literal, meta-character, or element

{}

Matches the specified number of occurrences of the immediately preceding literal, meta-character, or element. You can specify an exact number, a range, or an open-ended range.

For example:

a{3} matches "aaa"

X{0,2}L matches "L", "XL", and "XXL"

AB-\d{2,}-YZ matches any alphanumeric identifier with the prefix"AB-"ť, the suffix "-YZ", and two or more numbers in the body of the identifier

[]

Matches any single character inside the brackets

For example:

[aeiou] matches a, or e, or i, or o, or u

[^aeiou] matches any character that is not a, or e, or i, or o, or u

[A-G] matches any uppercase letter from A to G

[A-Ga-g] matches any uppercase letter from A to G, or any lowercase letter from a to g

[5-9] matches any number from 5 to 9

()

Creates a group that defines a sequence or block of characters, which can then be treated as a single unit.

For example:

S(ch)?mid?th? matches"Smith"ť or "Schmidt"ť

(56A.*){2} matches any alphanumeric identifier in which the sequence"56A"ť occurs at least twice

(56A).*-.*\1 matches any alphanumeric identifier in which the sequence "56A"ť occurs at least twice, with a hyphen located between two of the occurrences

\

An escape character that specifies that the character immediately following is a literal. Use the escape character if you want to literally match meta-characters. For example, \( finds a left parenthesis, and \\ finds a backslash.

Use the escape character if you want to literally match any of the following characters:

^ $ . * + ? = ! : | \ ( ) [ ] { }

Other punctuation characters such as the ampersand (&) or the "at sign" (@) do not require the escape character.

\int

Specifies that a group, previously defined with parentheses (), recurs. int is an integer that identifies the sequential position of the previously defined group in relation to any other groups. This meta-character can be used in the pattern parameter in both REGEXFIND() and REGEXREPLACE().

For example:

(123).*\1 matches any identifier in which the group of digits “123” occurs at least twice

^(\d{3}).*\1 matches any identifier in which the first 3 digits recur

^(\d{3}).*\1.*\1 matches any identifier in which the first 3 digits recur at least twice

^(\D)(\d)-.*\2\1 matches any identifier in which the alphanumeric prefix recurs with the alpha and numeric characters reversed

$int

Specifies that a group found in a target string is used in a replacement string. int is an integer that identifies the sequential position of the group in the target string in relation to any other groups. This meta-character can only be used in the new_string parameter in REGEXREPLACE( ).

For example:

If the pattern (\d{3})[ -]?(\d{3})[ -]?(\d{4}) is used to match a variety of different telephone number formats, the new_string ($1)-$2-$3 can be used to replace the numbers with themselves, and standardize the formatting. 999 123-4567 and 9991234567 both become (999)-123-4567.

|

Matches the character, block of characters, or expression before or after the pipe (|)

For example:

a|b matches a or b

abc|def matches "abc"ť or "def"ť

Sm(i|y)th matches "Smith" or "Smyth"

[a-c]|[Q-S]|[x-z] matches any of the following letters: a, b, c, Q, R, S, x, y, z

\s|- matches a space or a hyphen

\w

Matches any word character (a to z, A to Z, 0 to 9, and the underscore character _ )

\W

Matches any non-word character (not a to z, A to Z, 0 to 9, or the underscore character _ )

\d

Matches any number (any decimal digit)

\D

Matches any non-number (any character that is not a decimal digit)

\s

Matches a space (a blank)

\S

Matches any non-space (a non-blank character)

\b

Matches a word boundary (between \w and \W characters)

Word boundaries consume no space themselves. For example:

"United Equipment"ť contains 4 word boundaries - one either side of the space, and one at the start and the end of the string. "United Equipment"ť is matched by the regular expression \b\w*\b\W\b\w*\b

^

Matches the start of a string

Inside brackets [ ], ^ negates the contents

$

Matches the end of a string