Regular Expressions

A regular expression is a specification of a pattern frequently used to clean up or extract pieces of data. You can search for a pattern and replace it with a different string or extract specific parts of the string. Define the pattern in the Regex() or Regex Match() function.

Regex()

Regex() searches for a pattern within a source string and returns a string. It simply identifies a pattern in a string or transforms a string into another string.

Regex(source, pattern, (<replacementString>, <GLOBALREPLACE>), <format>, <IGNORECASE>);

IGNORECASE disregards case. GLOBALREPLACE repeats the match until the entire string is processed. format is a backreference to the matched group. Regex() returns missing if the match fails.

Example of Matching a String

bus|car is the regular expression (in quotation marks because it is a string). The expression means match “bus” or “car”.

sentence = "I took the bus to work.";

vehicle = Regex( sentence, "bus|car" );

"bus"

Examples of Replacing a String

The third optional argument in Regex() is a specification of the result string. The default value, \0, is a backreference to everything that was matched by the regular expression. In the preceding example, the word “bus” is matched in sentence. The default third argument, \0, replaces the entire sentence with “bus”.

A more interesting variation uses parentheses to create additional backreferences.

sentence = "I took the bus to work.";

Regex( sentence, "(.*) bus (.*)", "\1 car \2" );

"I took the car to work."

The (.*) before and after bus are part of the regular expression. The parentheses create a capturing group. The . matches any character. The * matches zero or more of the previous expression. As a result, the first parenthesis pair matches everything before bus, and the second parenthesis pair matches everything after bus. The third argument, \1 car \2, reassembles the text; it leaves out bus and substitutes car.

See Backreferences and Capturing Groups for more information.

Example of Global Replacement

GLOBALREPLACE changes the behavior of Regex(). If the match succeeds, the entire source string is returned with substitutions made for each place where the pattern matches. If there are no matches, an unchanged source string is returned.

sentence = "I took the red bus followed by the blue bus to get to work today.";

Regex( sentence, "bus", "car", GLOBALREPLACE);

"I took the red car followed by the blue car to get to work today."

You can also use backreferences. This example starts with a different sentence.

sentence = "I took the red bus followed by the blue car to get to work today.";

Regex(

sentence,

"(\w*) (bus|car)",

"bicycle (not \2) that was \1",

GLOBALREPLACE

);

"I took the bicycle (not bus) that was red followed by the bicycle (not car) that was blue to get to work today."

The \w* matches zero or more word characters and becomes backreference 1 because of the parentheses. bus|car becomes backreference 2 because of the parentheses. The third argument, bicycle (not \2) that was \1, describes how to build the substitution text for the part of the source text that was matched.

Notice how the backreferences can be used to swap data positions. This might be useful for swapping the position of first names and last names.

Regex Match()

Regex Match() returns an empty list with zero elements if the match fails. If the match succeeds, the first list is the text of the entire match (backreference 0). The second list is the text that matches backreference 1, and so on.

Regex Match(source, pattern, <NULL>, <MATCHCASE>);

Unlike Regex(), Regex Match() is case insensitive. Include MATCHCASE for a case-sensitive match. Include NULL if you want to match case but there is no replacement text.

Example of Parsing Name-Value Pairs

The following example parses pairs of names and values.

Regex Match(

"person=Fred id=77 friend= favorite=tea",

"(\w+)=(\S*) (\w+)=(\S*) (\w+)=(\S*) (\w+)=(\S*)"

);

{"person=Fred id=77 friend= favorite=tea", "person", "Fred", "id", "77", "friend", "", "favorite", "tea"}

The \w+ matches one or more word characters. The \S* matches zero or more characters that are not spaces. In the resulting JSL list, the field names (person, id, friend, favorite) and their corresponding values (Fred, 77, "", tea) are separate strings.

If the first argument to Regex Match() is a variable and a third argument specifies the replacement value, the matched text is replaced in the variable.

Comparing Regex() and Regex Match()

Regex() and Regex Match() match a pattern in a given string but return different results. To transforms your string into another string, use Regex(). To identify the substrings that match specific parts of the pattern, use Regex Match().

This example shows the efficiency of Regex Match() compared to Regex(). The source is a list of six strings. The goal is to extract portions of those six strings into the subject, verb, and object columns of a data table (Final Data Table).

Final Data Table

source = {"the cat ate the chicken", "the dog chased the cat", "did ralph like mary", "the girl pets the dog", "these words are strange", "the cat was chased by the dog"};

// Create the data table.

dt = New Table( "English 101",

New Column( "subject", character ),

New Column( "verb", character ),

New Column( "object", character )

);

// Iterate through the strings in the list.

For( i = 1, i <= N Items( source ), i++,

// Assign the result of each match to matchList.

matchList = Regex Match(

source[i],

// Scan each string. Match zero or more characters

// and one item in each group.

);

// If matchList has zero items (string 5), don’t add a row

// to the table. Put each matched string in separate

// data table cells.

If( N Items( matchList ) > 0,

dt << Add Rows( 1 );

dt:subject = matchList[2]; // Match the first open parenthesis.

dt:verb = matchList[3]; // Match the second open parenthesis.

dt:object = matchList[4]; // Match the third open parenthesis.

);

Regex Match() returns {"the cat was chased by the dog", "cat", "chased", "dog"} in a single try with each answer in a separate string. Compare this example to a similar one using Regex(), which returns one answer at a time and builds the final string using backreferences.

For( i = 1, i <= N Items( source ), i++,

If( !Is Missing( s ) & !Is Missing( v ) & !Is Missing( o ),

dt << Add Rows( 1 );

dt:subject = s; // Return the match for \1.

dt:verb = v; // Return the match for \2.

dt:object = o; // Return the match for \3.

);

Backreferences are discussed in Backreferences and Capturing Groups.

Special Characters in Regular Expressions

Special characters are commonly used in regular expressions. The period is a special character that matches one instance of the specified character. It must be escaped with a backslash to be interpreted as a period. In the following expression, the period is replaced with an exclamation point.

Regex( "Bicycling makes traveling to work fun.", "\.", "!", GLOBALREPLACE );

"Bicycling makes traveling to work fun!"

Special Characters in Regular Expressions describes the special characters and provides examples.

Special Characters in Regular Expressions

•	Precedes a literal character.

<\/a> interprets the forward slash literally in the end HTML anchor tag.

•	Precedes an escape sequence.

\n matches a newline character.

Matches the beginning of a string, not including the newline character.

^apple matches “apple” at the beginning of a string.

Matches the end of a string, not including the newline character.

apple$ matches “apple” at the end of a string.

Matches any single character including a newline character.

.apple matches any single character and then “apple”.

Represents a logical OR to separate alternative values.

(apple|orange|banana) matches “apple”, “orange”, or “banana”.

Matches zero or one instance.

apple (pie)? matches one or more instances of “pie”.

Matches zero or more instances.

Matches one or more instances.

( )

Encloses a sub-expression.

(apple|orange|banana) matches “apple”, “orange”, or “banana”.

^(\w+) matches the beginning of a line and then one or more word characters.

[ ]

Encloses an expression that matches set of characters.

[\s] matches a whitespace character or a digit.

[a-z0-9] matches “a” through “z” and numbers “0” through “9”.

{ }

Encloses an expression that represents repetition.

apple{3} repeats three times.

apple{3,} repeats at least three times as many times as possible.

apple{3, 10} repeats three times but no more than 10 times.

Append a question mark to indicate repeating as few times as possible. For example, apple{3,}? repeats at least three times as few times as possible.

Escaped Characters in Regular Expressions

The backslash in a regular expression precedes a literal character. You also escape certain letters that represent common character classes, such as \w for a word character or \s for a space. The following example matches word characters (alphanumeric and underscores) and spaces.

Regex(

"Are you there, Alice?, asked Jerry.", // source

"(here|there).+(\w+).+(said|asked)(\s)(\w+)\." ); // regular expression

"there, Alice?, asked Jerry."


(here\|there).+	Matches “there”, a comma, and a space.
(\w+)	Matches “Alice”.
.+	Matches “?, “.
(said\|asked)(\s)	Matches “asked” followed by a space. Without the space, the match would end here; “asked” is followed by a space in the source string.
(\w+)\.	Matches “Jerry” and a period.

Escaped Characters describes the escaped characters supported in JMP. \C, \G, \X, and \z are not supported.

Escaped Characters
\A	start of a string
\b	word boundary. The zero-length string between \w and \W or \W and \w.
\B	not at a word boundary
\cX	ASCII control character
\d	single digit [0-9]
\D	single character that is NOT a digit [^0-9]
\l	match a single lowercase letter [a-z]
\L	single character that is not lowercase [^a-z]
\s	single whitespace character
\S	single character that is NOT white space
\u	single uppercase character [A-Z]
\U	single character that is not uppercase [^A-Z]
\w	word character [a-zA-Z0-9_]
\W	single character that is NOT a word character [^a-zA-Z0-9_]
\x00-\xFF	hexadecimal character
\x{0000}-\x{FFFF}	Unicode code point
\Z	end of a string before the line break

Greedy and Reluctant Regular Expressions

The ?, *, and + operators are greedy by default. They match as many of the preceding character as possible. The ? operator makes them reluctant; ?? matches 0, then 1 if needed; +? matches 1 and then additional characters; *? matches 0 and then additional characters.

The following example starts at the letter n and compares it to the first \d (digits) in the pattern. No digit matches. Because the pattern does not begin with ^ (start of line), the matcher advances to u. The process repeats until the 3 matches the first \d and the 2 matches the second \d.

Regex( "number=32.5", "\d\d" );

"32"

Change the pattern to use the greedy + (match one or more).

Regex( "number=324.5", "\d+" );

"324"

The preceding example begins much the same, but as soon as the 3 is found and the \d matches, the + greedily matches the 2 and the 4.

Usually, the greedy behavior makes pattern matching faster because the string is consumed sooner. Sometimes a reluctant behavior is better. Adding the ? after the * or + changes them from greedy to reluctant.

Regex( "number=324.5", "\d+?" );

"3"

Here, + requires at least one match of a digit character, but ? changes it from “as many as possible” to “as few as possible”. It stopped after the 3 because the pattern was satisfied.

Compare the following results.

Greedy:

Regex( "number=324.5", "(\d+)(\d+)\.", "first=\1 second=\2" );

"first=32 second=4"

Reluctant:

Regex( "number=324.5", "(\d+?)(\d+?)\.", "first=\1 second=\2" );

"first=3 second=24"

In the greedy example above, the matcher greedily matched 3, 2, and 4 for the first \d+. The matcher then had to give back the 4 so that the second \d+ could match something. The reluctant example followed a different path to get a different answer. Initially, the second value was 2, but the pattern could not match the period to the 4, so the second \d+? reluctantly matched the 4 as well.

Use the Reluctant Match for Speed

The greedy and reluctant matches usually produce the same result but not always. See the previous section. One reason you might need the reluctant match is for speed. Suppose that you have a million-character string that begins “The quick fox…” and you want to find the word before “fox”. You might write the following expression and expect \1 to contain “quick”.

The (.+) fox

\1 might contain “quick” eventually, after the .+ grabs the million characters to the end of the string and then gives them up, one at a time, until “fox” is found. If there is more than one “fox”, it will be the last fox, not this one. To speed it up and make sure we get the first fox, add the ? operator.

The (.+?) fox

The ? advances one character at a time to get past “quick” and find the first “fox”. This method is much faster than going too far.

Typically, the + or * operator is applied to a more restrictive expression such as \d* to match a run of digits, and greedy is faster than reluctant.

Aside from the multiple fox possibility, greedy and reluctant eventually get the same answer. Using the right operator speeds up the match. The right one might be greedy, or it might be reluctant. It depends on what is being matched.

The greedy .* finds the last fox after backing up.

Regex(

"The quick fox saw another fox eating grapes",

"The (.*) fox",

"\1"

);

"quick fox saw another"

The reluctant .*? stops on the first fox.

Regex(

"The quick fox saw another fox eating grapes",

The greedy .* has to back up a lot. There is no second fox.

Regex(

"The quick fox saw another animal eating grapes",

The greedy word character match is an even better choice for this problem.

Regex(

"The quick fox saw another fox eating grapes",

Backreferences and Capturing Groups

A regular expression can consist of patterns grouped in parentheses, also known as capturing groups. In ([a-zA-Z])\s([0-9]), ([a-zA-Z]) is the first capturing group; ([0-9]) is the second capturing group.

Use a backreference to replace the pattern matched by a capturing group. In Perl, these groups are represented by the special variables $1, $2, $3, and so on. ($1 indicates text matched by the first parenthetical group.) In JMP, use a backslash followed by the group number (\1, \2, \3).

The following example includes a third argument that specifies the replacement text and backreferences.

Regex(

" Are you there, Alice?, asked Jerry.", // source

" (here|there).+ (\w+).+(said|asked) (\w+)\.", // regular expression

" I am \1, \4, replied \2." ); // optional format argument

" I am there, Jerry, replied Alice."


" I am \1,	Creates the text “I am”, a space, and then the first matched pattern, “there”.
\4,	Creates the text “Jerry” with the fourth matched pattern (\w+).
replied \2."	Creates the text “replied” and a space. Matches “Alice.” with the second matched pattern (\w+).

Lookaround Assertions

Lookaround assertions check for a pattern but do not return that pattern in the results. Lookaheads look forward for a pattern. Lookbehinds look back for a pattern.

Negative Lookahead Example

Negative lookaheads check for the absence of a pattern before a specific pattern. ?! indicates a negative lookahead. The following expression matches a comma not followed by a number or space and replaces the pattern with a comma and space:

Regex( "one,two 1,234 cat,dog,duck fish, and chips,to go",

",(?!\d|\s)", ", ",GLOBALREPLACE);

"one, two 1,234 cat, dog, duck fish, and chips, to go"

Positive Lookahead Example

Positive lookaheads check for the presence of a pattern before a specific pattern. ?= indicates a positive lookahead. The following expression has the same result as the preceding negative lookahead but matches a comma followed by any lowercase character:

Regex( "one,two 1,234 cat,dog,duck fish, and chips,to go",

",(?=[a-z])", ", ",GLOBALREPLACE);

"one, two 1,234 cat, dog, duck fish, and chips, to go"

Positive Lookbehind Example

In this example, the positive lookbehind regular expression matches the “ssn=” or “salary=” keywords without including the keyword in the matched text. The matched text is the string of characters that consists of zero or more dollar signs, digits, and hyphens.

data = "name=bill salary=$5 ssn=123-45-6789 age=13,name=mary salary=$6 ssn=987-65-4321 age=14";

redacted = Regex(data, "(?<=(ssn=)|(salary=))[$\d-]*", "###", GLOBALREPLACE);

"name=bill salary=### ssn=### age=13,name=mary salary=### ssn=### age=14"

Here is another way to get the same result using a backreference substitution. ((ssn=)|(salary=)) is the capturing group. "\1" is the backreference to that group.

data = "name=bill salary=$5 ssn=123-45-6789 age=13,name=mary salary=$6 ssn=987-65-4321 age=14";

redacted = Regex(data, "((ssn=)|(salary=))[$\d-]*", "\1###", GLOBALREPLACE);

"name=bill salary=### ssn=### age=13,name=mary salary=### ssn=### age=14"

Backreferences are discussed in Backreferences and Capturing Groups.