Regex() searches for a pattern within a source string and returns a string. It simply identifies a pattern in a string or transforms a string into another string.
IGNORECASE disregards case. GLOBALREPLACE repeats the match until the entire string is processed. format is a backreference to the matched group. Regex() returns missing if the match fails.
bus|car is the regular expression (in quotation marks because it is a string). The expression means match “bus” or “car”.
The third optional argument in Regex() is a specification of the result string. The default value, \0, is a backreference to everything that was matched by the regular expression. In the preceding example, the word “bus” is matched in sentence. The default third argument, \0, replaces the entire sentence with “bus”.
The (.*) before and after bus are part of the regular expression. The parentheses create a capturing group. The . matches any character. The * matches zero or more of the previous expression. As a result, the first parenthesis pair matches everything before bus, and the second parenthesis pair matches everything after bus. The third argument, \1 car \2, reassembles the text; it leaves out bus and substitutes car.
GLOBALREPLACE changes the behavior of Regex(). If the match succeeds, the entire source string is returned with substitutions made for each place where the pattern matches. If there are no matches, an unchanged source string is returned.
The \w* matches zero or more word characters and becomes backreference 1 because of the parentheses. bus|car becomes backreference 2 because of the parentheses. The third argument, bicycle (not \2) that was \1, describes how to build the substitution text for the part of the source text that was matched.
Regex Match() returns an empty list with zero elements if the match fails. If the match succeeds, the first list is the text of the entire match (backreference 0). The second list is the text that matches backreference 1, and so on.
Unlike Regex(), Regex Match() is case insensitive. Include MATCHCASE for a case-sensitive match. Include NULL if you want to match case but there is no replacement text.
The \w+ matches one or more word characters. The \S* matches zero or more characters that are not spaces. In the resulting JSL list, the field names (person, id, friend, favorite) and their corresponding values (Fred, 77, "", tea) are separate strings.
If the first argument to Regex Match() is a variable and a third argument specifies the replacement value, the matched text is replaced in the variable.
Regex() and Regex Match() match a pattern in a given string but return different results. To transforms your string into another string, use Regex(). To identify the substrings that match specific parts of the pattern, use Regex Match().
This example shows the efficiency of Regex Match() compared to Regex(). The source is a list of six strings. The goal is to extract portions of those six strings into the subject, verb, and object columns of a data table (Final Data Table).
Final Data Table
Regex Match() returns {"the cat was chased by the dog", "cat", "chased", "dog"} in a single try with each answer in a separate string. Compare this example to a similar one using Regex(), which returns one answer at a time and builds the final string using backreferences.
Special Characters in Regular Expressions describes the special characters and provides examples.
<\/a> interprets the forward slash literally in the end HTML anchor tag.
\n matches a newline character.
^apple matches “apple” at the beginning of a string.
apple$ matches “apple” at the end of a string.
.apple matches any single character and then “apple”.
(apple|orange|banana) matches “apple”, “orange”, or “banana”.
apple (pie)? matches one or more instances of “pie”.
(apple|orange|banana) matches “apple”, “orange”, or “banana”.
^(\w+) matches the beginning of a line and then one or more word characters.
[\s] matches a whitespace character or a digit.
[a-z0-9] matches “a” through “z” and numbers “0” through “9”.
apple{3} repeats three times.
apple{3,} repeats at least three times as many times as possible.
apple{3, 10} repeats three times but no more than 10 times.
Append a question mark to indicate repeating as few times as possible. For example, apple{3,}? repeats at least three times as few times as possible.
The backslash in a regular expression precedes a literal character. You also escape certain letters that represent common character classes, such as \w for a word character or \s for a space. The following example matches word characters (alphanumeric and underscores) and spaces.
(\w+)
Escaped Characters describes the escaped characters supported in JMP. \C, \G, \X, and \z are not supported.
word character [a-zA-Z0-9_]
\x00-\xFF
\x{0000}-\x{FFFF}
The ?, *, and + operators are greedy by default. They match as many of the preceding character as possible. The ? operator makes them reluctant; ?? matches 0, then 1 if needed; +? matches 1 and then additional characters; *? matches 0 and then additional characters.
The following example starts at the letter n and compares it to the first \d (digits) in the pattern. No digit matches. Because the pattern does not begin with ^ (start of line), the matcher advances to u. The process repeats until the 3 matches the first \d and the 2 matches the second \d.
The preceding example begins much the same, but as soon as the 3 is found and the \d matches, the + greedily matches the 2 and the 4.
Here, + requires at least one match of a digit character, but ? changes it from “as many as possible” to “as few as possible”. It stopped after the 3 because the pattern was satisfied.
In the greedy example above, the matcher greedily matched 3, 2, and 4 for the first \d+. The matcher then had to give back the 4 so that the second \d+ could match something. The reluctant example followed a different path to get a different answer. Initially, the second value was 2, but the pattern could not match the period to the 4, so the second \d+? reluctantly matched the 4 as well.
\1 might contain “quick” eventually, after the .+ grabs the million characters to the end of the string and then gives them up, one at a time, until “fox” is found. If there is more than one “fox”, it will be the last fox, not this one. To speed it up and make sure we get the first fox, add the ? operator.
The ? advances one character at a time to get past “quick” and find the first “fox”. This method is much faster than going too far.
Typically, the + or * operator is applied to a more restrictive expression such as \d* to match a run of digits, and greedy is faster than reluctant.
The greedy .* finds the last fox after backing up.
The reluctant .*? stops on the first fox.
The greedy .* has to back up a lot. There is no second fox.
A regular expression can consist of patterns grouped in parentheses, also known as capturing groups. In ([a-zA-Z])\s([0-9]), ([a-zA-Z]) is the first capturing group; ([0-9]) is the second capturing group.
Use a backreference to replace the pattern matched by a capturing group. In Perl, these groups are represented by the special variables $1, $2, $3, and so on. ($1 indicates text matched by the first parenthetical group.) In JMP, use a backslash followed by the group number (\1, \2, \3).
Lookaround assertions check for a pattern but do not return that pattern in the results. Lookaheads look forward for a pattern. Lookbehinds look back for a pattern.
Negative lookaheads check for the absence of a pattern before a specific pattern. ?! indicates a negative lookahead. The following expression matches a comma not followed by a number or space and replaces the pattern with a comma and space:
Positive lookaheads check for the presence of a pattern before a specific pattern. ?= indicates a positive lookahead. The following expression has the same result as the preceding negative lookahead but matches a comma followed by any lowercase character:
Here is another way to get the same result using a backreference substitution. ((ssn=)|(salary=)) is the capturing group. "\1" is the backreference to that group.