Scripting Guide > Types of Data > Regular Expressions > Greedy and Reluctant Regular Expressions
Publication date: 08/13/2020

Greedy and Reluctant Regular Expressions

The ?, *, and + operators are greedy by default. They match as many of the preceding character as possible. The ? operator makes them reluctant; ?? matches 0, then 1 if needed; +? matches 1 and then additional characters; *? matches 0 and then additional characters.

The following example starts at the letter n and compares it to the first \d (digits) in the pattern. No digit matches. Because the pattern does not begin with ^ (start of line), the matcher advances to u. The process repeats until the 3 matches the first \d and the 2 matches the second \d.

Regex( "number=32.5", "\d\d" );

"32"

Change the pattern to use the greedy + (match one or more).

Regex( "number=324.5", "\d+" );

"324"

The preceding example begins much the same, but as soon as the 3 is found and the \d matches, the + greedily matches the 2 and the 4.

Usually, the greedy behavior makes pattern matching faster because the string is consumed sooner. Sometimes a reluctant behavior is better. Adding the ? after the * or + changes them from greedy to reluctant.

Regex( "number=324.5", "\d+?" );

"3"

Here, + requires at least one match of a digit character, but ? changes it from “as many as possible” to “as few as possible”. It stopped after the 3 because the pattern was satisfied.

Compare the following results.

Greedy:

Regex( "number=324.5", "(\d+)(\d+)\.", "first=\1 second=\2" );

"first=32 second=4"

Reluctant:

Regex( "number=324.5", "(\d+?)(\d+?)\.", "first=\1 second=\2" );

"first=3 second=24"

In the greedy example above, the matcher greedily matched 3, 2, and 4 for the first \d+. The matcher then had to give back the 4 so that the second \d+ could match something. The reluctant example followed a different path to get a different answer. Initially, the second value was 2, but the pattern could not match the period to the 4, so the second \d+? reluctantly matched the 4 as well.

Use the Reluctant Match for Speed

The greedy and reluctant matches usually produce the same result but not always. See the previous section. One reason you might need the reluctant match is for speed. Suppose that you have a million-character string that begins “The quick fox…” and you want to find the word before “fox”. You might write the following expression and expect \1 to contain “quick”.

The (.+) fox

\1 might contain “quick” eventually, after the .+ grabs the million characters to the end of the string and then gives them up, one at a time, until “fox” is found. If there is more than one “fox”, it will be the last fox, not this one. To speed it up and make sure we get the first fox, add the ? operator.

The (.+?) fox

The ? advances one character at a time to get past “quick” and find the first “fox”. This method is much faster than going too far.

Typically, the + or * operator is applied to a more restrictive expression such as \d* to match a run of digits, and greedy is faster than reluctant.

Aside from the multiple fox possibility, greedy and reluctant eventually get the same answer. Using the right operator speeds up the match. The right one might be greedy, or it might be reluctant. It depends on what is being matched.

The greedy .* finds the last fox after backing up.

Regex(
	"The quick fox saw another fox eating grapes",
	"The (.*) fox",
	"\1"
);

"quick fox saw another"

The reluctant .*? stops on the first fox.

Regex(
	"The quick fox saw another fox eating grapes",
	"The (.*?) fox",
	"\1"
);

"quick"

The greedy .* has to back up a lot. There is no second fox.

Regex(
	"The quick fox saw another animal eating grapes",
	"The (.*) fox",
	"\1"
);

"quick"

The greedy word character match is an even better choice for this problem.

Regex(
	"The quick fox saw another fox eating grapes",
	"The (\w*) fox",
	"\1"
);

"quick"

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).
.