Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python Regex Greedy vs Non-Greedy Quantifiers

#1
Python Regex Greedy vs Non-Greedy Quantifiers

Ready to earn the black belt of your regex superpower? This tutorial shows you the subtle but important difference between greedy and non-greedy regex quantifiers.

But first things first: what are “quantifiers” anyway? Great question – I’m glad you asked! So let’s dive into Python’s three main regex quantifiers.

Python Regex Quantifiers


The word “quantifier” originates from latin: it’s meaning is quantus = how much / how often.

This is precisely what a regular expression quantifier means: you tell the regex engine how often you want to match a given pattern.

If you think you don’t define any quantifier, you do it implicitly: no quantifier means to match the regular expression exactly once.

So what are the regex quantifiers in Python?


Quantifier Meaning
A? Match regular expression A zero or one times
A* Match regular expression A zero or more times
A+ Match regular expression A one or more times
A{m} Match regular expression A exactly m times
A{m,n} Match regular expression A between m and n times (included)

Note that in this tutorial, I assume you have at least a remote idea of what regular expressions actually are. If you haven’t, no problem, check out my detailed regex tutorial on this blog.

You see in the table that the quantifiers ?, *, +, {m}, and {m,n} define how often you repeat the matching of regex A.

Let’s have a look at some examples—one for each quantifier:

>>> import re
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a{3}', 'aaaa')
['aaa']
>>> re.findall('a{1,2}', 'aaaa')
['aa', 'aa']

In each line, you try a different quantifier on the same text 'aaaa'. And, interestingly, each line leads to a different output:

  • The zero-or-one regex 'a?' matches four times one 'a'. Note that it doesn’t match zero characters if it can avoid doing so.
  • The zero-or-more regex 'a*' matches once four 'a's and consumes them. At the end of the string, it can still match the empty string.
  • The one-or-more regex 'a+' matches once four 'a's. In contrast to the previous quantifier, it cannot match an empty string.
  • The repeating regex 'a{3}' matches up to three 'a's in a single run. It can do so only once.
  • The repeating regex 'a{1,2}' matches one or two 'a's. It tries to match as many as possible.

You’ve learned the basic quantifiers of Python regular expressions. Now, it’s time to explore the meaning of the term greedy. Shall we?

Python Regex Greedy Match


A greedy match means that the regex engine (the one which tries to find your pattern in the string) matches as many characters as possible.

For example, the regex 'a+' will match as many 'a's as possible in your string 'aaaa'. Although the substrings 'a', 'aa', 'aaa' all match the regex 'a+', it’s not enough for the regex engine. It’s always hungry and tries to match even more.


In other words, the greedy quantifiers give you the longest match from a given position in the string.

As it turns out, all default quantifiers ?, *, +, {m}, and {m,n} you’ve learned above are greedy: they “consume” or match as many characters as possible so that the regex pattern is still satisfied.

Here are the above examples again that all show how greedy the regex engine is:

>>> import re
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a{3}', 'aaaa')
['aaa']
>>> re.findall('a{1,2}', 'aaaa')
['aa', 'aa']

In all cases, a shorter match would also be valid. But as the regex engine is greedy per default, those are not enough for the regex engine.

Okay, so how can we do a non-greedy match?

Python Regex Non-Greedy Match


A non-greedy match means that the regex engine matches as few characters as possible—so that it still can match the pattern in the given string.

For example, the regex 'a+?' will match as few 'a's as possible in your string 'aaaa'. Thus, it matches the first character 'a' and is done with it. Then, it moves on to the second character (which is also a match) and so on.

In other words, the non-greedy quantifiers give you the shortest possible match from a given position in the string.

You can make the default quantifiers ?, *, +, {m}, and {m,n} non-greedy by appending a question mark symbol '?' to them: ??, *?, +?, and {m,n}?. they “consume” or match as few characters as possible so that the regex pattern is still satisfied.

Here are some examples that show how non-greedy matching works:

Non-Greedy Question Mark Operator (??)


Let’s start with the question mark (zero-or-one operator):

>>> import re
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a??', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']

In the first instance, you use the zero-or-one regex 'a?'. It’s greedy so it matches one 'a' character if possible.

In the second instance, you use the non-greedy zero-or-one version 'a??'. It matches zero 'a's if possible. Note that it moves from left to right so it matches the empty string and “consumes” it. Only then, it cannot match the empty string anymore so it is forced to match the first 'a' character. But after that, it’s free to match the empty string again. This pattern of first matching the empty string and only then matching the 'a' if it is absolutely needed repeats. That’s why this strange pattern occurs.

Non-Greedy Asterisk Operator (*?)


Let’s start with the asterisk (zero-or-more operator):

>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a*?', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']

First, you use the zero-or-more asterisk regex 'a*'. It’s greedy so it matches as many 'a' characters as it can.

Second, you use the non-greedy zero-or-one version 'a*?'. Again, it matches zero 'a's if possible. Only if it has already matched zero characters at a certain position, it matches one character at that position, “consumes” it, and moves on.

Non-Greedy Plus Operator (+?)


Let’s start with the plus (one-or-more operator):

>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a+?', 'aaaa')
['a', 'a', 'a', 'a']

First, you use the one-or-more plus regex 'a+'. It’s greedy so it matches as many 'a' characters as it can (but at least one).

Second, you use the non-greedy one-or-more version 'a+?'. In this case, the regex engine matches only one character 'a', consumes it, and moves on with the next match.

Let’s summarize what you’ve learned so far:

Greedy vs Non-Greedy Match – What’s the Difference?


Given a pattern with a quantifier (e.g. the asterisk operator) that allows the regex engine to match the pattern multiple times.

A given string may match the regex in multiple ways. For example, both substrings 'a' and 'aaa' are valid matches when matching the pattern 'a*' in the string 'aaaa'.

So the difference between the greedy and the non-greedy match is the following: The greedy match will try to match as many repetitions of the quantified pattern as possible. The non-greedy match will try to match as few repetitions of the quantified pattern as possible.

Examples Greedy vs Non-Greedy Match


Let’s consider a range of examples that help you understand the difference between greedy and non-greedy matches in Python:

>>> import re
>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a+?', 'aaaa')
['a', 'a', 'a', 'a']
>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a*?', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a??', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']

Make sure you completely understand those examples before you move on. If you don’t, please read the previous paragraphs again.

Which is Faster: Greedy vs Non-Greedy?


Considering that greedy quantifiers match a maximal and non-greedy a minimal number of patterns, is there any performance difference?

Great question!

Indeed, some benchmarks suggest that there’s a significant performance difference: the greedy quantifier is 100% slower in realistic experiments on benchmark data.

So if you optimize for speed and you don’t care about greedy or non-greedy matches—and you don’t know anything else—go for the non-greedy quantifier!

However, the truth is not as simple. For example, consider the following basic experiment that falsifies the previous hypothesis that the non-greedy version is faster:

>>> import timeit
>>> timeit.timeit('import re;re.findall("a*", "aaaaaaaaaaaa")')
1.0579840000000331
>>> timeit.timeit('import re;re.findall("a*?", "aaaaaaaaaaaa")')
3.7830938000000742

I used the speed testing tool timeit that allows to throw in some simple Python statements and check how long they run. Per default, the passed statement is executed 1,000,000 times.

You can see a notable performance difference of more than 300%! The non-greedy version is three times slower than the greedy version.

Why is that?

The reason is the re.findall() method that returns a list of matching substrings. Here’s the output both statements would produce:

>>> re.findall("a*", "aaaaaaaaaaaa")
['aaaaaaaaaaaa', '']
>>> re.findall("a*?", "aaaaaaaaaaaa")
['', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '']

You can see that the greedy version finds one match and is done with it. The non-greedy version finds 25 matches which leads to far more processing and memory overhead.

So what happens if you use the re.search() method that returns only the first match rather than the re.findall() method that returns all matches?

>>> timeit.timeit('import re;re.search("a*", "aaaaaaaaaaaa")')
0.8420328999998219
>>> timeit.timeit('import re;re.search("a*?", "aaaaaaaaaaaa")')
0.7955709000000297

As expected, this changes things again. Both regex searches yield a single result, but the non-greedy match is much shorter: it matches the empty string '' rather than the whole string 'aaaaaaaaaaaa'. Of course, this is a bit faster.

However, the difference is negligible in this minimal example.

There’s More: Greedy, Docile, Lazy, Helpful, Possessive Match


In this article, I’ve classified the regex world into greedy and non-greedy quantifiers. But you can differentiate the “non-greedy” class even more!

Next, I’ll give you a short overview based on this great article of the most important terms in this regard:

  • Greedy: match as many instances of the quantified pattern as you can.
  • Docile: match as many instances of the quantified pattern as long as it still matches the overall pattern—if this is possible. Note that what I called “greedy” in this article is really “docile”.
  • Lazy: match as few instances of the quantified pattern as needed. This is what I called “non-greedy” in this article.
  • Possessive: never gives up a partial match. So the regex engine may not even find a match that actually exist—just because it’s so greedy. This is very unusual and you won’t see it a lot in practice.

If you want to learn more about those, I’d recommend that you read this excellent online tutorial.

Where to Go From Here


Summary: You’ve learned that the greedy quantifiers ?, *, and + match as many repetitions of the quantified pattern as possible. The non-greedy quantifiers ??, *?, and +? match as few repetitions of the quantified pattern as possible.

If you want to master Python and regular expressions, join my free email academy—it’s fun!



https://www.sickgaming.net/blog/2020/02/...antifiers/
Reply



Possibly Related Threads…
Thread Author Replies Views Last Post
  [Tut] Python Tuple to Integer xSicKxBot 0 1 10 hours ago
Last Post: xSicKxBot
  [Tut] Python List max() xSicKxBot 0 3 Yesterday, 11:01 AM
Last Post: xSicKxBot
  [Tut] What Does “if __name__ == ‘__main__’” Do in Python? xSicKxBot 0 7 07-01-2020, 08:31 AM
Last Post: xSicKxBot
  [Tut] The Most Pythonic Way to Compare Two Lists in Python xSicKxBot 0 7 06-28-2020, 04:18 PM
Last Post: xSicKxBot
  [Tut] How to Test Multiple Variables Against a Value in Python? xSicKxBot 0 9 06-23-2020, 08:00 AM
Last Post: xSicKxBot
  [Tut] The Most Pythonic Way to Check if a File Exists in Python xSicKxBot 0 19 06-20-2020, 09:22 AM
Last Post: xSicKxBot
  [Tut] How to Get a List Slice with Arbitrary Indices in Python? xSicKxBot 0 20 06-18-2020, 03:52 PM
Last Post: xSicKxBot
  [Tut] How to Use Python’s Join() on a List of Objects (Not Strings)? xSicKxBot 0 20 06-17-2020, 02:14 PM
Last Post: xSicKxBot
  [Tut] Python Join List of DataFrames xSicKxBot 0 22 06-16-2020, 10:16 AM
Last Post: xSicKxBot
  [Tut] Python Join List Slice xSicKxBot 0 20 06-14-2020, 06:37 AM
Last Post: xSicKxBot

Forum Jump:

[-]
Upcoming Events

[-]
Discord

[-]
Latest Threads
News - Loot Boxes Are The Same As Gambli...
Last Post: xSicKxBot
Today 07:06 PM
» Replies: 0
» Views: 0
News - Madden NFL 21's Closed Beta Begin...
Last Post: xSicKxBot
Today 07:06 PM
» Replies: 0
» Views: 0
[Tut] Python Tuple to Integer
Last Post: xSicKxBot
Today 12:13 PM
» Replies: 0
» Views: 1
(Indie Deal) Crazy Sausage Bundle & Metr...
Last Post: xSicKxBot
Today 12:13 PM
» Replies: 0
» Views: 1
News - NBA 2K21’s Legend Edition Will Pa...
Last Post: xSicKxBot
Today 12:12 PM
» Replies: 0
» Views: 1
News - Hyper Scape Impressions: Is Ubiso...
Last Post: xSicKxBot
Today 12:12 PM
» Replies: 0
» Views: 0
News - This Week At Bungie – 7/2/2020
Last Post: xSicKxBot
Today 06:51 AM
» Replies: 0
» Views: 2
Steam - Weekend Deal – The Elder Scrolls...
Last Post: xSicKxBot
Today 06:49 AM
» Replies: 0
» Views: 3
Xbox Wire - Xbox Insider Release Notes –...
Last Post: xSicKxBot
Today 06:49 AM
» Replies: 0
» Views: 1
Substance Alchemist Hands-On
Last Post: xSicKxBot
Today 05:40 AM
» Replies: 0
» Views: 2

[-]
Twitter

[-]
Sponsored
Get the Deal of the Week at RefurBees.com

Copyright © SickGaming.net 2012-2019