Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python Regex Greedy vs Non-Greedy Quantifiers

#1
Python Regex Greedy vs Non-Greedy Quantifiers

<div><p>Ready to earn the black belt of your <a href="https://blog.finxter.com/python-regex/" target="_blank" rel="noreferrer noopener" aria-label="regex superpower (opens in a new tab)">regex superpower</a>? This tutorial shows you the subtle but important difference between greedy and non-greedy regex quantifiers. </p>
<p>But first things first: <strong>what are “quantifiers” anyway?</strong> Great question – I’m glad you asked! So let’s dive into Python’s three main regex quantifiers.</p>
<h2>Python Regex Quantifiers</h2>
<p>The word “<a href="https://www.merriam-webster.com/dictionary/quantity" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">quantifier</a>” originates from latin: it’s meaning is <strong>quantus = how much / how often</strong>.</p>
<p><strong>This is precisely what a regular expression quantifier means: you tell the regex engine how often you want to match a given pattern. </strong></p>
<p>If you think you don’t define any quantifier, you do it implicitly: no quantifier means to match the regular expression exactly once.</p>
<p>So what are the regex quantifiers in Python?</p>
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td>Quantifier</td>
<td>Meaning</td>
</tr>
<tr>
<td><code>A?</code></td>
<td>Match regular expression <code>A</code> zero or one times</td>
</tr>
<tr>
<td><code>A*</code></td>
<td>Match regular expression <code>A</code> zero or more times</td>
</tr>
<tr>
<td><code>A+</code></td>
<td>Match regular expression <code>A</code> one or more times</td>
</tr>
<tr>
<td><code>A{m}</code></td>
<td>Match regular expression <code>A</code> exactly m times</td>
</tr>
<tr>
<td><code>A{m,n}</code></td>
<td>Match regular expression <code>A</code> between m and n times (included)</td>
</tr>
</tbody>
</table>
</figure>
<p>Note that in this tutorial, I assume you have at least a remote idea of what regular expressions actually are. If you haven’t, no problem, check out my <a rel="noreferrer noopener" aria-label="detailed regex tutorial on this blog (opens in a new tab)" href="https://blog.finxter.com/python-regex/" target="_blank">detailed regex tutorial on this blog</a>.</p>
<p>You see in the table that the quantifiers <code>?</code>, <code>*</code>, <code>+</code>, <code>{m}</code>, and <code>{m,n}</code> define how often you repeat the matching of regex <code>A</code>. </p>
<p>Let’s have a look at some examples—one for each quantifier:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a{3}', 'aaaa')
['aaa']
>>> re.findall('a{1,2}', 'aaaa')
['aa', 'aa']</pre>
<p>In each line, you try a different quantifier on the same text <code>'aaaa'</code>. And, interestingly, each line leads to a different output:</p>
<ul>
<li>The <a rel="noreferrer noopener" aria-label="zero-or-one (opens in a new tab)" href="https://blog.finxter.com/python-re-question-mark/" target="_blank">zero-or-one</a> regex <code>'a?'</code> matches four times one <code>'a'</code>. Note that it doesn’t match zero characters if it can avoid doing so.</li>
<li>The <a rel="noreferrer noopener" href="https://blog.finxter.com/python-re-question-mark/" target="_blank">zero-or-more</a> regex <code>'a*'</code> matches once four <code>'a'</code>s and consumes them. At the end of the string, it can still match the empty string.</li>
<li>The <a rel="noreferrer noopener" href="https://blog.finxter.com/python-re-question-mark/" target="_blank">one-or-more</a> regex <code>'a+'</code> matches once four <code>'a'</code>s. In contrast to the previous quantifier, it cannot match an empty string.</li>
<li>The repeating regex <code>'a{3}'</code> matches up to three <code>'a'</code>s in a single run. It can do so only once.</li>
<li>The repeating regex <code>'a{1,2}'</code> matches one or two <code>'a'</code>s. It tries to match as many as possible.</li>
</ul>
<p>You’ve learned the basic quantifiers of Python regular expressions. Now, it’s time to explore the meaning of the term greedy. Shall we?</p>
<h2>Python Regex Greedy Match</h2>
<p><strong>A greedy match means that the regex engine (the one which tries to find your pattern in the string) matches as many characters as possible.</strong></p>
<p>For example, the regex <code>'a+'</code> will match as many <code>'a'</code>s as possible in your string <code>'aaaa'</code>. Although the substrings <code>'a'</code>, <code>'aa'</code>, <code>'aaa'</code> all match the regex <code>'a+'</code>, it’s not enough for the regex engine. It’s always hungry and tries to match even more.</p>
<figure class="wp-block-image size-large is-resized"><img src="https://66.media.tumblr.com/afcca057d2243b8de45c61035d32ae9a/tumblr_mxpukxBkoH1qbt7i1o1_640.gifv" alt="" width="176" height="228"/></figure>
<p>In other words, the greedy quantifiers give you the <strong>longest match</strong> from a given position in the string.</p>
<p>As it turns out, all default quantifiers <code>?</code>, <code>*</code>, <code>+</code>, <code>{m}</code>, and <code>{m,n}</code> you’ve learned above are greedy: they “consume” or match as many characters as possible so that the regex pattern is still satisfied.</p>
<p>Here are the above examples again that all show how greedy the regex engine is:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a{3}', 'aaaa')
['aaa']
>>> re.findall('a{1,2}', 'aaaa')
['aa', 'aa']</pre>
<p>In all cases, a shorter match would also be valid. But as the regex engine is greedy per default, those are not enough for the regex engine.</p>
<p>Okay, so how can we do a non-greedy match?</p>
<h2>Python Regex Non-Greedy Match</h2>
<p><strong>A non-greedy match means that the regex engine matches as few characters as possible—so that it still can match the pattern in the given string.</strong></p>
<p>For example, the regex <code>'a+?'</code> will match as few <code>'a'</code>s as possible in your string <code>'aaaa'</code>. Thus, it matches the first character <code>'a'</code> and is done with it. Then, it moves on to the second character (which is also a match) and so on.</p>
<p>In other words, the non-greedy quantifiers give you the <strong>shortest possible match</strong> from a given position in the string.</p>
<p>You can make the default quantifiers <code>?</code>, <code>*</code>, <code>+</code>, <code>{m}</code>, and <code>{m,n}</code> non-greedy by appending a question mark symbol <code>'?'</code> to them: <code>??</code>, <code>*?</code>, <code>+?</code>, and <code>{m,n}?</code>. they “consume” or match as few characters as possible so that the regex pattern is still satisfied.</p>
<p>Here are some examples that show how non-greedy matching works:</p>
<h2>Non-Greedy Question Mark Operator (??)</h2>
<p>Let’s start with the question mark (zero-or-one operator):</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a??', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']</pre>
<p>In the first instance, you use the zero-or-one regex <code>'a?'</code>. It’s greedy so it matches one <code>'a'</code> character if possible. </p>
<p>In the second instance, you use the non-greedy zero-or-one version <code>'a??'</code>. It matches zero <code>'a'</code>s if possible. Note that it moves from left to right so it matches the empty string and “consumes” it. Only then, it cannot match the empty string anymore so it is forced to match the first <code>'a'</code> character. But after that, it’s free to match the empty string again. This pattern of first matching the empty string and only then matching the <code>'a'</code> if it is absolutely needed repeats. That’s why this strange pattern occurs.</p>
<h2>Non-Greedy Asterisk Operator (*?)</h2>
<p>Let’s start with the asterisk (zero-or-more operator):</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a*?', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']</pre>
<p>First, you use the zero-or-more asterisk regex <code>'a*'</code>. It’s greedy so it matches as many <code>'a'</code> characters as it can. </p>
<p>Second, you use the non-greedy zero-or-one version <code>'a*?'</code>. Again, it matches zero <code>'a'</code>s if possible. Only if it has already matched zero characters at a certain position, it matches one character at that position, “consumes” it, and moves on.</p>
<h2>Non-Greedy Plus Operator (+?)</h2>
<p>Let’s start with the plus (one-or-more operator):</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a+?', 'aaaa')
['a', 'a', 'a', 'a']</pre>
<p>First, you use the one-or-more plus regex <code>'a+'</code>. It’s greedy so it matches as many <code>'a'</code> characters as it can (but at least one). </p>
<p>Second, you use the non-greedy one-or-more version <code>'a+?'</code>. In this case, the regex engine matches only one character <code>'a'</code>, consumes it, and moves on with the next match.</p>
<p>Let’s summarize what you’ve learned so far:</p>
<h2>Greedy vs Non-Greedy Match – What’s the Difference?</h2>
<p>Given a pattern with a quantifier (e.g. the <a href="https://blog.finxter.com/python-re-asterisk/" target="_blank" rel="noreferrer noopener" aria-label="asterisk operator (opens in a new tab)">asterisk operator</a>) that allows the regex engine to match the pattern multiple times. </p>
<p>A given string may match the regex in multiple ways. For example, both substrings <code>'a'</code> and <code>'aaa'</code> are valid matches when matching the pattern <code>'a*'</code> in the string <code>'aaaa'</code>. </p>
<p>So the difference between the greedy and the non-greedy match is the following: The greedy match will try to match as many repetitions of the quantified pattern as possible. The non-greedy match will try to match as few repetitions of the quantified pattern as possible. </p>
<h2>Examples Greedy vs Non-Greedy Match</h2>
<p>Let’s consider a range of examples that help you understand the difference between greedy and non-greedy matches in Python:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('a+', 'aaaa')
['aaaa']
>>> re.findall('a+?', 'aaaa')
['a', 'a', 'a', 'a']
>>> re.findall('a*', 'aaaa')
['aaaa', '']
>>> re.findall('a*?', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']
>>> re.findall('a?', 'aaaa')
['a', 'a', 'a', 'a', '']
>>> re.findall('a??', 'aaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '']</pre>
<p>Make sure you completely understand those examples before you move on. If you don’t, please read the previous paragraphs again.</p>
<h2>Which is Faster: Greedy vs Non-Greedy?</h2>
<p>Considering that greedy quantifiers match a maximal and non-greedy a minimal number of patterns, is there any performance difference?</p>
<p>Great question! </p>
<p>Indeed, some benchmarks suggest that there’s a significant performance difference: the greedy quantifier is 100% slower in <a href="https://www.rexegg.com/regex-quantifiers.html">realistic experiments on benchmark data</a>. </p>
<p><strong>So if you optimize for speed and you don’t care about greedy or non-greedy matches—and you don’t know anything else—go for the non-greedy quantifier!</strong></p>
<p>However, the truth is not as simple. For example, consider the following basic experiment that falsifies the previous hypothesis that the non-greedy version is faster:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import timeit
>>> timeit.timeit('import re;re.findall("a*", "aaaaaaaaaaaa")')
1.0579840000000331
>>> timeit.timeit('import re;re.findall("a*?", "aaaaaaaaaaaa")')
3.7830938000000742</pre>
<p>I used the speed testing tool <a rel="noreferrer noopener" aria-label=" (opens in a new tab)" href="https://docs.python.org/3.8/library/timeit.html" target="_blank">timeit </a>that allows to throw in some simple Python statements and check how long they run. Per default, the passed statement is executed 1,000,000 times. </p>
<p>You can see a notable performance difference of more than 300%! The non-greedy version is three times slower than the greedy version. </p>
<p>Why is that?</p>
<p>The reason is the re.findall() method that returns a list of matching substrings. Here’s the output both statements would produce:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall("a*", "aaaaaaaaaaaa")
['aaaaaaaaaaaa', '']
>>> re.findall("a*?", "aaaaaaaaaaaa")
['', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '']</pre>
<p>You can see that the greedy version finds one match and is done with it. The non-greedy version finds 25 matches which leads to far more processing and memory overhead.</p>
<p>So what happens if you use the <a rel="noreferrer noopener" aria-label="re.search() (opens in a new tab)" href="https://blog.finxter.com/python-regex-search/" target="_blank">re.search()</a> method that returns only the first match rather than the <a rel="noreferrer noopener" aria-label="re.findall() (opens in a new tab)" href="https://blog.finxter.com/python-re-findall/" target="_blank">re.findall()</a> method that returns all matches?</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> timeit.timeit('import re;re.search("a*", "aaaaaaaaaaaa")')
0.8420328999998219
>>> timeit.timeit('import re;re.search("a*?", "aaaaaaaaaaaa")')
0.7955709000000297</pre>
<p>As expected, this changes things again. Both regex searches yield a single result, but the non-greedy match is much shorter: it matches the empty string <code>''</code> rather than the whole string <code>'aaaaaaaaaaaa'</code>. Of course, this is a bit faster.</p>
<p>However, the difference is negligible in this minimal example.</p>
<h2>There’s More: Greedy, Docile, Lazy, Helpful, Possessive Match</h2>
<p>In this article, I’ve classified the regex world into greedy and non-greedy quantifiers. But you can differentiate the “non-greedy” class even more!</p>
<p>Next, I’ll give you a short overview based on <a href="https://www.rexegg.com/regex-quantifiers.html">this great article</a> of the most important terms in this regard:</p>
<ul>
<li><strong>Greedy</strong>: match as many instances of the quantified pattern as you can.</li>
<li><strong>Docile</strong>: match as many instances of the quantified pattern as long as it still matches the overall pattern—if this is possible. Note that what I called “greedy” in this article is really “docile”.</li>
<li><strong>Lazy</strong>: match as few instances of the quantified pattern as needed. This is what I called “non-greedy” in this article.</li>
<li><strong>Possessive</strong>: never gives up a partial match. So the regex engine may not even find a match that actually exist—just because it’s so greedy. This is very unusual and you won’t see it a lot in practice.</li>
</ul>
<p>If you want to learn more about those, I’d recommend that you read this excellent online tutorial<a rel="noreferrer noopener" aria-label=". (opens in a new tab)" href="https://www.rexegg.com/regex-quantifiers.html" target="_blank">. </a></p>
<h2>Where to Go From Here</h2>
<p><strong>Summary: </strong>You’ve learned that the greedy quantifiers <code>?</code>, <code>*</code>, and <code>+</code> match as many repetitions of the quantified pattern as possible. The non-greedy quantifiers <code>??</code>, <code>*?</code>, and <code>+?</code> match as few repetitions of the quantified pattern as possible. </p>
<p>If you want to master Python and regular expressions, <a href="https://blog.finxter.com/subscribe/" target="_blank" rel="noreferrer noopener" aria-label="join my free email academy (opens in a new tab)">join my free email academy</a>—it’s fun!</p>
</div>


https://www.sickgaming.net/blog/2020/02/...antifiers/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016