Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] The Python Re Plus (+) Symbol in Regular Expressions

#1
The Python Re Plus (+) Symbol in Regular Expressions

<div><p>This article is all about the <strong>plus “+” symbol in Python’s <a rel="noreferrer noopener" target="_blank" href="https://docs.python.org/3/library/re.html">re library</a>. </strong>Study it carefully and master this important piece of knowledge once and for all!</p>
<h2>What’s the Python Re + Quantifier?</h2>
<p>Say, you have any regular expression <strong>A</strong>. The regular expression (regex) <strong>A+</strong> then matches one or more occurrences of <strong>A</strong>. We call the “+” symbol the at-least-once quantifier because it requires at least one occurrence of the preceding regex. For example, the regular expression <strong>‘yes+’</strong> matches strings <strong>‘yes’</strong>, <strong>‘yess’</strong>, and <strong>‘yesssssss’</strong>. But it does neither match the string <strong>‘ye’</strong>, nor the empty string <strong>”</strong> because the plus quantifier <strong>+</strong> does not apply to the whole regex <strong>‘yes’</strong> but only to the preceding regex <strong>‘s’</strong>. </p>
<p>Let’s study some examples to help you gain a deeper understanding.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('a+b', 'aaaaaab')
['aaaaaab']
>>> re.findall('ab+', 'aaaaaabb')
['abb']
>>> re.findall('ab+', 'aaaaaabbbbb')
['abbbbb']
>>> re.findall('ab+?', 'aaaaaabbbbb')
['ab']
>>> re.findall('ab+', 'aaaaaa')
[]
>>> re.findall('[a-z]+', 'hello world')
['hello', 'world']</pre>
<p>Next, we’ll explain those examples one by one.</p>
<h2>Examples 1 and 2: Greedy Plus (+) Quantifiers</h2>
<p>Here’s the first example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('a+b', 'aaaaaab')
['aaaaaab']</pre>
<p>You use the re.findall() method. In case you don’t know it, here’s the definition from the <a href="https://blog.finxter.com/python-re-findall/">Finxter blog article</a>:</p>
<p><strong>The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.</strong></p>
<p><a href="https://blog.finxter.com/python-re-findall/">Please consult the blog article to learn everything you need to know about this fundamental Python method.</a></p>
<p>The first argument is the regular expression pattern <strong>‘a+b’</strong> and the second argument is the string to be searched. In plain English, you want to find all patterns in the string that start with at least one, but possibly many, characters ‘a’, followed by the character ‘b’. </p>
<p>The findall() method returns the matching substring: <strong>‘aaaaaab’</strong>. The asterisk quantifier + is greedy. This means that it tries to match as many occurrences of the preceding regex as possible. So in our case, it wants to match as many arbitrary characters as possible so that the pattern is still matched. Therefore, the regex engine “consumes” the whole sentence.</p>
<p>The second example is similar:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('ab+', 'aaaaaabb')
['abb']</pre>
<p>You search for the character ‘a’ followed by at least one character ‘b’. As the plus (+) quantifier is greedy, it matches as many ‘b’s as it can lay its hands on.</p>
<h2>Examples 3 and 4: Non-Greedy Plus (+) Quantifiers</h2>
<p>But what if you want to match at least one occurrence of a regex in a non-greedy manner. In other words, you don’t want the regex engine to consume more and more as long as it can but returns as quickly as it can from the processing.</p>
<p>Again, here’s the example of the greedy match:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('ab+', 'aaaaaabbbbb')
['abbbbb']</pre>
<p>The regex engine starts with the first character ‘a’ and finds that it’s a partial match. So, it moves on to match the second ‘a’—which violates the pattern ‘ab+’ that allows only for a single character ‘a’. So it moves on to the third character, and so on, until it reaches the last character ‘a’ in the string ‘aaaaaabbbbb’. It’s a partial match, so it moves on to the first occurrence of the character ‘b’. It realizes that the ‘b’ character can be matched by the part of the regex ‘b+’. Thus, the engine starts matching ‘b’s. And it greedily matches ‘b’s until it cannot match any further character. At this point it looks at the result and sees that it has found a matching substring which is the result of the operation.</p>
<p>However, it could have stopped far earlier to produce a non-greedy match after matching the first character ‘b’. Here’s an example of the non-greedy quantifier ‘+?’ (both symbols together form one regex expression).</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('ab+?', 'aaaaaabbbbb')
['ab']</pre>
<p>Now, the regex engine does not greedily “consume” as many ‘b’ characters as possible. Instead, it stops as soon as the pattern is matched (non-greedy).</p>
<h2>Examples 5 and 6</h2>
<p>For the sake of your thorough understanding, let’s have a look at the other given example: </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('ab+', 'aaaaaa')
[]</pre>
<p>You can see that the plus (+) quantifier requires that at least one occurrence of the preceding regex is matched. In the example, it’s the character ‘b’ that is not partially matched. So, the result is the empty list indicating that no matching substring was found.</p>
<p>Another interesting example is the following:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[a-z]+', 'hello world')
['hello', 'world']</pre>
<p>You use the plus (+) quantifier in combination with a character class that defines specifically which characters are valid matches. </p>
<p><em><strong>Note Character Class</strong>: Within the character class, you can define character ranges. For example, the character range [a-z] matches one lowercase character in the alphabet while the character range [A-Z] matches one uppercase character in the alphabet. </em></p>
<p>The empty space is not part of the given character class [a-z], so it won’t be matched in the text. Thus, the result is the list of words that start with at least one character: ‘hello’, ‘world’.</p>
<h2>What If You Want to Match the Plus (+) Symbol Itself?</h2>
<p>You know that the plus quantifier matches at least one of the preceding regular expression. But what if you search for the plus (+) symbol itself? How can you search for it in a string?</p>
<p>The answer is simple: escape the plus symbol in your regular expression using the backslash. In particular, use ‘\+’ instead of ‘+’. Here’s an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> text = '2 + 2 = 4'
>>> re.findall(' + ', text)
[]
>>> re.findall(' \+ ', text)
[' + ']
>>> re.findall(' \++ ', '2 ++++ 2 = 4')
[' ++++ ']</pre>
<p>If you want to find the ‘+’ symbol in your string, you need to escape it by using the backslash. If you don’t do this, the Python regex engine will interpret it as a normal “at-least-once” regex. Of course, you can combine the escaped plus symbol ‘\+’ with the “at-least-once” regex searching for at least one occurrences of the plus symbol.</p>
<h2>[Collection] What Are The Different Python Re Quantifiers?</h2>
<p>The plus quantifier—Python re +—is only one of many regex operators. If you want to use (and understand) regular expressions in practice, you’ll need to know all of them by heart!</p>
<p>So let’s dive into the other operators:</p>
<p>A regular expression is a decades-old concept in computer science. Invented in the 1950s by famous mathematician Stephen Cole Kleene, the decades of evolution brought a huge variety of operations. Collecting all operations and writing up a comprehensive list would result in a very thick and unreadable book by itself.</p>
<p>Fortunately, you don’t have to learn all regular expressions before you can start using them in your practical code projects. Next, you’ll get a quick and dirty overview of the most important regex operations and how to use them in Python. In follow-up chapters, you’ll then study them in detail — with many practical applications and code puzzles.</p>
<p>Here are the most important regex quantifiers:</p>
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td><strong>Quantifier</strong></td>
<td><strong>Description</strong></td>
<td><strong>Example</strong></td>
</tr>
<tr>
<td><code>.</code></td>
<td>The <strong>wild-card</strong> (‘dot’) matches any character in a string except the newline character ‘n’.</td>
<td>Regex ‘…’ matches all words with three characters such as ‘abc’, ‘cat’, and ‘dog’.</td>
</tr>
<tr>
<td><code>*</code></td>
<td>The <strong>zero-or-more</strong> asterisk matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex.</td>
<td>Regex ‘cat*’ matches the strings ‘ca’, ‘cat’, ‘catt’, ‘cattt’, and ‘catttttttt’.</td>
</tr>
<tr>
<td><code>?</code></td>
<td>The <strong>zero-or-one</strong> matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. </td>
<td>Regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.</td>
</tr>
<tr>
<td><code>+</code></td>
<td>The <strong>at-least-one</strong> matches one or more occurrences of the immediately preceding regex. </td>
<td>Regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.</td>
</tr>
<tr>
<td><code>^</code></td>
<td>The <strong>start-of-string</strong> matches the beginning of a string. </td>
<td>Regex ‘^p’ matches the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.</td>
</tr>
<tr>
<td><code>$</code></td>
<td>The <strong>end-of-string</strong> matches the end of a string. </td>
<td>Regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.</td>
</tr>
<tr>
<td><code>A|B</code></td>
<td>The <strong>OR</strong> matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. </td>
<td>Regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.</td>
</tr>
<tr>
<td><code>AB</code></td>
<td>&nbsp;The <strong>AND</strong> matches first the regex A and second the regex B, in this sequence. </td>
<td>We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.</td>
</tr>
</tbody>
</table>
</figure>
<p>Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.</p>
<p>We’ve already seen many examples but let’s dive into even more!</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!'] ''' print(re.findall('is.*and', text)) '''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and'] ''' print(re.findall('her:?', text)) '''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:'] ''' print(re.findall('^Ha.*', text)) '''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. Can you figure out why Python doesn't find any?
[] ''' print(re.findall('n$', text)) '''
Finds all occurrences where the new-line character 'n'
occurs at the end of the string.
['n'] ''' print(re.findall('(Life|Death)', text)) '''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death'] '''
</pre>
<p>In these examples, you’ve already seen the special symbol ‘n’ which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions. Next, we’ll discover the most important special symbols.</p>
<h2>What’s the Difference Between Python Re + and ? Quantifiers?</h2>
<p>You can read the Python Re A? quantifier as <strong>zero-or-one regex</strong>: the preceding regex A is matched either zero times or exactly once. But it’s not matched more often.</p>
<p>Analogously, you can read the Python Re A+ operator as the <strong>at-least-once regex</strong>: the preceding regex A is matched an arbitrary number of times but at least once (as the name suggests).</p>
<p>Here’s an example that shows the difference:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('ab?', 'abbbbbbb')
['ab']
>>> re.findall('ab+', 'abbbbbbb')
['abbbbbbb']</pre>
<p>The regex ‘ab?’ matches the character ‘a’ in the string, followed by character ‘b’ if it exists (which it does in the code). </p>
<p>The regex ‘ab+’ matches the character ‘a’ in the string, followed by as many characters ‘b’ as possible (and at least one).</p>
<h2>What’s the Difference Between Python Re * and + Quantifiers?</h2>
<p>You can read the Python Re A* quantifier as <strong>zero-or-more regex</strong>: the preceding regex A is matched an arbitrary number of times.</p>
<p>Analogously, you can read the Python Re A+ operator as the <strong>at-least-once regex</strong>: the preceding regex A is matched an arbitrary number of times too—but at least once.</p>
<p>Here’s an example that shows the difference:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('ab*', 'aaaaaaaa')
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']
>>> re.findall('ab+', 'aaaaaaaa')
[]</pre>
<p>The regex ‘ab*’ matches the character ‘a’ in the string, followed by an arbitary number of occurrences of character ‘b’. The substring ‘a’ perfectly matches this formulation. Therefore, you find that the regex matches eight times in the string.</p>
<p>The regex ‘ab+’ matches the character ‘a’, followed by as many characters ‘b’ as possible—but at least one. However, the character ‘b’ does not exist so there’s no match.</p>
<h2>What are Python Re <code>*?</code>, <code>+?</code>, <code>??</code> Quantifiers?</h2>
<p>You’ve learned about the three quantifiers:</p>
<ul>
<li>The quantifier A* matches an arbitrary number of patterns A.</li>
<li>The quantifier A+ matches at least one pattern A.</li>
<li>The quantifier A? matches zero-or-one pattern A.</li>
</ul>
<p>Those three are all <strong>greedy</strong>: they match as many occurrences of the pattern as possible. Here’s an example that shows their greediness:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('a*', 'aaaaaaa')
['aaaaaaa', '']
>>> re.findall('a+', 'aaaaaaa')
['aaaaaaa']
>>> re.findall('a?', 'aaaaaaa')
['a', 'a', 'a', 'a', 'a', 'a', 'a', '']</pre>
<p>The code shows that all three quantifiers *, +, and ? match as many ‘a’ characters as possible.</p>
<p>So, the logical question is: how to match as few as possible? We call this <strong>non-greedy </strong>matching. You can append the question mark after the respective quantifiers to tell the regex engine that you intend to match as few patterns as possible: *?, +?, and ??.</p>
<p>Here’s the same example but with the non-greedy quantifiers:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('a*?', 'aaaaaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '']
>>> re.findall('a+?', 'aaaaaaa')
['a', 'a', 'a', 'a', 'a', 'a', 'a']
>>> re.findall('a??', 'aaaaaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '']</pre>
<p>In this case, the code shows that all three quantifiers *?, +?, and ?? match as few ‘a’ characters as possible. </p>
<h2>Related Re Methods</h2>
<p>There are five important regular expression methods which you should master:</p>
<ul>
<li>The <strong>re.findall(pattern, string)</strong> method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li>
<li>The <strong>re.search(pattern, string)</strong> method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>.</li>
<li>The <strong>re.match(pattern, string)</strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li>
<li>The <strong>re.fullmatch(pattern, string)</strong> method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>.</li>
<li>The <strong>re.compile(pattern)</strong> method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in <a href="https://blog.finxter.com/python-regex-compile/">our blog tutorial</a>.</li>
<li>The<strong> re.split(pattern, string)</strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in <a href="https://blog.finxter.com/python-regex-split/">our blog tutorial</a>.</li>
<li>The <strong>re.sub(The re.sub(pattern, repl, string, count=0, flags=0)</strong> method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in <a href="https://blog.finxter.com/python-regex-sub/">our blog tutorial</a>.</li>
</ul>
<p>These seven methods are 80% of what you need to know to get started with Python’s regular expression functionality.</p>
<h2>Where to Go From Here?</h2>
<p>You’ve learned everything you need to know about the asterisk quantifier * in this regex tutorial. </p>
<p><strong>Summary</strong>: <em>Regex A+ matches one or more occurrences of regex A. The “+” symbol is the at-least-once quantifier because it requires at least one occurrence of the preceding regex. The non-greedy version of the at-least-once quantifier is A+? with the trailing question mark.</em></p>
<p><strong>Want to earn money while you learn Python?</strong> Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?</p>
<p>Join the free webinar that shows you how to become a thriving coding business owner online!</p>
<p><a href="https://blog.finxter.com/webinar-freelancer/">[Webinar] Are You a Six-Figure Freelance Developer?</a></p>
<p>Join us. It’s fun! <img src="https://s.w.org/images/core/emoji/12.0.0-1/72x72/1f642.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
</div>


https://www.sickgaming.net/blog/2020/01/...pressions/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016