Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python Regex Or – A Simple Illustrated Guide

#1
Python Regex Or – A Simple Illustrated Guide

<div><p>This tutorial is all about the <strong>or | operator of Python’s <a rel="noreferrer noopener" target="_blank" href="https://docs.python.org/3/library/re.html">re library</a>.</strong> You can also play the tutorial video while you read:</p>
<figure class="wp-block-embed-youtube wp-block-embed is-type-rich is-provider-embed-handler wp-embed-aspect-16-9 wp-has-aspect-ratio">
<div class="wp-block-embed__wrapper">
<div class="ast-oembed-container"><iframe title="Python Regex Or – A Simple Illustrated Guide" width="1100" height="619" src="https://www.youtube.com/embed/wx9PGvSxQRs?feature=oembed" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>
</p></div>
</figure>
<h2>What’s the Python Regex Or | Operator?</h2>
<p><strong>Given a string. Say, your goal is to find all substrings that match either the string <code>'iPhone'</code> or the string <code>'iPad'</code>. How can you achieve this?</strong></p>
<p><strong>The easiest way to achieve this is the Python or operator <code>|</code> using the regular expression pattern <code>(iPhone|iPad)</code>. </strong></p>
<p><strong>Here’s an example:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> text = 'Buy now: iPhone only $399 with free iPad'
>>> re.findall('(iPhone|iPad)', text)
['iPhone', 'iPad']</pre>
<p>You have the (salesy) text that contains both strings <code>'iPhone'</code> and <code>'iPad'</code>. </p>
<p>You use the re.findall() method. In case you don’t know it, here’s the definition from the <a href="https://blog.finxter.com/python-re-findall/">Finxter blog article</a>:</p>
<p><strong><em>The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.</em></strong></p>
<p><a href="https://blog.finxter.com/python-re-findall/">Please consult the blog article to learn everything you need to know about this fundamental Python method.</a></p>
<p>The first argument is the pattern <code>(iPhone|iPad)</code>. It either matches the first part right in front of the or symbol <code>|</code>—which is <code>iPhone</code>—or the second part after it—which is <code>iPad</code>. </p>
<p>The second argument is the text <code>'Buy now: iPhone only $399 with free iPad'</code> which you want to search for the pattern. </p>
<p>The result shows that there are two matching substrings in the text: <code>'iPhone'</code> and <code>'iPad'</code>. </p>
<h2>Python Regex Or: Examples</h2>
<p>Let’s study some more examples to teach you all the possible uses and border cases—one after another.</p>
<p>You start with the previous example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> text = 'Buy now: iPhone only $399 with free iPad'
>>> re.findall('(iPhone|iPad)', text)
['iPhone', 'iPad']</pre>
<p>What happens if you don’t use the parenthesis?</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> text = 'iPhone iPhone iPhone iPadiPad'
>>> re.findall('(iPhone|iPad)', text)
['iPhone', 'iPhone', 'iPhone', 'iPad', 'iPad']
>>> re.findall('iPhone|iPad', text)
['iPhone', 'iPhone', 'iPhone', 'iPad', 'iPad']</pre>
<p>In the second example, you just skipped the parentheses using the regex pattern <code>iPhone|iPad</code> rather than <code>(iPhone|iPad)</code>. But no problem–it still works and generates the exact same output!</p>
<p>But what happens if you leave one side of the or operation empty?</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('iPhone|', text)
['iPhone', '', 'iPhone', '', 'iPhone', '', '', '', '', '', '', '', '', '', '']
</pre>
<p>The output is not as strange as it seems. The or operator allows for empty operands—in which case it wants to match the non-empty string. If this is not possible, it matches the empty string (so everything will be a match).</p>
<p>The previous example also shows that it still tries to match the non-empty string if possible. But what if the trivial empty match is on the left side of the or operand?</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('|iPhone', text)
['', 'iPhone', '', '', 'iPhone', '', '', 'iPhone', '', '', '', '', '', '', '', '', '', '']</pre>
<p>This shows some subtleties of the regex engine. First of all, it still matches the non-empty string if possible! But more importantly, you can see that the regex engine matches from left to right. It first tries to match the left regex (which it does on every single position in the text). An empty string that’s already matched will not be considered anymore. Only then, it tries to match the regex on the right side of the or operator.</p>
<p>Think of it this way: the regex engine moves from the left to the right—one position at a time. It matches the empty string every single time. Then it moves over the empty string and in some cases, it can still match the non-empty string. Each match “consumes” a substring and cannot be matched anymore. But an empty string cannot be consumed. That’s why you see the first match is the empty string and the second match is the substring <code>'iPhone'</code>. </p>
<h2>How to Nest the Python Regex Or Operator?</h2>
<p>Okay, you’re not easily satisfied, are you? Let’s try nesting the Python regex or operator <code>|</code>. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> text = 'xxx iii zzz iii ii xxx'
>>> re.findall('xxx|iii|zzz', text)
['xxx', 'iii', 'zzz', 'iii', 'xxx']</pre>
<p>So you can use multiple or operators in a row. Of course, you can also use the grouping (parentheses) operator to nest an arbitrary complicated construct of or operations:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('x(i|(zz|ii|(x| )))', text)
[('x', 'x', 'x'), (' ', ' ', ' '), ('x', 'x', 'x')]</pre>
<p>But this seldomly leads to clean and readable code. And it can usually avoided easily by putting a bit of thought into your regex design.</p>
<h2>Python Regex Or: Character Class</h2>
<p>If you only want to match a single character out of a set of characters, the character class is a much better way of doing it:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> text = 'hello world'
>>> re.findall('[abcdefghijklmnopqrstuvwxyz]+', text)
['hello', 'world']</pre>
<p>A shorter and more concise version would be to use the range operator within character classes:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[a-z]+', text)
['hello', 'world']</pre>
<p>The character class is enclosed in the bracket notation <code>[ ]</code> and it literally means “match exactly one of the symbols in the class”. Thus, it carries the same semantics as the or operator: |. However, if you try to do something on those lines…</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+', text)
['o', 'd']</pre>
<p>… you’ll first write much less concise code and, second, risk of getting confused by the output. The reason is that the parenthesis is the group operator—it captures the position and substring that matches the regex. Used in the findall() method, it only returns the content of the last matched group. This turns out to be the last character of the word <code>'hello'</code> and the last character of the word <code>'world'</code>. </p>
<h2>How to Match the Or Character (Vertical Line ‘|’Wink?</h2>
<p>So if the character <code>'|'</code> stands for the <strong>or </strong>character in a given regex, the question arises how to match the vertical line symbol <code>'|'</code> itself?</p>
<p>The answer is simple: escape the or character in your regular expression using the backslash. In particular, use <code>'A\|B'</code><em> instead of </em><code>'A|B'</code> to match the string <code>'A|B'</code> itself. Here’s an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('A|B', 'AAAA|BBBB')
['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
>>> re.findall('A\|B', 'AAAA|BBBB')
['A|B']</pre>
<p>Do you really understand the outputs of this code snippet? In the first example, you’re searching for either character <code>'A'</code> or character <code>'B'</code>. In the second example, you’re searching for the string <code>'A|B'</code> (which contains the <code>'|'</code> character).</p>
<h2>Python Regex And</h2>
<p>If there’s a Python regex “or”, there must also be an “and” operator, right?</p>
<p>Correct! But think about it for a moment: say, you want one regex to occur alongside another regex. In other words, you want to match regex A and regex B. So what do you do? You match regex AB.</p>
<p>You’ve already seen many examples of the “Python regex AND” operator—but here’s another one: </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('AB', 'AAAACAACAABAAAABAAC')
['AB', 'AB']</pre>
<p>The simple concatenation of regex A and B already performs an implicit “and operation”. </p>
<h2>Python Regex Not</h2>
<p>How can you search a string for substrings that do NOT match a given pattern? In other words, what’s the “negative pattern” in Python regular expressions?</p>
<p>The answer is two-fold:</p>
<ul>
<li>If you want to match all characters except a set of specific characters, you can use the negative character class <code>[^...]</code>. </li>
<li>If you want to match all substrings except the ones that match a regex pattern, you can use the feature of <a href="https://www.regular-expressions.info/lookaround.html">negative lookahead</a> <code>(?!...)</code>. </li>
</ul>
<p>Here’s an example for the negative character class:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('[^a-m]', 'aaabbbaababmmmnoopmmaa')
['n', 'o', 'o', 'p']</pre>
<p>And here’s an example for the negative lookahead pattern to match all “words that are not followed by words”:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[a-z]+(?![a-z]+)', 'hello world')
['hello', 'world']</pre>
<p>The negative lookahead <code>(?![a-z]+)</code> doesn’t consume (<em>match</em>) any character. It just checks whether the pattern <code>[a-z]+</code> does NOT match at a given position. The only times this happens is just before the empty space and the end of the string.</p>
<h2>[Collection] What Are The Different Python Re Quantifiers?</h2>
<p>The “and”, “or”, and “not” operators are not the only regular expression operators you need to understand. So what are other operators?</p>
<p>Next, you’ll get a quick and dirty overview of the most important regex operations and how to use them in Python. Here are the most important regex quantifiers:</p>
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td><strong>Quantifier</strong></td>
<td><strong>Description</strong></td>
<td><strong>Example</strong></td>
</tr>
<tr>
<td><code>.</code></td>
<td>The <strong>wild-card</strong> (‘dot’) matches any character in a string except the newline character ‘n’.</td>
<td>Regex ‘…’ matches all words with three characters such as ‘abc’, ‘cat’, and ‘dog’.</td>
</tr>
<tr>
<td><code>*</code></td>
<td>The <strong>zero-or-more</strong> asterisk matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex.</td>
<td>Regex ‘cat*’ matches the strings ‘ca’, ‘cat’, ‘catt’, ‘cattt’, and ‘catttttttt’.</td>
</tr>
<tr>
<td><code>?</code></td>
<td>The <strong>zero-or-one</strong> matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. </td>
<td>Regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.</td>
</tr>
<tr>
<td><code>+</code></td>
<td>The <strong>at-least-one</strong> matches one or more occurrences of the immediately preceding regex. </td>
<td>Regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.</td>
</tr>
<tr>
<td><code>^</code></td>
<td>The <strong>start-of-string</strong> matches the beginning of a string. </td>
<td>Regex ‘^p’ matches the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.</td>
</tr>
<tr>
<td><code>$</code></td>
<td>The <strong>end-of-string</strong> matches the end of a string. </td>
<td>Regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.</td>
</tr>
<tr>
<td><code>A|B</code></td>
<td>The <strong>OR</strong> matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. </td>
<td>Regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.</td>
</tr>
<tr>
<td><code>AB</code></td>
<td>&nbsp;The <strong>AND</strong> matches first the regex A and second the regex B, in this sequence. </td>
<td>We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.</td>
</tr>
</tbody>
</table>
</figure>
<p>Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.</p>
<p>We’ve already seen many examples but let’s dive into even more!</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!'] ''' print(re.findall('is.*and', text)) '''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and'] ''' print(re.findall('her:?', text)) '''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:'] ''' print(re.findall('^Ha.*', text)) '''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. Can you figure out why Python doesn't find any?
[] ''' print(re.findall('n$', text)) '''
Finds all occurrences where the new-line character 'n'
occurs at the end of the string.
['n'] ''' print(re.findall('(Life|Death)', text)) '''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death'] '''
</pre>
<p>In these examples, you’ve already seen the special symbol ‘\n’ which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions. Next, we’ll discover the most important special symbols.</p>
<h2>Related Re Methods</h2>
<p>There are seven important regular expression methods which you must master:</p>
<ul>
<li>The <strong>re.findall(pattern, string)</strong> method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li>
<li>The <strong>re.search(pattern, string)</strong> method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>.</li>
<li>The <strong>re.match(pattern, string)</strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li>
<li>The <strong>re.fullmatch(pattern, string)</strong> method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>.</li>
<li>The <strong>re.compile(pattern)</strong> method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in <a href="https://blog.finxter.com/python-regex-compile/">our blog tutorial</a>.</li>
<li>The<strong> re.split(pattern, string)</strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in <a href="https://blog.finxter.com/python-regex-split/">our blog tutorial</a>.</li>
<li>The <strong>re.sub(The re.sub(pattern, repl, string, count=0, flags=0)</strong> method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in <a href="https://blog.finxter.com/python-regex-sub/">our blog tutorial</a>.</li>
</ul>
<p>These seven methods are 80% of what you need to know to get started with Python’s regular expression functionality.</p>
<h2>Where to Go From Here?</h2>
<p>You’ve learned everything you need to know about the <strong><em>Python Regex Or</em></strong> Operator. </p>
<p><em><strong>Summary</strong>: </em></p>
<p><strong>Given a string. Say, your goal is to find all substrings that match either the string <code>'iPhone'</code> or the string <code>'iPad'</code>. How can you achieve this?</strong></p>
<p><strong>The easiest way to achieve this is the Python or operator <code>|</code> using the regular expression pattern <code>(iPhone|iPad)</code>. </strong></p>
<hr class="wp-block-separator"/>
<p><strong>Want to earn money while you learn Python?</strong> Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?</p>
<p>Join the free webinar that shows you how to become a thriving coding business owner online!</p>
<p><a href="https://blog.finxter.com/webinar-freelancer/">[Webinar] Become a Six-Figure Freelance Developer with Python</a></p>
<p>Join us. It’s fun! <img src="https://s.w.org/images/core/emoji/12.0.0-1/72x72/1f642.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
</div>


https://www.sickgaming.net/blog/2020/02/...ted-guide/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016