Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python Re Dot

#1
Python Re Dot

<div><p>You’re about to learn one of the most frequently used regex operators: the dot regex <strong>.</strong> in Python’s <a href="https://docs.python.org/3/library/re.html">re library</a>.</p>
<h2>What’s the Dot Regex in Python’s Re Library?</h2>
<p>The dot regex <code>.</code> matches all characters except the newline character. For example, the regular expression <strong>‘…’</strong> matches strings <strong>‘hey’ </strong>and <strong>‘tom’</strong>. But it does not match the string <strong>‘yo\ntom’</strong> which contains the newline character <strong>‘\n’</strong>.</p>
<p>Let’s study some basic examples to help you gain a deeper understanding.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> >>> text = '''But then I saw no harm, and then I heard
Each syllable that breath made up between them.'''
>>> re.findall('B..', text)
['But']
>>> re.findall('heard.Each', text)
[]
>>> re.findall('heard\nEach', text)
['heard\nEach']
>>> </pre>
<p>You first import Python’s re library for regular expression handling. Then, you create a multi-line text using the triple string quotes.</p>
<p>Let’s dive into the first example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('B..', text)
['But']</pre>
<p>You use the re.findall() method. Here’s the definition from the <a href="https://blog.finxter.com/python-re-findall/">Finxter blog article</a>:</p>
<p><strong>The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.</strong></p>
<p><a href="https://blog.finxter.com/python-re-findall/">Please consult the blog article to learn everything you need to know about this fundamental Python method.</a></p>
<p>The first argument is the regular expression pattern <strong>‘B..’</strong>. The second argument is the string to be searched for the pattern. You want to find all patterns starting with the ‘B’ character, followed by two arbitrary characters except the newline character.</p>
<p>The findall() method finds only one such occurrence: the string ‘But’. </p>
<p>The second example shows that the dot operator does not match the newline character:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('heard.Each', text)
[]</pre>
<p>In this example, you’re looking at the simple pattern ‘heard.Each’. You want to find all occurrences of string ‘heard’ followed by an arbitrary non-whitespace character, followed by the string ‘Each’. </p>
<p>But such a pattern does not exist! Many coders intuitively read the dot regex as an <em>arbitrary character</em>. You must be aware that the correct definition of the dot regex is an <em>arbitrary character except the newline</em>. This is a source of many bugs in regular expressions.</p>
<p>The third example shows you how to explicitly match the newline character ‘\n’ instead:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('heard\nEach', text)
['heard\nEach']</pre>
<p>Now, the regex engine matches the substring.</p>
<p>Naturally, the following relevant question arises:</p>
<h2>How to Match an Arbitrary Character (Including Newline)?</h2>
<p>The dot regex . matches a single arbitrary character—except the newline character. But what if you do want to match the newline character, too? There are two main ways to accomplish this.</p>
<ul>
<li>Use the <strong>re.DOTALL</strong> flag.</li>
<li>Use a character class <strong>[.\n]</strong>.</li>
</ul>
<p>Here’s the concrete example showing both cases:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> >>> s = '''hello
python'''
>>> re.findall('o.p', s)
[]
>>> re.findall('o.p', s, flags=re.DOTALL)
['o\np']
>>> re.findall('o[.\n]p', s)
['o\np']
</pre>
<p>You create a multi-line string. Then you try to find the regex pattern ‘o.p’ in the string. But there’s no match because the dot operator does not match the newline character per default. However, if you define the flag re.DOTALL, the newline character will also be a valid match. </p>
<p><a href="https://blog.finxter.com/python-regex-flags/">Learn more about the different flags in my Finxter blog tutorial.</a></p>
<p>An alternative is to use the slightly more complicated regex pattern [.\n]. The square brackets enclose a <strong>character class</strong>—a set of characters that are all a valid match. Think of a character class as an OR operation: exactly one character must match.</p>
<h2>What If You Actually Want to Match a Dot?</h2>
<p>If you use the character <strong>‘.’</strong> in a regular expression, Python assumes that it’s the dot operator you’re talking about. But what if you actually want to match a dot—for example to match the period at the end of a sentence?</p>
<p>Nothing simpler than that: escape the dot regex by using the backslash: <strong>‘\.’</strong>. The backslash nullifies the meaning of the special symbol ‘.’ in the regex. The regex engine now knows that you’re actually looking for the dot character, not an arbitrary character except newline.</p>
<p>Here’s an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> text = 'Python. Is. Great. Period.'
>>> re.findall('\.', text)
['.', '.', '.', '.']
</pre>
<p>The findall() method returns all four periods in the sentence as matching substrings for the regex <strong>‘\.’</strong>. </p>
<p>In this example, you’ll learn how you can combine it with other regular expressions:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('\.\s', text)
['. ', '. ', '. ']</pre>
<p>Now, you’re looking for a period character followed by an arbitrary whitespace. There are only three such matching substrings in the text.</p>
<p>In the next example, you learn how to combine this with a character class:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[st]\.', text)
['s.', 't.']</pre>
<p>You want to find either character ‘s’ or character ‘t’ followed by the period character ‘.’. Two substrings match this regex.</p>
<p>Note that skipping the backslash is required. If you forget this, it can lead to strange behavior:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[st].', text)
['th', 's.', 't.']</pre>
<p>As an arbitrary character is allowed after the character class, the substring ‘th’ also matches the regex.</p>
<h2>[Collection] What Are The Different Python Re Quantifiers?</h2>
<p>If you want to use (and understand) regular expressions in practice, you’ll need to know the most important quantifiers that can be applied to any regex (including the dot regex)!</p>
<p>So let’s dive into the other regexes:</p>
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td><strong>Quantifier</strong></td>
<td><strong>Description</strong></td>
<td><strong>Example</strong></td>
</tr>
<tr>
<td><code>.</code></td>
<td>The <strong>wild-card</strong> (‘dot’) matches any character in a string except the newline character ‘n’.</td>
<td>Regex ‘…’ matches all words with three characters such as ‘abc’, ‘cat’, and ‘dog’.</td>
</tr>
<tr>
<td><code>*</code></td>
<td>The <strong>zero-or-more</strong> asterisk matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex.</td>
<td>Regex ‘cat*’ matches the strings ‘ca’, ‘cat’, ‘catt’, ‘cattt’, and ‘catttttttt’.</td>
</tr>
<tr>
<td><code>?</code></td>
<td>The <strong>zero-or-one</strong> matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. </td>
<td>Regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.</td>
</tr>
<tr>
<td><code>+</code></td>
<td>The <strong>at-least-one</strong> matches one or more occurrences of the immediately preceding regex. </td>
<td>Regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.</td>
</tr>
<tr>
<td><code>^</code></td>
<td>The <strong>start-of-string</strong> matches the beginning of a string. </td>
<td>Regex ‘^p’ matches the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.</td>
</tr>
<tr>
<td><code>$</code></td>
<td>The <strong>end-of-string</strong> matches the end of a string. </td>
<td>Regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.</td>
</tr>
<tr>
<td><code>A|B</code></td>
<td>The <strong>OR</strong> matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. </td>
<td>Regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.</td>
</tr>
<tr>
<td><code>AB</code></td>
<td>&nbsp;The <strong>AND</strong> matches first the regex A and second the regex B, in this sequence. </td>
<td>We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.</td>
</tr>
</tbody>
</table>
</figure>
<p>Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.</p>
<p>We’ve already seen many examples but let’s dive into even more!</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!'] ''' print(re.findall('is.*and', text)) '''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and'] ''' print(re.findall('her:?', text)) '''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:'] ''' print(re.findall('^Ha.*', text)) '''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. Can you figure out why Python doesn't find any?
[] ''' print(re.findall('n$', text)) '''
Finds all occurrences where the new-line character 'n'
occurs at the end of the string.
['n'] ''' print(re.findall('(Life|Death)', text)) '''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death'] '''
</pre>
<p>In these examples, you’ve already seen the special symbol ‘\n’ which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions.</p>
<h2>Related Re Methods</h2>
<p>There are five important regular expression methods which you should master:</p>
<ul>
<li>The <strong>re.findall(pattern, string)</strong> method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li>
<li>The <strong>re.search(pattern, string)</strong> method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>.</li>
<li>The <strong>re.match(pattern, string)</strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li>
<li>The <strong>re.fullmatch(pattern, string)</strong> method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>.</li>
<li>The <strong>re.compile(pattern)</strong> method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in <a href="https://blog.finxter.com/python-regex-compile/">our blog tutorial</a>.</li>
<li>The<strong> re.split(pattern, string)</strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in <a href="https://blog.finxter.com/python-regex-split/">our blog tutorial</a>.</li>
<li>The <strong>re.sub(The re.sub(pattern, repl, string, count=0, flags=0)</strong> method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in <a href="https://blog.finxter.com/python-regex-sub/">our blog tutorial</a>.</li>
</ul>
<p>These seven methods are 80% of what you need to know to get started with Python’s regular expression functionality.</p>
<h2>Where to Go From Here?</h2>
<p>You’ve learned everything you need to know about the dot regex <code>.</code> in this regex tutorial. </p>
<p><em><strong>Summary</strong>: The dot regex <code>.</code> matches all characters except the newline character. For example, the regular expression <strong>‘…’</strong> matches strings <strong>‘hey’ </strong>and <strong>‘tom’</strong>. But it does not match the string <strong>‘yo\ntom’</strong> which contains the newline character <strong>‘\n’</strong>.</em></p>
<p><strong>Want to earn money while you learn Python?</strong> Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?</p>
<p>Join the free webinar that shows you how to become a thriving coding business owner online!</p>
<p><a href="https://blog.finxter.com/webinar-freelancer/">[Webinar] Become a Six-Figure Freelance Developer with Python</a></p>
<p>Join us. It’s fun! <img src="https://s.w.org/images/core/emoji/12.0.0-1/72x72/1f642.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
</div>


https://www.sickgaming.net/blog/2020/01/...on-re-dot/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016