Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python Regex Compile

#1
Python Regex Compile

<div><p>Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!</p>
<p>This article is all about the <strong>re.compile(pattern)</strong> method of Python’s <a rel="noreferrer noopener" target="_blank" href="https://docs.python.org/3/library/re.html">re library</a>. Before we dive into re.compile(), let’s get an overview of the four related methods you must understand:</p>
<ul>
<li>The <strong>findall(pattern, string)</strong> method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li>
<li>The <strong>search(pattern, string)</strong> method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>. </li>
<li>The <strong>match(pattern, string)</strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li>
<li>The <strong>fullmatch(pattern, string)</strong> method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>. </li>
</ul>
<p>Equipped with this quick overview of the most critical regex methods, let’s answer the following question:</p>
<h2>How Does re.compile() Work in Python?</h2>
<p><strong>The re.compile(pattern) method returns a regular expression object (see next section)</strong></p>
<p><strong>You then use the object to call important regex methods such as search(string), match(string), fullmatch(string), and findall(string). </strong></p>
<p><strong>In short: You compile the pattern first. You search the pattern in a string second.</strong></p>
<p>This two-step approach is more efficient than calling, say, search(pattern, string) at once. That is, <em>IF you call the search() method multiple times on the same pattern</em>. Why? Because you can reuse the compiled pattern multiple times.</p>
<p>Here’s an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re # These two lines ...
regex = re.compile('Py...n')
match = regex.search('Python is great') # ... are equivalent to ...
match = re.search('Py...n', 'Python is great')</pre>
<p>In both instances, the match variable contains the following match object:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;re.Match object; span=(0, 6), match='Python'></pre>
<p>But in the first case, we can find the pattern not only in the string ‘Python is great’ but also in other strings—without any redundant work of compiling the pattern again and again.</p>
<p><strong>Specification</strong>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.compile(pattern, flags=0)</pre>
<p>The method has up to two arguments.</p>
<ul>
<li><strong>pattern</strong>: the regular expression pattern that you want to match.</li>
<li><strong>flags </strong>(optional argument): a more advanced modifier that allows you to customize the behavior of the function. Want to know <a href="https://blog.finxter.com/python-regex-flags/">how to use those flags? Check out this detailed article</a> on the Finxter blog.</li>
</ul>
<p>We’ll explore those arguments in more detail later. </p>
<p><strong>Return Value:</strong></p>
<p>The re.compile(patterns, flags) method returns a regular expression object. You may ask (and rightly so):</p>
<h2>What’s a Regular Expression Object?</h2>
<p>Python internally creates a <a href="https://docs.python.org/3/library/re.html#re-objects">regular expression object</a> (from the <code>Pattern</code> class) to prepare the pattern matching process. You can call the following methods on the regex object:</p>
<figure class="wp-block-table is-style-stripes">
<table>
<thead>
<tr>
<th>Method </th>
<th>Description </th>
</tr>
</thead>
<tbody>
<tr>
<td> <code>Pattern.search</code>(<em>string</em>[, <em>pos</em>[, <em>endpos</em>]])</td>
<td>Searches the regex anywhere in the string and returns a match object or None. You can define start and end positions of the search.</td>
</tr>
<tr>
<td> <code>Pattern.match</code>(<em>string</em>[, <em>pos</em>[, <em>endpos</em>]])</td>
<td>Searches the regex at the beginning of the string and returns a match object or None. You can define start and end positions of the search. </td>
</tr>
<tr>
<td> <code>Pattern.fullmatch</code>(<em>string</em>[, <em>pos</em>[, <em>endpos</em>]])</td>
<td>Matches the regex with the whole string and returns a match object or None. You can define start and end positions of the search. </td>
</tr>
<tr>
<td> <code>Pattern.split</code>(<em>string</em>, <em>maxsplit=0</em>) </td>
<td>Divides the string into a list of substrings. The regex is the delimiter. You can define a maximum number of splits.</td>
</tr>
<tr>
<td> <code>Pattern.findall</code>(<em>string</em>[, <em>pos</em>[, <em>endpos</em>]]) </td>
<td>Searches the regex anywhere in the string and returns a list of matching substrings. You can define start and end positions of the search.</td>
</tr>
<tr>
<td> <code>Pattern.finditer</code>(<em>string</em>[, <em>pos</em>[, <em>endpos</em>]]) </td>
<td>Returns an iterator that goes over all matches of the regex in the string (returns one match object after another). You can define the start and end positions of the search.</td>
</tr>
<tr>
<td> <code>Pattern.sub</code>(<em>repl</em>, <em>string</em>, <em>count=0</em>) </td>
<td>Returns a new string by replacing the first <em>count </em>occurrences of the regex in the string (from left to right) with the replacement string <em>repl</em>.</td>
</tr>
<tr>
<td> <code>Pattern.subn</code>(<em>repl</em>, <em>string</em>, <em>count=0</em>) </td>
<td>Returns a new string by replacing the first <em>count </em>occurrences of the regex in the string (from left to right) with the replacement string <em>repl</em>. However, it returns a tuple with the replaced string as the first and the number of successful replacements as the second tuple value.</td>
</tr>
</tbody>
</table>
</figure>
<p>If you’re familiar with the most basic regex methods, you’ll realize that all of them appear in this table. But there’s one distinction: you don’t have to define the pattern as an argument. For example, the regex method re.search(pattern, string) will internally compile a regex object p and then call p.search(string).</p>
<p>You can see this fact in the official implementation of the <a href="https://blog.finxter.com/python-regex-search/">re.search(pattern, string) method</a>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def search(pattern, string, flags=0): """Scan through string looking for a match to the pattern, returning a Match object, or None if no match was found.""" return _compile(pattern, flags).search(string)</pre>
<p><a href="https://github.com/python/cpython/blob/master/Lib/re.py"><em>(Source: GitHub repository of the re package)</em></a></p>
<p>The re.search(pattern, string) method is a mere wrapper for compiling the pattern first and calling the p.search(string) function on the compiled regex object p.</p>
<h2>Is It Worth Using Python’s re.compile()?</h2>
<p>No, in the vast majority of cases, it’s not worth the extra line.</p>
<p>Consider the following example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re # These two lines ...
regex = re.compile('Py...n')
match = regex.search('Python is great') # ... are equivalent to ...
match = re.search('Py...n', 'Python is great')</pre>
<p>Don’t get me wrong. Compiling a pattern once and using it many times throughout your code (e.g., in a loop) comes with a big performance benefit. In some anecdotal cases, compiling the pattern first lead to <a href="https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile">10x to 50x speedup</a> compared to compiling it again and again.</p>
<p>But the reason it is not worth the extra line is that Python’s re library ships with an internal cache. At the time of this writing, the cache has a limit of up to 512 compiled regex objects. So for the first 512 times, you can be sure when calling re.search(pattern, string) that the cache contains the compiled pattern already.</p>
<p>Here’s the relevant code snippet from <a href="https://github.com/python/cpython/blob/14a0e16c8805f7ba7c98132ead815dcfdf0e9d33/Lib/re.py">re’s GitHub repository</a>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># --------------------------------------------------------------------
# internals _cache = {} # ordered! _MAXCACHE = 512
def _compile(pattern, flags): # internal: compile pattern if isinstance(flags, RegexFlag): flags = flags.value try: return _cache[type(pattern), pattern, flags] except KeyError: pass if isinstance(pattern, Pattern): if flags: raise ValueError( "cannot process flags argument with a compiled pattern") return pattern if not sre_compile.isstring(pattern): raise TypeError("first argument must be string or compiled pattern") p = sre_compile.compile(pattern, flags) if not (flags &amp; DEBUG): if len(_cache) >= _MAXCACHE: # Drop the oldest item try: del _cache[next(iter(_cache))] except (StopIteration, RuntimeError, KeyError): pass _cache[type(pattern), pattern, flags] = p return p</pre>
<p>Can you find the spots where the cache is initialized and used?</p>
<p>While in most cases, you don’t need to compile a pattern, in some cases, you should. These follow directly from the previous implementation:</p>
<ul>
<li>You’ve got more than MAXCACHE patterns in your code.</li>
<li>You’ve got more than MAXCACHE <em>different </em>patterns between two <em>same </em>pattern instances. Only in this case, you will see “cache misses” where the cache has already flushed the seemingly stale pattern instances to make room for newer ones.</li>
<li>You reuse the pattern multiple times. Because if you don’t, it won’t make sense to use sparse memory to save them in your memory.</li>
<li>(Even then, it may only be useful if the patterns are relatively complicated. Otherwise, you won’t see a lot of performance benefits in practice.)</li>
</ul>
<p>To summarize, compiling the pattern first and storing the compiled pattern in a variable for later use is often nothing but “premature optimization”—one of the deadly sins of beginner and intermediate programmers.</p>
<h2>What Does re.compile() Really Do?</h2>
<p>It doesn’t seem like a lot, does it? My intuition was that the real work is in finding the pattern in the text—which happens after compilation. And, of course, matching the pattern <strong><em>is </em></strong>the hard part. But a sensible compilation helps a lot in preparing the pattern to be matched efficiently by the regex engine—work that would otherwise have be done by the regex engine.</p>
<p>Regex’s compile() method does a lot of things such as:</p>
<ul>
<li>Combine two subsequent characters in the regex if they together indicate a special symbol such as certain Greek symbols.</li>
<li>Prepare the regex to ignore uppercase and lowercase.</li>
<li>Check for certain (smaller) patterns in the regex.</li>
<li>Analyze matching groups in the regex enclosed in parentheses.</li>
</ul>
<p>Here’s the implemenation of the compile() method—it looks more complicated than expected, no?</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def _compile(code, pattern, flags): # internal: compile a (sub)pattern emit = code.append _len = len LITERAL_CODES = _LITERAL_CODES REPEATING_CODES = _REPEATING_CODES SUCCESS_CODES = _SUCCESS_CODES ASSERT_CODES = _ASSERT_CODES iscased = None tolower = None fixes = None if flags &amp; SRE_FLAG_IGNORECASE and not flags &amp; SRE_FLAG_LOCALE: if flags &amp; SRE_FLAG_UNICODE: iscased = _sre.unicode_iscased tolower = _sre.unicode_tolower fixes = _ignorecase_fixes else: iscased = _sre.ascii_iscased tolower = _sre.ascii_tolower for op, av in pattern: if op in LITERAL_CODES: if not flags &amp; SRE_FLAG_IGNORECASE: emit(op) emit(av) elif flags &amp; SRE_FLAG_LOCALE: emit(OP_LOCALE_IGNORE[op]) emit(av) elif not iscased(av): emit(op) emit(av) else: lo = tolower(av) if not fixes: # ascii emit(OP_IGNORE[op]) emit(lo) elif lo not in fixes: emit(OP_UNICODE_IGNORE[op]) emit(lo) else: emit(IN_UNI_IGNORE) skip = _len(code); emit(0) if op is NOT_LITERAL: emit(NEGATE) for k in (lo,) + fixes[lo]: emit(LITERAL) emit(k) emit(FAILURE) code[skip] = _len(code) - skip elif op is IN: charset, hascased = _optimize_charset(av, iscased, tolower, fixes) if flags &amp; SRE_FLAG_IGNORECASE and flags &amp; SRE_FLAG_LOCALE: emit(IN_LOC_IGNORE) elif not hascased: emit(IN) elif not fixes: # ascii emit(IN_IGNORE) else: emit(IN_UNI_IGNORE) skip = _len(code); emit(0) _compile_charset(charset, flags, code) code[skip] = _len(code) - skip elif op is ANY: if flags &amp; SRE_FLAG_DOTALL: emit(ANY_ALL) else: emit(ANY) elif op in REPEATING_CODES: if flags &amp; SRE_FLAG_TEMPLATE: raise error("internal: unsupported template operator %r" % (op,)) if _simple(av[2]): if op is MAX_REPEAT: emit(REPEAT_ONE) else: emit(MIN_REPEAT_ONE) skip = _len(code); emit(0) emit(av[0]) emit(av[1]) _compile(code, av[2], flags) emit(SUCCESS) code[skip] = _len(code) - skip else: emit(REPEAT) skip = _len(code); emit(0) emit(av[0]) emit(av[1]) _compile(code, av[2], flags) code[skip] = _len(code) - skip if op is MAX_REPEAT: emit(MAX_UNTIL) else: emit(MIN_UNTIL) elif op is SUBPATTERN: group, add_flags, del_flags, p = av if group: emit(MARK) emit((group-1)*2) # _compile_info(code, p, _combine_flags(flags, add_flags, del_flags)) _compile(code, p, _combine_flags(flags, add_flags, del_flags)) if group: emit(MARK) emit((group-1)*2+1) elif op in SUCCESS_CODES: emit(op) elif op in ASSERT_CODES: emit(op) skip = _len(code); emit(0) if av[0] >= 0: emit(0) # look ahead else: lo, hi = av[1].getwidth() if lo != hi: raise error("look-behind requires fixed-width pattern") emit(lo) # look behind _compile(code, av[1], flags) emit(SUCCESS) code[skip] = _len(code) - skip elif op is CALL: emit(op) skip = _len(code); emit(0) _compile(code, av, flags) emit(SUCCESS) code[skip] = _len(code) - skip elif op is AT: emit(op) if flags &amp; SRE_FLAG_MULTILINE: av = AT_MULTILINE.get(av, av) if flags &amp; SRE_FLAG_LOCALE: av = AT_LOCALE.get(av, av) elif flags &amp; SRE_FLAG_UNICODE: av = AT_UNICODE.get(av, av) emit(av) elif op is BRANCH: emit(op) tail = [] tailappend = tail.append for av in av[1]: skip = _len(code); emit(0) # _compile_info(code, av, flags) _compile(code, av, flags) emit(JUMP) tailappend(_len(code)); emit(0) code[skip] = _len(code) - skip emit(FAILURE) # end of branch for tail in tail: code[tail] = _len(code) - tail elif op is CATEGORY: emit(op) if flags &amp; SRE_FLAG_LOCALE: av = CH_LOCALE[av] elif flags &amp; SRE_FLAG_UNICODE: av = CH_UNICODE[av] emit(av) elif op is GROUPREF: if not flags &amp; SRE_FLAG_IGNORECASE: emit(op) elif flags &amp; SRE_FLAG_LOCALE: emit(GROUPREF_LOC_IGNORE) elif not fixes: # ascii emit(GROUPREF_IGNORE) else: emit(GROUPREF_UNI_IGNORE) emit(av-1) elif op is GROUPREF_EXISTS: emit(op) emit(av[0]-1) skipyes = _len(code); emit(0) _compile(code, av[1], flags) if av[2]: emit(JUMP) skipno = _len(code); emit(0) code[skipyes] = _len(code) - skipyes + 1 _compile(code, av[2], flags) code[skipno] = _len(code) - skipno else: code[skipyes] = _len(code) - skipyes + 1 else: raise error("internal: unsupported operand type %r" % (op,))</pre>
<p>Don’t worry, you don’t need to understand the code. Just note that all this work would have to be done by the regex engine at “matching runtime” if you wouldn’t compile the pattern first. If we can do it only once, it’s certainly a low-hanging fruit for performance optimizations—especially for long regular expression patterns.</p>
<h2>How to Use the Optional Flag Argument?</h2>
<p>As you’ve seen in the specification, the compile() method comes with an optional third ‘flag’ argument:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.compile(pattern, flags=0)</pre>
<p>What’s the purpose of the <a href="https://blog.finxter.com/python-regex-flags/">flags argument</a>?</p>
<p>Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex). </p>
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td><strong>Syntax</strong></td>
<td><strong>Meaning</strong></td>
</tr>
<tr>
<td> <strong>re.ASCII</strong></td>
<td>If you don’t use this flag, the special Python regex symbols w, W, b, B, d, D, s and S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests. </td>
</tr>
<tr>
<td> <strong>re.A</strong> </td>
<td>Same as re.ASCII </td>
</tr>
<tr>
<td> <strong>re.DEBUG</strong> </td>
<td>If you use this flag, Python will print some useful information to the shell that helps you debugging your regex. </td>
</tr>
<tr>
<td> <strong>re.IGNORECASE</strong> </td>
<td>If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z]. </td>
</tr>
<tr>
<td> <strong>re.I</strong> </td>
<td>Same as re.IGNORECASE </td>
</tr>
<tr>
<td> <strong>re.LOCALE</strong> </td>
<td>Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable. </td>
</tr>
<tr>
<td> <strong>re.L</strong> </td>
<td>Same as re.LOCALE </td>
</tr>
<tr>
<td> <strong>re.MULTILINE</strong> </td>
<td>This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string. </td>
</tr>
<tr>
<td> <strong>re.M</strong> </td>
<td>Same as re.MULTILINE </td>
</tr>
<tr>
<td> <strong>re.DOTALL</strong> </td>
<td>Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘n’. Switch on this flag to really match all characters including the newline character. </td>
</tr>
<tr>
<td> <strong>re.S</strong> </td>
<td>Same as re.DOTALL </td>
</tr>
<tr>
<td> <strong>re.VERBOSE</strong> </td>
<td>To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex. </td>
</tr>
<tr>
<td> <strong>re.X</strong> </td>
<td>Same as re.VERBOSE </td>
</tr>
</tbody>
</table>
</figure>
<p>Here’s how you’d use it in a practical example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re text = 'Python is great (python really is)' regex = re.compile('Py...n', flags=re.IGNORECASE) matches = regex.findall(text)
print(matches)
# ['Python', 'python']</pre>
<p>Although your regex ‘Python’ is uppercase, we ignore the capitalization by using the flag re.IGNORECASE.</p>
<h2>Where to Go From Here?</h2>
<p><strong>You’ve learned about the re.compile(pattern) method that prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code.</strong></p>
<p>Learning Python is hard. But if you cheat, it isn’t as hard as it has to be:</p>
<p><a href="https://blog.finxter.com/subscribe/">Download 8 Free Python Cheat Sheets now!</a></p>
</div>


https://www.sickgaming.net/blog/2020/01/...x-compile/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016