This tutorial explains everything you need to know about matching groups in Python’s repackage for regular expressions. You may have also read the term “capture groups” which points to the same concept.
As you read through the tutorial, you can also watch the tutorial video where I explain everything in a simple way:
So let’s start with the basics:
Matching Group ()
What’s a matching group?
Like you use parentheses to structure mathematical expressions, (2 + 2) * 2 versus 2 + (2 * 2), you use parentheses to structure regular expressions. An example regex that does this is 'a(b|c)'. The whole content enclosed in the opening and closing parentheses is called matching group (or capture group). You can have multiple matching groups in a single regex. And you can even have hierarchical matching groups, for example 'a(b|(cd))'.
One big advantage of a matching group is that it captures the matched substring. You can retrieve it in other parts of the regular expression—or after analyzing the result of the whole regex matching.
Let’s have a short example for the most basic use of a matching group—to structure the regex.
Say you create regex b?(a.)* with the matching group (a.) that matches all patterns starting with zero or one occurrence of character 'b' and an arbitrary number of two-character-sequences starting with the character 'a'. Hence, the strings 'bacacaca', 'aaaa', '' (the empty string), and 'Xababababab' all match your regex.
The use of the parentheses for structuring the regular expression is intuitive and should come naturally to you because the same rules apply as for arithmetic operations. However, there’s a more advanced use of regex groups: retrieval.
You can retrieve the matched content of each matching group. So the next question naturally arises:
How to Get the First Matching Group?
There are two scenarios when you want to access the content of your matching groups:
Access the matching group in the regex pattern to reuse partially matched text from one group somewhere else.
Access the matching group after the whole match operation to analyze the matched text in your Python code.
In the first case, you simply get the first matching group with the \number special sequence. For example, to get the first matching group, you’d use the \1 special sequence. Here’s an example:
>>> import re
>>> re.search(r'(j.n) is \1','jon is jon')
<re.Match object; span=(0, 10), match='jon is jon'>
You’ll use this feature a lot because it gives you much more expression power: for example, you can search for a name in a text-based on a given pattern and then process specifically this name in the rest of the text (and not all other names that would also fit the pattern).
Note that the numbering of the groups start with \1 and not with \0—a rare exception to the rule that in programming, all numbering starts with 0.
In the second case, you want to know the contents of the first group after the whole match. How do you do that?
The answer is also simple: use the m.group(0) method on the matching objectm. Here’s an example:
>>> import re
>>> m = re.search(r'(j.n)','jon is jon')
>>> m.group(1) 'jon'
The numbering works consistently with the previously introduced regex group numbering: start with identifier 1 to access the contents of the first group.
How to Get All Other Matching Groups?
Again, there are two different intentions when asking this question:
Access the matching group in the regex pattern to reuse partially matched text from one group somewhere else.
Access the matching group after the whole match operation to analyze the matched text in your Python code.
In the first case, you use the special sequence \2 to access the second matching group, \3 to access the third matching group, and \99 to access the ninety-ninth matching group.
Here’s an example:
>>> import re
>>> re.search(r'(j..) (j..)\s+\2', 'jon jim jim')
<re.Match object; span=(0, 11), match='jon jim jim'>
>>> re.search(r'(j..) (j..)\s+\2', 'jon jim jon')
>>>
As you can see, the special sequence \2 refers to the matching contents of the second group 'jim'.
In the second case, you can simply increase the identifier too to access the other matching groups in your Python code:
>>> import re
>>> m = re.search(r'(j..) (j..)\s+\2', 'jon jim jim')
>>> m.group(0) 'jon jim jim'
>>> m.group(1) 'jon'
>>> m.group(2) 'jim'
This code also shows an interesting feature: if you use the identifier 0 as an argument to the m.group(0) method, the regex module will give you the contents of the whole match. You can think of it as the first group being the whole match.
Named Groups: (?P<name>…) and (?P=name)
Accessing the captured group using the notation \number is not always convenient and sometimes not even possible (for example if you have more than 99 groups in your regex). A major disadvantage of regular expressions is that they tend to be hard to read. It’s therefore important to know about the different tweaks to improve readability.
One such optimization is a named group. It’s really just that: a matching group that captures part of the match but with one twist: it has a name. Now, you can use this name to access the captured group at a later point in your regular expression pattern. This can improve readability of the regular expression.
import re
pattern = '(?P<quote>["\']).*(?P=quote)'
text = 'She said "hi"'
print(re.search(pattern, text))
# <re.Match object; span=(9, 13), match='"hi"'>
The code searches for substrings that are enclosed in either single or double quotes. You first match the opening quote by using the regex ["\']. You escape the single quote, \' so that the Python regex engine does not assume (wrongly) that the single quote indicates the end of the string. You then use the same group to match the closing quote of the same character (either a single or double quote).
Non-Capturing Groups (?:…)
In the previous examples, you’ve seen how to match and capture groups with the parentheses (...). You’ve learned that each match of this basic group operator is captured so that you can retrieve it later in the regex with the special commands \1, \2, …, \99 or after the match on the matched object m with the method m.group(1), m.group(2), and so on.
But what if you don’t need that? What if you just need to keep your regex pattern in order—but you don’t want to capture the contents of a matching group?
The simple solution is the non-capturing group operation (?: ... ). You can use it just like the capturing group operation ( ... ). Here’s an example:
>>>import re
>>> re.search('(?:python|java) is great', 'python is great. java is great.')
<re.Match object; span=(0, 15), match='python is great'>
The non-capturing group exists with the sole purpose to structure the regex. You cannot use its content later:
>>> m = re.search('(?:python|java) is great', 'python is great. java is great.')
>>> m.group(1)
Traceback (most recent call last): File "<pyshell#28>", line 1, in <module> m.group(1)
IndexError: no such group
>>>
If you try to access the contents of the non-capturing group, the regex engine will throw an IndexError: no such group.
Of course, there’s a straightforward alternative to non-capturing groups. You can simply use the normal (capturing) group but don’t access its contents. Only rarely will the performance penalty of capturing a group that’s not needed have any meaningful impact on your overall application.
Positive Lookahead (?=…)
The concept of lookahead is a very powerful one and any advanced coder should know it. A friend recently told me that he had written a complicated regex that ignores the order of occurrences of two words in a given text. It’s a challenging problem and without the concept of lookahead, the resulting code will be complicated and hard to understand. However, the concept of lookahead makes this problem simple to write and read.
But first things first: how does the lookahead assertion work?
In normal regular expression processing, the regex is matched from left to right. The regex engine “consumes” partially matching substrings. The consumed substring cannot be matched by any other part of the regex.
Figure:A simple example of lookahead. The regular expression engine matches (“consumes”) the string partially. Then it checks whether the remaining pattern could be matched without actually matching it.
Think of the lookahead assertion as a non-consuming pattern match. The regex engine goes from the left to the right—searching for the pattern. At each point, it has one “current” position to check if this position is the first position of the remaining match. In other words, the regex engine tries to “consume” the next character as a (partial) match of the pattern.
The advantage of the lookahead expression is that it doesn’t consume anything. It just “looks ahead” starting from the current position whether what follows would theoretically match the lookahead pattern. If it doesn’t, the regex engine cannot move on. Next, it “backtracks”—which is just a fancy way of saying: it goes back to a previous decision and tries to match something else.
Positive Lookahead Example: How to Match Two Words in Arbitrary Order?
What if you want to search a given text for pattern A AND pattern B—but in no particular order? If both patterns appear anywhere in the string, the whole string should be returned as a match.
Now, this is a bit more complicated because any regular expression pattern is ordered from left to right. A simple solution is to use the lookahead assertion (?.*A) to check whether regex A appears anywhere in the string. (Note we assume a single line string as the .* pattern doesn’t match the newline character by default.)
Let’s first have a look at the minimal solution to check for two patterns anywhere in the string (say, patterns ‘hi’ AND ‘you’).
>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are yo?')
[]
>>> re.findall(pattern, 'hi how are you?')
['']
In the first example, both words do not appear. In the second example, they do.
Let’s go back to the expression (?=.*hi)(?=.*you) to match strings that contain both ‘hi’ and ‘you’. Why does it work?
The reason is that the lookahead expressions don’t consume anything. You first search for an arbitrary number of characters .*, followed by the word hi. But because the regex engine hasn’t consumed anything, it’s still in the same position at the beginning of the string. So, you can repeat the same for the word you.
Note that this method doesn’t care about the order of the two words:
>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are you?')
['']
>>> re.findall(pattern, 'you are how? hi!')
['']
No matter which word “hi” or “you” appears first in the text, the regex engine finds both.
You may ask: why’s the output the empty string? The reason is that the regex engine hasn’t consumed any character. It just checked the lookaheads. So the easy fix is to consume all characters as follows:
Now, the whole string is a match because after checking the lookahead with ‘(?=.*hi)(?=.*you)’, you also consume the whole string ‘.*’.
Negative Lookahead (?!…)
The negative lookahead works just like the positive lookahead—only it checks that the given regex pattern does not occur going forward from a certain position.
Here’s an example:
>>> import re
>>> re.search('(?!.*hi.*)', 'hi say hi?')
<re.Match object; span=(8, 8), match=''>
The negative lookahead pattern (?!.*hi.*) ensures that, going forward in the string, there’s no occurrence of the substring 'hi'. The first position where this holds is position 8 (right after the second 'h'). Like the positive lookahead, the negative lookahead does not consume any character so the result is the empty string (which is a valid match of the pattern).
You can even combine multiple negative lookaheads like this:
>>> re.search('(?!.*hi.*)(?!\?).', 'hi say hi?')
<re.Match object; span=(8, 9), match='i'>
You search for a position where neither ‘hi’ is in the lookahead, nor does the question mark character follow immediately. This time, we consume an arbitrary character so the resulting match is the character 'i'.
Group Flags (?aiLmsux:…) and (?aiLmsux)
You can control the regex engine with the flags argument of the re.findall(), re.search(), or re.match() methods. For example, if you don’t care about capitalization of your matched substring, you can pass the re.IGNORECASE flag to the regex methods:
>>> re.findall('PYTHON', 'python is great', flags=re.IGNORECASE)
['python']
But using a global flag for the whole regex is not always optimal. What if you want to ignore the capitalization only for a certain subregex?
You can do this with the group flags: a, i, L, m, s, u, and x. Each group flag has its own meaning:
Syntax
Meaning
a
If you don’t use this flag, the special Python regex symbols \w, \W, \b, \B, \d, \D, \s and \S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests.
i
If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z].
L
Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable.
m
This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string.
s
Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘\n’. Switch on this flag to really match all characters including the newline character.
x
To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex.
For example, if you want to switch off the differentiation of capitalization, you’ll use the i flag as follows:
>>> re.findall('(?i:PYTHON)', 'python is great')
['python']
You can also switch off the capitalization for the whole regex with the “global group flag” (?i) as follows:
>>> re.findall('(?i)PYTHON', 'python is great')
['python']
Where to Go From Here?
Summary: You’ve learned about matching groups to structure the regex and capture parts of the matching result. You can then retrieve the captured groups with the \number syntax within the regex pattern itself and with the m.group(i) syntax in the Python code at a later stage.
To learn the Python basics, check out my free Python email academy with many advanced courses—including a regex video tutorial in your INBOX.
Goal: Given a string that is either Morse code or normal text. Write a function that transforms the string into the other language: Morse code should be translated to normal text. Normal text should be translated to Morse code.
Output Example: Create a function morse(txt) that takes an input string argument txt and returns its translation:
Note that Morse code doesn’t differentiate lowercase or uppercase characters. So you just use uppercase characters as default translation output.
AlgorithmIdea: A simple algorithm is enough to solve the problem:
Detect if a string is Morse code or normal text. The simple but not perfect solution is to check if the first character is either the dot symbol '.' or the minus symbol '-'. Note that you can easily extend this by checking if all characters are either the dot symbol or the minus symbol (a simple regular expression will be enough).
Prepare a dictionary that maps all “normal text” symbols to their respective Morse code translations. Use the inverse dictionary (or create it ad-hoc) to get the inverse mapping.
Iterate over all characters in the string and use the dictionary to translate each character separately.
Implementation: Here’s the Python implementation of the above algorithm for Morse code translation:
def morse(txt): '''Morse code encryption and decryption''' d = {'A':'.-','B':'-...','C':'-.-.','D':'-..','E':'.', 'F':'..-.','G':'--.','H':'....','I':'..','J':'.---', 'K':'-.-','L':'.-..','M':'--','N':'-.','O':'---', 'P':'.--.','Q':'--.-','R':'.-.','S':'...','T':'-', 'U':'..-','V':'...-','W':'.--','X':'-..-','Y':'-.--', 'Z':'--..', ' ':'.....'} translation = '' # Encrypt Morsecode if txt.startswith('.') or txt.startswith('−'): # Swap key/values in d: d_encrypt = dict([(v, k) for k, v in d.items()]) # Morse code is separated by empty space chars txt = txt.split(' ') for x in txt: translation += d_encrypt.get(x) # Decrypt to Morsecode: else: txt = txt.upper() for x in txt: translation += d.get(x) + ' ' return translation.strip() print(morse('python'))
# .--. -.-- - .... --- -.
print(morse('.--. -.-- - .... --- -.'))
# PYTHON
print(morse(morse('HEY')))
# HEY
Algorithmic complexity: The runtime complexity is linear in the length of the input string to be translated—one translation operation per character. Dictionary membership has constant runtime complexity. The memory overhead is also linear in the input text as all the characters have to be hold in memory.
AlternativeImplementation: Albrecht also proposed a much shorter alternative:
def morse(txt): encrypt = {'A':'.-', 'B':'-...', 'C':'-.-.', 'D':'-..', 'E':'.', 'F':'..-.', 'G':'--.', 'H':'....', 'I':'..', 'J':'.---', 'K':'-.-', 'L':'.-..', 'M':'--', 'N':'-.', 'O':'---', 'P':'.--.', 'Q':'--.-', 'R':'.-.', 'S':'...', 'T':'-', 'U':'..-', 'V':'...-', 'W':'.--', 'X':'-..-', 'Y':'-.--', 'Z':'--..', ' ':'.....'} decrypt = {v: k for k, v in encrypt.items()} if '-' in txt: return ''.join(decrypt[i] for i in txt.split()) return ' '.join(encrypt[i] for i in txt.upper()) print(morse('python'))
# .--. -.-- - .... --- -.
print(morse('.--. -.-- - .... --- -.'))
# PYTHON
print(morse(morse('HEY')))
# HEY
It uses dict comprehension and generator expressions to make it much more concise.
Ready to earn the black belt of your regex superpower? This tutorial shows you the subtle but important difference between greedy and non-greedy regex quantifiers.
But first things first: what are “quantifiers” anyway? Great question – I’m glad you asked! So let’s dive into Python’s three main regex quantifiers.
Python Regex Quantifiers
The word “quantifier” originates from latin: it’s meaning is quantus = how much / how often.
This is precisely what a regular expression quantifier means: you tell the regex engine how often you want to match a given pattern.
If you think you don’t define any quantifier, you do it implicitly: no quantifier means to match the regular expression exactly once.
So what are the regex quantifiers in Python?
Quantifier
Meaning
A?
Match regular expression A zero or one times
A*
Match regular expression A zero or more times
A+
Match regular expression A one or more times
A{m}
Match regular expression A exactly m times
A{m,n}
Match regular expression A between m and n times (included)
Note that in this tutorial, I assume you have at least a remote idea of what regular expressions actually are. If you haven’t, no problem, check out my detailed regex tutorial on this blog.
You see in the table that the quantifiers ?, *, +, {m}, and {m,n} define how often you repeat the matching of regex A.
Let’s have a look at some examples—one for each quantifier:
In each line, you try a different quantifier on the same text 'aaaa'. And, interestingly, each line leads to a different output:
The zero-or-one regex 'a?' matches four times one 'a'. Note that it doesn’t match zero characters if it can avoid doing so.
The zero-or-more regex 'a*' matches once four 'a's and consumes them. At the end of the string, it can still match the empty string.
The one-or-more regex 'a+' matches once four 'a's. In contrast to the previous quantifier, it cannot match an empty string.
The repeating regex 'a{3}' matches up to three 'a's in a single run. It can do so only once.
The repeating regex 'a{1,2}' matches one or two 'a's. It tries to match as many as possible.
You’ve learned the basic quantifiers of Python regular expressions. Now, it’s time to explore the meaning of the term greedy. Shall we?
Python Regex Greedy Match
A greedy match means that the regex engine (the one which tries to find your pattern in the string) matches as many characters as possible.
For example, the regex 'a+' will match as many 'a's as possible in your string 'aaaa'. Although the substrings 'a', 'aa', 'aaa' all match the regex 'a+', it’s not enough for the regex engine. It’s always hungry and tries to match even more.
In other words, the greedy quantifiers give you the longest match from a given position in the string.
As it turns out, all default quantifiers ?, *, +, {m}, and {m,n} you’ve learned above are greedy: they “consume” or match as many characters as possible so that the regex pattern is still satisfied.
Here are the above examples again that all show how greedy the regex engine is:
In all cases, a shorter match would also be valid. But as the regex engine is greedy per default, those are not enough for the regex engine.
Okay, so how can we do a non-greedy match?
Python Regex Non-Greedy Match
A non-greedy match means that the regex engine matches as few characters as possible—so that it still can match the pattern in the given string.
For example, the regex 'a+?' will match as few 'a's as possible in your string 'aaaa'. Thus, it matches the first character 'a' and is done with it. Then, it moves on to the second character (which is also a match) and so on.
In other words, the non-greedy quantifiers give you the shortest possible match from a given position in the string.
You can make the default quantifiers ?, *, +, {m}, and {m,n} non-greedy by appending a question mark symbol '?' to them: ??, *?, +?, and {m,n}?. they “consume” or match as few characters as possible so that the regex pattern is still satisfied.
Here are some examples that show how non-greedy matching works:
Non-Greedy Question Mark Operator (??)
Let’s start with the question mark (zero-or-one operator):
In the first instance, you use the zero-or-one regex 'a?'. It’s greedy so it matches one 'a' character if possible.
In the second instance, you use the non-greedy zero-or-one version 'a??'. It matches zero 'a's if possible. Note that it moves from left to right so it matches the empty string and “consumes” it. Only then, it cannot match the empty string anymore so it is forced to match the first 'a' character. But after that, it’s free to match the empty string again. This pattern of first matching the empty string and only then matching the 'a' if it is absolutely needed repeats. That’s why this strange pattern occurs.
Non-Greedy Asterisk Operator (*?)
Let’s start with the asterisk (zero-or-more operator):
First, you use the zero-or-more asterisk regex 'a*'. It’s greedy so it matches as many 'a' characters as it can.
Second, you use the non-greedy zero-or-one version 'a*?'. Again, it matches zero 'a's if possible. Only if it has already matched zero characters at a certain position, it matches one character at that position, “consumes” it, and moves on.
First, you use the one-or-more plus regex 'a+'. It’s greedy so it matches as many 'a' characters as it can (but at least one).
Second, you use the non-greedy one-or-more version 'a+?'. In this case, the regex engine matches only one character 'a', consumes it, and moves on with the next match.
Let’s summarize what you’ve learned so far:
Greedy vs Non-Greedy Match – What’s the Difference?
Given a pattern with a quantifier (e.g. the asterisk operator) that allows the regex engine to match the pattern multiple times.
A given string may match the regex in multiple ways. For example, both substrings 'a' and 'aaa' are valid matches when matching the pattern 'a*' in the string 'aaaa'.
So the difference between the greedy and the non-greedy match is the following: The greedy match will try to match as many repetitions of the quantified pattern as possible. The non-greedy match will try to match as few repetitions of the quantified pattern as possible.
Examples Greedy vs Non-Greedy Match
Let’s consider a range of examples that help you understand the difference between greedy and non-greedy matches in Python:
Make sure you completely understand those examples before you move on. If you don’t, please read the previous paragraphs again.
Which is Faster: Greedy vs Non-Greedy?
Considering that greedy quantifiers match a maximal and non-greedy a minimal number of patterns, is there any performance difference?
Great question!
Indeed, some benchmarks suggest that there’s a significant performance difference: the greedy quantifier is 100% slower in realistic experiments on benchmark data.
So if you optimize for speed and you don’t care about greedy or non-greedy matches—and you don’t know anything else—go for the non-greedy quantifier!
However, the truth is not as simple. For example, consider the following basic experiment that falsifies the previous hypothesis that the non-greedy version is faster:
I used the speed testing tool timeit that allows to throw in some simple Python statements and check how long they run. Per default, the passed statement is executed 1,000,000 times.
You can see a notable performance difference of more than 300%! The non-greedy version is three times slower than the greedy version.
Why is that?
The reason is the re.findall() method that returns a list of matching substrings. Here’s the output both statements would produce:
You can see that the greedy version finds one match and is done with it. The non-greedy version finds 25 matches which leads to far more processing and memory overhead.
So what happens if you use the re.search() method that returns only the first match rather than the re.findall() method that returns all matches?
As expected, this changes things again. Both regex searches yield a single result, but the non-greedy match is much shorter: it matches the empty string '' rather than the whole string 'aaaaaaaaaaaa'. Of course, this is a bit faster.
However, the difference is negligible in this minimal example.
There’s More: Greedy, Docile, Lazy, Helpful, Possessive Match
In this article, I’ve classified the regex world into greedy and non-greedy quantifiers. But you can differentiate the “non-greedy” class even more!
Next, I’ll give you a short overview based on this great article of the most important terms in this regard:
Greedy: match as many instances of the quantified pattern as you can.
Docile: match as many instances of the quantified pattern as long as it still matches the overall pattern—if this is possible. Note that what I called “greedy” in this article is really “docile”.
Lazy: match as few instances of the quantified pattern as needed. This is what I called “non-greedy” in this article.
Possessive: never gives up a partial match. So the regex engine may not even find a match that actually exist—just because it’s so greedy. This is very unusual and you won’t see it a lot in practice.
If you want to learn more about those, I’d recommend that you read this excellent online tutorial.
Where to Go From Here
Summary: You’ve learned that the greedy quantifiers ?, *, and + match as many repetitions of the quantified pattern as possible. The non-greedy quantifiers ??, *?, and +? match as few repetitions of the quantified pattern as possible.
This tutorial makes you a master of character sets in Python. (I know, I know, it feels awesome to see your deepest desires finally come true.)
As I wrote this article, I saw a lot of different terms describing this same powerful concept such as “character class“, “character range“, or “character group“. However, the most precise term is “character set” as introduced in the official Python regex docs. So in this tutorial, I’ll use this term throughout.
Python Regex – Character Set
So, what is a character set in regular expressions?
The character set is (surprise) a set of characters: if you use a character set in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set. As you may know, a set is an unordered collection of unique elements. So each character in a character set is unique and the order doesn’t really matter (with a few minor exceptions).
Here’s an example of a character set as used in a regular expression:
>>> import re
>>> re.findall('[abcde]', 'hello world!')
['e', 'd']
You use the re.findall(pattern, string) method to match the pattern '[abcde]' in the string 'hello world!'. You can think of all characters a, b, c, d, and e as being in an OR relation: either of them would be a valid match.
The regex engine goes from the left to the right, scanning over the string ‘hello world!’ and simultaneously trying to match the (character set) pattern. Two characters from the text ‘hello world!’ are in the character set—they are valid matches and returned by the re.findall() method.
You can simplify many character sets by using the range symbol ‘-‘ that has a special meaning within square brackets: [a-z] reads “match any character from a to z”, while [0-9] reads “match any character from 0 to 9”.
You can even combine multiple character ranges in a single character set:
>>> re.findall('[a-eA-E0-4]', 'hello WORLD 42!')
['e', 'D', '4', '2']
Here, you match three ranges: lowercase characters from a to e, uppercase characters from A to E, and numbers from 0 to 4. Note that the ranges are inclusive so both start and stop symbols are included in the range.
Python Regex Negative Character Set
But what if you want to match all characters—except some? You can achieve this with a negative character set!
The negative character set works just like a character set, but with one difference: it matches all characters that are not in the character set.
Here’s an example where you match all sequences of characters that do not containcharacters a, b, c, d, or e:
>>> import re
>>> re.findall('[^a-e]+', 'hello world')
['h', 'llo worl']
We use the “at-least-once quantifier +” in the example that matches at least one occurrence of the preceding regex (if you’re unsure about how it works, check out my detailed Finxter tutorial about the plus operator).
There are only two such sequences: the one-character sequence ‘h’ and the eight-character sequence ‘llo worl’. You can see that even the empty space matches the negative character set.
Summary: the negative character set matches all characters that are not enclosed in the brackets.
How to Fix “re.error: unterminated character set at position”?
Now that you know character classes, you can probably fix this error easily: it occurs if you use the opening (or closing) bracket ‘[‘ in your pattern. Maybe you want to match the character ‘[‘ in your string?
But Python assumes that you’ve just opened a character class—and you forgot to close it.
Here’s an example:
>>> re.findall('[', 'hello [world]')
Traceback (most recent call last): File "<pyshell#5>", line 1, in <module> re.findall('[', 'hello [world]') File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 223, in findall return _compile(pattern, flags).findall(string) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 930, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 426, in _parse_sub not nested and not items)) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 532, in _parse source.tell() - here)
re.error: unterminated character set at position 0
The error happens because you used the bracket character ‘[‘ as if it was a normal symbol.
So, how to fix it? Just escape the special bracket character ‘\[‘ with the single backslash:
>>> re.findall('\[', 'hello [world]')
['[']
This removes the “special” meaning of the bracket symbol.
Related Re Methods
There are seven important regular expression methods which you must master:
The re.findall(pattern, string) method returns a list of string matches. Read more in our blog tutorial.
The re.search(pattern, string) method returns a match object of the first match. Read more in our blog tutorial.
The re.match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Read more in our blog tutorial.
The re.fullmatch(pattern, string) method returns a match object if the regex matches the whole string. Read more in our blog tutorial.
The re.compile(pattern) method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in our blog tutorial.
The re.split(pattern, string) method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in our blog tutorial.
The re.sub(The re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in our blog tutorial.
You’ve learned everything you need to know about the Python Regex Character Set Operator.
Summary:
If you use a character set [XYZ] in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set: X, Y, or Z.
Want to earn money while you learn Python? Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?
Join the free webinar that shows you how to become a thriving coding business owner online!
A regular expression is a decades-old concept in computer science. Invented in the 1950s by famous mathematician Stephen Cole Kleene, the decades of evolution brought a huge variety of operations. Collecting all operations and writing up a comprehensive list would result in a very thick and unreadable book by itself.
Fortunately, you don’t have to learn all regular expressions before you can start using them in your practical code projects. Next, you’ll get a quick and dirty overview of the most important regex operations and how to use them in Python. In follow-up chapters, you’ll then study them in detail — with many practical applications and code puzzles.
Here are the most important regex operators:
. The wild-card operator (‘dot’) matches any character in a string except the newline character ‘\n’. For example, the regex ‘…’ matches all words with three characters such as ‘abc’, ‘cat’, and ‘dog’.
* The zero-or-more asterisk operator matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex. For example, the regex ‘cat*’ matches the strings ‘ca’, ‘cat’, ‘catt’, ‘cattt’, and ‘catttttttt’.
? The zero-or-one operator matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. For example, the regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.
+ The at-least-one operator matches one or more occurrences of the immediately preceding regex. For example, the regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.
^ The start-of-string operator matches the beginning of a string. For example, the regex ‘^p’ would match the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.
$ The end-of-string operator matches the end of a string. For example, the regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.
A|B The OR operator matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. For example, the regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.
AB The AND operator matches first the regex A and second the regex B, in this sequence. We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.
Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.
Let’s dive into some examples!
Examples
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!'] ''' print(re.findall('is.*and', text)) '''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and'] ''' print(re.findall('her:?', text)) '''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:'] ''' print(re.findall('^Ha.*', text)) '''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. Can you figure out why Python doesn't find any?
[] ''' print(re.findall('\n$', text)) '''
Finds all occurrences where the new-line character '\n'
occurs at the end of the string.
['\n'] ''' print(re.findall('(Life|Death)', text)) '''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death'] '''
In these examples, you’ve already seen the special symbol \n which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions.
Where to Go From Here?
If you want to master regular expressions once and for all, I’d recommend that you read the massive regular expression tutorial on the Finxter blog — for free!
You may already know Python’s and operator when applied to two Booleans:
>>> True and False
False
>>> True and True
True
Simple enough. Yet, that’s not the whole story: you can use the and operator even on complex data types such as lists or custom objects. So you may ask (and rightly so):
What If You Apply the AND Operator To Two Objects?
To understand the output, you have to understand two things:
How does the and operator work?
What’s the truth value of any object – such as a list?
You must understand the deeper meaning of those definitions: all of them are short-circuit which means that as soon as the condition is fullfilled, they will abort further processing.
In the x and y operation, if the value of x is evaluated to True, Python simply returns the value of y. It doesn’t even look at what the value of y actually is. If you’re using Boolean operators x and y, this is expected behavior because if x is True, then the y determines whether x and y is True.
This leads to the interesting behavior: if x and y are objects, the result of the operation x and y will be an object, too! (And not a Boolean value.)
In combination with the next piece of Python knowledge, this leads to an interesting behavior:
What’s the truth value of any object – such as a list?
The Python convention is simple: if the object is “empty”, the truth value is False. Otherwise, it’s True. So an empty list, an empty string, or a 0 integer value are all False. Most other values will be True.
Now, you’re equipped with the basics to understand the answer to the following question:
What If You Apply the AND Operator To Two Objects?
Say, you’ve got two non-Boolean objects x and y. What’s the result of the operation x and y?
The answer is simple: the result is y if x is non-empty (and, thus, evaluates to True).
What If You Apply the AND Operator To Two Lists?
Here’s an example for two list objects:
>>> [1, 2, 3] and [0, 0, 0, 0]
[0, 0, 0, 0]
The first argument of the and operation is non-empty and evaluates to True. Therefore, the result of the operation is the second list argument [0, 0, 0, 0].
But what if the first argument is empty?
>>> [] and [0, 0, 0, 0]
[]
The result is the first argument (and not a Boolean value False). If you’re in doubt why, consult the above definition again:
x and y: if x is false, then x, else y
Summary
You’ve learned that the and operator returns the first operand if it evaluates to False, otherwise the second operand.
You’ve also learned that you can use the and operator even for non-Boolean types in which case the result will be an object, not a Boolean value.
Finally, you’ve also learned that an empty object usually evaluates to False.
If you find this interesting, feel free to check out my upcoming Python book that shows you hundreds of small Python tricks like this one:
Python’s re module comes with a number of regular expression methods that help you achieve more with less.
Think of those methods as the framework connecting regular expressions with the Python programming language. Every programming language comes with its own way of handling regular expressions. For example, the Perl programming language has many built-in mechanisms for regular expressions—you don’t need to import a regular expression library—while the Java programming language provides regular expressions only within a library. This is also the approach of Python.
These are the most important regular expression methods of Python’s re module:
re.findall(pattern, string): Checks if the string matches the pattern and returns all occurrences of the matched pattern as a list of strings.
re.search(pattern, string): Checks if the string matches the regex pattern and returns only the first match as a match object. The match object is just that: an object that stores meta information about the match such as the matching position and the matched substring.
re.match(pattern, string): Checks if any string prefix matches the regex pattern and returns a match object.
re.fullmatch(pattern, string): Checks if the whole string matches the regex pattern and returns a match object.
re.compile(pattern): Creates a regular expression object from the pattern to speed up the matching if you want to use the regex pattern multiple times.
re.split(pattern, string): Splits the string wherever the pattern regex matches and returns a list of strings. For example, you can split a string into a list of words by using whitespace characters as separators.
re.sub(pattern, repl, string): Replaces (substitutes) the first occurrence of the regex pattern with the replacement string repl and return a new string.
Example: Let’s have a look at some examples of all the above functions:
import re text = '''
LADY CAPULET Alack the day, she's dead, she's dead, she's dead! CAPULET Ha! let me see her: out, alas! she's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. Nurse O lamentable day! ''' print(re.findall('she', text)) '''
Finds the pattern 'she' four times in the text: ['she', 'she', 'she', 'she'] ''' print(re.search('she', text)) '''
Finds the first match of 'she' in the text: <re.Match object; span=(34, 37), match='she'> The match object contains important information
such as the matched position. ''' print(re.match('she', text)) '''
Tries to match any string prefix -- but nothing found: None ''' print(re.fullmatch('she', text)) '''
Fails to match the whole string with the pattern 'she': None ''' print(re.split('\n', text)) '''
Splits the whole string on the new line delimiter '\n': ['', 'LADY CAPULET', '', " Alack the day, she's dead, she's dead, she's dead!", '', 'CAPULET', '', " Ha! let me see her: out, alas! she's cold:", ' Her blood is settled, and her joints are stiff;', ' Life and these lips have long been separated:', ' Death lies on her like an untimely frost', ' Upon the sweetest flower of all the field.', '', 'Nurse', '', ' O lamentable day!', ''] ''' print(re.sub('she', 'he', text)) '''
Replaces all occurrences of 'she' with 'he': LADY CAPULET Alack the day, he's dead, he's dead, he's dead! CAPULET Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. Nurse O lamentable day! '''
Now, you know the most important regular expression functions. You know how to apply regular expressions to strings. But you don’t know how to write your regex patterns in the first place. Let’s dive into regular expressions and fix this once and for all!
Where to Go From Here
You’ve learned a quick overview of the Python regular expression methods. These are the basis of all advanced regex concepts in Python.
What’s the best-kept productivity secret of code masters?
Here’s what ex-Google’s tech lead says is the most important skill as a coder (spoiler: it has to do with the topic of the tutorial):
Congratulations – you’re about to become a regular expression master. I’ve not only written the most comprehensive free regular expression tutorial on the web (16812 words) but also added a lot of tutorial videos wherever I saw fit.
So take your cup of coffee, scroll through the tutorial, and enjoy your brain cells getting active!
Note that I use both both terms “regular expression” and the more concise “regex” in this tutorial.
Regex Methods Overview
Python’s re module comes with a number of regular expression methods that help you achieve more with less.
Think of those methods as the framework connecting regular expressions with the Python programming language. Every programming language comes with its own way of handling regular expressions. For example, the Perl programming language has many built-in mechanisms for regular expressions—you don’t need to import a regular expression library—while the Java programming language provides regular expressions only within a library. This is also the approach of Python.
These are the most important regular expression methods of Python’s re module:
re.findall(pattern, string): Checks if the string matches the pattern and returns all occurrences of the matched pattern as a list of strings.
re.search(pattern, string): Checks if the string matches the regex pattern and returns only the first match as a match object. The match object is just that: an object that stores meta information about the match such as the matching position and the matched substring.
re.match(pattern, string): Checks if any string prefix matches the regex pattern and returns a match object.
re.fullmatch(pattern, string): Checks if the whole string matches the regex pattern and returns a match object.
re.compile(pattern): Creates a regular expression object from the pattern to speed up the matching if you want to use the regex pattern multiple times.
re.split(pattern, string): Splits the string wherever the pattern regex matches and returns a list of strings. For example, you can split a string into a list of words by using whitespace characters as separators.
re.sub(pattern, repl, string): Replaces (substitutes) the first occurrence of the regex pattern with the replacement string repl and return a new string.
Example: Let’s have a look at some examples of all the above functions:
import re text = '''
LADY CAPULET Alack the day, she's dead, she's dead, she's dead! CAPULET Ha! let me see her: out, alas! she's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. Nurse O lamentable day! ''' print(re.findall('she', text)) '''
Finds the pattern 'she' four times in the text: ['she', 'she', 'she', 'she'] ''' print(re.search('she', text)) '''
Finds the first match of 'she' in the text: <re.Match object; span=(34, 37), match='she'> The match object contains important information
such as the matched position. ''' print(re.match('she', text)) '''
Tries to match any string prefix -- but nothing found: None ''' print(re.fullmatch('she', text)) '''
Fails to match the whole string with the pattern 'she': None ''' print(re.split('\n', text)) '''
Splits the whole string on the new line delimiter '\n': ['', 'LADY CAPULET', '', " Alack the day, she's dead, she's dead, she's dead!", '', 'CAPULET', '', " Ha! let me see her: out, alas! she's cold:", ' Her blood is settled, and her joints are stiff;', ' Life and these lips have long been separated:', ' Death lies on her like an untimely frost', ' Upon the sweetest flower of all the field.', '', 'Nurse', '', ' O lamentable day!', ''] ''' print(re.sub('she', 'he', text)) '''
Replaces all occurrences of 'she' with 'he': LADY CAPULET Alack the day, he's dead, he's dead, he's dead! CAPULET Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. Nurse O lamentable day! '''
Now, you know the most important regular expression functions. You know how to apply regular expressions to strings. But you don’t know how to write your regex patterns in the first place. Let’s dive into regular expressions and fix this once and for all!
Basic Regex Operations
A regular expression is a decades-old concept in computer science. Invented in the 1950s by famous mathematician Stephen Cole Kleene, the decades of evolution brought a huge variety of operations. Collecting all operations and writing up a comprehensive list would result in a very thick and unreadable book by itself.
Fortunately, you don’t have to learn all regular expressions before you can start using them in your practical code projects. Next, you’ll get a quick and dirty overview of the most important regex operations and how to use them in Python. In follow-up chapters, you’ll then study them in detail — with many practical applications and code puzzles.
Here are the most important regex operators:
. The wild-card operator (‘dot’) matches any character in a string except the newline character ‘\n’. For example, the regex ‘…’ matches all words with three characters such as ‘abc’, ‘cat’, and ‘dog’.
* The zero-or-more asterisk operator matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex. For example, the regex ‘cat*’ matches the strings ‘ca’, ‘cat’, ‘catt’, ‘cattt’, and ‘catttttttt’.
? The zero-or-one operator matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. For example, the regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.
+ The at-least-one operator matches one or more occurrences of the immediately preceding regex. For example, the regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.
^ The start-of-string operator matches the beginning of a string. For example, the regex ‘^p’ would match the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.
$ The end-of-string operator matches the end of a string. For example, the regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.
A|B The OR operator matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. For example, the regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.
AB The AND operator matches first the regex A and second the regex B, in this sequence. We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.
Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.
We’ve already seen many examples but let’s dive into even more!
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!'] ''' print(re.findall('is.*and', text)) '''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and'] ''' print(re.findall('her:?', text)) '''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:'] ''' print(re.findall('^Ha.*', text)) '''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. Can you figure out why Python doesn't find any?
[] ''' print(re.findall('\n$', text)) '''
Finds all occurrences where the new-line character '\n'
occurs at the end of the string.
['\n'] ''' print(re.findall('(Life|Death)', text)) '''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death'] '''
In these examples, you’ve already seen the special symbol ‘\n’ which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions. Next, we’ll discover the most important special symbols.
Special Symbols
Regular expressions need special symbols like you need air to breathe. Some symbols such as the new line character ‘\n’ are vital for writing effective regular expressions in practice. Other symbols such as the word symbol ‘\w’ make your code more readable and concise being a one-symbol solution for the longer regex [a-zA-Z0-9_].
Many of those symbols are also available in other regex languages such as Perl. Thus, studying this list carefully will improve your conceptual strength in using regular expressions—independent from the concrete tool you use.
Let’s get a quick overview of the four most important special symbols in Python’s re library!
\n The newline symbol is not a special symbol of the regex library, it’s a standard character. However, you’ll see the newline character so often that I just couldn’t write this list without including it. For example, the regex ‘hello\nworld’ matches a string where the string ‘hello’ is placed in one line and the string ‘world’ is placed into the second line.
\t The tabular character is, like the newline character, not a special symbol of the regex library. It just encodes the tabular space ‘ ‘ which is different to a sequence of whitespaces ‘ ‘ (even if it doesn’t look different). For example, the regex ‘hello\n\tworld’ matches the string that consists of ‘hello’ in the first line and ‘ world’ in the second line (with a leading tab character).
\s The whitespace character is, in contrast to the newline character, a special symbol of the regex libraries. You’ll find it in many other programming languages, too. The problem is that you often don’t know which type of whitespace is used: tabular characters, simple whitespaces, or even newlines. The whitespace character ‘\s’ simply matches any of them. For example, the regex ‘\s+hello\s+world’ would match the string ‘ \t \n hello \n \n \t world’, as well as ‘hello world’.
\w The word character regex simplifies text processing significantly. If you want to match any word but you don’t want to write complicated subregexes to match a word character, you can simply use the word character regex \w to match any Unicode character. For example, the regex ‘\w+’ matches the strings ‘hello’, ‘bye’, ‘Python’, and ‘Python_is_great’.
\W The negative word character. It matches any character that is not a word character.
\b The word boundary regex is also a special symbol used in many regex tools. You can use it to match (as the name suggests) a word boundary between the \w and the \W character. But note that it matches only the empty string! You may ask: why does it exist if it doesn’t match any character? The reason is that it doesn’t “consume” the character right in front or right after a word. This way, you can search for whole words (or parts of words) and return only the word but not the delimiting character itself.
\d The digit character matches all numeric symbols between 0 and 9. You can use it to match integers with an arbitrary number of digits: the regex ‘\d+’ matches integer numbers ‘10’, ‘1000’, ‘942’, and ‘99999999999’.
These are the most important special symbols and characters. A detailed examination follows in subsequent tutorials.
But before we move on, let’s understand them better by studying some examples!
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('\w+\W+\w+', text)) '''
Matches each pair of words in the text. ['Ha! let', 'me see', 'her: out', 'alas! he', 's cold', 'Her blood', 'is settled', 'and her', 'joints are', 'stiff;\n Life', 'and these', 'lips have', 'long been', 'separated:\n Death', 'lies on', 'her like', 'an untimely', 'frost\n Upon', 'the sweetest', 'flower of', 'all the'] Note that it matches also across new lines: 'stiff;\n Life'
is also matches! Note also that what is already matched is "consumed" and
doesn't match again. This is why the combination 'let me' is not
a matching substring. ''' print(re.findall('\d', text)) '''
No integers in the text: [ ] ''' print(re.findall('\n\t', text)) '''
Match all occurrences where a tab follows a newline: [ ] No match because each line starts with a sequence of four
whitespaces rather than the tab character. ''' print(re.findall('\n ', text)) '''
Match all occurrences where 4 whitespaces ' ' follow a newline: ['\n ', '\n ', '\n ', '\n ', '\n '] Matches all five lines. '''
Regex Methods
Yes, you’ve already studied the regex function superficially at the beginning of this tutorial. Now, you’re going to learn everything about those important functions in great detail.
findall()
The findall() method is the most basic way of using regular expressions in Python. So how does the re.findall() method work?
Let’s study its specification.
How Does the findall() Method Work in Python?
The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.
Specification:
re.findall(pattern, string, flags=0)
The re.findall() method has up to three arguments.
pattern: the regular expression pattern that you want to match.
string: the string which you want to search for the pattern.
flags (optional argument): a more advanced modifier that allows you to customize the behavior of the function.
You’ll dive into each of them in a moment.
Return Value:
The re.findall() method returns a list of strings. Each string element is a matching substring of the string argument.
Let’s check out a few examples!
Examples re.findall()
First, you import the re module and create the text string to be searched for the regex patterns:
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. '''
Let’s say, you want to search the text for the string ‘her’:
>>> re.findall('her', text)
['her', 'her', 'her']
The first argument is the pattern you look for. In our case, it’s the string ‘her’. The second argument is the text to be analyzed. You stored the multi-line string in the variable text—so you take this as the second argument. You don’t need to define the optional third argument flags of the findall() method because you’re fine with the default behavior in this case.
Also note that the findall() function returns a list of all matching substrings. In this case, this may not be too useful because we only searched for an exact string. But if we search for more complicated patterns, this may actually be very useful:
The regex ‘\\bf\w+\\b’ matches all words that start with the character ‘f’.
You may ask: why to enclose the regex with a leading and trailing ‘\\b’? This is the word boundary character that matches the empty string at the beginning or at the end of a word. You can define a word as a sequence of characters that are not whitespace characters or other delimiters such as ‘.:,?!’.
In the previous example, you need to escape the boundary character ‘\b’ again because in a Python string, the default meaning of the character sequence ‘\b’ is the backslash character.
Summary
You now know that the re.findall(pattern, string) method matches all occurrences of the regex pattern in a given string—and returns a list of all matches as strings.
Intermezzo: Python Regex Flags
In many functions, you see a third argument flags. What are they and how do they work?
Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (e.g. whether to ignore capitalization when matching your regex).
For example, here’s how the third argument flags is used in the re.findall() method:
re.findall(pattern, string, flags=0)
So the flags argument seems to be an integer argument with the default value of 0. To control the default regex behavior, you simply use one of the predefined integer values. You can access these predefined values via the re library:
Syntax
Meaning
re.ASCII
If you don’t use this flag, the special Python regex symbols \w, \W, \b, \B, \d, \D, \s and \S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests.
re.A
Same as re.ASCII
re.DEBUG
If you use this flag, Python will print some useful information to the shell that helps you debugging your regex.
re.IGNORECASE
If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z].
re.I
Same as re.IGNORECASE
re.LOCALE
Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable.
re.L
Same as re.LOCALE
re.MULTILINE
This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string.
re.M
Same as re.MULTILINE
re.DOTALL
Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘\n’. Switch on this flag to really match all characters including the newline character.
re.S
Same as re.DOTALL
re.VERBOSE
To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex.
re.X
Same as re.VERBOSE
How to Use These Flags?
Simply include the flag as the optional flag argument as follows:
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('HER', text, flags=re.IGNORECASE))
# ['her', 'Her', 'her', 'her']
As you see, the re.IGNORECASE flag ensures that all occurrences of the string ‘her’ are matched—no matter their capitalization.
How to Use Multiple Flags?
Yes, simply add them together (sum them up) as follows:
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall(' HER # Ignored', text, flags=re.IGNORECASE + re.VERBOSE))
# ['her', 'Her', 'her', 'her']
You use both flags re.IGNORECASE (all occurrences of lower- or uppercase string variants of ‘her’ are matched) and re.VERBOSE (ignore comments and whitespaces in the regex). You sum them together re.IGNORECASE + re.VERBOSE to indicate that you want to take both.
search()
This article is all about the search() method. To learn about the easy-to-use but less powerful findall() method that returns a list of string matches, check out our article about the similar findall() method.
So how does the re.search() method work? Let’s study the specification.
How Does re.search() Work in Python?
The re.search(pattern, string) method matches the first occurrence of the pattern in the string and returns a match object.
Specification:
re.search(pattern, string, flags=0)
The re.findall() method has up to three arguments.
pattern: the regular expression pattern that you want to match.
string: the string which you want to search for the pattern.
The re.search() method returns a match object. You may ask (and rightly so):
What’s a Match Object?
If a regular expression matches a part of your string, there’s a lot of useful information that comes with it: what’s the exact position of the match? Which regex groups were matched—and where?
The match object is a simple wrapper for this information. Some regex methods of the re package in Python—such as search()—automatically create a match object upon the first pattern match.
At this point, you don’t need to explore the match object in detail. Just know that we can access the start and end positions of the match in the string by calling the methods m.start() and m.end() on the match object m:
In the first line, you create a match object m by using the re.search() method. The pattern ‘h…o’ matches in the string ‘hello world’ at start position 0. You use the start and end position to access the substring that matches the pattern (using the popular Python technique of slicing).
Now, you know the purpose of the match() object in Python. Let’s check out a few examples of re.search()!
A Guided Example for re.search()
First, you import the re module and create the text string to be searched for the regex patterns:
>>> import re
>>> text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. '''
Let’s say you want to search the text for the string ‘her’:
The first argument is the pattern to be found. In our case, it’s the string ‘her’. The second argument is the text to be analyzed. You stored the multi-line string in the variable text—so you take this as the second argument. You don’t need to define the optional third argument flags of the search() method because you’re fine with the default behavior in this case.
Look at the output: it’s a match object! The match object gives the span of the match—that is the start and stop indices of the match. We can also directly access those boundaries by using the start() and stop() methods of the match object:
The problem is that the search() method only retrieves the first occurrence of the pattern in the string. If you want to find all matches in the string, you may want to use the findall() method of the re library.
What’s the Difference Between re.search() and re.findall()?
There are two differences between the re.search(pattern, string) and re.findall(pattern, string) methods:
re.search(pattern, string) returns a match object while re.findall(pattern, string) returns a list of matching strings.
re.search(pattern, string) returns only the first match in the string while re.findall(pattern, string) returns all matches in the string.
Both can be seen in the following example:
>>> text = 'Python is superior to Python'
>>> re.search('Py...n', text)
<re.Match object; span=(0, 6), match='Python'>
>>> re.findall('Py...n', text)
['Python', 'Python']
The string ‘Python is superior to Python’ contains two occurrences of ‘Python’. The search() method only returns a match object of the first occurrence. The findall() method returns a list of all occurrences.
What’s the Difference Between re.search() and re.match()?
The methods re.search(pattern, string) and re.findall(pattern, string) both return a match object of the first match. However, re.match() attempts to match at the beginning of the string while re.search() matches anywhere in the string.
You can see this difference in the following code:
>>> text = 'Slim Shady is my name'
>>> re.search('Shady', text)
<re.Match object; span=(5, 10), match='Shady'>
>>> re.match('Shady', text)
>>>
The re.search() method retrieves the match of the ‘Shady’ substring as a match object. But if you use the re.match() method, there is no match and no return value because the substring ‘Shady’ does not occur at the beginning of the string ‘Slim Shady is my name’.
match()
The Python re.match() method is the third most-used regex method in Python. Let’s study the specification in detail.
How Does re.match() Work in Python?
The re.match(pattern, string) method matches the pattern at the beginning of the string and returns a match object.
Specification:
re.match(pattern, string, flags=0)
The re.match() method has up to three arguments.
pattern: the regular expression pattern that you want to match.
string: the string which you want to search for the pattern.
flags (optional argument): a more advanced modifier that allows you to customize the behavior of the function.
We’ll explore them in more detail later.
Return Value:
The re.match() method returns a match object. You can access the start and end positions of the match in the string by calling the methods m.start() and m.end() on the match object m:
In the first line, you create a match object m by using the re.match() method. The pattern ‘h…o’ matches in the string ‘hello world’ at start position 0. You use the start and end position to access the substring that matches the pattern (using the popular Python technique of slicing). But note that as the match() method always attempts to match only at the beginning of the string, the m.start() method will always return zero.
Now, you know the purpose of the match() object in Python. Let’s check out a few examples of re.match()!
A Guided Example for re.match()
First, you import the re module and create the text string to be searched for the regex patterns:
>>> import re
>>> text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. '''
Let’s say you want to search the text for the string ‘her’:
>>> re.match('lips', text)
>>>
The first argument is the pattern to be found: the string ‘lips’. The second argument is the text to be analyzed. You stored the multi-line string in the variable text—so you take this as the second argument. The third argument flags of the match() method is optional.
There’s no output! This means that the re.match() method did not return a match object. Why? Because at the beginning of the string, there’s no match for the regex pattern ‘lips’.
So how can we fix this? Simple, by matching all the characters that precede the string ‘lips’ in the text:
>>> re.match('(.|\n)*lips', text)
<re.Match object; span=(0, 122), match="\n Ha! let me see her: out, alas! he's cold:\n>
The regex (.|\n)*lips matches all prefixes (an arbitrary number of characters including new lines) followed by the string ‘lips’. This results in a new match object that matches a huge substring from position 0 to position 122. Note that the match object doesn’t print the whole substring to the shell. If you access the matched substring, you’ll get the following result:
>>> m = re.match('(.|\n)*lips', text)
>>> text[m.start():m.end()] "\n Ha! let me see her: out, alas! he's cold:\n Her blood is settled, and her joints are stiff;\n Life and these lips"
Interestingly, you can also achieve the same thing by specifying the third flag argument as follows:
>>> m = re.match('.*lips', text, flags=re.DOTALL)
>>> text[m.start():m.end()] "\n Ha! let me see her: out, alas! he's cold:\n Her blood is settled, and her joints are stiff;\n Life and these lips"
The re.DOTALL flag ensures that the dot operator . matches all characters including the new line character.
What’s the Difference Between re.match() and re.findall()?
There are two differences between the re.match(pattern, string) and re.findall(pattern, string) methods:
re.match(pattern, string) returns a match object while re.findall(pattern, string) returns a list of matching strings.
re.match(pattern, string) returns only the first match in the string—and only at the beginning—while re.findall(pattern, string) returns all matches in the string.
Both can be seen in the following example:
>>> text = 'Python is superior to Python'
>>> re.match('Py...n', text)
<re.Match object; span=(0, 6), match='Python'>
>>> re.findall('Py...n', text)
['Python', 'Python']
The string ‘Python is superior to Python’ contains two occurrences of ‘Python’. The match() method only returns a match object of the first occurrence. The findall() method returns a list of all occurrences.
What’s the Difference Between re.match() and re.search()?
The methods re.search(pattern, string) and re.match(pattern, string) both return a match object of the first match. However, re.match() attempts to match at the beginning of the string while re.search() matches anywhere in the string.
You can see this difference in the following code:
>>> text = 'Slim Shady is my name'
>>> re.search('Shady', text)
<re.Match object; span=(5, 10), match='Shady'>
>>> re.match('Shady', text)
>>>
The re.search() method retrieves the match of the ‘Shady’ substring as a match object. But if you use the re.match() method, there is no match and no return value because the substring ‘Shady’ does not occur at the beginning of the string ‘Slim Shady is my name’.
fullmatch()
This section is all about the re.fullmatch(pattern, string) method of Python’s re library. There are two similar methods to help you use regular expressions:
The findall(pattern, string) method returns a list of string matches. Check out our blog tutorial.
The search(pattern, string) method returns a match object of the first match. Check out our blog tutorial.
The match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Check out our blog tutorial.
So how does the re.fullmatch() method work? Let’s study the specification.
How Does re.fullmatch() Work in Python?
The re.fullmatch(pattern, string) method returns a match object if the pattern matches the whole string.
Specification:
re.fullmatch(pattern, string, flags=0)
The re.fullmatch() method has up to three arguments.
pattern: the regular expression pattern that you want to match.
string: the string which you want to search for the pattern.
The re.fullmatch() method returns a match object. You may ask (and rightly so): Let’s check out a few examples of re.fullmatch()!
A Guided Example for re.fullmatch()
First, you import the re module and create the text string to be searched for the regex patterns:
>>> import re
>>> text = '''
Call me Ishmael. Some years ago--never mind how long precisely
--having little or no money in my purse, and nothing particular
to interest me on shore, I thought I would sail about a little
and see the watery part of the world. '''
Let’s say you want to match the full text with this regular expression:
>>> re.fullmatch('Call(.|\n)*', text)
>>>
The first argument is the pattern to be found: ‘Call(.|\n)*’. The second argument is the text to be analyzed. You stored the multi-line string in the variable text—so you take this as the second argument. The third argument flags of the fullmatch() method is optional and we skip it in the code.
There’s no output! This means that the re.fullmatch() method did not return a match object. Why? Because at the beginning of the string, there’s no match for the ‘Call’ part of the regex. The regex starts with an empty line!
So how can we fix this? Simple, by matching a new line character ‘\n’ at the beginning of the string.
>>> re.fullmatch('\nCall(.|\n)*', text)
<re.Match object; span=(0, 229), match='\nCall me Ishmael. Some years ago--never mind how>
The regex (.|\n)* matches an arbitrary number of characters (new line characters or not) after the prefix ‘\nCall’. This matches the whole text so the result is a match object. Note that there are 229 matching positions so the string included in resulting match object is only the prefix of the whole matching string. This fact is often overlooked by beginner coders.
What’s the Difference Between re.fullmatch() and re.match()?
The methods re.fullmatch(pattern, string) and re.match(pattern, string) both return a match object. Both attempt to match at the beginning of the string. The only difference is that re.fullmatch() also attempts to match the end of the string as well: it wants to match the whole string!
You can see this difference in the following code:
>>> text = 'More with less'
>>> re.match('More', text)
<re.Match object; span=(0, 4), match='More'>
>>> re.fullmatch('More', text)
>>>
The re.match(‘More’, text) method matches the string ‘More’ at the beginning of the string ‘More with less’. But the re.fullmatch(‘More’, text) method does not match the whole text. Therefore, it returns the None object—nothing is printed to your shell!
What’s the Difference Between re.fullmatch() and re.findall()?
There are two differences between the re.fullmatch(pattern, string) and re.findall(pattern, string) methods:
re.fullmatch(pattern, string) returns a match object while re.findall(pattern, string) returns a list of matching strings.
re.fullmatch(pattern, string) can only match the whole string, while re.findall(pattern, string) can return multiple matches in the string.
Both can be seen in the following example:
>>> text = 'the 42th truth is 42'
>>> re.fullmatch('.*?42', text)
<re.Match object; span=(0, 20), match='the 42th truth is 42'>
>>> re.findall('.*?42', text)
['the 42', 'th truth is 42']
Note that the regex .*? matches an arbitrary number of characters but it attempts to consume as few characters as possible. This is called “non-greedy” match (the *? operator). The fullmatch() method only returns a match object that matches the whole string. The findall() method returns a list of all occurrences. As the match is non-greedy, it finds two such matches.
What’s the Difference Between re.fullmatch() and re.search()?
The methods re.fullmatch() and re.search(pattern, string) both return a match object. However, re.fullmatch() attempts to match the whole string while re.search() matches anywhere in the string.
You can see this difference in the following code:
>>> text = 'Finxter is fun!'
>>> re.search('Finxter', text)
<re.Match object; span=(0, 7), match='Finxter'>
>>> re.fullmatch('Finxter', text)
>>>
The re.search() method retrieves the match of the ‘Finxter’ substring as a match object. But the re.fullmatch() method has no return value because the substring ‘Finxter’ does not match the whole string ‘Finxter is fun!’.
Summary
Now you know the re.fullmatch(pattern, string) method that attempts to match the whole string—and returns a match object if it succeeds or None if it doesn’t.
compile()
This article is all about the re.compile(pattern) method of Python’s re library. Before we dive into re.compile(), let’s get an overview of the four related methods you must understand:
The findall(pattern, string) method returns a list of string matches.
The search(pattern, string) method returns a match object of the first match. The match(pattern, string) method returns a match object if the regex matches at the beginning of the string.
The fullmatch(pattern, string) method returns a match object if the regex matches the whole string.
Equipped with this quick overview of the most critical regex methods, let’s answer the following question:
How Does re.compile() Work in Python?
The re.compile(pattern) method returns a regular expression object (see next section)
You then use the object to call important regex methods such as search(string), match(string), fullmatch(string), and findall(string).
In short: You compile the pattern first. You search the pattern in a string second.
This two-step approach is more efficient than calling, say, search(pattern, string) at once. That is, IF you call the search() method multiple times on the same pattern. Why? Because you can reuse the compiled pattern multiple times.
Here’s an example:
import re # These two lines ...
regex = re.compile('Py...n')
match = regex.search('Python is great') # ... are equivalent to ...
match = re.search('Py...n', 'Python is great')
In both instances, the match variable contains the following match object:
<re.Match object; span=(0, 6), match='Python'>
But in the first case, we can find the pattern not only in the string ‘Python is great’ but also in other strings—without any redundant work of compiling the pattern again and again.
Specification:
re.compile(pattern, flags=0)
The method has up to two arguments.
pattern: the regular expression pattern that you want to match.
We’ll explore those arguments in more detail later.
Return Value:
The re.compile(patterns, flags) method returns a regular expression object. You may ask (and rightly so):
What’s a Regular Expression Object?
Python internally creates a regular expression object (from the Pattern class) to prepare the pattern matching process. You can call the following methods on the regex object:
Method
Description
Pattern.search(string[, pos[, endpos]])
Searches the regex anywhere in the string and returns a match object or None. You can define start and end positions of the search.
Pattern.match(string[, pos[, endpos]])
Searches the regex at the beginning of the string and returns a match object or None. You can define start and end positions of the search.
Pattern.fullmatch(string[, pos[, endpos]])
Matches the regex with the whole string and returns a match object or None. You can define start and end positions of the search.
Pattern.split(string, maxsplit=0)
Divides the string into a list of substrings. The regex is the delimiter. You can define a maximum number of splits.
Pattern.findall(string[, pos[, endpos]])
Searches the regex anywhere in the string and returns a list of matching substrings. You can define start and end positions of the search.
Pattern.finditer(string[, pos[, endpos]])
Returns an iterator that goes over all matches of the regex in the string (returns one match object after another). You can define the start and end positions of the search.
Pattern.sub(repl, string, count=0)
Returns a new string by replacing the first count occurrences of the regex in the string (from left to right) with the replacement string repl.
Pattern.subn(repl, string, count=0)
Returns a new string by replacing the first count occurrences of the regex in the string (from left to right) with the replacement string repl. However, it returns a tuple with the replaced string as the first and the number of successful replacements as the second tuple value.
If you’re familiar with the most basic regex methods, you’ll realize that all of them appear in this table. But there’s one distinction: you don’t have to define the pattern as an argument. For example, the regex method re.search(pattern, string) will internally compile a regex object p and then call p.search(string).
def search(pattern, string, flags=0): """Scan through string looking for a match to the pattern, returning a Match object, or None if no match was found.""" return _compile(pattern, flags).search(string)
The re.search(pattern, string) method is a mere wrapper for compiling the pattern first and calling the p.search(string) function on the compiled regex object p.
Is It Worth Using Python’s re.compile()?
No, in the vast majority of cases, it’s not worth the extra line.
Consider the following example:
import re # These two lines ...
regex = re.compile('Py...n')
match = regex.search('Python is great') # ... are equivalent to ...
match = re.search('Py...n', 'Python is great')
Don’t get me wrong. Compiling a pattern once and using it many times throughout your code (e.g., in a loop) comes with a big performance benefit. In some anecdotal cases, compiling the pattern first lead to 10x to 50x speedup compared to compiling it again and again.
But the reason it is not worth the extra line is that Python’s re library ships with an internal cache. At the time of this writing, the cache has a limit of up to 512 compiled regex objects. So for the first 512 times, you can be sure when calling re.search(pattern, string) that the cache contains the compiled pattern already.
# --------------------------------------------------------------------
# internals _cache = {} # ordered! _MAXCACHE = 512
def _compile(pattern, flags): # internal: compile pattern if isinstance(flags, RegexFlag): flags = flags.value try: return _cache[type(pattern), pattern, flags] except KeyError: pass if isinstance(pattern, Pattern): if flags: raise ValueError( "cannot process flags argument with a compiled pattern") return pattern if not sre_compile.isstring(pattern): raise TypeError("first argument must be string or compiled pattern") p = sre_compile.compile(pattern, flags) if not (flags & DEBUG): if len(_cache) >= _MAXCACHE: # Drop the oldest item try: del _cache[next(iter(_cache))] except (StopIteration, RuntimeError, KeyError): pass _cache[type(pattern), pattern, flags] = p return p
Can you find the spots where the cache is initialized and used?
While in most cases, you don’t need to compile a pattern, in some cases, you should. These follow directly from the previous implementation:
You’ve got more than MAXCACHE patterns in your code.
You’ve got more than MAXCACHE different patterns between two same pattern instances. Only in this case, you will see “cache misses” where the cache has already flushed the seemingly stale pattern instances to make room for newer ones.
You reuse the pattern multiple times. Because if you don’t, it won’t make sense to use sparse memory to save them in your memory.
(Even then, it may only be useful if the patterns are relatively complicated. Otherwise, you won’t see a lot of performance benefits in practice.)
To summarize, compiling the pattern first and storing the compiled pattern in a variable for later use is often nothing but “premature optimization”—one of the deadly sins of beginner and intermediate programmers.
What Does re.compile() Really Do?
It doesn’t seem like a lot, does it? My intuition was that the real work is in finding the pattern in the text—which happens after compilation. And, of course, matching the pattern is the hard part. But a sensible compilation helps a lot in preparing the pattern to be matched efficiently by the regex engine—work that would otherwise have be done by the regex engine.
Regex’s compile() method does a lot of things such as:
Combine two subsequent characters in the regex if they together indicate a special symbol such as certain Greek symbols.
Prepare the regex to ignore uppercase and lowercase.
Check for certain (smaller) patterns in the regex.
Analyze matching groups in the regex enclosed in parentheses.
The implemenation of the compile() method is not easy to read (trust me, I tried). It consists of many different steps.
Just note that all this work would have to be done by the regex engine at “matching runtime” if you wouldn’t compile the pattern first. If we can do it only once, it’s certainly a low-hanging fruit for performance optimizations—especially for long regular expression patterns.
How to Use the Optional Flag Argument?
As you’ve seen in the specification, the compile() method comes with an optional third ‘flag’ argument:
re.compile(pattern, flags=0)
Here’s how you’d use it in a practical example:
import re text = 'Python is great (python really is)' regex = re.compile('Py...n', flags=re.IGNORECASE) matches = regex.findall(text)
print(matches)
# ['Python', 'python']
Although your regex ‘Python’ is uppercase, we ignore the capitalization by using the flag re.IGNORECASE.
Summary
You’ve learned about the re.compile(pattern) method that prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code.
split()
Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!
This article is all about the re.split(pattern, string) method of Python’s re library.
Let’s answer the following question:
How Does re.split() Work in Python?
The re.split(pattern, string, maxsplit=0, flags=0) method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those.
The string contains four words that are separated by whitespace characters (in particular: the empty space ‘ ‘ and the tabular character ‘\t’). You use the regular expression ‘\s+’ to match all occurrences of a positive number of subsequent whitespaces. The matched substrings serve as delimiters. The result is the string divided along those delimiters.
But that’s not all! Let’s have a look at the formal definition of the split method.
Specification
re.split(pattern, string, maxsplit=0, flags=0)
The method has four arguments—two of which are optional.
pattern: the regular expression pattern you want to use as a delimiter.
string: the text you want to break up into a list of strings.
maxsplit (optional argument): the maximum number of split operations (= the size of the returned list). Per default, the maxsplit argument is 0, which means that it’s ignored.
flags (optional argument): a more advanced modifier that allows you to customize the behavior of the function. Per default the regex module does not consider any flags. Want to know how to use those flags? Check out this detailed article on the Finxter blog.
The first and second arguments are required. The third and fourth arguments are optional.
You’ll learn about those arguments in more detail later.
Return Value:
The regex split method returns a list of substrings obtained by using the regex as a delimiter.
Regex Split Minimal Example
Let’s study some more examples—from simple to more complex.
The easiest use is with only two arguments: the delimiter regex and the string to be split.
You use an arbitrary number of ‘f’ or ‘g’ characters as regular expression delimiters. How do you accomplish this? By combining the character class regex [A] and the one-or-more regex A+ into the following regex: [fg]+. The strings in between are added to the return list.
How to Use the maxsplit Argument?
What if you don’t want to split the whole string but only a limited number of times. Here’s an example:
We use the simple delimiter regex ‘-‘ to divide the string into substrings. In the first method call, we set maxsplit=5 to obtain six list elements. In the second method call, we set maxsplit=3 to obtain three list elements. Can you see the pattern?
You can also use positional arguments to save some characters:
Although your regex is lowercase, we ignore the capitalization by using the flag re.I which is short for re.IGNORECASE. If we wouldn’t do it, the result would be quite different:
As the character class [xy] only contains lowerspace characters ‘x’ and ‘y’, their uppercase variants appear in the returned list rather than being used as delimiters.
What’s the Difference Between re.split() and string.split() Methods in Python?
The method re.split() is much more powerful. The re.split(pattern, string) method can split a string along all occurrences of a matched pattern. The pattern can be arbitrarily complicated. This is in contrast to the string.split(delimiter) method which also splits a string into substrings along the delimiter. However, the delimiter must be a normal string.
An example where the more powerful re.split() method is superior is in splitting a text along any whitespace characters:
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely Frost Upon the sweetest flower of all the field. ''' print(re.split('\s+', text)) '''
['', 'Ha!', 'let', 'me', 'see', 'her:', 'out,', 'alas!', "he's", 'cold:', 'Her', 'blood', 'is', 'settled,', 'and', 'her', 'joints', 'are', 'stiff;', 'Life', 'and', 'these', 'lips', 'have', 'long', 'been', 'separated:', 'Death', 'lies', 'on', 'her', 'like', 'an', 'untimely', 'Frost', 'Upon', 'the', 'sweetest', 'flower', 'of', 'all', 'the', 'field.', ''] '''
The re.split() method divides the string along any positive number of whitespace characters. You couldn’t achieve such a result with string.split(delimiter) because the delimiter must be a constant-sized string.
Summary
You’ve learned about the re.split(pattern, string) method that divides the string along the matched pattern occurrences and returns a list of substrings.
sub()
Do you want to replace all occurrences of a pattern in a string? You’re in the right place! This article is all about the re.sub(pattern, string) method of Python’s re library.
Let’s answer the following question:
How Does re.sub() Work in Python?
The re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl.
Here’s a minimal example:
>>> import re
>>> text = 'C++ is the best language. C++ rocks!'
>>> re.sub('C\+\+', 'Python', text) 'Python is the best language. Python rocks!'
>>>
The text contains two occurrences of the string ‘C++’. You use the re.sub() method to search all of those occurrences. Your goal is to replace all those with the new string ‘Python’ (Python is the best language after all).
Note that you must escape the ‘+’ symbol in ‘C++’ as otherwise it would mean the at-least-oneregex.
You can also see that the sub() method replaces all matched patterns in the string—not only the first one.
But there’s more! Let’s have a look at the formal definition of the sub() method.
Specification
re.sub(pattern, repl, string, count=0, flags=0)
The method has four arguments—two of which are optional.
pattern: the regular expression pattern to search for strings you want to replace.
repl: the replacement string or function. If it’s a function, it needs to take one argument (the match object) which is passed for each occurrence of the pattern. The return value of the replacement function is a string that replaces the matching substring.
string: the text you want to replace.
count (optional argument): the maximum number of replacements you want to perform. Per default, you use count=0 which reads as replace all occurrences of the pattern.
flags (optional argument): a more advanced modifier that allows you to customize the behavior of the method. Per default, you don’t use any flags.
The initial three arguments are required. The remaining two arguments are optional.
You’ll learn about those arguments in more detail later.
Return Value:
A new string where count occurrences of the first substrings that match the pattern are replaced with the string value defined in the repl argument.
Regex Sub Minimal Example
Let’s study some more examples—from simple to more complex.
The easiest use is with only three arguments: the pattern ‘sing’, the replacement string ‘program’, and the string you want to modify (text in our example).
>>> import re
>>> text = 'Learn to sing because singing is fun.'
>>> re.sub('sing', 'program', text) 'Learn to program because programing is fun.'
Just ignore the grammar mistake for now. You get the point: we don’t sing, we program.
But what if you want to actually fix this grammar mistake? After all, it’s programming, not programing. In this case, we need to substitute ‘sing’ with ‘program’ in some cases and ‘sing’ with ‘programm’ in other cases.
You see where this leads us: the sub argument must be a function! So let’s try this:
import re def sub(matched): if matched.group(0)=='singing': return 'programming' else: return 'program' text = 'Learn to sing because singing is fun.'
print(re.sub('sing(ing)?', sub, text))
# Learn to program because programming is fun.
In this example, you first define a substitution function sub. The function takes the matched object as an input and returns a string. If it matches the longer form ‘singing’, it returns ‘programming’. Else it matches the shorter form ‘sing’, so it returns the shorter replacement string ‘program’ instead.
How to Use the count Argument of the Regex Sub Method?
What if you don’t want to substitute all occurrences of a pattern but only a limited number of them? Just use the count argument! Here’s an example:
>>> import re
>>> s = 'xxxxxxhelloxxxxxworld!xxxx'
>>> re.sub('x+', '', s, count=2) 'helloworld!xxxx'
>>> re.sub('x+', '', s, count=3) 'helloworld!'
In the first substitution operation, you replace only two occurrences of the pattern ‘x+’. In the second, you replace all three.
You can also use positional arguments to save some characters:
>>> re.sub('x+', '', s, 3) 'helloworld!'
But as many coders don’t know about the count argument, you probably should use the keyword argument for readability.
How to Use the Optional Flag Argument?
As you’ve seen in the specification, the re.sub() method comes with an optional fourth flag argument:
Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex).
Here’s how you’d use the flags argument in a minimal example:
>>> import re
>>> s = 'xxxiiixxXxxxiiixXXX'
>>> re.sub('x+', '', s) 'iiiXiiiXXX'
>>> re.sub('x+', '', s, flags=re.I) 'iiiiii'
In the second substitution operation, you ignore the capitalization by using the flag re.I which is short for re.IGNORECASE. That’s why it substitutes even the uppercase ‘X’ characters that now match the regex ‘x+’, too.
What’s the Difference Between Regex Sub and String Replace?
Why? Because you can replace all occurrences of a regex pattern rather than only all occurrences of a string in another string.
So with re.sub() you can do everything you can do with string.replace() but some things more!
Here’s an example:
>>> 'Python is python is PYTHON'.replace('python', 'fun') 'Python is fun is PYTHON'
>>> re.sub('(Python)|(python)|(PYTHON)', 'fun', 'Python is python is PYTHON') 'fun is fun is fun'
The string.replace() method only replaces the lowercase word ‘python’ while the re.sub() method replaces all occurrences of uppercase or lowercase variants.
Note, you can accomplish the same thing even easier with the flags argument.
>>> re.sub('python', 'fun', 'Python is python is PYTHON', flags=re.I) 'fun is fun is fun'
How to Remove Regex Pattern in Python?
Nothing simpler than that. Just use the empty string as a replacement string:
>>> re.sub('p', '', 'Python is python is PYTHON', flags=re.I) 'ython is ython is YTHON'
You replace all occurrences of the pattern ‘p’ with the empty string ”. In other words, you remove all occurrences of ‘p’. As you use the flags=re.I argument, you ignore capitalization.
Summary
You’ve learned the re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl.
The Dot Operator .
You’re about to learn one of the most frequently used regex operators: the dot regex . in Python’s re library.
What’s the Dot Regex in Python’s Re Library?
The dot regex . matches all characters except the newline character. For example, the regular expression ‘…’ matches strings ‘hey’ and ‘tom’. But it does not match the string ‘yo\ntom’ which contains the newline character ‘\n’.
Let’s study some basic examples to help you gain a deeper understanding.
>>> import re
>>> >>> text = '''But then I saw no harm, and then I heard
Each syllable that breath made up between them.'''
>>> re.findall('B..', text)
['But']
>>> re.findall('heard.Each', text)
[]
>>> re.findall('heard\nEach', text)
['heard\nEach']
>>>
You first import Python’s re library for regular expression handling. Then, you create a multi-line text using the triple string quotes.
Let’s dive into the first example:
>>> re.findall('B..', text)
['But']
You use the re.findall(pattern, string) method that finds all occurrences of the pattern in the string and returns a list of all matching substrings.
The first argument is the regular expression pattern ‘B..’. The second argument is the string to be searched for the pattern. You want to find all patterns starting with the ‘B’ character, followed by two arbitrary characters except the newline character.
The findall() method finds only one such occurrence: the string ‘But’.
The second example shows that the dot operator does not match the newline character:
>>> re.findall('heard.Each', text)
[]
In this example, you’re looking at the simple pattern ‘heard.Each’. You want to find all occurrences of string ‘heard’ followed by an arbitrary non-whitespace character, followed by the string ‘Each’.
But such a pattern does not exist! Many coders intuitively read the dot regex as an arbitrary character. You must be aware that the correct definition of the dot regex is an arbitrary character except the newline. This is a source of many bugs in regular expressions.
The third example shows you how to explicitly match the newline character ‘\n’ instead:
Naturally, the following relevant question arises:
How to Match an Arbitrary Character (Including Newline)?
The dot regex . matches a single arbitrary character—except the newline character. But what if you do want to match the newline character, too? There are two main ways to accomplish this.
You create a multi-line string. Then you try to find the regex pattern ‘o.p’ in the string. But there’s no match because the dot operator does not match the newline character per default. However, if you define the flag re.DOTALL, the newline character will also be a valid match.
An alternative is to use the slightly more complicated regex pattern [.\n]. The square brackets enclose a character class—a set of characters that are all a valid match. Think of a character class as an OR operation: exactly one character must match.
What If You Actually Want to Match a Dot?
If you use the character ‘.’ in a regular expression, Python assumes that it’s the dot operator you’re talking about. But what if you actually want to match a dot—for example to match the period at the end of a sentence?
Nothing simpler than that: escape the dot regex by using the backslash: ‘\.’. The backslash nullifies the meaning of the special symbol ‘.’ in the regex. The regex engine now knows that you’re actually looking for the dot character, not an arbitrary character except newline.
Here’s an example:
>>> import re
>>> text = 'Python. Is. Great. Period.'
>>> re.findall('\.', text)
['.', '.', '.', '.']
The findall() method returns all four periods in the sentence as matching substrings for the regex ‘\.’.
In this example, you’ll learn how you can combine it with other regular expressions:
>>> re.findall('\.\s', text)
['. ', '. ', '. ']
Now, you’re looking for a period character followed by an arbitrary whitespace. There are only three such matching substrings in the text.
In the next example, you learn how to combine this with a character class:
>>> re.findall('[st]\.', text)
['s.', 't.']
You want to find either character ‘s’ or character ‘t’ followed by the period character ‘.’. Two substrings match this regex.
Note that skipping the backslash is required. If you forget this, it can lead to strange behavior:
>>> re.findall('[st].', text)
['th', 's.', 't.']
As an arbitrary character is allowed after the character class, the substring ‘th’ also matches the regex.
Summary
You’ve learned everything you need to know about the dot regex . in this tutorial.
Summary: The dot regex . matches all characters except the newline character. For example, the regular expression ‘…’ matches strings ‘hey’ and ‘tom’. But it does not match the string ‘yo\ntom’ which contains the newline character ‘\n’.
The Asterisk Operator *
Every computer scientist knows the asterisk quantifier of regular expressions. But many non-techies know it, too. Each time you search for a text file *.txt on your computer, you use the asterisk operator.
This section is all about the asterisk * quantifier.
What’s the Python Re * Quantifier?
When applied to regular expression A, Python’s A* quantifier matches zero or more occurrences of A. The * quantifier is called asterisk operator and it always applies only to the preceding regular expression. For example, the regular expression ‘yes*’ matches strings ‘ye’, ‘yes’, and ‘yesssssss’. But it does not match the empty string because the asterisk quantifier * does not apply to the whole regex ‘yes’ but only to the preceding regex ‘s’.
Let’s study two basic examples to help you gain a deeper understanding. Do you get all of them?
>>> import re
>>> text = 'finxter for fast and fun python learning'
>>> re.findall('f.* ', text)
['finxter for fast and fun python ']
>>> re.findall('f.*? ', text)
['finxter ', 'for ', 'fast ', 'fun ']
>>> re.findall('f[a-z]*', text)
['finxter', 'for', 'fast', 'fun']
>>>
Don’t worry if you had problems understanding those examples. You’ll learn about them next. Here’s the first example:
Greedy Asterisk Example
>>> re.findall('f.* ', text)
['finxter for fast and fun python ']
The first argument of the re.findall() method is the regular expression pattern ‘f.* ‘. The second argument is the string to be searched for the pattern. In plain English, you want to find all patterns in the string that start with the character ‘f’, followed by an arbitrary number of optional characters, followed by an empty space.
The findall() method returns only one matching substring: ‘finxter for fast and fun python ‘. The asterisk quantifier * is greedy. This means that it tries to match as many occurrences of the preceding regex as possible. So in our case, it wants to match as many arbitrary characters as possible so that the pattern is still matched. Therefore, the regex engine “consumes” the whole sentence.
Non-Greedy Asterisk Example
But what if you want to find all words starting with an ‘f’? In other words: how to match the text with a non-greedy asterisk operator?
In this example, you’re looking at a similar pattern with only one difference: you use the non-greedy asterisk operator *?. You want to find all occurrences of character ‘f’ followed by an arbitrary number of characters (but as few as possible), followed by an empty space.
Therefore, the regex engine finds four matches: the strings ‘finxter ‘, ‘for ‘, ‘fast ‘, and ‘fun ‘.
This regex achieves almost the same thing: finding all words starting with f. But you use the asterisk quantifier in combination with a character class that defines specifically which characters are valid matches.
Within the character class, you can define character ranges. For example, the character range [a-z] matches one lowercase character in the alphabet while the character range [A-Z] matches one uppercase character in the alphabet.
But note that the empty space is not part of the character class, so it won’t be matched if it appears in the text. Thus, the result is the same list of words that start with character f: ‘finxter ‘, ‘for ‘, ‘fast ‘, and ‘fun ‘.
What If You Want to Match the Asterisk Character Itself?
You know that the asterisk quantifier matches an arbitrary number of the preceding regular expression. But what if you search for the asterisk (or star) character itself? How can you search for it in a string?
The answer is simple: escape the asterisk character in your regular expression using the backslash. In particular, use ‘\*’ instead of ‘*’. Here’s an example:
You find all occurrences of the star symbol in the text by using the regex ‘\*’. Consequently, if you use the regex ‘\**’, you search for an arbitrary number of occurrences of the asterisk symbol (including zero occurrences). And if you would like to search for all maximal number of occurrences of subsequent asterisk symbols in a text, you’d use the regex ‘\*+’.
What’s the Difference Between Python Re * and ? Quantifiers?
You can read the Python Re A? quantifier as zero-or-one regex: the preceding regex A is matched either zero times or exactly once. But it’s not matched more often.
Analogously, you can read the Python Re A* operator as the zero-or-more regex (I know it sounds a bit clunky): the preceding regex A is matched an arbitrary number of times.
The regex ‘ab?’ matches the character ‘a’ in the string, followed by character ‘b’ if it exists (which it does in the code).
The regex ‘ab*’ matches the character ‘a’ in the string, followed by as many characters ‘b’ as possible.
What’s the Difference Between Python Re * and + Quantifiers?
You can read the Python Re A* quantifier as zero-or-more regex: the preceding regex A is matched an arbitrary number of times.
Analogously, you can read the Python Re A+ operator as the at-least-once regex: the preceding regex A is matched an arbitrary number of times too—but at least once.
The regex ‘ab*’ matches the character ‘a’ in the string, followed by an arbitary number of occurrences of character ‘b’. The substring ‘a’ perfectly matches this formulation. Therefore, you find that the regex matches eight times in the string.
The regex ‘ab+’ matches the character ‘a’, followed by as many characters ‘b’ as possible—but at least one. However, the character ‘b’ does not exist so there’s no match.
Summary: When applied to regular expression A, Python’s A* quantifier matches zero or more occurrences of A. The * quantifier is called asterisk operator and it always applies only to the preceding regular expression. For example, the regular expression ‘yes*’ matches strings ‘ye’, ‘yes’, and ‘yesssssss’. But it does not match the empty string because the asterisk quantifier * does not apply to the whole regex ‘yes’ but only to the preceding regex ‘s’.
The Zero-Or-One Operator: Question Mark (?)
Congratulations, you’re about to learn one of the most frequently used regex operators: the question mark quantifier A?.
What’s the Python Re ? Quantifier
When applied to regular expression A, Python’s A?quantifier matches either zero or one occurrences of A. The ? quantifier always applies only to the preceding regular expression. For example, the regular expression ‘hey?’ matches both strings ‘he’ and ‘hey’. But it does not match the empty string because the ? quantifier does not apply to the whole regex ‘hey’ but only to the preceding regex ‘y’.
Let’s study two basic examples to help you gain a deeper understanding. Do you get all of them?
Don’t worry if you had problems understanding those examples. You’ll learn about them next. Here’s the first example:
>>> re.findall('aa[cde]?', 'aacde aa aadcde')
['aac', 'aa', 'aad']
You use the re.findall() method. Again, the re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.
The first argument is the regular expression pattern ‘aa[cde]?’. The second argument is the string to be searched for the pattern. In plain English, you want to find all patterns that start with two ‘a’ characters, followed by one optional character—which can be either ‘c’, ‘d’, or ‘e’.
The findall() method returns three matching substrings:
First, string ‘aac’ matches the pattern. After Python consumes the matched substring, the remaining substring is ‘de aa aadcde’.
Second, string ‘aa’ matches the pattern. Python consumes it which leads to the remaining substring ‘ aadcde’.
Third, string ‘aad’ matches the pattern in the remaining substring. What remains is ‘cde’ which doesn’t contain a matching substring anymore.
In this example, you’re looking at the simple pattern ‘aa?’. You want to find all occurrences of character ‘a’ followed by an optional second ‘a’. But be aware that the optional second ‘a’ is not needed for the pattern to match.
Therefore, the regex engine finds three matches: the characters ‘a’.
This regex pattern looks complicated: ‘[cd]?[cde]?’. But is it really?
Let’s break it down step-by-step:
The first part of the regex [cd]? defines a character class [cd] which reads as “match either c or d”. The question mark quantifier indicates that you want to match either one or zero occurrences of this pattern.
The second part of the regex [cde]? defines a character class [cde] which reads as “match either c, d, or e”. Again, the question mark indicates the zero-or-one matching requirement.
As both parts are optional, the empty string matches the regex pattern. However, the Python regex engine attempts as much as possible.
Thus, the regex engine performs the following steps:
The first match in the string ‘ccc dd ee’ is ‘cc’. The regex engine consumes the matched substring, so the string ‘c dd ee’ remains.
The second match in the remaining string is the character ‘c’. The empty space ‘ ‘ does not match the regex so the second part of the regex [cde] does not match. Because of the question mark quantifier, this is okay for the regex engine. The remaining string is ‘ dd ee’.
The third match is the empty string ”. Of course, Python does not attempt to match the same position twice. Thus, it moves on to process the remaining string ‘dd ee’.
The fourth match is the string ‘dd’. The remaining string is ‘ ee’.
The fifth match is the string ”. The remaining string is ‘ee’.
The sixth match is the string ‘e’. The remaining string is ‘e’.
The seventh match is the string ‘e’. The remaining string is ”.
The eighth match is the string ”. Nothing remains.
This was the most complicated of our examples. Congratulations if you understood it completely!
What’s the Difference Between Python Re ? and * Quantifiers?
You can read the Python Re A? quantifier as zero-or-one regex: the preceding regex A is matched either zero times or exactly once. But it’s not matched more often.
Analogously, you can read the Python Re A* operator as the zero-or-multiple-times regex (I know it sounds a bit clunky): the preceding regex A is matched an arbitrary number of times.
The regex ‘ab?’ matches the character ‘a’ in the string, followed by character ‘b’ if it exists (which it does in the code).
The regex ‘ab*’ matches the character ‘a’ in the string, followed by as many characters ‘b’ as possible.
What’s the Difference Between Python Re ? and + Quantifiers?
You can read the Python Re A? quantifier as zero-or-one regex: the preceding regex A is matched either zero times or exactly once. But it’s not matched more often.
Analogously, you can read the Python Re A+ operator as the at-least-once regex: the preceding regex A is matched an arbitrary number of times but at least once.
The regex ‘ab?’ matches the character ‘a’ in the string, followed by character ‘b’ if it exists—but it doesn’t in the code.
The regex ‘ab+’ matches the character ‘a’ in the string, followed by as many characters ‘b’ as possible—but at least one. However, the character ‘b’ does not exist so there’s no match.
What are Python Re *?, +?, ?? Quantifiers?
You’ve learned about the three quantifiers:
The quantifier A* matches an arbitrary number of patterns A.
The quantifier A+ matches at least one pattern A.
The quantifier A? matches zero-or-one pattern A.
Those three are all greedy: they match as many occurrences of the pattern as possible. Here’s an example that shows their greediness:
The code shows that all three quantifiers *, +, and ? match as many ‘a’ characters as possible.
So, the logical question is: how to match as few as possible? We call this non-greedy matching. You can append the question mark after the respective quantifiers to tell the regex engine that you intend to match as few patterns as possible: *?, +?, and ??.
Here’s the same example but with the non-greedy quantifiers:
In this case, the code shows that all three quantifiers *?, +?, and ?? match as few ‘a’ characters as possible.
Summary
You’ve learned everything you need to know about the question mark quantifier ? in this regex tutorial.
Summary: When applied to regular expression A, Python’s A? quantifier matches either zero or one occurrences of A. The ? quantifier always applies only to the preceding regular expression. For example, the regular expression ‘hey?’ matches both strings ‘he’ and ‘hey’. But it does not match the empty string because the ? quantifier does not apply to the whole regex ‘hey’ but only to the preceding regex ‘y’.
Say, you have any regular expression A. The regular expression (regex) A+ then matches one or more occurrences of A. We call the “+” symbol the at-least-once quantifier because it requires at least one occurrence of the preceding regex. For example, the regular expression ‘yes+’ matches strings ‘yes’, ‘yess’, and ‘yesssssss’. But it does neither match the string ‘ye’, nor the empty string ” because the plus quantifier + does not apply to the whole regex ‘yes’ but only to the preceding regex ‘s’.
Let’s study some examples to help you gain a deeper understanding.
The first argument of the findall() method is the regular expression pattern ‘a+b’ and the second argument is the string to be searched. In plain English, you want to find all patterns in the string that start with at least one, but possibly many, characters ‘a’, followed by the character ‘b’.
The findall() method returns the matching substring: ‘aaaaaab’. The asterisk quantifier + is greedy. This means that it tries to match as many occurrences of the preceding regex as possible. So in our case, it wants to match as many arbitrary characters as possible so that the pattern is still matched. Therefore, the regex engine “consumes” the whole sentence.
The second example is similar:
>>> re.findall('ab+', 'aaaaaabb')
['abb']
You search for the character ‘a’ followed by at least one character ‘b’. As the plus (+) quantifier is greedy, it matches as many ‘b’s as it can lay its hands on.
Examples 3 and 4: Non-Greedy Plus (+) Quantifiers
But what if you want to match at least one occurrence of a regex in a non-greedy manner. In other words, you don’t want the regex engine to consume more and more as long as it can but returns as quickly as it can from the processing.
Again, here’s the example of the greedy match:
>>> re.findall('ab+', 'aaaaaabbbbb')
['abbbbb']
The regex engine starts with the first character ‘a’ and finds that it’s a partial match. So, it moves on to match the second ‘a’—which violates the pattern ‘ab+’ that allows only for a single character ‘a’. So it moves on to the third character, and so on, until it reaches the last character ‘a’ in the string ‘aaaaaabbbbb’. It’s a partial match, so it moves on to the first occurrence of the character ‘b’. It realizes that the ‘b’ character can be matched by the part of the regex ‘b+’. Thus, the engine starts matching ‘b’s. And it greedily matches ‘b’s until it cannot match any further character. At this point it looks at the result and sees that it has found a matching substring which is the result of the operation.
However, it could have stopped far earlier to produce a non-greedy match after matching the first character ‘b’. Here’s an example of the non-greedy quantifier ‘+?’ (both symbols together form one regex expression).
>>> re.findall('ab+?', 'aaaaaabbbbb')
['ab']
Now, the regex engine does not greedily “consume” as many ‘b’ characters as possible. Instead, it stops as soon as the pattern is matched (non-greedy).
Examples 5 and 6
For the sake of your thorough understanding, let’s have a look at the other given example:
>>> re.findall('ab+', 'aaaaaa')
[]
You can see that the plus (+) quantifier requires that at least one occurrence of the preceding regex is matched. In the example, it’s the character ‘b’ that is not partially matched. So, the result is the empty list indicating that no matching substring was found.
You use the plus (+) quantifier in combination with a character class that defines specifically which characters are valid matches.
Note Character Class: Within the character class, you can define character ranges. For example, the character range [a-z] matches one lowercase character in the alphabet while the character range [A-Z] matches one uppercase character in the alphabet.
The empty space is not part of the given character class [a-z], so it won’t be matched in the text. Thus, the result is the list of words that start with at least one character: ‘hello’, ‘world’.
What If You Want to Match the Plus (+) Symbol Itself?
You know that the plus quantifier matches at least one of the preceding regular expression. But what if you search for the plus (+) symbol itself? How can you search for it in a string?
The answer is simple: escape the plus symbol in your regular expression using the backslash. In particular, use ‘\+’ instead of ‘+’. Here’s an example:
If you want to find the ‘+’ symbol in your string, you need to escape it by using the backslash. If you don’t do this, the Python regex engine will interpret it as a normal “at-least-once” regex. Of course, you can combine the escaped plus symbol ‘\+’ with the “at-least-once” regex searching for at least one occurrences of the plus symbol.
What’s the Difference Between Python Re + and ? Quantifiers?
You can read the Python Re A? quantifier as zero-or-one regex: the preceding regex A is matched either zero times or exactly once. But it’s not matched more often.
Analogously, you can read the Python Re A+ operator as the at-least-once regex: the preceding regex A is matched an arbitrary number of times but at least once (as the name suggests).
The regex ‘ab?’ matches the character ‘a’ in the string, followed by character ‘b’ if it exists (which it does in the code).
The regex ‘ab+’ matches the character ‘a’ in the string, followed by as many characters ‘b’ as possible (and at least one).
What’s the Difference Between Python Re * and + Quantifiers?
You can read the Python Re A* quantifier as zero-or-more regex: the preceding regex A is matched an arbitrary number of times.
Analogously, you can read the Python Re A+ operator as the at-least-once regex: the preceding regex A is matched an arbitrary number of times too—but at least once.
The regex ‘ab*’ matches the character ‘a’ in the string, followed by an arbitary number of occurrences of character ‘b’. The substring ‘a’ perfectly matches this formulation. Therefore, you find that the regex matches eight times in the string.
The regex ‘ab+’ matches the character ‘a’, followed by as many characters ‘b’ as possible—but at least one. However, the character ‘b’ does not exist so there’s no match.
Summary
Regex A+ matches one or more occurrences of regex A. The “+” symbol is the at-least-once quantifier because it requires at least one occurrence of the preceding regex. The non-greedy version of the at-least-once quantifier is A+? with the trailing question mark.
These two regexes are fundamental to all regular expressions—even outside the Python world. So invest 5 minutes now and master them once and for all!
Python Re Start-of-String (^) Regex
You can use the caret operator ^ to match the beginning of the string. For example, this is useful if you want to ensure that a pattern appears at the beginning of a string. Here’s an example:
>>> import re
>>> re.findall('^PYTHON', 'PYTHON is fun.')
['PYTHON']
The findall(pattern, string) method finds all occurrences of the pattern in the string. The caret at the beginning of the pattern ‘^PYTHON’ ensures that you match the word Python only at the beginning of the string. In the previous example, this doesn’t make any difference. But in the next example, it does:
>>> re.findall('^PYTHON', 'PYTHON! PYTHON is fun')
['PYTHON']
Although there are two occurrences of the substring ‘PYTHON’, there’s only one matching substring—at the beginning of the string.
But what if you want to match not only at the beginning of the string but at the beginning of each line in a multi-line string? In other words:
Python Re Start-of-Line (^) Regex
The caret operator, per default, only applies to the start of a string. So if you’ve got a multi-line string—for example, when reading a text file—it will still only match once: at the beginning of the string.
However, you may want to match at the beginning of each line. For example, you may want to find all lines that start with ‘Python’ in a given string.
You can specify that the caret operator matches the beginning of each line via the re.MULTILINE flag. Here’s an example showing both usages—without and with setting the re.MULTILINE flag:
>>> import re
>>> text = '''
Python is great.
Python is the fastest growing
major programming language in
the world.
Pythonistas thrive.'''
>>> re.findall('^Python', text)
[]
>>> re.findall('^Python', text, re.MULTILINE)
['Python', 'Python', 'Python']
>>>
The first output is the empty list because the string ‘Python’ does not appear at the beginning of the string.
The second output is the list of three matching substrings because the string ‘Python’ appears three times at the beginning of a line.
Python re.sub()
The re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl.
You can use the caret operator to substitute wherever some pattern appears at the beginning of the string:
>>> import re
>>> re.sub('^Python', 'Code', 'Python is \nPython') 'Code is \nPython'
Only the beginning of the string matches the regex pattern so you’ve got only one substitution.
Again, you can use the re.MULTILINE flag to match the beginning of each line with the caret operator:
>>> re.sub('^Python', 'Code', 'Python is \nPython', flags=re.MULTILINE) 'Code is \nCode'
Now, you replace both appearances of the string ‘Python’.
Python re.match(), re.search(), re.findall(), and re.fullmatch()
Let’s quickly recap the most important regex methods in Python:
The re.findall(pattern, string, flags=0) method returns a list of string matches.
The re.search(pattern, string, flags=0) method returns a match object of the first match.
The re.match(pattern, string, flags=0) method returns a match object if the regex matches at the beginning of the string.
The re.fullmatch(pattern, string, flags=0) method returns a match object if the regex matches the whole string.
You can see that all four methods search for a pattern in a given string. You can use the caret operator ^ within each pattern to match the beginning of the string. Here’s one example per method:
So you can use the caret operator to match at the beginning of the string. However, you should note that it doesn’t make a lot of sense to use it for the match() and fullmatch() methods as they, by definition, start by trying to match the first character of the string.
You can also use the re.MULTILINE flag to match the beginning of each line (rather than only the beginning of the string):
Again, it’s questionable whether this makes sense for the re.match() and re.fullmatch() methods as they only look for a match at the beginning of the string.
Python Re End of String ($) Regex
Similarly, you can use the dollar-sign operator $ to match the end of the string. Here’s an example:
>>> import re
>>> re.findall('fun$', 'PYTHON is fun')
['fun']
The findall() method finds all occurrences of the pattern in the string—although the trailing dollar-sign $ ensures that the regex matches only at the end of the string.
This can significantly alter the meaning of your regex as you can see in the next example:
>>> re.findall('fun$', 'fun fun fun')
['fun']
Although, there are three occurrences of the substring ‘fun’, there’s only one matching substring—at the end of the string.
But what if you want to match not only at the end of the string but at the end of each line in a multi-line string?
Python Re End of Line ($)
The dollar-sign operator, per default, only applies to the end of a string. So if you’ve got a multi-line string—for example, when reading a text file—it will still only match once: at the end of the string.
However, you may want to match at the end of each line. For example, you may want to find all lines that end with ‘.py’.
To achieve this, you can specify that the dollar-sign operator matches the end of each line via the re.MULTILINE flag. Here’s an example showing both usages—without and with setting the re.MULTILINE flag:
>>> import re
>>> text = '''
Coding is fun
Python is fun
Games are fun
Agreed?'''
>>> re.findall('fun$', text)
[]
>>> re.findall('fun$', text, flags=re.MULTILINE)
['fun', 'fun', 'fun']
>>>
The first output is the empty list because the string ‘fun’ does not appear at the end of the string.
The second output is the list of three matching substrings because the string ‘fun’ appears three times at the end of a line.
Python re.sub()
The re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in the Finxter blog tutorial.
You can use the dollar-sign operator to substitute wherever some pattern appears at the end of the string:
>>> import re
>>> re.sub('Python$', 'Code', 'Is Python\nPython') 'Is Python\nCode'
Only the end of the string matches the regex pattern so there’s only one substitution.
Again, you can use the re.MULTILINE flag to match the end of each line with the dollar-sign operator:
Now, you replace both appearances of the string ‘Python’.
Python re.match(), re.search(), re.findall(), and re.fullmatch()
All four methods—re.findall(), re.search(), re.match(), and re.fullmatch()—search for a pattern in a given string. You can use the dollar-sign operator $ within each pattern to match the end of the string. Here’s one example per method:
>>> import re
>>> text = 'Python is Python'
>>> re.findall('Python$', text)
['Python']
>>> re.search('Python$', text)
<re.Match object; span=(10, 16), match='Python'>
>>> re.match('Python$', text)
>>> re.fullmatch('Python$', text)
>>>
So you can use the dollar-sign operator to match at the end of the string. However, you should note that it doesn’t make a lot of sense to use it for the fullmatch() methods as it, by definition, already requires that the last character of the string is part of the matching substring.
You can also use the re.MULTILINE flag to match the end of each line (rather than only the end of the whole string):
As the pattern doesn’t match the string prefix, both re.match() and re.fullmatch() return empty results.
How to Match the Caret (^) or Dollar ($) Symbols in Your Regex?
You know that the caret and dollar symbols have a special meaning in Python’s regular expression module: they match the beginning or end of each string/line. But what if you search for the caret (^) or dollar ($) symbols themselves? How can you match them in a string?
The answer is simple: escape the caret or dollar symbols in your regular expression using the backslash. In particular, use ‘\^’ instead of ‘^’ and ‘\$’ instead of ‘$’. Here’s an example:
>>> import re
>>> text = 'The product ^^^ costs $3 today.'
>>> re.findall('\^', text)
['^', '^', '^']
>>> re.findall('\$', text)
['$']
By escaping the special symbols ^ and $, you tell the regex engine to ignore their special meaning.
Summary
You’ve learned everything you need to know about the caret operator ^ and the dollar-sign operator $ in this regex tutorial.
Summary: The caret operator ^ matches at the beginning of a string. The dollar-sign operator $ matches at the end of a string. If you want to match at the beginning or end of each line in a multi-line string, you can set the re.MULTILINE flag in all the relevant re methods.
The first argument is the pattern (iPhone|iPad). It either matches the first part right in front of the or symbol |—which is iPhone—or the second part after it—which is iPad.
The second argument is the text ‘Buy now: iPhone only $399 with free iPad’ which you want to search for the pattern.
The result shows that there are two matching substrings in the text: ‘iPhone’ and ‘iPad’.
Python Regex Or: Examples
Let’s study some more examples to teach you all the possible uses and border cases—one after another.
You start with the previous example:
>>> import re
>>> text = 'Buy now: iPhone only $399 with free iPad'
>>> re.findall('(iPhone|iPad)', text)
['iPhone', 'iPad']
In the second example, you just skipped the parentheses using the regex pattern iPhone|iPad rather than (iPhone|iPad). But no problem–it still works and generates the exact same output!
But what happens if you leave one side of the or operation empty?
The output is not as strange as it seems. The or operator allows for empty operands—in which case it wants to match the non-empty string. If this is not possible, it matches the empty string (so everything will be a match).
The previous example also shows that it still tries to match the non-empty string if possible. But what if the trivial empty match is on the left side of the or operand?
This shows some subtleties of the regex engine. First of all, it still matches the non-empty string if possible! But more importantly, you can see that the regex engine matches from left to right. It first tries to match the left regex (which it does on every single position in the text). An empty string that’s already matched will not be considered anymore. Only then, it tries to match the regex on the right side of the or operator.
Think of it this way: the regex engine moves from the left to the right—one position at a time. It matches the empty string every single time. Then it moves over the empty string and in some cases, it can still match the non-empty string. Each match “consumes” a substring and cannot be matched anymore. But an empty string cannot be consumed. That’s why you see the first match is the empty string and the second match is the substring ‘iPhone’.
How to Nest the Python Regex Or Operator?
Okay, you’re not easily satisfied, are you? Let’s try nesting the Python regex or operator |.
>>> text = 'xxx iii zzz iii ii xxx'
>>> re.findall('xxx|iii|zzz', text)
['xxx', 'iii', 'zzz', 'iii', 'xxx']
So you can use multiple or operators in a row. Of course, you can also use the grouping (parentheses) operator to nest an arbitrary complicated construct of or operations:
But this seldomly leads to clean and readable code. And it can usually avoided easily by putting a bit of thought into your regex design.
Python Regex Or: Character Class
If you only want to match a single character out of a set of characters, the character class is a much better way of doing it:
>>> import re
>>> text = 'hello world'
>>> re.findall('[abcdefghijklmnopqrstuvwxyz]+', text)
['hello', 'world']
A shorter and more concise version would be to use the range operator within character classes:
>>> re.findall('[a-z]+', text)
['hello', 'world']
The character class is enclosed in the bracket notation [ ] and it literally means “match exactly one of the symbols in the class”. Thus, it carries the same semantics as the or operator: |. However, if you try to do something on those lines…
… you’ll first write much less concise code and, second, risk of getting confused by the output. The reason is that the parenthesis is the group operator—it captures the position and substring that matches the regex. Used in the findall() method, it only returns the content of the last matched group. This turns out to be the last character of the word ‘hello’ and the last character of the word ‘world’.
How to Match the Or Character (Vertical Line ‘|’)?
So if the character ‘|’ stands for the or character in a given regex, the question arises how to match the vertical line symbol ‘|’ itself?
The answer is simple: escape the or character in your regular expression using the backslash. In particular, use ‘A\|B’ instead of ‘A|B’ to match the string ‘A|B’ itself. Here’s an example:
Do you really understand the outputs of this code snippet? In the first example, you’re searching for either character ‘A’ or character ‘B’. In the second example, you’re searching for the string ‘A|B’ (which contains the ‘|’ character).
Python Regex Not
How can you search a string for substrings that do NOT match a given pattern? In other words, what’s the “negative pattern” in Python regular expressions?
The answer is two-fold:
If you want to match all characters except a set of specific characters, you can use the negative character class [^…].
If you want to match all substrings except the ones that match a regex pattern, you can use the feature of negative lookahead (?!…).
Here’s an example for the negative character class:
>>> import re
>>> re.findall('[^a-m]', 'aaabbbaababmmmnoopmmaa')
['n', 'o', 'o', 'p']
And here’s an example for the negative lookahead pattern to match all “words that are not followed by words”:
The negative lookahead (?![a-z]+) doesn’t consume (match) any character. It just checks whether the pattern [a-z]+ does NOT match at a given position. The only times this happens is just before the empty space and the end of the string.
Summary
You’ve learned everything you need to know about the Python Regex Or Operator.
Given a string. Say, your goal is to find all substrings that match either the string ‘iPhone’ or the string ‘iPad’. How can you achieve this?
The easiest way to achieve this is the Python or operator | using the regular expression pattern (iPhone|iPad).
Sure, there’s the OR operator (example: ‘iPhone|iPad’). But what’s the meaning of matching one regular expression AND another?
There are different interpretations for the AND operator in a regular expression (regex):
Ordered: Match one regex pattern after another. In other words, you first match pattern A AND then you match pattern B. Here the answer is simple: you use the pattern AB to match both.
Unordered: Match multiple patterns in a string but in no particular order (source). In this case, you’ll use a bag-of-words approach.
I’ll discuss both in the following.
Ordered Python Regex AND Operator
Given a string. Say, your goal is to find all substrings that match string ‘iPhone’, followed by string ‘iPad’. You can view this as the AND operator of two regular expressions. How can you achieve this?
The straightforward AND operation of both strings is the regular expression pattern iPhoneiPad.
In the following example, you want to match pattern ‘aaa’ and pattern ‘bbb’—in this order.
>>> import re
>>> text = 'aaabaaaabbb'
>>> A = 'aaa'
>>> B = 'bbb'
>>> re.findall(A+B, text)
['aaabbb']
>>>
You use the re.findall() method. The first argument is the pattern A+B which evaluates to ‘aaabbb’. There’s nothing fancy about this: each time you write a string consisting of more than one character, you essentially use the ordered AND operator.
The second argument is the text ‘aaabaaaabbb’ which you want to search for the pattern.
The result shows that there’s a matching substring in the text: ‘aaabbb’.
Unordered Python Regex AND Operator
But what if you want to search a given text for pattern A AND pattern B—but in no particular order? In other words: if both patterns appear anywhere in the string, the whole string should be returned as a match.
Now this is a bit more complicated because any regular expression pattern is ordered from left to right. A simple solution is to use the lookahead assertion (?.*A) to check whether regex A appears anywhere in the string. (Note we assume a single line string as the .* pattern doesn’t match the newline character by default.)
Let’s first have a look at the minimal solution to check for two patterns anywhere in the string (say, patterns ‘hi’ AND ‘you’).
>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are yo?')
[]
>>> re.findall(pattern, 'hi how are you?')
['']
In the first example, both words do not appear. In the second example, they do.
But how does the lookahead assertion work? You must know that any other regex pattern “consumes” the matched substring. The consumed substring cannot be matched by any other part of the regex.
Think of the lookahead assertion as a non-consuming pattern match. The regex engine goes from the left to the right—searching for the pattern. At each point, it has one “current” position to check if this position is the first position of the remaining match. In other words, the regex engine tries to “consume” the next character as a (partial) match of the pattern.
The advantage of the lookahead expression is that it doesn’t consume anything. It just “looks ahead” starting from the current position whether what follows would theoretically match the lookahead pattern. If it doesn’t, the regex engine cannot move on.
A simple example of lookahead. The regular expression engine matches (“consumes”) the string partially. Then it checks whether the remaining pattern could be matched without actually matching it.
Let’s go back to the expression (?=.*hi)(?=.*you) to match strings that contain both ‘hi’ and ‘you’. Why does it work?
The reason is that the lookahead expressions don’t consume anything. You first search for an arbitrary number of characters .*, followed by the word hi. But because the regex engine hasn’t consumed anything, it’s still at the same position at the beginning of the string. So, you can repeat the same for the word you.
Note that this method doesn’t care about the order of the two words:
>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are you?')
['']
>>> re.findall(pattern, 'you are how? hi!')
['']
No matter which word “hi” or “you” appears first in the text, the regex engine finds both.
You may ask: why’s the output the empty string? The reason is that the regex engine hasn’t consumed any character. It just checked the lookaheads. So the easy fix is to consume all characters as follows:
Now, the whole string is a match because after checking the lookahead with ‘(?=.*hi)(?=.*you)’, you also consume the whole string ‘.*’.
Summary:
There are different interpretations for the AND operator in a regular expression (regex):
Ordered: Match one regex pattern after another. In other words, you first match pattern A AND then you match pattern B. Here the answer is simple: you use the pattern AB to match both.
Unordered: Match multiple patterns in a string but in no particular order. In this case, you’ll use a bag-of-words approach.
Where to Go From Here
Wow. You’ve spent a lot of time learning everything you need to know about Python regular expressions. Thanks for your time!
At this point, I know you have skills. But do you actually leverage those skills in the most effective way? In other words: do you earn money with Python?
If the answer is no, let me show you a simple way how you can create your simple, home-based coding business online:
Your Python app is slow? It’s time for a speed booster! Learn how in this tutorial.
As you read through the article, feel free to watch the explainer video:
Performance Tuning Concepts 101
I could have started this tutorial with a list of tools you can use to speed up your app. But I feel that this would create more harm than good because you’d spend a lot of time setting up the tools and very little time optimizing your performance.
Instead, I’ll take a different approach addressing the critical concepts of performance tuning first.
So, what’s more important than any one tool for performance optimization?
You must understand the universal concepts of performance tuning first.
The good thing is that you’ll be able to apply those concepts in any language and in any application.
The bad thing is that you must change your expectations a bit: I won’t provide you with a magic tool that speeds up your program on the push of a button.
Let’s start with the following list of the most important things to consider when you think you need to optimize your app’s performance:
Premature Optimization Is The Root Of All Evil
Premature optimization is one of the main problems of badly written code. But what is it anyway?
Definition:Premature optimization is the act of spending valuable resources (time, effort, lines of code, simplicity) to optimize code that doesn’t need to get optimized.
There’s no problem with optimized code per se. The problem is just that there’s no such thing as free lunch. If you think you optimize code snippets, what you’re really doing is to trade one variable (e.g. complexity) against another variable (e.g. performance). An example of such an optimization is to add a cache to avoid computing things repeatedly.
The problem is that if you’re doing it blindly, you may not even realize the harm you’re doing. For example, adding 50% more lines of code just to improve execution speed by 0.1% would be a trade-off that will screw up your whole software development process when done repeatedly.
But don’t take my word for it. This is what one of the most famous computer scientists of all times, Donald Knuth, says about premature optimization:
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97 % of the time: premature optimization is the root of all evil.
A good heuristic is to write the most readable code per default. If this leads to an interactive application that’s already fast enough, good. If users of your application start complaining about speed, then take a structured approach to performance optimization, as described in this tutorial.
Action steps:
Make your code as readable and concise as you can.
Use comments and follow the coding standards (e.g. PEP8 in Python).
Ship your application and do user testing.
Is your application too slow? Really? Okay, then do the following:
Jot down the current performance of your app in seconds if you want to optimize for speed or bytes if you want to optimize for memory.
Do not cross this line until you’ve checked off the previous point.
Measure First, Improve Second
What you measure gets improved. The contrary also holds: what you don’t measure, doesn’t get improved.
This principle is a direct consequence of the first principle: “premature optimization is the root of all evil”. Why? Because if you do premature optimization, you optimize before you measure. But you should always only optimize after you have started your measurements. There’s no point in “improving” runtime if you don’t know from which level you want to improve. Maybe your optimization actually increased runtime? Maybe it had no effect at all? You cannot know unless you have started any attempt to optimize with a clear benchmark.
The consequence is to start with the most straightforward, naive (“dumb”) code that’s also easy to read. This is your benchmark. Any optimization or improvement idea must improve upon this benchmark. As soon as you’ve proven—by rigorous measurement—that your optimization improves your benchmark by X% in performance (memory footprint or speed), this becomes your new benchmark.
This way, your guaranteed to improve the performance of your code over time. And you can document, prove, and defend any optimization to your boss, your peer group, or even the scientific community.
Action steps:
You start with the naive solution that’s easy to read. Mostly, the naive solution is very easy to read.
You take the naive solution as benchmark by measuring its performance rigorously.
You document your measurements in a Google Spreadsheet (okay, you can also use Excel).
You come up with alternative code and measure its performance against the benchmark.
If the new code is better (faster, more memory efficient) than the old benchmark, the new code becomes the new benchmark. All subsequent improvements have to beat the new benchmark (otherwise, you throw them away).
Pareto Is King
I know it’s not big news: the 80/20 Pareto principle—named after Italian economist Vilfredo Pareto—is alive and well in performance optimization.
To exemplify this, have a look at my current CPU usage as I’m writing this:
If you plot this in Python, you see the following Pareto-like distribution:
20% of the code requires 80% of the CPU usage (okay, I haven’t really checked if the numbers match but you get the point).
If I wanted to reduce CPU usage on my computer, I just need to close Cortana and Search and—voilà—a significant portion of the CPU load would be gone:
The interesting observation is that even by removing the two most expensive tasks, the plot looks just the same. Now there are two most expensive tasks: Explorer and System.
This leads us to the 1×1 of performance tuning:
Performance optimization is fractal. As soon as you’re done removing the bottleneck, there’s a new bottleneck lurking around. You “just” need to repeatedly remove the bottleneck to get maximal “bang for your buck”.
Action Steps:
Follow the algorithm.
Identify the bottleneck (= the function with highest negative impact on your performance).
Fix the bottleneck.
Repeat.
Algorithmic Optimization Wins
At this point, you’ve already figured out that you need to optimize your code. You have direct user feedback that your application is too slow. Or you have a strong signal (e.g. through Google Analytics) that your slow web app causes a higher than usual bounce rate etc.
You also know where you are now (in seconds or bytes) and where you want to go (in seconds or bytes).
You also know the bottleneck. (This is where the performance profiling tools discussed below come into play.)
Now, you need to figure out how to overcome the bottleneck. The best leverage point for you as a coder is to tune the algorithms and data structures.
Say, you’re working at a financial application. You know your bottleneck is the function calculate_ROI() that goes over all combinations of potential buying and selling points to calculate the maximum profit (the naive solution). As this is the bottleneck of the whole application, your first task is to find a better algorithm. Fortunately, you find the maximum profit algorithm. The computational complexity reduces from O(n**2) to O(n log n).
(If this particular topic interests you, start reading this SO article.)
Action steps:
Given your current bottleneck function.
Can you improve its data structures? Often, there’s a low hanging fruit by using sets instead of lists (e.g., checking membership is much faster for sets than lists), or dictionaries instead of collections of tuples.
Can you find better algorithms that are already proven? Can you tweak existing algorithms for your specific problem at hand?
Spend a lot of time researching these questions. It pays off. You’ll become a better computer scientist in the process. And it’s your bottleneck after all—so it’s a huge leverage point for your application.
All Hail to the Cache
Have you checked off all previous boxes? You know exactly where you are and where you want to go. You know what bottleneck to optimize. You know about alternative algorithms and data structures.
Here’s a quick and dirty trick that works surprisingly well for a large variety of applications. To improve your performance often means to remove unnecessary computations. One low-hanging fruit is to store the result of a subset of computations you have already performed in a cache.
How can you create a cache in practice? In Python, it’s as simple as creating a dictionary where you associate each function input (e.g. as an input string) with the function output.
You can then ask the cache to give you the computations you’ve already performed.
A simple example of an effective use of caching (sometimes called memoization) is the Fibonacci algorithm:
def fib2(n): if n<2: return n return fib2(n-1) + fib2(n-2)
The problem is that the function calls fib2(n-1) and fib2(n-2) calculate largely the same things. For instance, both separately calculate the Fibonacci value fib2(n-3). This adds up!
But with caching, you can simply memorize the results of previous computations so that the result for fib2(n-3) is calculated only once. All other times, you can pull the result from the cache and get an instant result.
Here’s the caching variant of Python Fibonacci:
def fib(n): if n in cache: return cache[n] if n < 2: return n fib_n = fib(n-1) + fib(n-2) cache[n] = fib_n return fib_n
You store the result of the computation fib(n-1) + fib(n-2) in the cache. If you already have the result of the n-th Fibonacci number, you simply pull it from the cache rather than recalculating it again and again.
Here’s the surprising speed improvement—just by using a simple cache:
import time t1 = time.time()
print(fib2(40))
t2 = time.time()
print(fib(40))
t3 = time.time() print("Fibonacci without cache: " + str(t2-t1))
print("Fibonacci with cache: " + str(t3-t2)) ''' OUTPUT:
102334155
102334155
Fibonacci without cache: 31.577041387557983
Fibonacci with cache: 0.015461206436157227 '''
There are two basic strategies you can use:
Perform computations in advanced (“offline”) and store their results in the cache. This is a great strategy for web applications where you can fill up a large cache once (or once a day) and then simply serve the result of your precomputations to the users. For them, your calculations “feel” blazingly fast. But in reality, you just serve them precalculated values. Google Maps heavily uses this trick to speedup shortest path computations.
Perform computations as they appear (“online”) and store their results in the cache. This reactive form is the most basic and simplest form of caching where you don’t need to decide which computations to perform in advance.
In both cases, the more computations you store, the higher the likelihood of “cache hits” where the computation can be returned immediately. But as you usually have a memory limit (e.g. 100,000 cache entries), you need to decide about a sensible cache replacement policy.
Action steps:
Think: How can you reduce redundant computations? Would caching be a sensible approach?
What type of data / computations do you cache?
What’s the size of your cache?
Which entries to remove if the cache is full?
If you have a web application, can you reuse computations of previous users to compute the result of your current user?
Less is More
Your problem is too hard? Make it easier!
Yes, it’s obvious. But then again, so many coders are too perfectionistic about their code. They accept huge complexity and computational overhead—just for this small additional feature that often doesn’t even get recognized by users.
A powerful “trick” for performance optimization is to seek out easier problems. Instead of spending your effort optimizing, it’s often much better to get rid of complexity, unnecessary features and computations, data. Use heuristics rather than optimal algorithms wherever possible. You often pay for perfect results with a 10x slow down in performance.
So ask yourself this: what is your current bottleneck function really doing? Is it really worth the effort? Can you remove the feature or offer a down-sized version? If the feature is used by 1% of your users but 100% perceive the increased latency, it may be time for some minimalism!
Action step:
Can you remove your current bottleneck altogether by just skipping the feature?
Can you simplify the problem?
Think 80/20: get rid of one expensive feature to add 10 non-expensive ones.
Think opportunity costs: omit one important feature so that you can pursue a very important feature.
Know When to Stop
It’s easy to do but it’s also easy not to do: stop!
Performance optimization can be one of the most time-intensive things to do as a coder. There’s always room for improvement. You can always tweak and improve. But your effort to improve your performance by X increases superlinearly or even exponentially to X. At some point, it’s just a waste of your time of improving your performance.
Action step:
Ask yourself constantly: is it really worth the effort to keep optimizing?
Python Profilers
Python comes with different profilers. If you’re new to performance optimization, you may ask: what’s a profiler anyway?
A performance profiler allows you to monitor your application more closely. If you just run a Python script in your shell, you see nothing but the output produced by your program. But you don’t see how much bytes were consumed by your program. You don’t see how long each function runs. You don’t see the data structures that caused most memory overhead.
Without those things, you cannot know what’s the bottleneck of your application. And, as you’ve already learned above, you cannot possibly start optimizing your code. Why? Because else you were complicit in “premature optimization”—one of the deadly sins in programming.
Instrumenting profilers insert special code at the beginning and end of each routine to record when the routine starts and when it exits. With this information, the profiler aims to measure the actual time taken by the routine on each call. This type of profiler may also record which other routines are called from a routine. It can then display the time for the entire routine and also break it down into time spent locally and time spent on each call to another routine.
Fortunately, there are a lot of profilers. In the remaining article, I’ll give you an overview of the most important profilers in Python and how to use them. Each comes with a reference for further reading.
Python cProfile
The most popular Python profiler is called cProfile. You can import it much like any other library by using the statement:
import cProfile
A simple statement but nonetheless a powerful tool in your toolbox.
Let’s write a Python script which you can profile. Say, you come up with this (very) raw Python script to find 100 random prime numbers between 2 and 1000 which you want to optimize:
import random def guess(): ''' Returns a random number ''' return random.randint(2, 1000) def is_prime(x): ''' Checks whether x is prime ''' for i in range(x): for j in range(x): if i * j == x: return False return True def find_primes(num): primes = [] for i in range(num): p = guess() while not is_prime(p): p = guess() primes += [p] return primes print(find_primes(100)) '''
[733, 379, 97, 557, 773, 257, 3, 443, 13, 547, 839, 881, 997,
431, 7, 397, 911, 911, 563, 443, 877, 269, 947, 347, 431, 673,
467, 853, 163, 443, 541, 137, 229, 941, 739, 709, 251, 673, 613,
23, 307, 61, 647, 191, 887, 827, 277, 389, 613, 877, 109, 227,
701, 647, 599, 787, 139, 937, 311, 617, 233, 71, 929, 857, 599,
2, 139, 761, 389, 2, 523, 199, 653, 577, 211, 601, 617, 419, 241,
179, 233, 443, 271, 193, 839, 401, 673, 389, 433, 607, 2, 389,
571, 593, 877, 967, 131, 47, 97, 443] '''
The program is slow (and you sense that there are many optimizations). But where to start?
As you’ve already learned, you need to know the bottleneck of your script. Let’s use the cProfile module to find it! The only thing you need to do is to add the following two lines to your script:
It’s really that simple. First, you write your script. Second, you call the cProfile.run() method to analyze its performance. Of course, you need to replace the execution command with your specific code you want to analyze. For example, if you want to test function f42(), you need to type in cProfile.run('f42()').
Here’s the output of the previous code snippet (don’t panic yet):
It still gives you the output to the shell—even if you didn’t execute the code directly, the cProfile.run() function did. You can see the list of the 100 random prime numbers here.
The next part prints some statistics to the shell:
3908 function calls in 1.614 seconds
Okay, this is interesting: the whole program took 1.614 seconds to execute. In total, 3908 function calls have been executed. Can you figure out which?
The print() function once.
The find_primes(100) function once.
The find_primes() function executes the for loop 100 times.
In the for loop, we execute the range(), guess(), and is_prime() functions. The program executes the guess() and is_prime() functions multiple times per loop iteration until it correctly guessed the next prime number.
The guess() function executes the randint(2,1000) method once.
The next part of the output shows you the detailed stats of the function names ordered by the function name (not its performance):
Each line stands for one function. For example the second line stands for the function is_prime. You can see that is_prime() had 535 executions with a total time of 1.54 seconds.
Wow! You’ve just found the bottleneck of the whole program: is_prime(). Again, the total execution time was 1.614 seconds and this one function dominates 95% of the total execution time!
So, you need to ask yourself the following questions: Do you need to optimize the code at all? If you do, how can you mitigate the bottleneck?
There are two basic ideas:
call the function is_prime() less frequently, and
optimize performance of the function itself.
You know that the best way to optimize code is to look for more efficient algorithms. A quick search reveals a much more efficient algorithm (see function is_prime2()).
import random def guess(): ''' Returns a random number ''' return random.randint(2, 1000) def is_prime(x): ''' Checks whether x is prime ''' for i in range(x): for j in range(x): if i * j == x: return False return True def is_prime2(x): ''' Checks whether x is prime ''' for i in range(2,int(x**0.5)+1): if x % i == 0: return False return True def find_primes(num): primes = [] for i in range(num): p = guess() while not is_prime2(p): p = guess() primes += [p] return primes import cProfile
cProfile.run('print(find_primes(100))')
What do you think: is our new prime checker faster? Let’s study the output of our code snippet:
Crazy – what a performance improvement! With the old bottleneck, the code takes 1.6 seconds. Now, it takes only 0.074 seconds—a 95% runtime performance improvement!
That’s the power of bottleneck analysis.
The cProfile method has many more functions and parameters but this simple method cProfile.run() is already enough to resolve many performance bottlenecks.
How to Sort the Output of the cProfile.run() Method?
To sort the output with respect to the i-th column, you can pass the sort=i argument to the cProfile.run() method. Here’s the help output:
>>> import cProfile
>>> help(cProfile.run)
Help on function run in module cProfile: run(statement, filename=None, sort=-1) Run statement under profiler optionally saving results in filename This function takes a single argument that can be passed to the "exec" statement, and an optional file name. In all cases this routine attempts to "exec" its first argument and gather profiling statistics from the execution. If no file name is present, then this function automatically prints a simple profiling report, sorted by the standard name string (file/line/function-name) that is presented in each line.
And here’s a minimal example profiling the above find_prime() method:
If you’re running a flask application on a server, you often want to improve performance. But remember: you must focus on the bottlenecks of your whole application—not only the performance of the Flask app running on your server. There are many other possible performance bottlenecks such as database access, heavy use of images, wrong file formats, videos, embedded scripts, etc.
Before you start optimizing the Flask app itself, you should first check out those speed analysis tools that analyze the end-to-end latency as perceived by the user.
These online tools are free and easy to use: you just have to copy&paste the URL of your website and press a button. They will then point you to the potential bottlenecks of your app. Just run all of them and collect the results in an excel file or so. Then spend some time thinking about the possible bottlenecks until your pretty confident that you’ve found the main bottleneck.
Here’s an example of a Google Page Speed run for the wealth creation Flask app www.wealthdashboard.app:
It’s clear that in this case, the performance bottleneck is the work performed by the application itself. This doesn’t surprise as it comes with rich and interactive user interface:
So in this case, it makes absolutely sense to dive into the Python Flask app itself which, in turn, uses the dash framework as a user interface.
So let’s start with the minimal example of the dash app. Note that the dash app internally runs a Flask server:
import dash
import dash_core_components as dcc
import dash_html_components as html external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css'] app = dash.Dash(__name__, external_stylesheets=external_stylesheets) app.layout = html.Div(children=[ html.H1(children='Hello Dash'), html.Div(children=''' Dash: A web application framework for Python. '''), dcc.Graph( id='example-graph', figure={ 'data': [ {'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'}, {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': u'Montréal'}, ], 'layout': { 'title': 'Dash Data Visualization' } } )
]) if __name__ == '__main__': #app.run_server(debug=True) import cProfile cProfile.run('app.run_server(debug=True)', sort=1)
Don’t worry, you don’t need to understand what’s going on. Only one thing is important: rather than running app.run_server(debut=True) in the third last line, you execute the cProfile.run(...) wrapper. You sort the output with respect to decreasing runtime (second column). The result of executing and terminating the Flask app looks as follows:
So there have been 6031 function calls—but runtime was dominated by the method WaitForSingleObject() as you can see in the first row of the output table. This makes sense as I only ran the server and shut it down—it didn’t really process any request.
But if you’d execute many requests as you test your server, you’d quickly find out about the bottleneck methods.
There are some specific profilers for Flask applications. I’d recommend that you start looking here:
You can set up the profiler in just a few lines of code. However, this flask profiler focuses on the performance of multiple endpoints (“urls”). If you want to explore the function calls of a single endpoint/url, you should still use the cProfile module for fine-grained analysis.
An easy way of using the cProfile module in your flask application is the Werkzeug project. Using it is as simple as wrapping the flask app like this:
from werkzeug.contrib.profiler import ProfilerMiddleware
app = ProfilerMiddleware(app)
Per default, the profiled data will be printed to your shell or the standard output (depends on how you serve your Flask application).
Pandas Profiling Example
To profile your pandas application, you should divide your overall script into many functions and use Python’s cProfile module (see above). This will quickly point towards potential bottlenecks.
However, if you want to find out about a specific Pandas dataframe, you could use the following two methods:
You’ve learned how to approach the problem of performance optimization conceptually:
Premature Optimization Is The Root Of All Evil
Measure First, Improve Second
Pareto Is King
Algorithmic Optimization Wins
All Hail to the Cache
Less is More
Know When to Stop
These concepts are vital for your coding productivity—they can save you weeks, if not months of mindless optimization.
The most important principle is to always focus on resolving the next bottleneck.
You’ve also learned about Python’s powerful cProfile module that helps you spot performance bottlenecks quickly. For the vast majority of Python applications, including Flask and Pandas, this will help you figure out the most critical bottlenecks.
Most of the time, there’s no need to optimize, say, beyond the first three bottlenecks (exception: scientific computing).
Python comes with different profilers. If you’re new to performance optimization, you may ask: what’s a profiler anyway?
A performance profiler allows you to monitor your application more closely. If you just run a Python script in your shell, you see nothing but the output produced by your program. But you don’t see how much bytes were consumed by your program. You don’t see how long each function runs. You don’t see the data structures that caused most memory overhead.
Without those things, you cannot know what’s the bottleneck of your application. And, as you’ve already learned above, you cannot possibly start optimizing your code. Why? Because else you were complicit in “premature optimization”—one of the deadly sins in programming.
Instrumenting profilers insert special code at the beginning and end of each routine to record when the routine starts and when it exits. With this information, the profiler aims to measure the actual time taken by the routine on each call. This type of profiler may also record which other routines are called from a routine. It can then display the time for the entire routine and also break it down into time spent locally and time spent on each call to another routine.
Fortunately, there are a lot of profilers. In the remaining article, I’ll give you an overview of the most important profilers in Python and how to use them. Each comes with a reference for further reading.
Python cProfile
The most popular Python profiler is called cProfile. You can import it much like any other library by using the statement:
import cProfile
A simple statement but nonetheless a powerful tool in your toolbox.
Let’s write a Python script which you can profile. Say, you come up with this (very) raw Python script to find 100 random prime numbers between 2 and 1000 which you want to optimize:
import random def guess(): ''' Returns a random number ''' return random.randint(2, 1000) def is_prime(x): ''' Checks whether x is prime ''' for i in range(x): for j in range(x): if i * j == x: return False return True def find_primes(num): primes = [] for i in range(num): p = guess() while not is_prime(p): p = guess() primes += [p] return primes print(find_primes(100)) '''
[733, 379, 97, 557, 773, 257, 3, 443, 13, 547, 839, 881, 997,
431, 7, 397, 911, 911, 563, 443, 877, 269, 947, 347, 431, 673,
467, 853, 163, 443, 541, 137, 229, 941, 739, 709, 251, 673, 613,
23, 307, 61, 647, 191, 887, 827, 277, 389, 613, 877, 109, 227,
701, 647, 599, 787, 139, 937, 311, 617, 233, 71, 929, 857, 599,
2, 139, 761, 389, 2, 523, 199, 653, 577, 211, 601, 617, 419, 241,
179, 233, 443, 271, 193, 839, 401, 673, 389, 433, 607, 2, 389,
571, 593, 877, 967, 131, 47, 97, 443] '''
The program is slow (and you sense that there are many optimizations). But where to start?
As you’ve already learned, you need to know the bottleneck of your script. Let’s use the cProfile module to find it! The only thing you need to do is to add the following two lines to your script:
It’s really that simple. First, you write your script. Second, you call the cProfile.run() method to analyze its performance. Of course, you need to replace the execution command with your specific code you want to analyze. For example, if you want to test function f42(), you need to type in cProfile.run('f42()').
Here’s the output of the previous code snippet (don’t panic yet):
It still gives you the output to the shell—even if you didn’t execute the code directly, the cProfile.run() function did. You can see the list of the 100 random prime numbers here.
The next part prints some statistics to the shell:
3908 function calls in 1.614 seconds
Okay, this is interesting: the whole program took 1.614 seconds to execute. In total, 3908 function calls have been executed. Can you figure out which?
The print() function once.
The find_primes(100) function once.
The find_primes() function executes the for loop 100 times.
In the for loop, we execute the range(), guess(), and is_prime() functions. The program executes the guess() and is_prime() functions multiple times per loop iteration until it correctly guessed the next prime number.
The guess() function executes the randint(2,1000) method once.
The next part of the output shows you the detailed stats of the function names ordered by the function name (not its performance):
Each line stands for one function. For example the second line stands for the function is_prime. You can see that is_prime() had 535 executions with a total time of 1.54 seconds.
Wow! You’ve just found the bottleneck of the whole program: is_prime(). Again, the total execution time was 1.614 seconds and this one function dominates 95% of the total execution time!
So, you need to ask yourself the following questions: Do you need to optimize the code at all? If you do, how can you mitigate the bottleneck?
There are two basic ideas:
call the function is_prime() less frequently, and
optimize performance of the function itself.
You know that the best way to optimize code is to look for more efficient algorithms. A quick search reveals a much more efficient algorithm (see function is_prime2()).
import random def guess(): ''' Returns a random number ''' return random.randint(2, 1000) def is_prime(x): ''' Checks whether x is prime ''' for i in range(x): for j in range(x): if i * j == x: return False return True def is_prime2(x): ''' Checks whether x is prime ''' for i in range(2,int(x**0.5)+1): if x % i == 0: return False return True def find_primes(num): primes = [] for i in range(num): p = guess() while not is_prime2(p): p = guess() primes += [p] return primes import cProfile
cProfile.run('print(find_primes(100))')
What do you think: is our new prime checker faster? Let’s study the output of our code snippet:
Crazy – what a performance improvement! With the old bottleneck, the code takes 1.6 seconds. Now, it takes only 0.074 seconds—a 95% runtime performance improvement!
You’ve learned how to use the cProfile module in Python to find the bottleneck of your application.
If you’re already optimizing performance of your Python apps, chances are that you can already earn six figures by selling your Python skills. Would you like to learn how?
Join the free webinar that shows you how to become a thriving coding business owner online!