Just like me an hour ago, you’re probably sitting in front of your regular expression code, puzzled by a strange error message:
re.error: multiple repeat at position x
How does it arise? Where does it come from? And, most importantly, how can you get rid of it?
This article gives you answers to all of those questions. Alternatively, you can also watch my short explainer video that shows you real quick how to resolve this error:
How Does the Multiple Repeat Error Arise in Python Re?
Python’s regex library re throws the multiple repeat error when you try to stack two regex quantifiers on top of each other. For example, the regex 'a++' will cause the multiple repeat error. You can get rid of this error by avoiding to stack quantifiers on top of each other.
Here’s an example:
>>> import re
>>> re.findall('a++', 'aaaa')
Traceback (most recent call last): File "<pyshell#29>", line 1, in <module> re.findall('a++', 'aaaa') File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 223, in findall ...
re.error: multiple repeat at position 2
I have shortened the error message to focus on the relevant parts. In the code, you first import the regex library re. You then use the re.findall(pattern, string) function (see this blog tutorial) to find the pattern 'a++' in the string 'aaaa'.
However, this doesn’t make a lot of sense: what’s the meaning of the pattern a++ anyway?
[Tips] What’s the Source of the Multiple Repeat Error and How to Avoid It?
The error happens if you use the Python regex package re. There are many different reasons for it but all of them have the same source: you stack quantifiers on top of each other.
If you don’t know what a quantifier is, scroll down and read the following subsection where I show you exactly what it is.
Here’s a list of reasons for the error message. Maybe your reason is among them?
You use the regex pattern 'X++' for any regex expression X. To avoid this error, get rid of one quantifier.
You use the regex pattern 'X+*' for any regex expression X. To avoid this error, get rid of one quantifier.
You use the regex pattern 'X**' for any regex expression X. To avoid this error, get rid of one quantifier.
You use the regex pattern 'X{m,n}*' for any regex expression X and number of repetitions m and n. To avoid this error, get rid of one quantifier.
You try to match a number of characters '+' and use a second quantifier on top of it such as '+?'. In this case, you should escape the first quantifier symbol '\+'.
You try to match a number of characters '*' and use a second quantifier on top of it such as '*+'. Avoid this error by escaping the first quantifier symbol '\*'.
Oftentimes, the error appears if you don’t properly escape the special quantifier metacharacters in your regex pattern.
Here’s a StackOverflow post that shows some code where this happened:
...
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
p = re.compile(term, re.IGNORECASE) ...
I edited the given code snippet to show the important part. The code fails because of a multiple repeat error. Can you see why?
The reason is that the regex 'lg incite" OR author:"http++www.dealitem.com" OR "for sale' contains two plus quantifiers stacked on top of each other in the substring 'http++'. Get rid of those and the code will run again!
Python Regex Quantifiers
The word “quantifier” originates from latin: it’s meaning is quantus = how much / how often.
This is precisely what a regular expression quantifier means: you tell the regex engine how often you want to match a given pattern.
If you think you don’t define any quantifier, you do it implicitly: no quantifier means to match the regular expression exactly once.
So what are the regex quantifiers in Python?
Quantifier
Meaning
A?
Match regular expression A zero or one times
A*
Match regular expression A zero or more times
A+
Match regular expression A one or more times
A{m}
Match regular expression A exactly m times
A{m,n}
Match regular expression A between m and n times (included)
Note that in this tutorial, I assume you have at least a remote idea of what regular expressions actually are. If you haven’t, no problem, check out my detailed regex tutorial on this blog.
You see in the table that the quantifiers ?, *, +, {m}, and {m,n} define how often you repeat the matching of regex A.
Let’s have a look at some examples—one for each quantifier:
In each line, you try a different quantifier on the same text 'aaaa'. And, interestingly, each line leads to a different output:
The zero-or-one regex 'a?' matches four times one 'a'. Note that it doesn’t match zero characters if it can avoid doing so.
The zero-or-more regex 'a*' matches once four 'a's and consumes them. At the end of the string, it can still match the empty string.
The one-or-more regex 'a+' matches once four 'a's. In contrast to the previous quantifier, it cannot match an empty string.
The repeating regex 'a{3}' matches up to three 'a's in a single run. It can do so only once.
The repeating regex 'a{1,2}' matches one or two 'a's. It tries to match as many as possible.
You’ve learned the basic quantifiers of Python regular expressions.
Where to Go From Here?
To summarize, you’ve learned that the multiple repeat error appears whenever you try to stack multiple quantifiers on top of each other. Avoid this and the error message will disappear.
I just wrote a regular expression in Python that matches multiple times in the text and wondered: how to count the number of matches?
Consider the example where you match an arbitrary number of word characters '[a-z]+' in a given sentence 'python is the best programming language in the world'.
You can watch my explainer video as you read over the tutorial:
How many matches are there in the string? To count the number of matches, you can use multiple methods:
1. Python re.findall()
Use the re.findall(pattern, string) method that returns a list of matching substrings. Then count the length of the returned list. Here’s an example:
>>> import re
>>> pattern = '[a-z]+'
>>> text = 'python is the best programming language in the world'
>>> len(re.findall(pattern, text))
9
Why is the result 9? Because there are nine matching substrings in the returned list of the re.findall() method:
Definition: returns an iterator that goes over all non-overlapping matches of the pattern in the text.
The flags argument allows you to customize some advanced properties of the regex engine such as whether capitalization of characters should be ignored. You can learn more about the flags argument in my detailed blog tutorial.
Example: You can use the iterator to count the number of matches. In contrast to the re.findall() method described above, this has the advantage that you can analyze the match objects themselves that carry much more information than just the matching substring.
import re
pattern = '[a-z]+'
text = 'python is the best programming language in the world'
for match in re.finditer(pattern, text): print(match) '''
<re.Match object; span=(0, 6), match='python'>
<re.Match object; span=(7, 9), match='is'>
<re.Match object; span=(10, 13), match='the'>
<re.Match object; span=(14, 18), match='best'>
<re.Match object; span=(19, 30), match='programming'>
<re.Match object; span=(31, 39), match='language'>
<re.Match object; span=(40, 42), match='in'>
<re.Match object; span=(43, 46), match='the'>
<re.Match object; span=(47, 52), match='world'> '''
If you want to count the number of matches, you can use a simple count variable:
import re
pattern = '[a-z]+'
text = 'python is the best programming language in the world' count = 0
for match in re.finditer(pattern, text): count += 1 print(count)
# 9
Or a more Pythonic solution:
import re
pattern = '[a-z]+'
text = 'python is the best programming language in the world' print(len([i for i in re.finditer(pattern, text)]))
# 9
This method works great if there are non-overlapping matches.
3. Overlapping Matches
The above two methods work great if there are no overlapping matches. If there are overlapping matches, the regex engine will just ignore them because it “consumes” the whole matching substrings and starts matching the next pattern only after the stop index of the previous match.
So if you need to find the number of overlapping matches, you need to use a different approach.
The idea is to keep track of the start position in the previous match and increment it by one after each match:
import re
pattern = '99'
text = '999 ways of writing 99 - 99999' left = 0
count = 0
while True: match = re.search(pattern, text[left:]) if not match: break count += 1 left += match.start() + 1
print(count)
# 7
By keeping track of the start index of the previous match in the left variable, we can control where to look for the next match in the string. Note that we use Python’s slicing operation text[left:] to ignore all left characters that are already considered in previous matches. In each loop iteration, we match another pattern in the text. This works even if those matches overlap.
Where to Go From Here
You’ve learned three ways of finding the number of matches of a given pattern in a string.
If you struggle with regular expressions, check out our free 20,000 word regex tutorial on the Finxter blog! It’ll give you regex superpowers!
I don’t know how often I sat in front of my computer, writing regular expressions and wondering: how to escape this or that character? The problem is that some special characters have a special meaning in Python strings and regular expressions. If you want to remove the special meaning, you need to escape the characters with an additional backslash.
If you have this problem too, you’re in luck. This article is the ultimate guide to escape special characters in Python. Just click on the topic that interests you and learn how to escape the special character you’re currently struggling with!
If you’re the impatient guy, you’re in luck too. Just try to add the backslash to your special character you want to escape: \x to escape special character x.
You can also watch the following video where I give you a quick example:
Python Regex Escape Characters
If you use special characters in strings, they carry a special meaning. Sometimes you don’t need that. The general idea is to escape the special character x with an additional backslash \x to get rid of the special meaning.
In the following, I show how to escape all possible special characters for Python strings and regular expressions:
Python Regex Escape Parentheses ()
How to escape the parentheses ( and ) in Python regular expressions?
Parentheses have a special meaning in Python regular expressions: they open and close matching groups.
You can get rid of the special meaning of parentheses by using the backslash prefix: \( and \). This way, you can match the parentheses characters in a given string. Here’s an example:
>>> import re
>>> re.findall(r'\(.*\)', 'Python is (really) great')
['(really)']
The result shows a string that contains the “special” characters '(' and ')'.
Python Regex Escape Square Brackets []
How to escape the square brackets [ and ] in Python regular expressions?
Square brackets have a special meaning in Python regular expressions: they open and close character sets.
You can get rid of the special meaning of brackets by using the backslash prefix: \[ and \]. This way, you can match the brackets characters in a given string. Here’s an example:
>>> import re
>>> re.findall(r'\[.*\]', 'Is Python [really] easy?')
['[really]']
The result shows a string that contains the “special” characters '[' and ']'.
Python Regex Escape Curly Brace (Brackets)
How to escape the curly braces{ and } in Python regular expressions?
The curly braces don’t have any special meaning in Python strings or regular expressions. Therefore, you don’t need to escape them with a leading backslash character \. However, you can do so if you wish as you see in the following example:
>>> import re
>>> re.findall(r'\{.*\}', 'if (2==2) { y = 3; }')
['{ y = 3; }']
>>> re.findall(r'{.*}', 'if (2==2) { y = 3; }')
['{ y = 3; }']
>>> re.findall('{.*}', 'if (2==2) { y = 3; }')
['{ y = 3; }']
All three cases match the same string enclosed in curly braces—even though we did not escape them and didn’t use the raw string r'' in the third example.
Python Regex Escape Slash (Backslash and Forward-Slash)
How to escape the slash characters—backslash \ and forward-slash /—in Python regular expressions?
The backslash has a special meaning in Python regular expressions: it escapes special characters and, thus, removes the special meaning. (How meta.)
You can see that the resulting matches have escaped backslashes themselves. This is because the backslash character has a special meaning in normal strings. Thus, the Python interpreter escapes it automatically by itself when printing it on the shell. Note that you didn’t need to escape the backslash characters when writing the raw string r'C:\home\usr\dir\hello\world' because the raw string already removes all the special meaning from the backslashed characters. But if you don’t want to use a raw string but a normal string, you need to escape the backslash character yourself:
In contrast to the backslash, the forward-slash doesn’t need to be escaped. Why? Because it doesn’t have a special meaning in Python strings and regular expressions. You can see this in the following example:
The result shows that even in a non-raw string, you can use the forward-slash without leading escape character.
Python Regex Escape String Single Quotes
How to escape the single quotes ' in Python regular expressions?
Single quotes have a special meaning in Python regular expressions: they open and close strings.
You can get rid of the special meaning of single quotes by using the backslash prefix: \'. This way, you can match the string quote characters in a given string. Here’s an example:
>>> import re
>>> re.findall('\'.*\'', "hello 'world'")
["'world'"]
The result shows a string that contains the “special” single quote characters. The result also shows an alternative that removes the special meaning of the single quotes: enclose them in double quotes: "hello 'world'".
Python Regex Escape String Double Quotes
How to escape the double quotes " in Python regular expressions?
Double quotes have a special meaning in Python regular expressions: they open and close strings.
You can get rid of the special meaning of single quotes by using the backslash prefix: \". This way, you can match the string quote characters in a given string. Here’s an example:
>>> import re
>>> re.findall('\".*\"', 'hello "world"')
['"world"']
The result shows a string that contains the “special” single quote characters. The result also shows an alternative that removes the special meaning of the single quotes: enclose them in double quotes: 'hello "world"'.
Python Regex Escape Dot (Period)
How to escape the regex dot (or period) meta character . in Python regular expressions?
The dot character has a special meaning in Python regular expressions: it matches an arbitrary character (except newline).
You can get rid of the special meaning of the dot character by using the backslash prefix: \.. This way, you can match the dot character in a given string. Here’s an example:
>>> import re
>>> re.findall('..\.', 'my. name. is. python.')
['my.', 'me.', 'is.', 'on.']
The result shows four strings that contain the “special” characters '.'.
Python Regex Escape Plus
How to escape the plus symbol + in Python regular expressions?
The plus symbol has a special meaning in Python regular expressions: it’s the one-or-more quantifier of the preceding regex.
You can get rid of the special meaning of the regex plus symbol by using the backslash prefix: \+. This way, you can match the plus symbol characters in a given string. Here’s an example:
>>> import re
>>> re.findall('\++', '+++python+++rocks')
['+++', '+++']
The result shows both usages: the plus symbol with and without leading escape character. If it is escaped \+, it matches the raw plus character. If it isn’t escaped +, it quantifies the regex pattern just in front of it (in our case the plus symbol itself).
Python Regex Escape Asterisk
How to escape the asterisk symbol * in Python regular expressions?
The asterisk symbol has a special meaning in Python regular expressions: it’s the zero-or-more quantifier of the preceding regex.
You can get rid of the special meaning of the regex asterisk symbol by using the backslash prefix: \*. This way, you can match the asterisk symbol characters in a given string. Here’s an example:
>>> import re
>>> re.findall('\**', '***python***rocks')
['***', '***']
The result shows both usages: the asterisk symbol with and without leading escape character. If it is escaped \*, it matches the raw asterisk character. If it isn’t escaped *, it quantifies the regex pattern just in front of it (in our case the asterisk symbol itself).
Python Regex Escape Question Mark
How to escape the question mark symbol ? in Python regular expressions?
The question mark symbol has a special meaning in Python regular expressions: it’s the zero-or-one quantifier of the preceding regex.
You can get rid of the special meaning of the question mark symbol by using the backslash prefix: \?. This way, you can match the question mark symbol characters in a given string. Here’s an example:
>>> import re
>>> re.findall('...\?', 'how are you?')
['you?']
The result shows that the question mark symbol was matched in the given string.
Python Regex Escape Underscore
How to escape the underscore character _ in Python regular expressions?
The underscore doesn’t have a special meaning in Python regular expressions or Python strings.
Therefore, you don’t need to escape the underscore character—just use it in your regular expression unescaped.
>>> import re
>>> re.findall('..._', 'i_use_underscore_not_whitespace')
['use_', 'ore_', 'not_']
In both cases, Python finds the underscore characters in the string and matches them in the result.
Python Regex Escape Pipe
How to escape the pipe symbol | (vertical line) in Python regular expressions?
The pipe symbol has a special meaning in Python regular expressions: the regex OR operator.
You can get rid of the special meaning of the pipe symbol by using the backslash prefix: \|. This way, you can match the parentheses characters in a given string. Here’s an example:
>>> import re
>>> re.findall('.\|.', 'a|b|c|d|e')
['a|b', 'c|d']
By escaping the pipe symbol, you get rid of the special meaning. The result is just the matched pipe symbol with leading and trailing arbitrary character.
If you don’t escape the pipe symbol, the result will be quite different:
In this case, the regex .|. matches “an arbitrary character or an arbitrary character”—quite meaningless!
Python Regex Escape Dollar
How to escape the dollar symbol $ in Python regular expressions?
The dollar symbol has a special meaning in Python regular expressions: it matches at the end of the string.
You can get rid of the special meaning by using the backslash prefix: \$. This way, you can match the dollar symbol in a given string. Here’s an example:
>>> import re
>>> re.findall('\$\d+', 'Your house is worth $1000000')
['$1000000']
Note that the \d+ regex matches an arbitrary number of numerical digits between 0 and 9.
Python Regex Escape Greater Than and Smaller Than
How to escape the greater than < and smaller than > symbols in Python regular expressions?
Greater and smaller than symbols don’t have a special meaning in Python regular expressions. Therefore, you don’t need to escape them.
Here’s an example:
>>> import re
>>> re.findall('<.*>.*<.*>', '<div>hello world</div>')
['<div>hello world</div>']
The result shows a string that even without escaping the HTML tag symbols, the regex matches the whole string.
Python Regex Escape Hyphen
How to escape the hyphen- in Python regular expressions?
Outside a character set, the hyphen doesn’t have a special meaning and you don’t need to escape it. Here’s an example:
>>> import re
>>> re.findall('..-', 'this is-me')
['is-']
The unescaped hyphen character in the regex matches the hyphen in the string.
However, inside a character set, the hyphen stands for the range symbol (e.g. [0-9]) so you need to escape it if you want to get rid of its special meaning and match the hyphen symbol itself. Here’s an example:
>>> re.findall('[a-z\-]+', 'hello-world is one word')
['hello-world', 'is', 'one', 'word']
Note that, in this case, if you don’t escape the hyphen in the character set, you get the same result:
>>> re.findall('[a-z-]+', 'hello-world is one word')
['hello-world', 'is', 'one', 'word']
The reason is that the hyphen appears at the end of the character set where it can have only one meaning: the hyphen symbol itself. However, in all other cases, the hyphen would be assumed to mean the range character which will result in strange behavior. A good practice is, thus, to escape the hyphen in the character class per default.
Python Regex Escape Newline
In a recent StackOverflow article, I read the following question:
I got a little confused about Python raw string. I know that if we use raw string, then it will treat '\' as a normal backslash (ex. r'\n' would be '\' and 'n'). However, I was wondering what if I want to match a new line character in raw string. I tried r'\n', but it didn’t work. Anybody has some good idea about this?
The coder asking the question has understood that the Python interpreter doesn’t assume that the two characters \ and n do have any special meaning in raw strings (in contrast to normal strings).
However, those two symbols have a special meaning for the regex engine! So if you use them as a regular expression pattern, they will indeed match the newline character:
>>> import re
>>> text = '''This
is
a
multiline
string'''
>>> re.findall(r'[a-z]+\n', text)
['his\n', 'is\n', 'a\n', 'multiline\n']
Therefore, you don’t need to escape the newline character again to match it in a given string.
Python re.escape Method
If you know that your string has a lot of special characters, you can also use the convenience method re.escape(pattern) from Python’s re module.
Specification: re.escape(pattern)
Definition: escapes all special regex meta characters in the given pattern.
Example: you can escape all special symbols in one go:
The dot symbol has a special meaning in the string 'https://www.finxter.com/'. There are no other special symbols. Therefore, all special symbols are replaced.
Note that “only characters that can have special meaning in a regular expression are escaped. As a result, '!', '"', '%', "'", ',', '/', ':', ';', '<', '=', '>', '@', and "`" are no longer escaped” (source).
Python Regex Bad Escape
There are some common errors in relation to escaping in Python regular expressions.
If you try to escape a normal character that has not a special meaning, Python will throw a “bad escape error”:
>>> re.findall('\m', 'hello {world}')
Traceback (most recent call last): File "<pyshell#61>", line 1, in <module> re.findall('\m', 'hello {world}') File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 223, in findall return _compile(pattern, flags).findall(string) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 930, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 426, in _parse_sub not nested and not items)) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 507, in _parse code = _escape(source, this, state) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 402, in _escape raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \m at position 0
As the error message suggests, there’s no escape sequence \m so you need to get rid of it to avoid the error.
Where to Go From Here
Wow, you either have read about a lot of escaped character sequences or you did a lot of scrolling to reach this point.
In both cases, you have a great advantage over other coders: you’re a persistent guy or gal!
Do you want to increase your advantage over your peers? Then join my Python email academy! I’ll teach you the ins and outs of Python coding—all free!
Regular expressions are a strange animal. Many students find them difficult to understand – do you?
I realized that a major reason for this is simply that they don’t understand the special regex characters. To put it differently: understand the special characters and everything else in the regex space will come much easier to you.
Regular expressions are built from characters. There are two types of characters: literal characters and special characters.
Literal Characters
Let’s start with the absolute first thing you need to know with regular expressions: a regular expression (short: regex) searches for a given pattern in a given string.
What’s a pattern? In its most basic form, a pattern can be a literal character. So the literal characters 'a', 'b', and 'c' are all valid regex patterns.
For example, you can search for the regex pattern 'a' in the string 'hello world' but it won’t find a match. You can also search for the pattern 'a' in the string 'hello woman' and there is a match: the second last character in the string.
Based on the simple insight that a literal character is a valid regex pattern, you’ll find that a combination of literal characters is also a valid regex pattern. For example, the regex pattern 'an' matches the last two characters in the string 'hello woman'.
Summary: Regular expressions are built from characters. An important class of characters are the literal characters. In principle, you can use all Unicode literal characters in your regex pattern.
Special Characters
However, the power of regular expressions come from their abstraction capability. Instead of writing the character set[abcdefghijklmnopqrstuvwxyz], you’d write [a-z] or even \w. The latter is a special regex character—and pros know them by heart. In fact, regex experts seldomly match literal characters. In most cases, they use more advanced constructs or special characters for various reasons such as brevity, expressiveness, or generality.
So what are the special characters you can use in your regex patterns?
Let’s have a look at the following table that contains all special characters in Python’s re package for regular expression processing.
Special Character
Meaning
\n
The newline symbol is not a special symbol particular to regex only, it’s actually one of the most widely-used, standard characters. However, you’ll see the newline character so often that I just couldn’t write this list without including it. For example, the regex 'hello\nworld' matches a string where the string 'hello' is placed in one line and the string 'world' is placed into the second line.
\t
The tabular character is, like the newline character, not a “regex-specific” symbol. It just encodes the tabular space ' ' which is different to a sequence of whitespaces (even if it doesn’t look different over here). For example, the regex 'hello\n\tworld' matches the string that consists of 'hello' in the first line and ' world' in the second line (with a leading tab character).
\s
The whitespace character is, in contrast to the newline character, a special symbol of the regex libraries. You’ll find it in many other programming languages, too. The problem is that you often don’t know which type of whitespace is used: tabular characters, simple whitespaces, or even newlines. The whitespace character '\s' simply matches any of them. For example, the regex '\s*hello\s+world' matches the string ' \t \n hello \n \n \t world', as well as 'hello world'.
\S
The whitespace-negation character matches everything that does not match \s.
\w
The word character regex simplifies text processing significantly. It represents the class of all characters used in typical words (A-Z, a-z, 0-9, and '_'). This simplifies the writing of complex regular expressions significantly. For example, the regex '\w+' matches the strings 'hello', 'bye', 'Python', and 'Python_is_great'.
\W
The word-character-negation. It matches any character that is not a word character.
\b
The word boundary is also a special symbol used in many regex tools. You can use it to match, as the name suggests, the boundary between the a word character (\w) and a non-word (\W) character. But note that it matches only the empty string! You may ask: why does it exist if it doesn’t match any character? The reason is that it doesn’t “consume” the character right in front or right after a word. This way, you can search for whole words (or parts of words) and return only the word but not the delimiting characters that separate the word, e.g., from other words.
\d
The digit character matches all numeric symbols between 0 and 9. You can use it to match integers with an arbitrary number of digits: the regex '\d+' matches integer numbers '10', '1000', '942', and '99999999999'.
\D
Matches any non-digit character. This is the inverse of \d and it’s equivalent to [^0-9].
But these are not all characters you can use in a regular expression.
There are also meta characters for the regex engine that allow you to do much more powerful stuff.
A good example is the asterisk operator that matches “zero or more” occurrences of the preceding regex. For example, the pattern .*txt matches an arbitrary number of arbitrary characters followed by the suffix 'txt'. This pattern has two special regex meta characters: the dot . and the asterisk operator *. You’ll now learn about those meta characters:
Regex Meta Characters
Feel free to watch the short video about the most important regex meta characters:
Next, you’ll get a quick and dirty overview of the most important regex operations and how to use them in Python.
Here are the most important regex operators:
Meta Character
Meaning
.
The wild-card operator (dot) matches any character in a string except the newline character '\n'. For example, the regex '...' matches all words with three characters such as 'abc', 'cat', and 'dog'.
*
The zero-or-more asterisk operator matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex. For example, the regex ‘cat*’ matches the strings 'ca', 'cat', 'catt', 'cattt', and 'catttttttt'.
?
The zero-or-one operator matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. For example, the regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.
+
The at-least-one operator matches one or more occurrences of the immediately preceding regex. For example, the regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.
^
The start-of-string operator matches the beginning of a string. For example, the regex ‘^p’ would match the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.
$
The end-of-string operator matches the end of a string. For example, the regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.
A|B
The OR operator matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. For example, the regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.
AB
The AND operator matches first the regex A and second the regex B, in this sequence. We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.
Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.
Let’s dive into some examples!
Examples
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!'] ''' print(re.findall('is.*and', text)) '''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and'] ''' print(re.findall('her:?', text)) '''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:'] ''' print(re.findall('^Ha.*', text)) '''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. Can you figure out why Python doesn't find any?
[] ''' print(re.findall('\n$', text)) '''
Finds all occurrences where the new-line character '\n'
occurs at the end of the string.
['\n'] ''' print(re.findall('(Life|Death)', text)) '''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death'] '''
In these examples, you’ve already seen the special symbol \n which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions.
Where to Go From Here
You’ve learned all special characters of regular expressions, as well as meta characters. This will give you a strong basis for improving your regex skills.
If you want to accelerate your skills, you need a good foundation. Check out my brand-new Python book “Python One-Liners (Amazon Link)” which boosts your skills from zero to hero—in a single line of Python code!
This tutorial explains everything you need to know about matching groups in Python’s repackage for regular expressions. You may have also read the term “capture groups” which points to the same concept.
As you read through the tutorial, you can also watch the tutorial video where I explain everything in a simple way:
So let’s start with the basics:
Matching Group ()
What’s a matching group?
Like you use parentheses to structure mathematical expressions, (2 + 2) * 2 versus 2 + (2 * 2), you use parentheses to structure regular expressions. An example regex that does this is 'a(b|c)'. The whole content enclosed in the opening and closing parentheses is called matching group (or capture group). You can have multiple matching groups in a single regex. And you can even have hierarchical matching groups, for example 'a(b|(cd))'.
One big advantage of a matching group is that it captures the matched substring. You can retrieve it in other parts of the regular expression—or after analyzing the result of the whole regex matching.
Let’s have a short example for the most basic use of a matching group—to structure the regex.
Say you create regex b?(a.)* with the matching group (a.) that matches all patterns starting with zero or one occurrence of character 'b' and an arbitrary number of two-character-sequences starting with the character 'a'. Hence, the strings 'bacacaca', 'aaaa', '' (the empty string), and 'Xababababab' all match your regex.
The use of the parentheses for structuring the regular expression is intuitive and should come naturally to you because the same rules apply as for arithmetic operations. However, there’s a more advanced use of regex groups: retrieval.
You can retrieve the matched content of each matching group. So the next question naturally arises:
How to Get the First Matching Group?
There are two scenarios when you want to access the content of your matching groups:
Access the matching group in the regex pattern to reuse partially matched text from one group somewhere else.
Access the matching group after the whole match operation to analyze the matched text in your Python code.
In the first case, you simply get the first matching group with the \number special sequence. For example, to get the first matching group, you’d use the \1 special sequence. Here’s an example:
>>> import re
>>> re.search(r'(j.n) is \1','jon is jon')
<re.Match object; span=(0, 10), match='jon is jon'>
You’ll use this feature a lot because it gives you much more expression power: for example, you can search for a name in a text-based on a given pattern and then process specifically this name in the rest of the text (and not all other names that would also fit the pattern).
Note that the numbering of the groups start with \1 and not with \0—a rare exception to the rule that in programming, all numbering starts with 0.
In the second case, you want to know the contents of the first group after the whole match. How do you do that?
The answer is also simple: use the m.group(0) method on the matching objectm. Here’s an example:
>>> import re
>>> m = re.search(r'(j.n)','jon is jon')
>>> m.group(1) 'jon'
The numbering works consistently with the previously introduced regex group numbering: start with identifier 1 to access the contents of the first group.
How to Get All Other Matching Groups?
Again, there are two different intentions when asking this question:
Access the matching group in the regex pattern to reuse partially matched text from one group somewhere else.
Access the matching group after the whole match operation to analyze the matched text in your Python code.
In the first case, you use the special sequence \2 to access the second matching group, \3 to access the third matching group, and \99 to access the ninety-ninth matching group.
Here’s an example:
>>> import re
>>> re.search(r'(j..) (j..)\s+\2', 'jon jim jim')
<re.Match object; span=(0, 11), match='jon jim jim'>
>>> re.search(r'(j..) (j..)\s+\2', 'jon jim jon')
>>>
As you can see, the special sequence \2 refers to the matching contents of the second group 'jim'.
In the second case, you can simply increase the identifier too to access the other matching groups in your Python code:
>>> import re
>>> m = re.search(r'(j..) (j..)\s+\2', 'jon jim jim')
>>> m.group(0) 'jon jim jim'
>>> m.group(1) 'jon'
>>> m.group(2) 'jim'
This code also shows an interesting feature: if you use the identifier 0 as an argument to the m.group(0) method, the regex module will give you the contents of the whole match. You can think of it as the first group being the whole match.
Named Groups: (?P<name>…) and (?P=name)
Accessing the captured group using the notation \number is not always convenient and sometimes not even possible (for example if you have more than 99 groups in your regex). A major disadvantage of regular expressions is that they tend to be hard to read. It’s therefore important to know about the different tweaks to improve readability.
One such optimization is a named group. It’s really just that: a matching group that captures part of the match but with one twist: it has a name. Now, you can use this name to access the captured group at a later point in your regular expression pattern. This can improve readability of the regular expression.
import re
pattern = '(?P<quote>["\']).*(?P=quote)'
text = 'She said "hi"'
print(re.search(pattern, text))
# <re.Match object; span=(9, 13), match='"hi"'>
The code searches for substrings that are enclosed in either single or double quotes. You first match the opening quote by using the regex ["\']. You escape the single quote, \' so that the Python regex engine does not assume (wrongly) that the single quote indicates the end of the string. You then use the same group to match the closing quote of the same character (either a single or double quote).
Non-Capturing Groups (?:…)
In the previous examples, you’ve seen how to match and capture groups with the parentheses (...). You’ve learned that each match of this basic group operator is captured so that you can retrieve it later in the regex with the special commands \1, \2, …, \99 or after the match on the matched object m with the method m.group(1), m.group(2), and so on.
But what if you don’t need that? What if you just need to keep your regex pattern in order—but you don’t want to capture the contents of a matching group?
The simple solution is the non-capturing group operation (?: ... ). You can use it just like the capturing group operation ( ... ). Here’s an example:
>>>import re
>>> re.search('(?:python|java) is great', 'python is great. java is great.')
<re.Match object; span=(0, 15), match='python is great'>
The non-capturing group exists with the sole purpose to structure the regex. You cannot use its content later:
>>> m = re.search('(?:python|java) is great', 'python is great. java is great.')
>>> m.group(1)
Traceback (most recent call last): File "<pyshell#28>", line 1, in <module> m.group(1)
IndexError: no such group
>>>
If you try to access the contents of the non-capturing group, the regex engine will throw an IndexError: no such group.
Of course, there’s a straightforward alternative to non-capturing groups. You can simply use the normal (capturing) group but don’t access its contents. Only rarely will the performance penalty of capturing a group that’s not needed have any meaningful impact on your overall application.
Positive Lookahead (?=…)
The concept of lookahead is a very powerful one and any advanced coder should know it. A friend recently told me that he had written a complicated regex that ignores the order of occurrences of two words in a given text. It’s a challenging problem and without the concept of lookahead, the resulting code will be complicated and hard to understand. However, the concept of lookahead makes this problem simple to write and read.
But first things first: how does the lookahead assertion work?
In normal regular expression processing, the regex is matched from left to right. The regex engine “consumes” partially matching substrings. The consumed substring cannot be matched by any other part of the regex.
Figure:A simple example of lookahead. The regular expression engine matches (“consumes”) the string partially. Then it checks whether the remaining pattern could be matched without actually matching it.
Think of the lookahead assertion as a non-consuming pattern match. The regex engine goes from the left to the right—searching for the pattern. At each point, it has one “current” position to check if this position is the first position of the remaining match. In other words, the regex engine tries to “consume” the next character as a (partial) match of the pattern.
The advantage of the lookahead expression is that it doesn’t consume anything. It just “looks ahead” starting from the current position whether what follows would theoretically match the lookahead pattern. If it doesn’t, the regex engine cannot move on. Next, it “backtracks”—which is just a fancy way of saying: it goes back to a previous decision and tries to match something else.
Positive Lookahead Example: How to Match Two Words in Arbitrary Order?
What if you want to search a given text for pattern A AND pattern B—but in no particular order? If both patterns appear anywhere in the string, the whole string should be returned as a match.
Now, this is a bit more complicated because any regular expression pattern is ordered from left to right. A simple solution is to use the lookahead assertion (?.*A) to check whether regex A appears anywhere in the string. (Note we assume a single line string as the .* pattern doesn’t match the newline character by default.)
Let’s first have a look at the minimal solution to check for two patterns anywhere in the string (say, patterns ‘hi’ AND ‘you’).
>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are yo?')
[]
>>> re.findall(pattern, 'hi how are you?')
['']
In the first example, both words do not appear. In the second example, they do.
Let’s go back to the expression (?=.*hi)(?=.*you) to match strings that contain both ‘hi’ and ‘you’. Why does it work?
The reason is that the lookahead expressions don’t consume anything. You first search for an arbitrary number of characters .*, followed by the word hi. But because the regex engine hasn’t consumed anything, it’s still in the same position at the beginning of the string. So, you can repeat the same for the word you.
Note that this method doesn’t care about the order of the two words:
>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are you?')
['']
>>> re.findall(pattern, 'you are how? hi!')
['']
No matter which word “hi” or “you” appears first in the text, the regex engine finds both.
You may ask: why’s the output the empty string? The reason is that the regex engine hasn’t consumed any character. It just checked the lookaheads. So the easy fix is to consume all characters as follows:
Now, the whole string is a match because after checking the lookahead with ‘(?=.*hi)(?=.*you)’, you also consume the whole string ‘.*’.
Negative Lookahead (?!…)
The negative lookahead works just like the positive lookahead—only it checks that the given regex pattern does not occur going forward from a certain position.
Here’s an example:
>>> import re
>>> re.search('(?!.*hi.*)', 'hi say hi?')
<re.Match object; span=(8, 8), match=''>
The negative lookahead pattern (?!.*hi.*) ensures that, going forward in the string, there’s no occurrence of the substring 'hi'. The first position where this holds is position 8 (right after the second 'h'). Like the positive lookahead, the negative lookahead does not consume any character so the result is the empty string (which is a valid match of the pattern).
You can even combine multiple negative lookaheads like this:
>>> re.search('(?!.*hi.*)(?!\?).', 'hi say hi?')
<re.Match object; span=(8, 9), match='i'>
You search for a position where neither ‘hi’ is in the lookahead, nor does the question mark character follow immediately. This time, we consume an arbitrary character so the resulting match is the character 'i'.
Group Flags (?aiLmsux:…) and (?aiLmsux)
You can control the regex engine with the flags argument of the re.findall(), re.search(), or re.match() methods. For example, if you don’t care about capitalization of your matched substring, you can pass the re.IGNORECASE flag to the regex methods:
>>> re.findall('PYTHON', 'python is great', flags=re.IGNORECASE)
['python']
But using a global flag for the whole regex is not always optimal. What if you want to ignore the capitalization only for a certain subregex?
You can do this with the group flags: a, i, L, m, s, u, and x. Each group flag has its own meaning:
Syntax
Meaning
a
If you don’t use this flag, the special Python regex symbols \w, \W, \b, \B, \d, \D, \s and \S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests.
i
If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z].
L
Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable.
m
This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string.
s
Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘\n’. Switch on this flag to really match all characters including the newline character.
x
To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex.
For example, if you want to switch off the differentiation of capitalization, you’ll use the i flag as follows:
>>> re.findall('(?i:PYTHON)', 'python is great')
['python']
You can also switch off the capitalization for the whole regex with the “global group flag” (?i) as follows:
>>> re.findall('(?i)PYTHON', 'python is great')
['python']
Where to Go From Here?
Summary: You’ve learned about matching groups to structure the regex and capture parts of the matching result. You can then retrieve the captured groups with the \number syntax within the regex pattern itself and with the m.group(i) syntax in the Python code at a later stage.
To learn the Python basics, check out my free Python email academy with many advanced courses—including a regex video tutorial in your INBOX.
Goal: Given a string that is either Morse code or normal text. Write a function that transforms the string into the other language: Morse code should be translated to normal text. Normal text should be translated to Morse code.
Output Example: Create a function morse(txt) that takes an input string argument txt and returns its translation:
Note that Morse code doesn’t differentiate lowercase or uppercase characters. So you just use uppercase characters as default translation output.
AlgorithmIdea: A simple algorithm is enough to solve the problem:
Detect if a string is Morse code or normal text. The simple but not perfect solution is to check if the first character is either the dot symbol '.' or the minus symbol '-'. Note that you can easily extend this by checking if all characters are either the dot symbol or the minus symbol (a simple regular expression will be enough).
Prepare a dictionary that maps all “normal text” symbols to their respective Morse code translations. Use the inverse dictionary (or create it ad-hoc) to get the inverse mapping.
Iterate over all characters in the string and use the dictionary to translate each character separately.
Implementation: Here’s the Python implementation of the above algorithm for Morse code translation:
def morse(txt): '''Morse code encryption and decryption''' d = {'A':'.-','B':'-...','C':'-.-.','D':'-..','E':'.', 'F':'..-.','G':'--.','H':'....','I':'..','J':'.---', 'K':'-.-','L':'.-..','M':'--','N':'-.','O':'---', 'P':'.--.','Q':'--.-','R':'.-.','S':'...','T':'-', 'U':'..-','V':'...-','W':'.--','X':'-..-','Y':'-.--', 'Z':'--..', ' ':'.....'} translation = '' # Encrypt Morsecode if txt.startswith('.') or txt.startswith('−'): # Swap key/values in d: d_encrypt = dict([(v, k) for k, v in d.items()]) # Morse code is separated by empty space chars txt = txt.split(' ') for x in txt: translation += d_encrypt.get(x) # Decrypt to Morsecode: else: txt = txt.upper() for x in txt: translation += d.get(x) + ' ' return translation.strip() print(morse('python'))
# .--. -.-- - .... --- -.
print(morse('.--. -.-- - .... --- -.'))
# PYTHON
print(morse(morse('HEY')))
# HEY
Algorithmic complexity: The runtime complexity is linear in the length of the input string to be translated—one translation operation per character. Dictionary membership has constant runtime complexity. The memory overhead is also linear in the input text as all the characters have to be hold in memory.
AlternativeImplementation: Albrecht also proposed a much shorter alternative:
def morse(txt): encrypt = {'A':'.-', 'B':'-...', 'C':'-.-.', 'D':'-..', 'E':'.', 'F':'..-.', 'G':'--.', 'H':'....', 'I':'..', 'J':'.---', 'K':'-.-', 'L':'.-..', 'M':'--', 'N':'-.', 'O':'---', 'P':'.--.', 'Q':'--.-', 'R':'.-.', 'S':'...', 'T':'-', 'U':'..-', 'V':'...-', 'W':'.--', 'X':'-..-', 'Y':'-.--', 'Z':'--..', ' ':'.....'} decrypt = {v: k for k, v in encrypt.items()} if '-' in txt: return ''.join(decrypt[i] for i in txt.split()) return ' '.join(encrypt[i] for i in txt.upper()) print(morse('python'))
# .--. -.-- - .... --- -.
print(morse('.--. -.-- - .... --- -.'))
# PYTHON
print(morse(morse('HEY')))
# HEY
It uses dict comprehension and generator expressions to make it much more concise.
Ready to earn the black belt of your regex superpower? This tutorial shows you the subtle but important difference between greedy and non-greedy regex quantifiers.
But first things first: what are “quantifiers” anyway? Great question – I’m glad you asked! So let’s dive into Python’s three main regex quantifiers.
Python Regex Quantifiers
The word “quantifier” originates from latin: it’s meaning is quantus = how much / how often.
This is precisely what a regular expression quantifier means: you tell the regex engine how often you want to match a given pattern.
If you think you don’t define any quantifier, you do it implicitly: no quantifier means to match the regular expression exactly once.
So what are the regex quantifiers in Python?
Quantifier
Meaning
A?
Match regular expression A zero or one times
A*
Match regular expression A zero or more times
A+
Match regular expression A one or more times
A{m}
Match regular expression A exactly m times
A{m,n}
Match regular expression A between m and n times (included)
Note that in this tutorial, I assume you have at least a remote idea of what regular expressions actually are. If you haven’t, no problem, check out my detailed regex tutorial on this blog.
You see in the table that the quantifiers ?, *, +, {m}, and {m,n} define how often you repeat the matching of regex A.
Let’s have a look at some examples—one for each quantifier:
In each line, you try a different quantifier on the same text 'aaaa'. And, interestingly, each line leads to a different output:
The zero-or-one regex 'a?' matches four times one 'a'. Note that it doesn’t match zero characters if it can avoid doing so.
The zero-or-more regex 'a*' matches once four 'a's and consumes them. At the end of the string, it can still match the empty string.
The one-or-more regex 'a+' matches once four 'a's. In contrast to the previous quantifier, it cannot match an empty string.
The repeating regex 'a{3}' matches up to three 'a's in a single run. It can do so only once.
The repeating regex 'a{1,2}' matches one or two 'a's. It tries to match as many as possible.
You’ve learned the basic quantifiers of Python regular expressions. Now, it’s time to explore the meaning of the term greedy. Shall we?
Python Regex Greedy Match
A greedy match means that the regex engine (the one which tries to find your pattern in the string) matches as many characters as possible.
For example, the regex 'a+' will match as many 'a's as possible in your string 'aaaa'. Although the substrings 'a', 'aa', 'aaa' all match the regex 'a+', it’s not enough for the regex engine. It’s always hungry and tries to match even more.
In other words, the greedy quantifiers give you the longest match from a given position in the string.
As it turns out, all default quantifiers ?, *, +, {m}, and {m,n} you’ve learned above are greedy: they “consume” or match as many characters as possible so that the regex pattern is still satisfied.
Here are the above examples again that all show how greedy the regex engine is:
In all cases, a shorter match would also be valid. But as the regex engine is greedy per default, those are not enough for the regex engine.
Okay, so how can we do a non-greedy match?
Python Regex Non-Greedy Match
A non-greedy match means that the regex engine matches as few characters as possible—so that it still can match the pattern in the given string.
For example, the regex 'a+?' will match as few 'a's as possible in your string 'aaaa'. Thus, it matches the first character 'a' and is done with it. Then, it moves on to the second character (which is also a match) and so on.
In other words, the non-greedy quantifiers give you the shortest possible match from a given position in the string.
You can make the default quantifiers ?, *, +, {m}, and {m,n} non-greedy by appending a question mark symbol '?' to them: ??, *?, +?, and {m,n}?. they “consume” or match as few characters as possible so that the regex pattern is still satisfied.
Here are some examples that show how non-greedy matching works:
Non-Greedy Question Mark Operator (??)
Let’s start with the question mark (zero-or-one operator):
In the first instance, you use the zero-or-one regex 'a?'. It’s greedy so it matches one 'a' character if possible.
In the second instance, you use the non-greedy zero-or-one version 'a??'. It matches zero 'a's if possible. Note that it moves from left to right so it matches the empty string and “consumes” it. Only then, it cannot match the empty string anymore so it is forced to match the first 'a' character. But after that, it’s free to match the empty string again. This pattern of first matching the empty string and only then matching the 'a' if it is absolutely needed repeats. That’s why this strange pattern occurs.
Non-Greedy Asterisk Operator (*?)
Let’s start with the asterisk (zero-or-more operator):
First, you use the zero-or-more asterisk regex 'a*'. It’s greedy so it matches as many 'a' characters as it can.
Second, you use the non-greedy zero-or-one version 'a*?'. Again, it matches zero 'a's if possible. Only if it has already matched zero characters at a certain position, it matches one character at that position, “consumes” it, and moves on.
First, you use the one-or-more plus regex 'a+'. It’s greedy so it matches as many 'a' characters as it can (but at least one).
Second, you use the non-greedy one-or-more version 'a+?'. In this case, the regex engine matches only one character 'a', consumes it, and moves on with the next match.
Let’s summarize what you’ve learned so far:
Greedy vs Non-Greedy Match – What’s the Difference?
Given a pattern with a quantifier (e.g. the asterisk operator) that allows the regex engine to match the pattern multiple times.
A given string may match the regex in multiple ways. For example, both substrings 'a' and 'aaa' are valid matches when matching the pattern 'a*' in the string 'aaaa'.
So the difference between the greedy and the non-greedy match is the following: The greedy match will try to match as many repetitions of the quantified pattern as possible. The non-greedy match will try to match as few repetitions of the quantified pattern as possible.
Examples Greedy vs Non-Greedy Match
Let’s consider a range of examples that help you understand the difference between greedy and non-greedy matches in Python:
Make sure you completely understand those examples before you move on. If you don’t, please read the previous paragraphs again.
Which is Faster: Greedy vs Non-Greedy?
Considering that greedy quantifiers match a maximal and non-greedy a minimal number of patterns, is there any performance difference?
Great question!
Indeed, some benchmarks suggest that there’s a significant performance difference: the greedy quantifier is 100% slower in realistic experiments on benchmark data.
So if you optimize for speed and you don’t care about greedy or non-greedy matches—and you don’t know anything else—go for the non-greedy quantifier!
However, the truth is not as simple. For example, consider the following basic experiment that falsifies the previous hypothesis that the non-greedy version is faster:
I used the speed testing tool timeit that allows to throw in some simple Python statements and check how long they run. Per default, the passed statement is executed 1,000,000 times.
You can see a notable performance difference of more than 300%! The non-greedy version is three times slower than the greedy version.
Why is that?
The reason is the re.findall() method that returns a list of matching substrings. Here’s the output both statements would produce:
You can see that the greedy version finds one match and is done with it. The non-greedy version finds 25 matches which leads to far more processing and memory overhead.
So what happens if you use the re.search() method that returns only the first match rather than the re.findall() method that returns all matches?
As expected, this changes things again. Both regex searches yield a single result, but the non-greedy match is much shorter: it matches the empty string '' rather than the whole string 'aaaaaaaaaaaa'. Of course, this is a bit faster.
However, the difference is negligible in this minimal example.
There’s More: Greedy, Docile, Lazy, Helpful, Possessive Match
In this article, I’ve classified the regex world into greedy and non-greedy quantifiers. But you can differentiate the “non-greedy” class even more!
Next, I’ll give you a short overview based on this great article of the most important terms in this regard:
Greedy: match as many instances of the quantified pattern as you can.
Docile: match as many instances of the quantified pattern as long as it still matches the overall pattern—if this is possible. Note that what I called “greedy” in this article is really “docile”.
Lazy: match as few instances of the quantified pattern as needed. This is what I called “non-greedy” in this article.
Possessive: never gives up a partial match. So the regex engine may not even find a match that actually exist—just because it’s so greedy. This is very unusual and you won’t see it a lot in practice.
If you want to learn more about those, I’d recommend that you read this excellent online tutorial.
Where to Go From Here
Summary: You’ve learned that the greedy quantifiers ?, *, and + match as many repetitions of the quantified pattern as possible. The non-greedy quantifiers ??, *?, and +? match as few repetitions of the quantified pattern as possible.
This tutorial makes you a master of character sets in Python. (I know, I know, it feels awesome to see your deepest desires finally come true.)
As I wrote this article, I saw a lot of different terms describing this same powerful concept such as “character class“, “character range“, or “character group“. However, the most precise term is “character set” as introduced in the official Python regex docs. So in this tutorial, I’ll use this term throughout.
Python Regex – Character Set
So, what is a character set in regular expressions?
The character set is (surprise) a set of characters: if you use a character set in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set. As you may know, a set is an unordered collection of unique elements. So each character in a character set is unique and the order doesn’t really matter (with a few minor exceptions).
Here’s an example of a character set as used in a regular expression:
>>> import re
>>> re.findall('[abcde]', 'hello world!')
['e', 'd']
You use the re.findall(pattern, string) method to match the pattern '[abcde]' in the string 'hello world!'. You can think of all characters a, b, c, d, and e as being in an OR relation: either of them would be a valid match.
The regex engine goes from the left to the right, scanning over the string ‘hello world!’ and simultaneously trying to match the (character set) pattern. Two characters from the text ‘hello world!’ are in the character set—they are valid matches and returned by the re.findall() method.
You can simplify many character sets by using the range symbol ‘-‘ that has a special meaning within square brackets: [a-z] reads “match any character from a to z”, while [0-9] reads “match any character from 0 to 9”.
You can even combine multiple character ranges in a single character set:
>>> re.findall('[a-eA-E0-4]', 'hello WORLD 42!')
['e', 'D', '4', '2']
Here, you match three ranges: lowercase characters from a to e, uppercase characters from A to E, and numbers from 0 to 4. Note that the ranges are inclusive so both start and stop symbols are included in the range.
Python Regex Negative Character Set
But what if you want to match all characters—except some? You can achieve this with a negative character set!
The negative character set works just like a character set, but with one difference: it matches all characters that are not in the character set.
Here’s an example where you match all sequences of characters that do not containcharacters a, b, c, d, or e:
>>> import re
>>> re.findall('[^a-e]+', 'hello world')
['h', 'llo worl']
We use the “at-least-once quantifier +” in the example that matches at least one occurrence of the preceding regex (if you’re unsure about how it works, check out my detailed Finxter tutorial about the plus operator).
There are only two such sequences: the one-character sequence ‘h’ and the eight-character sequence ‘llo worl’. You can see that even the empty space matches the negative character set.
Summary: the negative character set matches all characters that are not enclosed in the brackets.
How to Fix “re.error: unterminated character set at position”?
Now that you know character classes, you can probably fix this error easily: it occurs if you use the opening (or closing) bracket ‘[‘ in your pattern. Maybe you want to match the character ‘[‘ in your string?
But Python assumes that you’ve just opened a character class—and you forgot to close it.
Here’s an example:
>>> re.findall('[', 'hello [world]')
Traceback (most recent call last): File "<pyshell#5>", line 1, in <module> re.findall('[', 'hello [world]') File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 223, in findall return _compile(pattern, flags).findall(string) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 930, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 426, in _parse_sub not nested and not items)) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 532, in _parse source.tell() - here)
re.error: unterminated character set at position 0
The error happens because you used the bracket character ‘[‘ as if it was a normal symbol.
So, how to fix it? Just escape the special bracket character ‘\[‘ with the single backslash:
>>> re.findall('\[', 'hello [world]')
['[']
This removes the “special” meaning of the bracket symbol.
Related Re Methods
There are seven important regular expression methods which you must master:
The re.findall(pattern, string) method returns a list of string matches. Read more in our blog tutorial.
The re.search(pattern, string) method returns a match object of the first match. Read more in our blog tutorial.
The re.match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Read more in our blog tutorial.
The re.fullmatch(pattern, string) method returns a match object if the regex matches the whole string. Read more in our blog tutorial.
The re.compile(pattern) method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in our blog tutorial.
The re.split(pattern, string) method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in our blog tutorial.
The re.sub(The re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in our blog tutorial.
You’ve learned everything you need to know about the Python Regex Character Set Operator.
Summary:
If you use a character set [XYZ] in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set: X, Y, or Z.
Want to earn money while you learn Python? Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?
Join the free webinar that shows you how to become a thriving coding business owner online!
A regular expression is a decades-old concept in computer science. Invented in the 1950s by famous mathematician Stephen Cole Kleene, the decades of evolution brought a huge variety of operations. Collecting all operations and writing up a comprehensive list would result in a very thick and unreadable book by itself.
Fortunately, you don’t have to learn all regular expressions before you can start using them in your practical code projects. Next, you’ll get a quick and dirty overview of the most important regex operations and how to use them in Python. In follow-up chapters, you’ll then study them in detail — with many practical applications and code puzzles.
Here are the most important regex operators:
. The wild-card operator (‘dot’) matches any character in a string except the newline character ‘\n’. For example, the regex ‘…’ matches all words with three characters such as ‘abc’, ‘cat’, and ‘dog’.
* The zero-or-more asterisk operator matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex. For example, the regex ‘cat*’ matches the strings ‘ca’, ‘cat’, ‘catt’, ‘cattt’, and ‘catttttttt’.
? The zero-or-one operator matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. For example, the regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.
+ The at-least-one operator matches one or more occurrences of the immediately preceding regex. For example, the regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.
^ The start-of-string operator matches the beginning of a string. For example, the regex ‘^p’ would match the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.
$ The end-of-string operator matches the end of a string. For example, the regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.
A|B The OR operator matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. For example, the regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.
AB The AND operator matches first the regex A and second the regex B, in this sequence. We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.
Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.
Let’s dive into some examples!
Examples
import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!'] ''' print(re.findall('is.*and', text)) '''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and'] ''' print(re.findall('her:?', text)) '''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:'] ''' print(re.findall('^Ha.*', text)) '''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. Can you figure out why Python doesn't find any?
[] ''' print(re.findall('\n$', text)) '''
Finds all occurrences where the new-line character '\n'
occurs at the end of the string.
['\n'] ''' print(re.findall('(Life|Death)', text)) '''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death'] '''
In these examples, you’ve already seen the special symbol \n which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions.
Where to Go From Here?
If you want to master regular expressions once and for all, I’d recommend that you read the massive regular expression tutorial on the Finxter blog — for free!
You may already know Python’s and operator when applied to two Booleans:
>>> True and False
False
>>> True and True
True
Simple enough. Yet, that’s not the whole story: you can use the and operator even on complex data types such as lists or custom objects. So you may ask (and rightly so):
What If You Apply the AND Operator To Two Objects?
To understand the output, you have to understand two things:
How does the and operator work?
What’s the truth value of any object – such as a list?
You must understand the deeper meaning of those definitions: all of them are short-circuit which means that as soon as the condition is fullfilled, they will abort further processing.
In the x and y operation, if the value of x is evaluated to True, Python simply returns the value of y. It doesn’t even look at what the value of y actually is. If you’re using Boolean operators x and y, this is expected behavior because if x is True, then the y determines whether x and y is True.
This leads to the interesting behavior: if x and y are objects, the result of the operation x and y will be an object, too! (And not a Boolean value.)
In combination with the next piece of Python knowledge, this leads to an interesting behavior:
What’s the truth value of any object – such as a list?
The Python convention is simple: if the object is “empty”, the truth value is False. Otherwise, it’s True. So an empty list, an empty string, or a 0 integer value are all False. Most other values will be True.
Now, you’re equipped with the basics to understand the answer to the following question:
What If You Apply the AND Operator To Two Objects?
Say, you’ve got two non-Boolean objects x and y. What’s the result of the operation x and y?
The answer is simple: the result is y if x is non-empty (and, thus, evaluates to True).
What If You Apply the AND Operator To Two Lists?
Here’s an example for two list objects:
>>> [1, 2, 3] and [0, 0, 0, 0]
[0, 0, 0, 0]
The first argument of the and operation is non-empty and evaluates to True. Therefore, the result of the operation is the second list argument [0, 0, 0, 0].
But what if the first argument is empty?
>>> [] and [0, 0, 0, 0]
[]
The result is the first argument (and not a Boolean value False). If you’re in doubt why, consult the above definition again:
x and y: if x is false, then x, else y
Summary
You’ve learned that the and operator returns the first operand if it evaluates to False, otherwise the second operand.
You’ve also learned that you can use the and operator even for non-Boolean types in which case the result will be an object, not a Boolean value.
Finally, you’ve also learned that an empty object usually evaluates to False.
If you find this interesting, feel free to check out my upcoming Python book that shows you hundreds of small Python tricks like this one: