Posted on Leave a comment

Python Regex Flags

In many functions, you see a third argument flags. What are they and how do they work?

Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (e.g. whether to ignore capitalization when matching your regex).

For example, here’s how the third argument flags is used in the re.findall() method:

re.findall(pattern, string, flags=0)

So the flags argument seems to be an integer argument with the default value of 0. To control the default regex behavior, you simply use one of the predefined integer values. You can access these predefined values via the re library:

Syntax Meaning
re.ASCII If you don’t use this flag, the special Python regex symbols \w, \W, \b, \B, \d, \D, \s and \S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests.
re.A Same as re.ASCII
re.DEBUG If you use this flag, Python will print some useful information to the shell that helps you debugging your regex.
re.IGNORECASE If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z].
re.I Same as re.IGNORECASE
re.LOCALE Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable.
re.L Same as re.LOCALE
re.MULTILINE This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string.
re.M Same as re.MULTILINE
re.DOTALL Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘\n’. Switch on this flag to really match all characters including the newline character.
re.S Same as re.DOTALL
re.VERBOSE To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex.
re.X Same as re.VERBOSE

How to Use These Flags?

Simply include the flag as the optional flag argument as follows:

import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('HER', text, flags=re.IGNORECASE))
# ['her', 'Her', 'her', 'her']

As you see, the re.IGNORECASE flag ensures that all occurrences of the string ‘her’ are matched—no matter their capitalization.

How to Use Multiple Flags?

Yes, simply add them together (sum them up) as follows:

import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall(' HER # Ignored', text, flags=re.IGNORECASE + re.VERBOSE))
# ['her', 'Her', 'her', 'her']

You use both flags re.IGNORECASE (all occurrences of lower- or uppercase string variants of ‘her’ are matched) and re.VERBOSE (ignore comments and whitespaces in the regex). You sum them together re.IGNORECASE + re.VERBOSE to indicate that you want to take both.

Posted on Leave a comment

Python re.findall() – Everything You Need to Know

When I first learned about regular expressions, I didn’t really appreciate their power. But there’s a reason regular expressions have survived seven decades of technological disruption: coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!

This article is all about the findall() method of Python’s re library. The findall() method is the most basic way of using regular expressions in Python: If you want to master them, start here!

So how does the re.findall() method work? Let’s study the specification.

How Does the findall() Method Work in Python?

The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.

Specification:

re.findall(pattern, string, flags=0)

The re.findall() method has up to three arguments.

  • pattern: the regular expression pattern that you want to match.
  • string: the string which you want to search for the pattern.
  • flags (optional argument): a more advanced modifier that allows you to customize the behavior of the function. Want to know how to use those flags? Check out this detailed article on the Finxter blog.

We will have a look at each of them in more detail.

Return Value:

The re.findall() method returns a list of strings. Each string element is a matching substring of the string argument.

Let’s check out a few examples!

Examples re.findall()

First, you import the re module and create the text string to be searched for the regex patterns:

import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. '''

Let’s say, you want to search the text for the string ‘her’:

>>> re.findall('her', text)
['her', 'her', 'her']

The first argument is the pattern you look for. In our case, it’s the string ‘her’. The second argument is the text to be analyzed. You stored the multi-line string in the variable text—so you take this as the second argument. You don’t need to define the optional third argument flags of the findall() method because you’re fine with the default behavior in this case.

Also note that the findall() function returns a list of all matching substrings. In this case, this may not be too useful because we only searched for an exact string. But if we search for more complicated patterns, this may actually be very useful:

>>> re.findall('\\bf\w+\\b', text)
['frost', 'flower', 'field']

The regex ‘\\bf\w+\\b’ matches all words that start with the character ‘f’.

You may ask: why to enclose the regex with a leading and trailing ‘\\b’? This is the word boundary character that matches the empty string at the beginning or at the end of a word. You can define a word as a sequence of characters that are not whitespace characters or other delimiters such as ‘.:,?!’.

In the previous example, you need to escape the boundary character ‘\b’ again because in a Python string, the default meaning of the character sequence ‘\b’ is the backslash character.

Where to Go From Here?

This article has introduced the re.findall(pattern, string) method that attempts to match all occurrences of the regex pattern in a given string—and returns a list of all matches as strings.

Python is growing rapidly and the world is more and more divided into two classes: those who understand coding and those who don’t. The latter will have larger and larger difficulties participating in the era of massive adoption and penetration of digital content. Do you want to increase your Python skills on a daily basis without investing a lot of time?

Then join my “Coffee Break Python” email list of tens of thousands of ambitious coders!