This article is part of a series of articles on Python Regular Expressions.
This article is a continuation on the topic and will build on what we’ve previously learned. In this article we’ll discuss:
- Working with Multi-line strings / matches
- Greedy vs. Non-Greedy matching
- Substitution using regular expressions
In the first article of this series, we learned the basics of working with regular expressions in Python.
1. Working with Multi-line Strings
There are a couple of scenarios that may arise when you are working with a multi-line string (separated by newline characters – ‘\n’). One case is that you may want to match something that spans more than one line. Consider this snippet of html:
>>> paragraph = \ ... ''' ... <p> ... This is a paragraph. ... It has multiple lines. ... </p> ... ''' >>>
We may want to grab the entire paragraph tag (contents and all). We would expect to find this. However, as we see below, this did not work.
>>> re.search(r'<p>.*</p>', paragraph) >>>
The problem with this regular expression search is that, by default, the ‘.’ special character does not match newline characters.
There is an easy fix for this though. The ‘re’ packages query methods can optionally accept some predefined flags which modify how special characters behave.
The re.DOTALL flag tells python to make the ‘.’ special character match all characters, including newline characters. Let’s try it out:
>>> match = re.search(r'<p>.*</p>', paragraph, re.DOTALL) >>> match.group(0) '<p>\nThis is a paragraph.\nIt has multiple lines.\n</p>' >>>
Perfect, using the re.DOTALL flag, we can match patterns that span multiple lines.
Another scenario that could arise when working with multi-line strings is that we may only want to pick out lines that start or end with a certain pattern. Using our same paragraph, we would expect to find the the third line of text (the line ‘It has multiple lines.’). However, again, as shown below, we see that this is not the case.
>>> re.search(r'^It has.*', paragraph) >>>
By default in python, the ‘^’ and ‘$’ special characters (these characters match the start and end of a line, respectively) only apply to the start and end of the entire string.
Thankfully, there is a flag to modify this behavior as well. The re.MULTILINE flag tells python to make the ‘^’ and ‘$’ special characters match the start or end of any line within a string. Using this flag:
>>> match = re.search(r'^It has.*', paragraph, re.MULTILINE) >>> match.group(0) 'It has multiple lines.' >>>
We get the behavior we expect.
2. Greedy vs Non-Greedy Matches
Sometimes, if we are not careful with the use of special characters, our regular expressions find more that we expected them to.
This is because by default, regular expressions are greedy (i.e. they will match as much as possible). Consider this next example:
>>> htmlSnippet = '<h1>This is the Title</h1>' >>>
If we were to write a regular expression query to pick out only the html tags from this snippet, we might first naively try the following:
>>> re.findall(r'<.*>', htmlSnippet) ['<h1>This is the Title</h1>']
However, we see that (perhaps unexpectedly) this matched the entire snippet.
This is a good example of how regular expressions are greedy by default, the ‘.*’ portion of the regular expression expanded as much as it possibly could while still satisfying the match. We can tell python to not be greedy (i.e. to stop expanding special characters once the smallest matching substring is found) by using a ‘?’ character after the expansion character (after the ‘*’, ‘+’, …):
>>> re.findall(r'<.*?>', htmlSnippet) ['<h1>', '</h1>']
By affixing the ‘*’ expansion character with a ‘?’, we are telling python to only expand to the smallest possible match, and we get the behavior we were looking for.
3. Substitution with Regular Expressions
Another task that the re package lets you do using regular expressions is to do substitutions within a string. The sub() methods takes a regular expression and phrase just like the query methods we’ve seen so far, but we also hand in the string to replace each match with. You can do straightforward substitutions like this:
>>> re.sub(r'\w+', 'word', 'This phrase contains 5 words') 'word word word word word'
This replaces every found word with the literal string ‘word’. You can also reference the match in the replace string using grouping (we learned about grouping in the previous article):
>>> re.sub(r'(?P<firstLetter>\w)\w*', r'\g<firstLetter>', 'This phrase contains 5 words') 'T p c 5 w'
In this case, we capture the first letter of each word in the ‘firstLetter’ group, and then call upon it in the replace string using the ‘\g<name>’ syntax. Had we not been using named groups, we could have specified the group number instead of the group name:
>>> re.sub(r'(\w)\w*', r'\g<1>', 'This phrase contains 5 words') 'T p c 5 w'
Sometimes, our replacement needs are more complex than what can be specified in a simple replacement string.
For this, the sub() method can also accept a replacement function instead of a replacement string literal.
The replacement function should accept a single argument, which will be a match object and return a string. The sub() method will call this function on each match found, and replace the matching content with the function’s return value.
To demonstrate this, lets write a function that will allow us to make an arbitrary string more url-friendly (i.e. we will convert all characters to lowercase and replace series of spaces with a single ‘_’ character).
>>> def slugify(matchObj): ... matchString = matchObj.group(0) ... if matchString.isalnum(): ... return matchString.lower() ... else: ... return '_' ... >>>
Our function accepts a match object and returns a string, just as is required by the sub function.
Now we can use this function to ‘slugify’ and arbitrary string. We match either a series of word characters or a series of spaces (the ‘|’ special character is essentially the OR operator for regular expressions. To be a valid match, the content must either match the pattern to the left of the ‘|’, or the pattern to the right of the ‘|’). The sub() method will pass each match object to our slugify() function:
>>> re.sub(r'\w+|\s+', slugify, 'This iS a NAME') 'this_is_a_name'
Notice that we pass a reference to the function object into the sub() method (i.e. we don’t invoke the function). Remember that the sub() method is going to invoke the slugify function on each match object for us.
Taking a minute to understand the flow of this last example will not only teach you how the sub() method works, but also about some fundamentals of Python. Python treats functions as first class citizens. They can be handed around just like any other object can be (in fact, functions are objects in Python).
As always, to learn even more about regular expressions and the ‘re’ package, check out the official Python documentation for re packagehttps://docs.python.org/2/library/re.html.
In this article we’ve dove a little deeper into working with regular expressions in Python. We learned about working with multi-line strings and how the re.DOTALL and re.MULTILINE flags can change the behavior of some special characters to better suit our needs. We also talked about how regular expressions are greedy by default (they expand to create the largest match), but we can alter this behavior using the ‘?’ character. Finally, we talked about how we can do substitutions within a string, specifying the replacement with either string literals or a replacement function.
Comments on this entry are closed.
Good article, point 3 worthy.
I love Regexs they are really powerful
Great article….much thanks!