≡ Menu

How to Use Python Regular Expressions to Parse a Text File (Practical Use Case Scenario with Python Reg-Ex Re Split Sub)

In the past few articles in the Python series, we’ve learned a lot about working with regular expressions in Python.

In this article, we’ll explain how we could use python regular expressions for a realistic task.

We’ll do a step by step walk through on how we can build Python data structures from formatted flat text files.

If you are new to Python regular expressions, the following two articles will help:

  1. Getting started with python reg-ex using re.match search findall
  2. Advanced python reg-ex examples – Multi-line, substitution, greedy/non-greedy matching

In this article, our example starts with some formatted flat text. This data could have came from a text file containing profile information for a dating site:

>>> rawProfiles = '''
... Tim Fake, 1982/03/21, I like to
... eat, sleep and
... relax
... 
... Lisa Test, 1990/05/12, I like long
... walks of the beach, watching sun-sets,
... and listening to slow jazz
... '''
>>>

The format of this text is:

<name>, <birth-date>, <description>

However, the description can span multiple lines, and each profile is separated by at least one blank line. We can use the split() method from the ‘re’ package to process this raw text. First we will separate each profile:

>>> profilesList = re.split(r'\n{2,}', rawProfiles)
>>> profilesList
['\nTim Fake, 1982/03/21, I like to\neat, sleep and\nrelax', ' Lisa Test, 1990/05/12, I like long\nwalks of the beach, watching sun-sets,\nand listening to slow jazz \n']

The {} expansion characters specify a range of repetitions to match. In our case ‘\n{2,}’ says to match a series of at least 2 newline characters, but because we didn’t specify an upper limit, the series could be arbitrarily long. This corresponds to the format of the text. Remember we said that each profile would be separated by at least one blank line (i.e. 2 consecutive newline characters).

Now we have a list of raw profiles. Before we do anything else, lets take care of the stray newline characters dispersed throughout the profile. These come as a result of the fact that a profile could span multiple lines. For now we’ll just substitute them for a ‘ ‘ character using the sub() method:

>>> profilesList = [ re.sub(r'\n', ' ', profile) for profile in profilesList ]
>>> profilesList
[' Tim Fake, 1982/03/21, I like to eat, sleep and relax', ' Lisa Test, 1990/05/12, I like long  walks of the beach, watching sun-sets, and listening to slow jazz  ']

The next step is to separate each profile into its individual fields. We could do this using matching and grouping (see the previous article on regex basics), but I’m going to do this using the split() method a second time. (For a more detailed look at python list comprehensions, see my previous article on this topic (can you put a link here?))

>>> profilesList = [ re.split(r',', profile, maxsplit=2) for profile in profilesList ]
>>> for profile in profilesList:
...    print profile
... 
[' Tim Fake', ' 1982/03/21', ' I like to eat, sleep and relax']
[' Lisa Test', ' 1990/05/12', ' I like long  walks of the beach, watching sun-sets, and listening to slow jazz  ']

In the above, notice how because we specified the maxsplit keyword parameter, the split() method left the descriptions alone (even though they contain ‘,’ characters as well). The maxsplit parameter tells split() to perform at most that many splits, no matter how many matches are found. In our case, we told split() to only split the string on the first 2 ‘,’ characters (i.e. creating 3 fields).

Our example is really progressing. We’ve now got a list of user profiles, with each user profile broken up into it’s specified fields. However, the data is messy, there is some stray whitespace sprinkled throughout our profiles.

Let’s clean this up:

>>> profilesList = [ map(str.strip, profile) for profile in profilesList ]
>>> for profile in profilesList:
...    print profile
... 
['Tim Fake', '1982/03/21', 'I like to eat, sleep and relax']
['Lisa Test', '1990/05/12', 'I like long   walks of the beach, watching sun-sets,  and listening to slow jazz']

The python standard library map() function takes a function and a list, and applies the function to each element of the list. In our case, applying the string’s strip() method to each field of the profile. For more information, visit the official Python re docs.

Our example has come to an end. We have successfully structured our user profile data.

We could easily take this list and use it to instantiate User Profile objects within our system, display user profiles on a web-page, or persist profile data in a database.

Add your comment

If you enjoyed this article, you might also like..

  1. 50 Linux Sysadmin Tutorials
  2. 50 Most Frequently Used Linux Commands (With Examples)
  3. Top 25 Best Linux Performance Monitoring and Debugging Tools
  4. Mommy, I found it! – 15 Practical Linux Find Command Examples
  5. Linux 101 Hacks 2nd Edition eBook Linux 101 Hacks Book

Bash 101 Hacks Book Sed and Awk 101 Hacks Book Nagios Core 3 Book Vim 101 Hacks Book

Comments on this entry are closed.

  • Mike October 7, 2014, 11:03 am

    This is most helpful. Thanks.

  • armando November 26, 2014, 8:29 am

    how do you do it if the text is in a file, without having to read the whole file in memory? E.g. the description could be very long, the file very large etc. Is there a way not to read chunks of the file in memory?

  • Ariel April 30, 2017, 1:41 am

    How do you export the result into CSV?

    Thanks