≡ Menu

How to Parse XML and Strip Tags using XPATH Examples in Linux (How to Combine Multiple Commands Using PIPE in Linux)

This tutorial explains the process of building useful multi­-part commands piece by piece.

To build complex commands in the terminal, we need to understand piping. Piping is basically taking the output of one command and sending it to another command as input. This is done with the | (pipe) symbol.

Last month, a small project required me to repeatedly read similar XML files to provide test data for another program. I would have to do it so frequently that it would be annoying to have to download, save, parse and repeat. The basic requirements were:

  1. Get XML from URL
  2. Parse the XML and select only two attributes of all elements
  3. Strip the tags so only the content remains
  4. Send to standard output

1. Prove the command line can parse XML

I had used the Ruby library REXML::Xpath for a script last year, and I remembered there was a Perl version available on the command line. You can install it with CPAN:

$ cpan XML::XPath

Let’s use a sample employee file to play with the idea. Open this employees.xml file in a browser, Open that in a browser and save as employees.xml.

Now we have our xpath command and a file to play with.

Test it with a simple path:

$ xpath employees.xml '/DIRECTORY/EMPLOYEE/FIRST_NAME'
­­ NODE ­­
<FIRST_NAME>Steven</FIRST_NAME>­­ NODE ­­
<FIRST_NAME>Susan</FIRST_NAME>­­ NODE ­­
<FIRST_NAME>Marigold</FIRST_NAME>­­ NODE ­­
...
<FIRST_NAME>Sunny</FIRST_NAME>­­ NODE ­­
<FIRST_NAME>Flo</FIRST_NAME>

Excellent! It prints the FIRST_NAME attribute of each /EMPLOYEE on the selected path. But how do we select multiple XPath elements? Looking at XPath syntax, we see a way. Combining the XPath expressions with the | character, we create an OR expression.

$ xpath employees.xml '/DIRECTORY/EMPLOYEE/ FIRST_NAME | /DIRECTORY/EMPLOYEE/LAST_NAME'
--­­ NODE ­­--
<FIRST_NAME>Steven</FIRST_NAME>­­-- NODE ­­--
<LAST_NAME>Sanguini</LAST_NAME>­--­ NODE ­--­
<FIRST_NAME>Susan</FIRST_NAME>­­-- NODE ­­--
<LAST_NAME>Aquilegia</LAST_NAME>--­­ NODE --­­
...
<FIRST_NAME>Flo</FIRST_NAME>­­-- NODE ­­--
<LAST_NAME>Lobalessia</LAST_NAME>

Notice, here | is interpreted as the OR operator and not output redirection.

Also, in this statement, we are selecting X as well as Y. Why does OR select both? It evaluates each node in the XML document separately and if the node is either A or B, it passes evaluation, and gets passed to output.

2. Download XML and send to STDOUT

This next step is actually going to come earlier on the command line and we will build it separately. I prefer to build the hardest, or “you can’t do that” command entries first as proof of concept. It would be pointless to do the surrounding command line work if the Step One can not work.

cURL is powerful command for HTTP interactions. These curl examples will get you started in the right direction.

We specify a location, following redirects if needed. For this, use this option: -L ‘http://www.thegeekstuff.com/scripts/employees.xml’

We turn off cURL’s information output. And specify GET protocol. For this, use this option: -s ­G

So let us test our command on the URL for the file we downloaded previously:

$ curl -­s -­G -­L ' http://www.thegeekstuff.com/scripts/employees.xml'
<?xml version="1.0" encoding="UTF­8"?>
<DIRECTORY>
<EMPLOYEE>
<FIRST_NAME>Steven</FIRST_NAME>
<LAST_NAME>Sanguini</LAST_NAME>
<STORE_NUMBER>4</STORE_NUMBER>
<SHIFT>FIRST</SHIFT>
<AUM>$2.44</AUM>
<ID>031599</ID>
</EMPLOYEE>
..

It defaults to STDOUT. Which is good since we are now going to redirect it to XPath removing the file argument:

$ curl ­-s -­G -­L ' http://www.thegeekstuff.com/scripts/employees.xml' | xpath \
'/DIRECTORY/EMPLOYEE/LAST_NAME | /DIRECTORY/EMPLOYEE/ID'
­--­ NODE ­­--
<LAST_NAME>Sanguini</LAST_NAME>­­-- NODE ­­--
<ID>031599</ID>­­ NODE ­­
<LAST_NAME>Aquilegia</LAST_NAME>­­-- NODE -- ­­
<ID>030699</ID>­­-- NODE ­­--
...
<LAST_NAME>Lobalessia</LAST_NAME>--­­ NODE --­­
<ID>022299</ID>

This produces the expected output. Great! Not sure why, but XPath sends ‘­­NODE ­­’ to standard error (STDERR). But we’ll see a possible reason later.

3. Strip XML Tags

Now we need to be able to strip those tags and get just the content. Sed is the best tool for doing on­the­fly regular expression substitutions. Learning REGEX is outside the scope of this article.

Please see our series of articles on Python Regular Expressions for more information.

When making complicated commands with multiple arguments and flags, I find it best to work with a simple example until I get it just right, then paste into context with the real arguments. We pipe a simple string to sed for a test substitution. Sed works on STDIN by default.

$ echo "This<strong> is </strong>a test." | sed ­-re 's/i//g'
Ths<strong> s </strong>a test.

Ok. That works. Now rewrite the search to replace a tag.

$ echo "This<strong> is </strong>a test." | sed ­-re 's/<\w+>//g'
This is </strong>a test.

Good. Let’s remove the close tag by adding ‘/’ escaped by prefixing ‘\’ and made optional by suffixing ‘?’

$ echo "This<strong> is </strong>a test." | sed ­re 's/<\/?\w+>//g'
This is a test.

Perfect. Exactly what we expected.

4. Putting it all together

Now that we have created the individual parts of our command we paste them together in logical order joined by | .

curl ­-s -­G -­L ' http://www.thegeekstuff.com/scripts/employees.xml' | \
xpath '/DIRECTORY/EMPLOYEE/LAST_NAME | /DIRECTORY/EMPLOYEE/ID ' | \
sed ­-re 's/<\/?\w+>//g'

Output:

Found 72 nodes:
--­­ NODE -- ­­
­--­ NODE ­­--
...
Sanguini031599Aquilegia030699...

Uh oh! Maybe this is why the ‘­­ NODE ­­’ markers are there. If we pipe this to a file, the NODE text does not follow. They are sent to standard error (STDERR), but we can redirect to STDOUT by using`2>&1` (explanation) and use sed substitute `sed ­re ‘s/­­ NODE ­­//g’` to strip in the same manner as the tags.

curl -­s -­G -­L 'http://www.thegeekstuff.com/scripts/employees.xml' | \
xpath '/DIRECTORY/EMPLOYEE/LAST_NAME | /DIRECTORY/EMPLOYEE/ID '
2>&1| sed -­re 's/­--­NODE--­­//g' | sed -­re 's/<\/?\w+>//g'

Output:

Found 72 nodes:
Sanguini
031599
Aquilegia
030699
...
Lobalessia
022299

Perfect. Now, as I work on my project, I can quickly get sample data from XML files on the web to STDOUT without all the hassle of saving files or running some complicated software. We can even pipe this to `tail –n+3` to cut off those first two response lines.

This article is just one example of various things that you can do if you learn how to combine multiple commands using pipe.

If you enjoyed this article, you might also like..

  1. 50 Linux Sysadmin Tutorials
  2. 50 Most Frequently Used Linux Commands (With Examples)
  3. Top 25 Best Linux Performance Monitoring and Debugging Tools
  4. Mommy, I found it! – 15 Practical Linux Find Command Examples
  5. Linux 101 Hacks 2nd Edition eBook Linux 101 Hacks Book

Bash 101 Hacks Book Sed and Awk 101 Hacks Book Nagios Core 3 Book Vim 101 Hacks Book

{ 2 comments… add one }

  • Kpinzou December 17, 2014, 10:15 am

    very helpful.

  • Philip January 3, 2017, 9:08 am

    hi,
    this is a shorter version using xpath that gets the data from any level , and only get the value without Element name so you dont need the sed stripping:
    curl -s ‘http://www.thegeekstuff.com/scripts/employees.xml’| xpath -e ‘//LAST_NAME/text()|//ID/text()’ 2>/dev/null

    The 2>/dev/null removes the NODE stuff
    the text() gets the data only
    the // makes xpath parse all levels
    Philip

Leave a Comment