How to Parse XML and Strip Tags using XPATH Examples in Linux (How to Combine Multiple Commands Using PIPE in Linux)

by Douglas King on December 15, 2014

This tutorial explains the process of building useful multi-part commands piece by piece.

To build complex commands in the terminal, we need to understand piping. Piping is basically taking the output of one command and sending it to another command as input. This is done with the | (pipe) symbol.

Last month, a small project required me to repeatedly read similar XML files to provide test data for another program. I would have to do it so frequently that it would be annoying to have to download, save, parse and repeat. The basic requirements were:

Get XML from URL
Parse the XML and select only two attributes of all elements
Strip the tags so only the content remains
Send to standard output

1. Prove the command line can parse XML

I had used the Ruby library REXML::Xpath for a script last year, and I remembered there was a Perl version available on the command line. You can install it with CPAN:

$ cpan XML::XPath

Let’s use a sample employee file to play with the idea. Open this employees.xml file in a browser, Open that in a browser and save as employees.xml.

Now we have our xpath command and a file to play with.

Test it with a simple path:

$ xpath employees.xml '/DIRECTORY/EMPLOYEE/FIRST_NAME'
 NODE 
<FIRST_NAME>Steven</FIRST_NAME> NODE 
<FIRST_NAME>Susan</FIRST_NAME> NODE 
<FIRST_NAME>Marigold</FIRST_NAME> NODE 
...
<FIRST_NAME>Sunny</FIRST_NAME> NODE 
<FIRST_NAME>Flo</FIRST_NAME>

Excellent! It prints the FIRST_NAME attribute of each /EMPLOYEE on the selected path. But how do we select multiple XPath elements? Looking at XPath syntax, we see a way. Combining the XPath expressions with the | character, we create an OR expression.

$ xpath employees.xml '/DIRECTORY/EMPLOYEE/ FIRST_NAME | /DIRECTORY/EMPLOYEE/LAST_NAME'
-- NODE --
<FIRST_NAME>Steven</FIRST_NAME>-- NODE --
<LAST_NAME>Sanguini</LAST_NAME>-- NODE --
<FIRST_NAME>Susan</FIRST_NAME>-- NODE --
<LAST_NAME>Aquilegia</LAST_NAME>-- NODE --
...
<FIRST_NAME>Flo</FIRST_NAME>-- NODE --
<LAST_NAME>Lobalessia</LAST_NAME>

Notice, here | is interpreted as the OR operator and not output redirection.

Also, in this statement, we are selecting X as well as Y. Why does OR select both? It evaluates each node in the XML document separately and if the node is either A or B, it passes evaluation, and gets passed to output.

2. Download XML and send to STDOUT

This next step is actually going to come earlier on the command line and we will build it separately. I prefer to build the hardest, or “you can’t do that” command entries first as proof of concept. It would be pointless to do the surrounding command line work if the Step One can not work.

cURL is powerful command for HTTP interactions. These curl examples will get you started in the right direction.

We specify a location, following redirects if needed. For this, use this option: -L ‘https://www.thegeekstuff.com/scripts/employees.xml’

We turn off cURL’s information output. And specify GET protocol. For this, use this option: -s G

So let us test our command on the URL for the file we downloaded previously:

$ curl -s -G -L ' https://www.thegeekstuff.com/scripts/employees.xml'
<?xml version="1.0" encoding="UTF8"?>
<DIRECTORY>
<EMPLOYEE>
<FIRST_NAME>Steven</FIRST_NAME>
<LAST_NAME>Sanguini</LAST_NAME>
<STORE_NUMBER>4</STORE_NUMBER>
<SHIFT>FIRST</SHIFT>
<AUM>$2.44</AUM>
<ID>031599</ID>
</EMPLOYEE>
..

It defaults to STDOUT. Which is good since we are now going to redirect it to XPath removing the file argument:

$ curl -s -G -L ' https://www.thegeekstuff.com/scripts/employees.xml' | xpath \
'/DIRECTORY/EMPLOYEE/LAST_NAME | /DIRECTORY/EMPLOYEE/ID'
-- NODE --
<LAST_NAME>Sanguini</LAST_NAME>-- NODE --
<ID>031599</ID> NODE 
<LAST_NAME>Aquilegia</LAST_NAME>-- NODE -- 
<ID>030699</ID>-- NODE --
...
<LAST_NAME>Lobalessia</LAST_NAME>-- NODE --
<ID>022299</ID>

This produces the expected output. Great! Not sure why, but XPath sends ‘NODE ’ to standard error (STDERR). But we’ll see a possible reason later.

3. Strip XML Tags

Now we need to be able to strip those tags and get just the content. Sed is the best tool for doing onthefly regular expression substitutions. Learning REGEX is outside the scope of this article.

Please see our series of articles on Python Regular Expressions for more information.

When making complicated commands with multiple arguments and flags, I find it best to work with a simple example until I get it just right, then paste into context with the real arguments. We pipe a simple string to sed for a test substitution. Sed works on STDIN by default.

$ echo "This<strong> is </strong>a test." | sed -re 's/i//g'
Ths<strong> s </strong>a test.

Ok. That works. Now rewrite the search to replace a tag.

$ echo "This<strong> is </strong>a test." | sed -re 's/<\w+>//g'
This is </strong>a test.

Good. Let’s remove the close tag by adding ‘/’ escaped by prefixing ‘\’ and made optional by suffixing ‘?’

$ echo "This<strong> is </strong>a test." | sed re 's/<\/?\w+>//g'
This is a test.

Perfect. Exactly what we expected.

4. Putting it all together

Now that we have created the individual parts of our command we paste them together in logical order joined by | .

curl -s -G -L ' https://www.thegeekstuff.com/scripts/employees.xml' | \
xpath '/DIRECTORY/EMPLOYEE/LAST_NAME | /DIRECTORY/EMPLOYEE/ID ' | \
sed -re 's/<\/?\w+>//g'

Output:

Found 72 nodes:
-- NODE -- 
-- NODE --
...
Sanguini031599Aquilegia030699...

Uh oh! Maybe this is why the ‘ NODE ’ markers are there. If we pipe this to a file, the NODE text does not follow. They are sent to standard error (STDERR), but we can redirect to STDOUT by using`2>&1` (explanation) and use sed substitute `sed re ‘s/ NODE //g’` to strip in the same manner as the tags.

curl -s -G -L 'https://www.thegeekstuff.com/scripts/employees.xml' | \
xpath '/DIRECTORY/EMPLOYEE/LAST_NAME | /DIRECTORY/EMPLOYEE/ID '
2>&1| sed -re 's/--NODE--//g' | sed -re 's/<\/?\w+>//g'

Output:

Found 72 nodes:
Sanguini
031599
Aquilegia
030699
...
Lobalessia
022299

Perfect. Now, as I work on my project, I can quickly get sample data from XML files on the web to STDOUT without all the hassle of saving files or running some complicated software. We can even pipe this to `tail –n+3` to cut off those first two response lines.

This article is just one example of various things that you can do if you learn how to combine multiple commands using pipe.

Add your comment

If you enjoyed this article, you might also like..

Tagged as: Linux pipe Examples, Linux XML Parsing, Linux xpath Examples

Comments on this entry are closed.

Kpinzou December 17, 2014, 10:15 am

very helpful.

∞
Philip January 3, 2017, 9:08 am

hi,
this is a shorter version using xpath that gets the data from any level , and only get the value without Element name so you dont need the sed stripping:
curl -s ‘http://www.thegeekstuff.com/scripts/employees.xml’| xpath -e ‘//LAST_NAME/text()|//ID/text()’ 2>/dev/null

The 2>/dev/null removes the NODE stuff
the text() gets the data only
the // makes xpath parse all levels
Philip

∞

Next post: Happy New Year 2015 – From Geek and the Dolls

Previous post: How to Use libwireshark in C Program to Decode Network Packets

How to Parse XML and Strip Tags using XPATH Examples in Linux (How to Combine Multiple Commands Using PIPE in Linux)

1. Prove the command line can parse XML

2. Download XML and send to STDOUT

3. Strip XML Tags

4. Putting it all together

If you enjoyed this article, you might also like..

About The Geek Stuff

Contact Us

Support Us