Laird Breyer's Free software

welcome/

java-mcmc/

software/

papers/

Extracting strings from an XML file

The xml-coreutils(7) are intended to work well with existing core utilities, which only understand freeform text. Thus there are a few commands which extract the data from an XML file.

The xml-strings(1) command simply removes all the markup (tags, comments, etc) from an XML file:


% xml-strings food.xml | grep Milk
Milk (2 litres)
% cat food.xml | grep Milk
  <product price="1.09">Milk (2 litres)</product>

If you have slightly more complex requirements, a good command to use is xml-printf(1). This is one of a family of commands which accept an XPATH, which you can learn about on the xml-coreutils(7) manpage. An XPATH represents a collection of elements within an XML document, and xml-printf(1) just prints the strings from those elements. Here are a few examples:


% xml-printf 'I like %s ~:>\n' food.xml :/products/product[1]
I like Chicken ~:>
% xml-printf 'The %s costs $%.2f\n' \
        food.xml :/products/product[3] \
        :/products/product@price[3]
The Apple costs $0.20
% xml-printf 'The products are:\n%30s\n' \
        food.xml :/*/product
The products are:
                       Chicken
                       Lobster
                         Apple
               Milk (2 litres)

The first argument of xml-printf(1) is a format string similar to the format string of printf(3). The remaining arguments are an XML file (food.xml) and various XPATHs, which start with a colon ':' to distinguish them from a file. In the first two examples, these XPATHs contain the single strings "Chicken" and the strings "Apple" and "0.20" respectively. In the last example, the XPATH represents all four tags named "product" in the food.xml document.

If you don't know what the W3C XPath specification is, then a good way to think of an XPATH is as a directory path, where each tag in an XML file is thought of as a directory, containing text or other tags. If you look at the food.xml file, then the "Chicken" string is contained in the first "product" tag, which is itself contained in the "products" top level tag.

If you're familiar with the W3C XPath specification, then you should know that, while the XPATH notation is inspired by the W3C XPath 1.0 specification, it is not a complete implementation, and likely never will be (namespaces, axes and functions are not very shell tool friendly).

Besides printing text in between the tags, you can print a list of the tags themselves by using the xml-find(1) command:


% xml-find food.xml 
/products
/products/product
/products/product
/products/product
/products/product