Extracting strings from an XML file
The xml-coreutils(7) are intended to work well with existing core
utilities, which only understand freeform text. Thus there are a few
commands which extract the data from an XML file.
The xml-strings(1) command simply removes all the markup (tags,
comments, etc) from an XML file:
% xml-strings food.xml | grep Milk
Milk (2 litres)
% cat food.xml | grep Milk
<product price="1.09">Milk (2 litres)</product>
If you have slightly more complex requirements, a good command to use
is xml-printf(1). This is one of a family of commands which accept
an XPATH, which you can learn about on the xml-coreutils(7) manpage.
An XPATH represents a collection of elements within
an XML document, and xml-printf(1) just prints the strings from
those elements. Here are a few examples:
% xml-printf 'I like %s ~:>\n' food.xml :/products/product[1]
I like Chicken ~:>
% xml-printf 'The %s costs $%.2f\n' \
food.xml :/products/product[3] \
:/products/product@price[3]
The Apple costs $0.20
% xml-printf 'The products are:\n%30s\n' \
food.xml :/*/product
The products are:
Chicken
Lobster
Apple
Milk (2 litres)
The first argument of xml-printf(1) is a format string similar to the
format string of printf(3). The remaining arguments are an XML file
(food.xml) and various XPATHs, which start with
a colon ':' to distinguish them from a file. In the first two
examples, these XPATHs contain the single strings "Chicken" and the
strings "Apple" and "0.20" respectively. In the last example, the
XPATH represents all four tags named "product" in
the food.xml document.
If you don't know what the W3C XPath specification is, then a good way
to think of an XPATH is as a directory path, where each tag in an XML
file is thought of as a directory, containing text or other tags. If
you look at the food.xml file, then the
"Chicken" string is contained in the first "product" tag, which is
itself contained in the "products" top level tag.
If you're familiar with the W3C XPath specification, then you should
know that, while the XPATH notation is inspired by the W3C XPath 1.0
specification, it is not a complete implementation, and likely never
will be (namespaces, axes and functions are not very shell tool
friendly).
Besides printing text in between the tags, you can print a list of the
tags themselves by using the xml-find(1) command:
% xml-find food.xml
/products
/products/product
/products/product
/products/product
/products/product
However, xml-find(1) is really much more useful than that, it is in
fact a general purpose selection tool, which can extract XML fragments
from a file using one or more XPATH(s). We'll show this later.
|