Extracting the structure of an XML file
The simplest structural information about an XML file is its type or
file format. If this is all you wish to know, use xml-file(1):
% xml-file food.xml xml_coreutils_tutorial.html
food.xml: XML text
xml_coreutils_tutorial.html: HTML text fragment
Just as file(1) uses heuristics to identify a file type from its
binary contents, xml-file(1) uses various pieces of data, such as the
DOCTYPE and the name of the root tag to (attempt to) identify an XML
file. However, xml-file(1) is not a replacement for file(1), and will
output "unrecognized file" if the file is anything other than
XML. Moreover, it will not recognize broken (malformed) files if the
break is below the tags it looks for.
Every shell user knows how to navigate their home directory using
ls(1) and cd(1). In xml-coreutils(7), the command xml-ls(1) lets you
navigate and list the "directory" structure of an XML file using
XPATHs. Here's an example using
the People.xml file we discussed earlier.
% xml-ls People.xml :/
<?xml version="1.0"?>
<root>
<People>
<Person/>
</People>
</root>
% xml-ls People.xml :/People/Person
<?xml version="1.0"?>
<root>
<Person>
<Address/>
<TelNo/>
</Person>
</root>
% xml-ls People.xml :/People/Person/Address
<?xml version="1.0"?>
<root>
<Address>
<LineOne/>
<LineTwo/>
<County/>
<Country/>
</Address>
</root>
% xml-ls People.xml :/People/Person/Address/Country
<?xml version="1.0"?>
<root>
<Country>
Ireland
</Country>
</root>
The output of xml-ls(1) is XML. This makes sense if you recall that
ls(1) prints both directory names and file names together. If we think
of a tag as analogous to a directory, then text (such as the string
"Ireland" in the last example) could be analogous to an ordinary
file. To support well formed XML output, there must be some
constraints, such as wrapping the output in a root tag. After all,
the original doctype is not directly relevant.
To extract a structure based upon the presence or absence of textual
contents, use xml-grep(1). The output will again be an XML file (so it
can be xml-grepped again!), but containing only the structure
necessary to access the text. The following examples give an idea of
how this works.
% xml-grep 'Green' People.xml
<?xml version="1.0"?>
<root>
<Person Name="Fred Davis">
<Address>
<LineTwo>Green Road</LineTwo>
</Address>
</Person>
</root>
% xml-grep -E '(Fred|Ire*)' People.xml
<?xml version="1.0"?>
<root>
<Person Name="Fred Davis">
<Address>
<Country>Ireland</Country>
</Address>
</Person>
</root>
% xml-grep -i --subtree 'fReD' People.xml
<?xml version="1.0"?>
<root>
<Person Name="Fred Davis">
<Address>
<LineOne>4 Bushy Street</LineOne>
<LineTwo>Green Road</LineTwo>
<County>Mayo</County>
<Country>Ireland</Country>
</Address>
<TelNo>+353 96 45232</TelNo>
</Person>
</root>
% xml-grep -v 'o' People.xml
<?xml version="1.0"?>
<root>
<Person Name="Fred Davis">
<Address>
<LineOne>4 Bushy Street</LineOne>
<Country>Ireland</Country>
</Address>
<TelNo>+353 96 45232</TelNo>
</Person>
</root>
Last but not least, there is xml-find(1), which we already mentioned
earlier. Just like its namesake find(1) traverses a directory,
looking for interesting files, and executing actions, xml-find(1)
actually traverses an XML file one node at a time, looking for
(selecting) interesting tags, and executing actions. This makes
xml-find(1) into an iterator. Before we can illustrate this properly,
we'll build up with a series of rather boring examples.
The simplest action is to search for a tag name and print it:
% xml-find People.xml -name 'Tel*' -print
/People/Person/TelNo
The tag name can also be passed to a program (or a script), for
example echo:
% xml-find People.xml -name 'Tel*' \
-exec echo 'The tag is ' '{}' ';'
The tag is /People/Person/TelNo
If this were a tutorial on find(1), then the placeholder {} would be the
name of a file,
which the -exec'd program could open and read. However, this is not
possible here because {} is only a tag name. So in xml-find(1), there
are two more placeholders, {@} which expands to a list of attributes
of the selected tag (if any), and {-} which expands to the name of a
temporary XML file which contains everything that belongs to the current
node. Thus:
% xml-find People.xml -name 'Tel*' \
-exec cat '{-}' ';'
<?xml version="1.0"?>
<People>
<Person>
<TelNo>+353 96 45232</TelNo></Person>
</People>
It's time to combine all these ideas into a final example. We'll
iterate through the food.xml file using
xml-find(1) to stop at each product, and printing the data we find
using xml-printf(1).
% xml-find food.xml -name 'product' \
-exec xml-printf 'Price of %-20s: %5.2f\n' \
{-} ://product ://product@price ';'
Price of Chicken : 3.00
Price of Lobster : 11.50
Price of Apple : 0.20
Price of Milk (2 litres) : 1.09
|