Editing an XML stream
The last command to be discussed in this tutorial is xml-sed(1), which
can be viewed as the swiss army knife of command line XML editing.
For search and replace operations, xml-sed(1) is invoked just like
sed(1):
% cat food.xml | xml-sed 's/Apple/Orange/'
<products>
<product price="3">Chicken</product>
<product price="11.50">Lobster</product>
<product price=".20">Orange</product>
<product price="1.09">Milk (2 litres)</product>
</products>
Although this cannot be seen here, the two commands xml-sed(1) and sed(1)
do differ. Whereas
sed(1) will replace text anywhere within the XML file, even if it
occurs within a tag name, xml-sed(1) as invoked above only replaces
text that resides outside of tag elements. Moreover, xml-sed(1)
understands editing constraints in the form of an XPATH. Compare:
% cat food.xml | sed 's/e/E/g'
<products>
<product pricE="3">ChickEn</product>
<product pricE="11.50">LobstEr</product>
<product pricE=".20">ApplE</product>
<product pricE="1.09">Milk (2 litrEs)</product>
</products>
% cat food.xml | xml-sed 's/e/E/' ://product[3]
<products>
<product price="3">Chicken</product>
<product price="11.50">Lobster</product>
<product price=".20">ApplE</product>
<product price="1.09">Milk (2 litres)</product>
</products>
For 99% of editing tasks, the above is all you need to know about
xml-sed(1). For the remaining 1%, we have to make a digression.
Consider your favourite text file in Unix. It consists of a number of
lines, separated by the newline character '\n'. This character isn't
directly visible, but it has an important structural function. Without
it, all the lines would join and the text file would be one long
stream of words and symbols.
Whenever the text is shown on a terminal, this newline character
is interpreted, rather than merely displayed as an ordinary
character. This distinction between '\n' and, say, the letter 'a' is
what makes sed(1) useful as a way to alter the structure of a text
document.
Think about what happens if you search and replace all the occurrences
of the letter 'a' with the letter 'A'. You get the same structural
document, but with altered letters. Now suppose you replace each 'a'
with '\n'. You have a document with a completely different number of
text lines. It is by altering the embedded meta information represented
by the character '\n' (using
ordinary editing commands), that a structural alteration is obtained.
% echo -e "Carol's cat carries carrots in a cart."
Carol's cat carries carrots in a cart.
% echo -e "CArol's cAt cArries cArrots in A cArt."
CArol's cAt cArries cArrots in A cArt.
% echo -e "C\\nrol's c\\nt c\\nrries c\\nrrots in \\n c\\nrt."
C
rol's c
t c
rries c
rrots in
c
rt.
What does all this mean for sed(1)? In principle, editing a text
document can be done without specialized (meta) commands for inserting or
deleting a line, ie the only thing that is needed are commands for
altering strings of characters.
The same principle also applies to xml-sed(1). There is no need for
specialized commands that create or remove tags, attributes, subtrees
etc, provided that the structural (meta) information which describes an XML
document is embedded directly in the text being edited. The language
used by xml-sed(1) to embed the structure of an XML document is the
same one used by xml-echo(1).
If you have an existing XML file, you can feed it to xml-unecho(1) to
recover the embedded structure:
% xml-unecho --xml-sed food.xml
[/products]\n\n
[/products/product@price=3]Chicken
[/products]\n
[/products/product@price=11.50]Lobster
[/products]\n
[/products/product@price=.20]Apple
[/products]\n
[/products/product@price=1.09]Milk (2 litres)
[/products]\n\n
The --xml-sed switch tells xml-unecho(1) to print exactly what
xml-sed(1) would see. Normally, xml-unecho(1) prints a slightly
altered form which, if interpreted by xml-echo(1), would recover the
original XML file. The --xml-sed form is preferable for stream editing,
because the
absolute path of the current node is always available, and this helps
prevent side effects.
Now suppose we edit the above, using sed(1) (that's right, we're not
using xml-sed(1) yet):
% xml-unecho --xml-sed food.xml \
| sed 's/]Apple/@juicy=true]A [bold]big[..] orange/'
[/products]\n\n
[/products/product@price=3]Chicken
[/products]\n
[/products/product@price=11.50]Lobster
[/products]\n
[/products/product@price=.20@juicy=true]A [bold]big[..] orange
[/products]\n
[/products/product@price=1.09]Milk (2 litres)
[/products]\n\n
We've just inserted an extra attribute, and a new tag! But this isn't
XML until we interpret it. Let's do everything at once using
xml-sed(1) now:
% cat food.xml \
| xml-sed 's/]Apple/@juicy=true]A [bold]big[..] orange/z'
<products>
<product price="3">Chicken</product>
<product price="11.50">Lobster</product>
<product price=".20" juicy="true">A <bold>big</bold> orange</product>
<product price="1.09">Milk (2 litres)</product>
</products>
The important ingredient here is the z flag in the s///z command. This
flag tells xml-sed(1) to edit the full echo-leaf (the lines generated
by xml-unecho(1) are called echo-leaves). If the z is missing, then the path and attribute
information (wich are surrounded by square brackets []) are not
editable. This restriction is solely for the benefit of casual users' feet.
The remaining aspects of xml-sed(1) are not very surprising if you
already know sed(1). There is a pattern and a holding space (which contains
the current echo-leaf), and each editing command can be addressed
individually. The available editing commands are the same as for
sed(1), with minor (and rather obvious) alterations to accomodate the
echo-leaf concept.
There is more to say but this tutorial is at an end. Happy hacking.
|