Laird Breyer's Free software

welcome/

java-mcmc/

software/

papers/

Editing an XML stream

The last command to be discussed in this tutorial is xml-sed(1), which can be viewed as the swiss army knife of command line XML editing.

For search and replace operations, xml-sed(1) is invoked just like sed(1):


% cat food.xml | xml-sed 's/Apple/Orange/'
<products>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20">Orange</product>
  <product price="1.09">Milk (2 litres)</product>

</products>

Although this cannot be seen here, the two commands xml-sed(1) and sed(1) do differ. Whereas sed(1) will replace text anywhere within the XML file, even if it occurs within a tag name, xml-sed(1) as invoked above only replaces text that resides outside of tag elements. Moreover, xml-sed(1) understands editing constraints in the form of an XPATH. Compare:


% cat food.xml | sed 's/e/E/g'
<products>

  <product pricE="3">ChickEn</product>
  <product pricE="11.50">LobstEr</product>
  <product pricE=".20">ApplE</product>
  <product pricE="1.09">Milk (2 litrEs)</product>

</products>
% cat food.xml | xml-sed 's/e/E/' ://product[3]
<products>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20">ApplE</product>
  <product price="1.09">Milk (2 litres)</product>

</products>

For 99% of editing tasks, the above is all you need to know about xml-sed(1). For the remaining 1%, we have to make a digression.

Consider your favourite text file in Unix. It consists of a number of lines, separated by the newline character '\n'. This character isn't directly visible, but it has an important structural function. Without it, all the lines would join and the text file would be one long stream of words and symbols.

Whenever the text is shown on a terminal, this newline character is interpreted, rather than merely displayed as an ordinary character. This distinction between '\n' and, say, the letter 'a' is what makes sed(1) useful as a way to alter the structure of a text document.

Think about what happens if you search and replace all the occurrences of the letter 'a' with the letter 'A'. You get the same structural document, but with altered letters. Now suppose you replace each 'a' with '\n'. You have a document with a completely different number of text lines. It is by altering the embedded meta information represented by the character '\n' (using ordinary editing commands), that a structural alteration is obtained.


% echo -e "Carol's cat carries carrots in a cart."
Carol's cat carries carrots in a cart.
% echo -e "CArol's cAt cArries cArrots in A cArt."
CArol's cAt cArries cArrots in A cArt.
% echo -e "C\\nrol's c\\nt c\\nrries c\\nrrots in \\n c\\nrt."
C
rol's c
t c
rries c
rrots in 
 c
rt.

What does all this mean for sed(1)? In principle, editing a text document can be done without specialized (meta) commands for inserting or deleting a line, ie the only thing that is needed are commands for altering strings of characters.

The same principle also applies to xml-sed(1). There is no need for specialized commands that create or remove tags, attributes, subtrees etc, provided that the structural (meta) information which describes an XML document is embedded directly in the text being edited. The language used by xml-sed(1) to embed the structure of an XML document is the same one used by xml-echo(1).

If you have an existing XML file, you can feed it to xml-unecho(1) to recover the embedded structure:


% xml-unecho --xml-sed food.xml 
[/products]\n\n  
[/products/product@price=3]Chicken
[/products]\n  
[/products/product@price=11.50]Lobster
[/products]\n  
[/products/product@price=.20]Apple
[/products]\n  
[/products/product@price=1.09]Milk (2 litres)
[/products]\n\n

The --xml-sed switch tells xml-unecho(1) to print exactly what xml-sed(1) would see. Normally, xml-unecho(1) prints a slightly altered form which, if interpreted by xml-echo(1), would recover the original XML file. The --xml-sed form is preferable for stream editing, because the absolute path of the current node is always available, and this helps prevent side effects.

Now suppose we edit the above, using sed(1) (that's right, we're not using xml-sed(1) yet):


% xml-unecho --xml-sed food.xml \
        | sed 's/]Apple/@juicy=true]A [bold]big[..] orange/'
[/products]\n\n  
[/products/product@price=3]Chicken
[/products]\n  
[/products/product@price=11.50]Lobster
[/products]\n  
[/products/product@price=.20@juicy=true]A [bold]big[..] orange
[/products]\n  
[/products/product@price=1.09]Milk (2 litres)
[/products]\n\n

We've just inserted an extra attribute, and a new tag! But this isn't XML until we interpret it. Let's do everything at once using xml-sed(1) now:


% cat food.xml \
        | xml-sed 's/]Apple/@juicy=true]A [bold]big[..] orange/z'
<products>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20" juicy="true">A <bold>big</bold> orange</product>
  <product price="1.09">Milk (2 litres)</product>

</products>

The important ingredient here is the z flag in the s///z command. This flag tells xml-sed(1) to edit the full echo-leaf (the lines generated by xml-unecho(1) are called echo-leaves). If the z is missing, then the path and attribute information (wich are surrounded by square brackets []) are not editable. This restriction is solely for the benefit of casual users' feet.

The remaining aspects of xml-sed(1) are not very surprising if you already know sed(1). There is a pattern and a holding space (which contains the current echo-leaf), and each editing command can be addressed individually. The available editing commands are the same as for sed(1), with minor (and rather obvious) alterations to accomodate the echo-leaf concept.