XML-COREUTILS
NAME
DESCRIPTION
COMMANDS
COMMON UNIFIED COMMAND LINE CONVENTION
XPATH SPECIFICATION
WHITESPACE HANDLING
ECHO-LEAF
AUTHORS
BUGS
SEE ALSO
NAME
xml-coreutils
− shell commands for XML processing
DESCRIPTION
xml-coreutils(7)
is a collection of shell commands, similar to the
traditional core utilities ("coreutils") shipped
on many Unix systems, but intended to operate on XML files
rather than text files. An important design goal is to keep
the "look and feel" of the traditional core
utilities, and to minimize the learning curve for
experienced shell users.
While the
current version of xml-coreutils(7) is likely to have
evolved somewhat from its initial design, the fundamental
ideas are described in detail in the following essay and
tutorial:
@PKGDATADIR@/doc/unix_xml.html
@PKGDATADIR@/doc/xml_coreutils_tutorial.html
This manpage
lists the available COMMANDS, and describes the COMMON
UNIFIED COMMAND LINE CONVENTION as well as the concept of an
ECHO-LEAF.
COMMANDS
The following
list of commands is available:
|
xml-cat(1) |
|
concatenate XML files and print XML on the standard
output. |
|
xml-cp(1) |
|
copy nodes from XML files into an XML file. |
|
xml-cut(1) |
|
print selected parts of an XML file as an XML file. |
|
xml-echo(1) |
|
generate an XML file on the standard output. |
|
xml-file(1) |
|
determine type of XML files. |
|
xml-find(1) |
|
search for nodes in XML files and execute actions. |
|
xml-fixtags(1) |
|
convert HTML into XML on the standard output. |
|
xml-fmt(1) |
|
reformat an XML file, writing to the standard
output. |
|
xml-grep(1) |
|
print matching fragments as an XML file on the standard
output. |
|
xml-head(1) |
|
truncate the parts of an XML document. |
|
xml-less(1) |
|
interactively display an XML file on a terminal. |
|
xml-ls(1) |
|
list the contents of an XML file. |
|
xml-mv(1) |
|
move nodes from XML files to an XML file or the standard
output. |
|
xml-printf(1) |
|
format and print data in an XML file to the standard
output. |
|
xml-rm(1) |
|
remove nodes from XML files. |
|
xml-sed(1) |
|
stream editor for filtering and transforming an XML
file. |
|
xml-strings(1) |
|
print the strings of data in an XML file to the standard
output. |
|
xml-unecho(1) |
|
ungenerate an XML file into an xml-echo(1)
expression. |
|
xml-wc(1) |
|
print height, depth and number of tags for each XML
file. |
COMMON UNIFIED COMMAND LINE CONVENTION
Since it is
often desired to work with only a small part of an XML file,
most (but not all) commands in xml-coreutils(7)
accept both a filename and one or more special strings
called XPATHs. The latter type of string matches a small
subset of the W3C XPath 1.0 standard and represents a set of
nodes within an associated XML file.
A command which
uses the common unified command line convention typically
has the following synopsis:
xml-command
[OPTIONS] [ [FILE]... [:XPATH]...
]...
This indicates
that after any OPTIONS, the remainder of the command line
consists of zero or more FILE(s), followed by zero or more
XPATH(s), followed again by zero or more FILE(s) and zero or
more XPATH(s), etc. Each XPATH is preceded by a colon (:),
which is not part of the XPATH but serves to distinguish the
argument from a generic operating system FILE. This method
is unambiguous, since a FILE whose name happens to start
with a colon can always be preceded by an absolute or
relative path which doesn’t.
The convention
is that every unbroken series of XPATH(s) is associated with
each FILE that forms part of the preceding unbroken series
of FILE(s). Stated another way, every FILE is unambiguously
associated with each of the XPATH(s) within the immediately
following unbroken series. It is possible that the first
unbroken series of XPATH(s) is not preceded by any FILE, in
which case the standard input is taken to be the missing
FILE. If the last unbroken series does not contain an XPATH,
then the special XPATH "/" is assumed.
In the example
below, the command operates on the following associations:
(stdin,xp1), (file1,xp1,xp2,xp3), (file2,xp1,xp2,xp3) and
(file3,"/"). Each such association may also be
called a bundle.
xml-command :xp1 file1 file2 :xp1 :xp2 :xp3 file3
The generic
meaning of a bundle such as (file2,xp1,xp2,xp3) is that
xml-command is performed on (or using) the set of nodes in
file2 which match any one of xp1, xp2 or xp3.
XPATH SPECIFICATION
An XPATH is a
string which represents a subset of XML nodes in an XML
document, using syntax similar to W3C XPath 1.0. Only a
(very small) part of XPath semantics is actually supported,
which neither includes axes, namespaces, functions or
complex predicates.
The XPATH
matching algorithm operates on path prefixes. This ensures
that whenever a node is selected, all its children will be
selected as well. The following examples are normative.
|
/ |
|
selects the whole document. |
|
/*/abc |
|
selects the nodes which are descendants of the tags
named abc, which are children of the top level tag. |
|
//abc |
|
selects a tag named abc (and all nodes which are
descendants of it) which can occur anywhere in the whole
document. |
|
//abc/ |
|
selects all nodes which are descendants of a tag named
abc which can occur anywhere in the whole document, but does
not select the tag abc itself. |
|
/xhtml/body/p/ |
|
selects each node which is a descendant of one of the
top level <p> nodes within the body of an xhtml
document. |
/xhtml/body/p/*
selects each node which is a
descendant of a tag which is a child of a <p> tag
which is a child of a <body> tag which is a child of
the root tag <xhtml>.
|
/abc@def |
|
selects the attribute named def of the top level tag
named abc. |
|
//abc@* |
|
selects each of the attributes of any tag named abc
within the document. |
|
/*/abc[2] |
|
selects the descendants of the second tag named abc that
is a child of the top level tag. |
WHITESPACE HANDLING
Every
xml-coreutils(7) command outputs either well formed
XML, or traditional unix text. All whitespace text nodes in
input XML documents are preserved verbatim in case they are
being output directly as part of an XML document, but no
such guarantee is made if the command merely outputs
text.
If a commands
generates its own XML fragments, then indenting is performed
using TAB characters, rather than spaces, since this is
simpler to process subsequently. The visual layout of XML
documents for human consumption is delegated to xml-fmt.
ECHO-LEAF
The name
echo-leaf refers to a character string of the special form
"[PATH]TEXT" that is used by several commands
including xml-echo(1) and xml-sed(1). Each
echo-leaf represents a minimal XML fragment consisting of a
text node and a hierarchical path, delimited by square
brackets, which leads to the XML tag surrounding this text
node.
A sequence of
echo-leaves of an XML file plays a similar role to a
sequence of lines of a text file. Whereas it is customary to
think of a line ending with a line break specification
’\n’, here we think of a text node as preceded
by a path specification ’[PATH]’.
In an
echo-leaf, the PATH is optional, but if it is present it
must be enclosed in square brackets ([]). The TEXT is
optional too.
The PATH
contains an absolute or relative path of an XML tag, and can
optionally include attribute specifications. The TEXT
contains ordinary text and may also contain escaped
sequences representing special XML constructs, but not
another "[PATH]".
An attribute
specification is a string of the form
"@NAME=VALUE", where NAME is the name of the
attribute and VALUE is the associated string value. VALUE
should not be surrounded by quotation marks (neither "
nor ’), but if it contains the special characters []@=
these must be preceded by a backslash.
The following
PATH examples are normative.
|
[/abc] |
|
represents a root tag named
"abc". |
|
[abc] |
|
represents a tag named "abc" which is a child
relative to the current context. |
|
[.] |
|
represents the tag that is currently in context. |
|
[..] |
|
represents the parent of the tag that is currently in
context. |
[/abc/def@importance=Earnest/../ghi]
represents a tag named
"ghi" which is a sibling of a tag named
"def" whose attribute "importance" has
the value "Earnest", and whose parent tag is the
root tag named "abc".
More details
about the typical contents of PATH and TEXT in an echo-leaf
can be found in the xml-echo(1) manpage.
AUTHORS
Laird
A. Breyer is the original author of this software. The
source code (GPLv3 or later) for the latest version is
available at the following locations:
http://www.lbreyer.com/gpl.html
http://xml-coreutils.sourceforge.net
BUGS
The
xml-coreutils collection is still incomplete, but already
usable for limited tasks. The behaviour of command options
and output formats are subject to change without warning
prior to v1.0 of the software.
SEE ALSO
xml-cat(1)
xml-cp(1) xml-cut(1) xml-echo(1)
xml-find(1) xml-fixtags(1) xml-fmt(1)
xml-grep(1) xml-head(1) xml-less(1)
xml-ls(1) xml-mv(1) xml-printf(1)
xml-rm(1) xml-sed(1) xml-strings(1)
xml-unecho(1) xml-wc(1)
|