Package 'xml2' reference manual

Title:	Parse XML
Description:	Work with XML files using a simple, consistent interface. Built on top of the 'libxml2' C library.
Authors:	Hadley Wickham [aut, cre], Jim Hester [aut], Jeroen Ooms [aut], Posit Software, PBC [cph, fnd], R Foundation [ctb] (Copy of R-project homepage cached as example)
Maintainer:	Hadley Wickham <[email protected]>
License:	MIT + file LICENSE
Version:	1.3.6.9000
Built:	2024-04-29 17:18:04 UTC
Source:	https://github.com/r-lib/xml2

Title:

Parse XML

Description:

Work with XML files using a simple, consistent interface. Built on top of the 'libxml2' C library.

Authors:

Hadley Wickham [aut, cre], Jim Hester [aut], Jeroen Ooms [aut], Posit Software, PBC [cph, fnd], R Foundation [ctb] (Copy of R-project homepage cached as example)

Maintainer:

Hadley Wickham <[email protected]>

License:

MIT + file LICENSE

Version:

1.3.6.9000

Built:

2024-04-29 17:18:04 UTC

Source:

https://github.com/r-lib/xml2

as_list(read_xml("<foo> a <c><![CDATA[<d></d>]]></c></foo>")) as_list(read_xml("<foo> <bar><baz /></bar> </foo>")) as_list(read_xml("<foo id = 'a'></foo>")) as_list(read_xml("<foo><bar id='a'/><bar id='b'/></foo>"))

as_xml_document(list(x = list())) # Nesting multiple nodes as_xml_document(list(foo = list(bar = list(baz = list())))) # attributes are stored as R attributes as_xml_document(list(foo = structure(list(), id = "a"))) as_xml_document(list(foo = list( bar = structure(list(), id = "a"), bar = structure(list(), id = "b") )))

download_xml( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle() ) download_html( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle() )

Read HTML or XML.

Description

Read HTML or XML.

Usage

read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS")

read_html(x, encoding = "", ..., options = c("RECOVER", "NOERROR", "NOBLANKS"))

## S3 method for class 'character'
read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS")

## S3 method for class 'raw'
read_xml(
  x,
  encoding = "",
  base_url = "",
  ...,
  as_html = FALSE,
  options = "NOBLANKS"
)

## S3 method for class 'connection'
read_xml(
  x,
  encoding = "",
  n = 64 * 1024,
  verbose = FALSE,
  ...,
  base_url = "",
  as_html = FALSE,
  options = "NOBLANKS"
)

Arguments

`x`	A string, a connection, or a raw vector. A string can be either a path, a url or literal xml. Urls will be converted into connections either using `base::url` or, if installed, `curl::curl`. Local paths ending in `.gz`, `.bz2`, `.xz`, `.zip` will be automatically uncompressed. If a connection, the complete connection is read into a raw vector before being parsed.
`encoding`	Specify a default encoding for the document. Unless otherwise specified XML documents are assumed to be in UTF-8 or UTF-16. If the document is not UTF-8/16, and lacks an explicit encoding directive, this allows you to supply a default.
`...`	Additional arguments passed on to methods.
`as_html`	Optionally parse an xml file as if it's html.
`options`	Set parsing options for the libxml2 parser. Zero or more of RECOVER recover on errors NOENT substitute entities DTDLOAD load the external subset DTDATTR default DTD attributes DTDVALID validate with the DTD NOERROR suppress error reports NOWARNING suppress warning reports PEDANTIC pedantic error reporting NOBLANKS remove blank nodes SAX1 use the SAX1 interface internally XINCLUDE Implement XInclude substitition NONET Forbid network access NODICT Do not reuse the context dictionary NSCLEAN remove redundant namespaces declarations NOCDATA merge CDATA as text nodes NOXINCNODE do not generate XINCLUDE START/END nodes COMPACT compact small text nodes; no modification of the tree allowed afterwards (will possibly crash if you try to modify the tree) OLD10 parse using XML-1.0 before update 5 NOBASEFIX do not fixup XINCLUDE xml:base uris HUGE relax any hardcoded limit from the parser OLDSAX parse using SAX2 interface before 2.7.0 IGNORE_ENC ignore internal document encoding hint BIG_LINES Store big lines numbers in text PSVI field
`base_url`	When loading from a connection, raw vector or literal html/xml, this allows you to specify a base url for the document. Base urls are used to turn relative urls into absolute urls.
`n`	If `file` is a connection, the number of bytes to read per iteration. Defaults to 64kb.
`verbose`	When reading from a slow connection, this prints some output on every iteration so you know its working.

Value

An XML document. HTML is normalised to valid XML - this may not be exactly the same transformation performed by the browser, but it's a reasonable approximation.

Setting the "user agent" header

When performing web scraping tasks it is both good practice — and often required — to set the user agent request header to a specific value. Sometimes this value is assigned to emulate a browser in order to have content render in a certain way (e.g. ⁠Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0⁠ to emulate more recent Windows browsers). Most often, this value should be set to provide the web resource owner information on who you are and the intent of your actions like this Google scraping bot user agent identifier: ⁠Googlebot/2.1 (+http://www.google.com/bot.html)⁠.

You can set the HTTP user agent for URL-based requests using httr::set_config() and httr::user_agent():

httr::set_config(httr::user_agent("[email protected]; +https://example.com/info.html"))

httr::set_config() changes the configuration globally, httr::with_config() can be used to change configuration temporarily.

Examples

# Literal xml/html is useful for small examples
read_xml("<foo><bar /></foo>")
read_html("<html><title>Hi<title></html>")
read_html("<html><title>Hi")

# From a local path
read_html(system.file("extdata", "r-project.html", package = "xml2"))

## Not run: 
# From a url
cd <- read_xml(xml2_example("cd_catalog.xml"))
me <- read_html("http://had.co.nz")

## End(Not run)

url_absolute(c(".", "..", "/", "/x"), "http://hadley.nz/a/b/c/d") url_relative("http://hadley.nz/a/c", "http://hadley.nz") url_relative("http://hadley.nz/a/c", "http://hadley.nz/") url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b") url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b/")

write_xml(x, file, ...) ## S3 method for class 'xml_document' write_xml(x, file, ..., options = "format", encoding = "UTF-8") write_html(x, file, ...) ## S3 method for class 'xml_document' write_html(x, file, ..., options = "format", encoding = "UTF-8")

h <- read_html("Hi!") tmp <- tempfile(fileext = ".xml") write_xml(h, tmp, options = "format") readLines(tmp) # write formatted HTML output write_html(h, tmp, options = "format") readLines(tmp)

xml_attr(x, attr, ns = character(), default = NA_character_) xml_has_attr(x, attr, ns = character()) xml_attrs(x, ns = character()) xml_attr(x, attr, ns = character()) <- value xml_set_attr(x, attr, value, ns = character()) xml_attrs(x, ns = character()) <- value xml_set_attrs(x, value, ns = character())

x <- read_xml("<root id='1'><child id ='a' /><child id='b' d='b'/></root>") xml_attr(x, "id") xml_attr(x, "apple") xml_attrs(x) kids <- xml_children(x) kids xml_attr(kids, "id") xml_has_attr(kids, "id") xml_attrs(kids) # Missing attributes give missing values xml_attr(xml_children(x), "d") xml_has_attr(xml_children(x), "d") # If the document has a namespace, use the ns argument and # qualified attribute names x <- read_xml(' <root xmlns:b="http://bar.com" xmlns:f="http://foo.com"> <doc b:id="b" f:id="f" id="" /> </root> ') doc <- xml_children(x)[[1]] ns <- xml_ns(x) xml_attrs(doc) xml_attrs(doc, ns) # If you don't supply a ns spec, you get the first matching attribute xml_attr(doc, "id") xml_attr(doc, "b:id", ns) xml_attr(doc, "id", ns) # Can set a single attribute with `xml_attr() <-` or `xml_set_attr()` xml_attr(doc, "id") <- "one" xml_set_attr(doc, "id", "two") # Or set multiple attributes with `xml_attrs()` or `xml_set_attrs()` xml_attrs(doc) <- c("b:id" = "one", "f:id" = "two", "id" = "three") xml_set_attrs(doc, c("b:id" = "one", "f:id" = "two", "id" = "three"))

x <- read_xml("<foo> <bar><boo /></bar> <baz/> </foo>") xml_children(x) xml_children(xml_children(x)) xml_siblings(xml_children(x)[[1]]) # Note the each unique node only appears once in the output xml_parent(xml_children(x)) # Mixed content x <- read_xml("<foo> a c <d>e</d> f</foo>") # Childen gets the elements, contents gets all node types xml_children(x) xml_contents(x) xml_length(x) xml_length(x, only_elements = FALSE) # xml_child makes it easier to select specific children xml_child(x) xml_child(x, 2) xml_child(x, "baz")

r <- xml_new_root( xml_dtd( "html", "-//W3C//DTD XHTML 1.0 Transitional//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ) ) # Use read_xml directly for more complicated DTD d <- read_xml( '<!DOCTYPE doc [ <!ELEMENT doc (#PCDATA)> <!ENTITY foo " test "> ]> <doc>This is a valid document &foo; !</doc>' )

xml_find_all(x, xpath, ns = xml_ns(x), ...) ## S3 method for class 'xml_nodeset' xml_find_all(x, xpath, ns = xml_ns(x), flatten = TRUE, ...) xml_find_first(x, xpath, ns = xml_ns(x)) xml_find_num(x, xpath, ns = xml_ns(x)) xml_find_int(x, xpath, ns = xml_ns(x)) xml_find_chr(x, xpath, ns = xml_ns(x)) xml_find_lgl(x, xpath, ns = xml_ns(x))

x <- read_xml("<foo><bar><baz/></bar><baz/></foo>") xml_find_all(x, ".//baz") xml_path(xml_find_all(x, ".//baz")) # Note the difference between .// and // # // finds anywhere in the document (ignoring the current node) # .// finds anywhere beneath the current node (bar <- xml_find_all(x, ".//bar")) xml_find_all(bar, ".//baz") xml_find_all(bar, "//baz") # Find all vs find one ----------------------------------------------------- x <- read_xml("<body> Some text. Some other text. No bold here! </body>") para <- xml_find_all(x, ".//p") # By default, if you apply xml_find_all to a nodeset, it finds all matches, # de-duplicates them, and returns as a single nodeset. This means you # never know how many results you'll get xml_find_all(para, ".//b") # If you set flatten to FALSE, though, xml_find_all will return a list of # nodesets, where each nodeset contains the matches for the corresponding # node in the original nodeset. xml_find_all(para, ".//b", flatten = FALSE) # xml_find_first only returns the first match per input node. If there are 0 # matches it will return a missing node xml_find_first(para, ".//b") xml_text(xml_find_first(para, ".//b")) # Namespaces --------------------------------------------------------------- # If the document uses namespaces, you'll need use xml_ns to form # a unique mapping between full namespace url and a short prefix x <- read_xml(' <root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com"> <f:doc><g:baz /></f:doc> <f:doc><g:baz /></f:doc> </root> ') xml_find_all(x, ".//f:doc") xml_find_all(x, ".//f:doc", xml_ns(x))

Create a new document, possibly with a root node

Description

xml_new_document creates only a new document without a root node. In most cases you should instead use xml_new_root, which creates a new document and assigns the root node in one step.

Usage

xml_new_document(version = "1.0", encoding = "UTF-8")

xml_new_root(
  .value,
  ...,
  .copy = inherits(.value, "xml_node"),
  .version = "1.0",
  .encoding = "UTF-8"
)

Arguments

`version`	The version number of the document.
`encoding`	The character encoding to use in the document. The default encoding is ‘UTF-8’. Available encodings are specified at http://xmlsoft.org/html/libxml-encoding.html#xmlCharEncoding.
`.value`	node to insert.
`...`	If named attributes or namespaces to set on the node, if unnamed text to assign to the node.
`.copy`	whether to copy the `.value` before replacing. If this is `FALSE` then the node will be moved from it's current location.
`.version`	The version number of the document, passed to `xml_new_document(version)`.
`.encoding`	The encoding of the document, passed to `xml_new_document(encoding)`.

Value

A xml_document object.

x <- read_xml(' <root> <doc1 xmlns = "http://foo.com"><baz /></doc1> <doc2 xmlns = "http://bar.com"><baz /></doc2> </root> ') xml_ns(x) # When there are default namespaces, it's a good idea to rename # them to give informative names: ns <- xml_ns_rename(xml_ns(x), d1 = "foo", d2 = "bar") ns # Now we can pass ns to other xml function to use fully qualified names baz <- xml_children(xml_children(x)) xml_name(baz) xml_name(baz, ns) xml_find_all(x, "//baz") xml_find_all(x, "//foo:baz", ns) str(as_list(x)) str(as_list(x, ns))

x <- read_xml( "<foo xmlns = 'http://foo.com'> <baz/> <bar xmlns = 'http://bar.com'> <baz/> </bar> </foo>" ) # Need to specify the default namespaces to find the baz nodes xml_find_all(x, "//d1:baz") xml_find_all(x, "//d2:baz") # After stripping the default namespaces you can find both baz nodes directly xml_ns_strip(x) xml_find_all(x, "//baz")

xml_replace(.x, .value, ..., .copy = TRUE) xml_add_sibling(.x, .value, ..., .where = c("after", "before"), .copy = TRUE) xml_add_child(.x, .value, ..., .where = length(xml_children(.x)), .copy = TRUE) xml_add_parent(.x, .value, ...) xml_remove(.x, free = FALSE)

xml_structure(read_xml("<a><c/><c/><d/></a>")) rproj <- read_html(system.file("extdata", "r-project.html", package = "xml2")) xml_structure(rproj) xml_structure(xml_find_all(rproj, ".//p")) h <- read_html("<body></body>") html_structure(h)

x <- read_xml("This is some text. This is bold!") xml_text(x) xml_text(xml_children(x)) x <- read_xml("<x>This is some text. <x>This is some nested text.</x></x>") xml_text(x) xml_text(xml_find_all(x, "//x")) x <- read_xml(" Some text ") xml_text(x, trim = TRUE) # xml_double() and xml_integer() are useful for extracting numeric attributes x <- read_xml("<plot><point x='1' y='2' /><point x='2' y='1' /></plot>") xml_integer(xml_find_all(x, "//@x"))

# Example from https://msdn.microsoft.com/en-us/library/ms256129(v=vs.110).aspx doc <- read_xml(system.file("extdata/order-doc.xml", package = "xml2")) schema <- read_xml(system.file("extdata/order-schema.xml", package = "xml2")) xml_validate(doc, schema)

`x`	A document, node, or node set.
`ns`	Optionally, a named vector giving prefix-url pairs, as produced by `xml_ns()`. If provided, all names will be explicitly qualified with the ns prefix, i.e. if the element `bar` is defined in namespace `foo`, it will be called `foo:bar`. (And similarly for attributes). Default namespaces must be given an explicit name. The ns is ignored when using `xml_name<-()` and `xml_set_name()`.
`...`	Needed for compatibility with generic. Unused.

`url`	A character string naming the URL of a resource to be downloaded.
`file`	A character string with the name where the downloaded file is saved.
`quiet`	If `TRUE`, suppress status messages (if any), and the progress bar.
`mode`	A character string specifying the mode with which to write the file. Useful values are `"w"`, `"wb"` (binary), `"a"` (append) and `"ab"`.
`handle`	a curl handle object

`x`	A character vector of urls relative to that base
`base`	A string giving a base url.

`x`	A character vector of urls.
`reserved`	A string containing additional characters to avoid escaping.

`x`	A document, node, or node set.
`attr`	Name of attribute to extract.
`ns`	Optionally, a named vector giving prefix-url pairs, as produced by `xml_ns()`. If provided, all names will be explicitly qualified with the ns prefix, i.e. if the element `bar` is defined in namespace `foo`, it will be called `foo:bar`. (And similarly for attributes). Default namespaces must be given an explicit name. The ns is ignored when using `xml_name<-()` and `xml_set_name()`.
`default`	Default value to use when attribute is not present.
`value`	character vector of new value.

`x`	A document or node to write to disk. It's not possible to save nodesets containing more than one node.
`file`	Path to file or connection to write to.
`...`	additional arguments passed to methods.
`options`	default: ‘format’. Zero or more of format Format output no_declaration Drop the XML declaration no_empty_tags Remove empty tags no_xhtml Disable XHTML1 rules require_xhtml Force XHTML rules as_xml Force XML output as_html Force HTML output format_whitespace Format with non-significant whitespace
`encoding`	The character encoding to use in the document. The default encoding is ‘UTF-8’. Available encodings are specified at http://xmlsoft.org/html/libxml-encoding.html#xmlCharEncoding.

`x`	A document, node, or node set.
`search`	For `xml_child`, either the child number to return (by position), or the name of the child node to return. If there are multiple child nodes with the same name, the first will be returned
`ns`	Optionally, a named vector giving prefix-url pairs, as produced by `xml_ns()`. If provided, all names will be explicitly qualified with the ns prefix, i.e. if the element `bar` is defined in namespace `foo`, it will be called `foo:bar`. (And similarly for attributes). Default namespaces must be given an explicit name. The ns is ignored when using `xml_name<-()` and `xml_set_name()`.
`only_elements`	For `xml_length`, should it count all children, or just children that are elements (the default)?

`name`	The name of the declaration
`external_id`	The external ID of the declaration
`system_id`	The system ID of the declaration

`x`	A document, node, or node set.
`xpath`	A string containing an xpath (1.0) expression.
`ns`	Optionally, a named vector giving prefix-url pairs, as produced by `xml_ns()`. If provided, all names will be explicitly qualified with the ns prefix, i.e. if the element `bar` is defined in namespace `foo`, it will be called `foo:bar`. (And similarly for attributes). Default namespaces must be given an explicit name. The ns is ignored when using `xml_name<-()` and `xml_set_name()`.
`...`	Further arguments passed to or from other methods.
`flatten`	A logical indicating whether to return a single, flattened nodeset or a list of nodesets.

`x`	A document, node, or node set.
`old`, `...`	An existing xml_namespace object followed by name-value (old prefix-new prefix) pairs to replace.

`.x`	a document, node or nodeset.
`.value`	node to insert.
`...`	If named attributes or namespaces to set on the node, if unnamed text to assign to the node.
`.copy`	whether to copy the `.value` before replacing. If this is `FALSE` then the node will be moved from it's current location.
`.where`	to add the new node, for `xml_add_child` the position after which to add, use `0` for the first child. For `xml_add_sibling` either ‘"before"’ or ‘"after"’ indicating if the new node should be before or after `.x`.
`free`	When removing the node also free the memory used for that node. Note if you use this option you cannot use any existing objects pointing to the node or its children, it is likely to crash R or return garbage.

`object`	R object to serialize.
`connection`	an open connection or (for `serialize`) `NULL` or (for `unserialize`) a raw vector (see ‘Details’).
`...`	Additional arguments passed to `read_xml()`.

`.x`	a node
`prefix`	The namespace prefix to use
`uri`	The namespace URI to use

`x`	HTML/XML document (or part there of)
`indent`	Number of spaces to ident
`file`	A connection, or a character string naming the file to print to. If `""` (the default), `cat` prints to the standard output connection, the console unless redirected by `sink`. If it is `"\|cmd"`, the output is piped to the command given by ‘cmd’, by opening a pipe connection.

`x`	A document, node, or node set.
`trim`	If `TRUE` will trim leading and trailing spaces.
`value`	character vector with replacement text.

`x`	A document, node, or node set.
`schema`	an XML document containing the schema

Package 'xml2'

Help Index

Coerce xml nodes to a list.

Description

Usage

Arguments

Details

Examples

Coerce a R list to xml nodes.

Description

Usage

Arguments

Examples

Download a HTML or XML file

Description

Usage

Arguments

Details

Value

See Also

Examples

Read HTML or XML.

Description

Usage

Arguments

Value

Setting the "user agent" header

Examples

Convert between relative and absolute urls.

Description

Usage

Arguments

Value

See Also

Examples

Escape and unescape urls.

Description

Usage

Arguments

Examples

Parse a url into its component pieces.

Description

Usage

Arguments

Value

Examples

Write XML or HTML to disk.

Description

Usage

Arguments

Examples

Retrieve an attribute.

Description

Usage

Arguments

Value

Examples

Construct a cdata node

Description

Usage

Arguments

Examples

Navigate around the family tree.

Description

Usage

Arguments

Value

Examples

Construct a comment node

Description

Usage

Arguments

Examples

Construct a document type definition

Description

Usage

Arguments

Examples

Find nodes that match an xpath expression.

Description