Getting Started with Nokogiri and XML in Ruby

Submitted on Jun 13, 2012, 8:50 a.m.

Here's a short post on getting started with Nokogiri - a Ruby gem that wraps libxml. I'm writing this because well, the docs at http://nokogiri.org/  kind of suck. I wanted to read a simple XML document. My XPath  fu was a little rusty, although all I wanted to do was read some attributes from a root element, some element values off of the root, and then a short collection of items (very similar to an Atom  document). My main bone of contention with the docs was their use of the `@doc.xpath("//character")` search operator at the very beginning of their parsing tutorial . How about we start from the beginning:

Here is a sample XML document. Save this to your local disk, install the Nokogiri gem, and fire up IRB.

<Collection version="2.0" id="74j5hc4je3b9">
  <Name>A Funfair in Bangkok</Name>
  <PermaLink>Funfair in Bangkok</PermaLink>
  <PermaLinkIsName>True</PermaLinkIsName>
  <Description>A small funfair near On Nut in Bangkok.</Description>
  <Date>2009-08-03T00:00:00</Date>
  <IsHidden>False</IsHidden>
  <Items>
    <Item filename="AGC_1998.jpg">
      <Title>Funfair in Bangkok</Title>
      <Caption>A small funfair near On Nut in Bangkok.</Caption>
      <Authors>Anthony Bouch</Authors>
      <Copyright>Copyright © Anthony Bouch</Copyright>
      <CreatedDate>2009-08-07T19:22:08</CreatedDate>
      <Keywords>
        <Keyword>Funfair</Keyword>
        <Keyword>Bangkok</Keyword>
        <Keyword>Thailand</Keyword>
      </Keywords>
      <ThumbnailSize width="133" height="200" />
      <PreviewSize width="532" height="800" />
      <OriginalSize width="2279" height="3425" />
    </Item>
    <Item filename="AGC_1164.jpg" iscover="True">
      <Title>Bumper Cars at a Funfair in Bangkok</Title>
      <Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
      <Authors>Anthony Bouch</Authors>
      <Copyright>Copyright © Anthony Bouch</Copyright>
      <CreatedDate>2009-08-03T22:08:24</CreatedDate>
      <Keywords>
        <Keyword>Bumper Cars</Keyword>
        <Keyword>Funfair</Keyword>
        <Keyword>Bangkok</Keyword>
        <Keyword>Thailand</Keyword>
      </Keywords>
      <ThumbnailSize width="200" height="133" />
      <PreviewSize width="800" height="532" />
      <OriginalSize width="3725" height="2479" />
    </Item>
  </Items>
</Collection>

From our IRB prompt - the first thing we'll do is require nokogiri.

>> require 'nokogiri'
=> true

Now let's load our XML document.

>> f = File.open("/path/to/the/collection.xml")
=> #
>> doc = Nokogiri::XML(f)
=> # You'll see the XML document output to the console.

The first thing we'd like to do is select the id attribute from the root. There's two ways you can do this.

>> doc.at_xpath("/*/@id")
=> #

Which will return the XML Attribute (which inherits from Node). You can use `.value`, `.text`, or `.inner_text` against the returned object to retrieve the actual value. Notice we've used the `at_xpath` method to select the element. `xpath` on its own will return a node array (with just one element in this case). The second method to get a root attribute, is to select the root element first using.

>> root = doc.root
>> # again here you'll see the complete XML document output to the console.

Now we can access the id attribute using a convenient array notation - returning the value immediately, or the XPath statement for an attribute which again will return an XML::Attr object from which we can retrieve the value.

>> root["id"]
=> "74j5hc4je3b9"

>> root.at_xpath("@id")
=> #

Since we're already positioned at the root element of the document, selecting elements beneath the root is easy.

>> root.at_xpath("Name")
=> #]>

You can use `root.at_xpath("Name").text` to retrieve the text value, but only if you're absolutely sure the element is present, otherwise you'll get an undefined method for nil:NilClass exception. Now lets select the items in our document, returning a node array of items that we can iterate over.

>> items = root.xpath("Items/Item")
=> #You'll see the xml for our two items output to the console.
>> items.count
=> 2

We can select an attribute for an item using the convenient index style syntax, or a regular XPath select with the `@` sign.

>> items[0]["filename"]
=> "AGC_1998.jpg"

And of course we can repeat and rinse with all of our element selectors, as well as move further down the structure of the document and select the keywords.

>> items[0].at_xpath("Title")
=> #]>

And very lastly - although this is a very different use case, and for some reason the first one that the Nokogiri [parsing tutorial](http://nokogiri.org/tutorials/searching_a_xml_html_document.html) decided to present, is the `//` XPath search operator which will search and return all elements at all levels for a matching element name.

>> doc.xpath("//Keywords")
=> #returning an array of Keyword elements across the entire document, including at the root, and item levels.

Last but not least - we'll close our file.

>> f.close

Of course the better way to do this in code is to use the `File.open(path) do |f| end;` block to ensure that the file is closed at the end of our Nokogiri session. And there you have it. Hope this helps anyone else who is using Nokogiri for the first time and would like to get started with very basic XPath queries to select attributes and elements from a simple XML document.