December 09, 2006

Posted by John

Tagged hpricot and xml

Older: Styleaby CSS Plugin

Newer: BDD and Setting Up Controllers

Parsing XML With Hpricot

Note: I have a comprehensive write up on parsing XML with ruby which provides how to’s for Hpricot, libxml-ruby and rexml.

Parsing xml has always annoyed/confused me (well at least in php it did). When I switched to Ruby, I first learned how to get the job done with REXML, but I was left wanting something more, something easier. Due to the gentle prodding of Chris on this err post, I decided to use Hpricot to parse XML on my last web service binge, a.k.a the twitter gem I just released.

I found Hpricot easy to use and extremely quick. I was instantly sold. Heck, it made me want to go find some massive xml file and parse it just for the fun. Ok, maybe not that far. Check out the simple example below:

require 'rubygems'
require 'hpricot'

xml = %{
<status>
  <id>1</id>
  <created_at>a date</created_at>
  <text>some text</text>
</status> 
}

doc = Hpricot::XML(xml)
(doc/:status).each do |status|
  ['id', 'created_at', 'text'].each do |el|
    puts "#{el}: #{status.at(el).innerHTML}"
  end
end

Does it get any easier? Sometimes it takes me hearing something a few times to actually try it, thus I’m confirming what Chris posted about. Hpricot is sweet with xml. Go ahead and try it out. If you don’t have any xml laying around, just hit an api somewhere to get some.

7 Comments

  1. Could someone pls explain what this syntax is doing?

    (doc/:status).each

  2. / is actually a method call. It is aliasing the hpricot search method. doc.search(:status).each would produce the same result. Why just added it because it is some sweet syntactical sugar.

  3. Benedikt Benedikt

    Jul 05, 2007

    The example above does not work in irb with Ruby 184-20. The ‘text’ tag is for some reason not found, parses into a nil class. Changing the tag and the reference in the array below to anything else e.g. ‘fluff’ takes care of the problem. Any idea why?

    xml = %{

    1


    a date
    some text

    }

    doc = Hpricot(xml)
    (doc/:status).each do |status|
    [‘id’, ‘created_at’, ‘fluff’].each do |el|
    puts “#{el}: #{status.at(el).innerHTML}”
    end
    end

  4. savitri savitri

    Jul 27, 2007

    script below do not work :

    xml = %{

    }

    doc = Hpricot(xml)
    (doc/:style).each do |style|
    [‘id’, ‘created_at’, ‘text’].each do |el|
    puts “#{el}: #{status.at(el).innerHTML}”
    end
    end

  5. @savitri – yeah, that doesn’t work because the one element is . There is an issue in hpricot with elements named text.

    I use something like this in the twitter gem now:

    status.get_elements_by_tag_name('text').innerHTML

  6. You should use Hpricot::XML for processing xml, the default html handler doesn’t handle all xml tags properly.

  7. @Steven – Yep. I use the xml method but I hadn’t updated the article. I will now.

Sorry, comments are closed for this article to ease the burden of pruning spam.

About

Authored by John Nunemaker (Noo-neh-maker), a programmer who has fallen deeply in love with Ruby. Learn More.

Projects

Flipper
Release your software more often with fewer problems.
Flip your features.