Parsing XML With Hpricot

Note: I have a comprehensive write up on parsing XML with ruby which provides how to’s for Hpricot, libxml-ruby and rexml.

Parsing xml has always annoyed/confused me (well at least in php it did). When I switched to Ruby, I first learned how to get the job done with REXML, but I was left wanting something more, something easier. Due to the gentle prodding of Chris on this err post, I decided to use Hpricot to parse XML on my last web service binge, a.k.a the twitter gem I just released.

I found Hpricot easy to use and extremely quick. I was instantly sold. Heck, it made me want to go find some massive xml file and parse it just for the fun. Ok, maybe not that far. Check out the simple example below:

require 'rubygems'
require 'hpricot'

xml = %{
<status>
  <id>1</id>
  <created_at>a date</created_at>
  <text>some text</text>
</status> 
}

doc = Hpricot::XML(xml)
(doc/:status).each do |status|
  ['id', 'created_at', 'text'].each do |el|
    puts "#{el}: #{status.at(el).innerHTML}"
  end
end

Does it get any easier? Sometimes it takes me hearing something a few times to actually try it, thus I’m confirming what Chris posted about. Hpricot is sweet with xml. Go ahead and try it out. If you don’t have any xml laying around, just hit an api somewhere to get some.

7 Comments

jason
Dec 10, 2006

Could someone pls explain what this syntax is doing?

(doc/:status).each
John Nunemaker
Dec 10, 2006

/ is actually a method call. It is aliasing the hpricot search method. doc.search(:status).each would produce the same result. Why just added it because it is some sweet syntactical sugar.
Benedikt
Jul 05, 2007

The example above does not work in irb with Ruby 184-20. The ‘text’ tag is for some reason not found, parses into a nil class. Changing the tag and the reference in the array below to anything else e.g. ‘fluff’ takes care of the problem. Any idea why?

xml = %{

1

a date
some text

}

doc = Hpricot(xml)
(doc/:status).each do |status|
[‘id’, ‘created_at’, ‘fluff’].each do |el|
puts “#{el}: #{status.at(el).innerHTML}”
end
end
savitri
Jul 27, 2007

script below do not work :

xml = %{

}

doc = Hpricot(xml)
(doc/:style).each do |style|
[‘id’, ‘created_at’, ‘text’].each do |el|
puts “#{el}: #{status.at(el).innerHTML}”
end
end
John Nunemaker
Jul 27, 2007

@savitri – yeah, that doesn’t work because the one element is . There is an issue in hpricot with elements named text.

I use something like this in the twitter gem now:

status.get_elements_by_tag_name('text').innerHTML
Steven Soroka
Dec 13, 2007

You should use Hpricot::XML for processing xml, the default html handler doesn’t handle all xml tags properly.
John Nunemaker
Dec 23, 2007

@Steven – Yep. I use the xml method but I hadn’t updated the article. I will now.