Parsing XML With Hpricot

December 9th, 2006

Parsing xml has always annoyed/confused me (well at least in php it did). When I switched to Ruby, I first learned how to get the job done with REXML, but I was left wanting something more, something easier. Due to the gentle prodding of Chris on this err post, I decided to use Hpricot to parse XML on my last web service binge, a.k.a the twitter gem I just released.

I found Hpricot easy to use and extremely quick. I was instantly sold. Heck, it made me want to go find some massive xml file and parse it just for the fun. Ok, maybe not that far. Check out the simple example below:

require 'rubygems'
require 'hpricot'

xml = %{
<status>
  <id>1</id>
  <created_at>a date</created_at>
  <text>some text</text>
</status> 
}

doc = Hpricot::XML(xml)
(doc/:status).each do |status|
  ['id', 'created_at', 'text'].each do |el|
    puts "#{el}: #{status.at(el).innerHTML}" 
  end
end

Does it get any easier? Sometimes it takes me hearing something a few times to actually try it, thus I’m confirming what Chris posted about. Hpricot is sweet with xml. Go ahead and try it out. If you don’t have any xml laying around, just hit an api somewhere to get some.

7 Responses to “Parsing XML With Hpricot”

  1. jason Says:

    Could someone pls explain what this syntax is doing?

    (doc/:status).each

  2. John Nunemaker Says:
    / is actually a method call. It is aliasing the hpricot search method. doc.search(:status).each would produce the same result. Why just added it because it is some sweet syntactical sugar.
  3. Benedikt Says:

    The example above does not work in irb with Ruby 184-20. The ‘text’ tag is for some reason not found, parses into a nil class. Changing the tag and the reference in the array below to anything else e.g. ‘fluff’ takes care of the problem. Any idea why?

    xml = %{ <status> <id>1</id> <created_at>a date</created_at> <fluff>some text</fluff> </status> }

    doc = Hpricot(xml) (doc/:status).each do |status| do |el| puts ”#{el}: #{status.at(el).innerHTML}” end end

  4. savitri Says:

    script below do not work :

    xml = %{ <style> <id>1</id> <created_at>a date</created_at> <text>some text</text> </style> }

    doc = Hpricot(xml) (doc/:style).each do |style| do |el| puts ”#{el}: #{status.at(el).innerHTML}” end end

  5. John Nunemaker Says:

    @savitri – yeah, that doesn’t work because the one element is <text>. There is an issue in hpricot with elements named text.

    I use something like this in the twitter gem now:

    status.get_elements_by_tag_name('text').innerHTML
  6. Steven Soroka Says:

    You should use Hpricot::XML for processing xml, the default html handler doesn’t handle all xml tags properly.

  7. John Nunemaker Says:

    @Steven – Yep. I use the xml method but I hadn’t updated the article. I will now.

Leave a Reply


(textile enabled)