Parsing XML With Hpricot
December 9th, 2006
Parsing xml has always annoyed/confused me (well at least in php it did). When I switched to Ruby, I first learned how to get the job done with REXML, but I was left wanting something more, something easier. Due to the gentle prodding of Chris on this err post, I decided to use Hpricot to parse XML on my last web service binge, a.k.a the twitter gem I just released.
I found Hpricot easy to use and extremely quick. I was instantly sold. Heck, it made me want to go find some massive xml file and parse it just for the fun. Ok, maybe not that far. Check out the simple example below:
require 'rubygems'
require 'hpricot'
xml = %{
<status>
<id>1</id>
<created_at>a date</created_at>
<text>some text</text>
</status>
}
doc = Hpricot::XML(xml)
(doc/:status).each do |status|
['id', 'created_at', 'text'].each do |el|
puts "#{el}: #{status.at(el).innerHTML}"
end
end
Does it get any easier? Sometimes it takes me hearing something a few times to actually try it, thus I’m confirming what Chris posted about. Hpricot is sweet with xml. Go ahead and try it out. If you don’t have any xml laying around, just hit an api somewhere to get some.

December 10th, 2006 at 06:07 PM
Could someone pls explain what this syntax is doing?
(doc/:status).each
December 10th, 2006 at 10:50 PM
/is actually a method call. It is aliasing the hpricot search method.doc.search(:status).eachwould produce the same result. Why just added it because it is some sweet syntactical sugar.July 5th, 2007 at 11:29 AM
The example above does not work in irb with Ruby 184-20. The ‘text’ tag is for some reason not found, parses into a nil class. Changing the tag and the reference in the array below to anything else e.g. ‘fluff’ takes care of the problem. Any idea why?
xml = %{ <status> <id>1</id> <created_at>a date</created_at> <fluff>some text</fluff> </status> }doc = Hpricot(xml) (doc/:status).each do |status| do |el| puts ”#{el}: #{status.at(el).innerHTML}” end end
July 27th, 2007 at 11:20 AM
script below do not work :
xml = %{ <style> <id>1</id> <created_at>a date</created_at> <text>some text</text> </style> }doc = Hpricot(xml) (doc/:style).each do |style| do |el| puts ”#{el}: #{status.at(el).innerHTML}” end end
July 27th, 2007 at 11:34 AM
@savitri – yeah, that doesn’t work because the one element is <text>. There is an issue in hpricot with elements named text.
I use something like this in the twitter gem now:
status.get_elements_by_tag_name('text').innerHTMLDecember 13th, 2007 at 12:47 AM
You should use Hpricot::XML for processing xml, the default html handler doesn’t handle all xml tags properly.
December 23rd, 2007 at 12:00 AM
@Steven – Yep. I use the xml method but I hadn’t updated the article. I will now.