December 09, 2006
Older: Styleaby CSS Plugin
Newer: BDD and Setting Up Controllers
Parsing XML With Hpricot
Note: I have a comprehensive write up on parsing XML with ruby which provides how to’s for Hpricot, libxml-ruby and rexml.
Parsing xml has always annoyed/confused me (well at least in php it did). When I switched to Ruby, I first learned how to get the job done with REXML, but I was left wanting something more, something easier. Due to the gentle prodding of Chris on this err post, I decided to use Hpricot to parse XML on my last web service binge, a.k.a the twitter gem I just released.
I found Hpricot easy to use and extremely quick. I was instantly sold. Heck, it made me want to go find some massive xml file and parse it just for the fun. Ok, maybe not that far. Check out the simple example below:
require 'rubygems'
require 'hpricot'
xml = %{
<status>
<id>1</id>
<created_at>a date</created_at>
<text>some text</text>
</status>
}
doc = Hpricot::XML(xml)
(doc/:status).each do |status|
['id', 'created_at', 'text'].each do |el|
puts "#{el}: #{status.at(el).innerHTML}"
end
end
Does it get any easier? Sometimes it takes me hearing something a few times to actually try it, thus I’m confirming what Chris posted about. Hpricot is sweet with xml. Go ahead and try it out. If you don’t have any xml laying around, just hit an api somewhere to get some.
7 Comments
Dec 10, 2006
Could someone pls explain what this syntax is doing?
(doc/:status).each
Dec 10, 2006
/
is actually a method call. It is aliasing the hpricot search method.doc.search(:status).each
would produce the same result. Why just added it because it is some sweet syntactical sugar.Jul 05, 2007
The example above does not work in irb with Ruby 184-20. The ‘text’ tag is for some reason not found, parses into a nil class. Changing the tag and the reference in the array below to anything else e.g. ‘fluff’ takes care of the problem. Any idea why?
xml = %{
1
a date
some text
}
doc = Hpricot(xml)
(doc/:status).each do |status|
[‘id’, ‘created_at’, ‘fluff’].each do |el|
puts “#{el}: #{status.at(el).innerHTML}”
end
end
Jul 27, 2007
script below do not work :
xml = %{
}
doc = Hpricot(xml)
(doc/:style).each do |style|
[‘id’, ‘created_at’, ‘text’].each do |el|
puts “#{el}: #{status.at(el).innerHTML}”
end
end
Jul 27, 2007
@savitri – yeah, that doesn’t work because the one element is. There is an issue in hpricot with elements named text.
I use something like this in the twitter gem now:
status.get_elements_by_tag_name('text').innerHTML
Dec 13, 2007
You should use Hpricot::XML for processing xml, the default html handler doesn’t handle all xml tags properly.
Dec 23, 2007
@Steven – Yep. I use the xml method but I hadn’t updated the article. I will now.
Sorry, comments are closed for this article to ease the burden of pruning spam.