August 11, 2008
Older: Ruby Object to XML Mapping Library
Newer: Rails App Performance Monitoring
Parsing XML with Ruby
Just for kicks and giggles, I decided to parse xml with each of the main libraries in Ruby (REXML, Hpricot, libxml-ruby), so I could see the differences between them in both API (getting at elements and attributes) and speed. I did two different xml formats. The first, Delicious, uses an attribute based approach, and the second, Twitter, uses a more elemental one. If you look at the xml files linked below, the previous sentence might make more sense.
Note: This is not for scientific and speed purposes but rather to get a feel for each of the libraries and how you traverse xml nodes and such with them.
The XML
Here are the files I used for reference. You’ll have to view source once you click on one of these links to actually see the xml.
- posts.xml – Uses xml element for object (post) and xml attributes for object attributes
- timeline.xml – Uses xml element for object (status) and child xml elements for attributes
REXML
Pros: In the standard library
Cons: Slow, I don’t like the name
%w[benchmark pp rexml/document].each { |x| require x }
##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
doc, posts = REXML::Document.new(xml), []
doc.elements.each('posts/post') do |p|
posts << p.attributes
end
# pp posts
}
################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
doc, statuses = REXML::Document.new(xml), []
doc.elements.each('statuses/status') do |s|
h = {:user => {}}
%w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
h[a.intern] = s.elements[a].text
end
%w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
h[:user][a.intern] = s.elements['user'].elements[a].text
end
statuses << h
end
# pp statuses
}
Hpricot
Pros: Cool name, created by _why, faster than REXML, also does HTML, creative API
Cons: Not as fast as libxml-ruby, more of an HTML parser linguistically (ie: uses innerHTML instead of text or content, etc.)
%w[benchmark pp rubygems].each { |x| require x }
gem 'hpricot', '>= 0.6'
require 'hpricot'
##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
doc, posts = Hpricot::XML(xml), []
(doc/:post).each do |p|
posts << p.attributes
end
# pp posts
}
################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
doc, statuses = Hpricot::XML(xml), []
(doc/:status).each do |s|
h = {:user => {}}
%w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
h[a.intern] = s.at(a).innerHTML
end
%w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
h[:user][a.intern] = s.at('user').at(a).innerHTML
end
statuses << h
end
# pp statuses
}
libxml-ruby
Pros: Blistering fast
Cons: Hpricot has cooler name, REXML and Hpricot both feel easier to use out of the box
%w[benchmark pp rubygems].each { |x| require x }
gem 'libxml-ruby', '>= 0.8.3'
require 'xml'
##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
parser, parser.string = XML::Parser.new, xml
doc, posts = parser.parse, []
doc.find('//posts/post').each do |p|
posts << p.attributes.inject({}) { |h, a| h[a.name] = a.value; h }
end
# pp posts
}
################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
parser, parser.string = XML::Parser.new, xml
doc, statuses = parser.parse, []
doc.find('//statuses/status').each do |s|
h = {:user => {}}
%w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
h[a.intern] = s.find(a).first.content
end
%w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
h[:user][a.intern] = s.find('user').first.find(a).first.content
end
statuses << h
end
# pp statuses
}
Conclusion
I’ll probably start using libxml-ruby but Hpricot is more fun (and I’ve used it a ton). Oh, if you are curious, this was the output from the scripts above on my machine.
=rexml delicious 0.020000 0.000000 0.020000 ( 0.021139) twitter 0.940000 0.020000 0.960000 ( 0.988666) =hpricot delicious 0.010000 0.000000 0.010000 ( 0.005548) twitter 0.250000 0.010000 0.260000 ( 0.258320) =libxml-ruby delicious 0.000000 0.000000 0.000000 ( 0.007829) twitter 0.030000 0.010000 0.040000 ( 0.034040)
The twitter one is slower because of the loops and hashes most likely. I doubt it has much to do with the actual parsing, though it is a larger file and would be a bit slower.
9 Comments
Aug 12, 2008
hi, did you already check that post: http://thebogles.com/blog/an-hpricot-style-interface-to-libxml/, that is using libxml in hpricot way. it looks nice.
Aug 12, 2008
@jney – Sweet! No I hadn’t viewed that. Thanks for the link.
Aug 12, 2008
Have you looked @ SimpleXML?
Aug 12, 2008
Since many web services also provide JSON feeds, have you done any benchmarking of libxml vs. json (and json-pure)?
Aug 13, 2008
@Kunal – xml-simple uses rexml under the hood and I’m technically using it with HTTParty as I’m using Active Support which uses xml-simple. So yep, I’ve looked at it but it’s going to have the same speed issues as REXML.
Aug 14, 2008
Since you work so much with XML in ruby, was wondering if you have come across any ruby library that does SAX with Pull Parsing? Just like StAX in Java?
Aug 14, 2008
HTTParty might benefit from the work I did replacing xml-simple in ActiveSupport in favor of libxml-ruby here.
I found significant performance improvements for relatively little work, with these modifications.
Aug 15, 2008
I definitely think libxml-ruby with a nicer API (kinda like hpricot, but more xml oriented) is the way to go! Would be cool if we could standardize something like this.
StAX would also be cool I guess, at least to have something to show the suits-people :)
Aug 22, 2008
There is innerText method in Hpricot you can use instead of innerHTML. Recently I even have found out that innnerText converts entities (e.g. & to &) whereas innerHTML does not.
Sorry, comments are closed for this article to ease the burden of pruning spam.