August 11, 2008

Posted by John

Tagged gems, hpricot, libxml-ruby, rexml, and xml

Older: Ruby Object to XML Mapping Library

Newer: Rails App Performance Monitoring

Parsing XML with Ruby

Just for kicks and giggles, I decided to parse xml with each of the main libraries in Ruby (REXML, Hpricot, libxml-ruby), so I could see the differences between them in both API (getting at elements and attributes) and speed. I did two different xml formats. The first, Delicious, uses an attribute based approach, and the second, Twitter, uses a more elemental one. If you look at the xml files linked below, the previous sentence might make more sense.

Note: This is not for scientific and speed purposes but rather to get a feel for each of the libraries and how you traverse xml nodes and such with them.

The XML

Here are the files I used for reference. You’ll have to view source once you click on one of these links to actually see the xml.

  • posts.xml – Uses xml element for object (post) and xml attributes for object attributes
  • timeline.xml – Uses xml element for object (status) and child xml elements for attributes

REXML

Pros: In the standard library
Cons: Slow, I don’t like the name

%w[benchmark pp rexml/document].each { |x| require x }

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = REXML::Document.new(xml), []
  doc.elements.each('posts/post') do |p|
    posts << p.attributes
  end
  # pp posts
}


################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = REXML::Document.new(xml), []
  doc.elements.each('statuses/status') do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.elements[a].text
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.elements['user'].elements[a].text
    end
    statuses << h
  end
  # pp statuses
}

Hpricot

Pros: Cool name, created by _why, faster than REXML, also does HTML, creative API
Cons: Not as fast as libxml-ruby, more of an HTML parser linguistically (ie: uses innerHTML instead of text or content, etc.)

%w[benchmark pp rubygems].each { |x| require x }
gem 'hpricot', '>= 0.6'
require 'hpricot'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = Hpricot::XML(xml), []
  (doc/:post).each do |p|
    posts << p.attributes
  end
  # pp posts
}


################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = Hpricot::XML(xml), []
  (doc/:status).each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.at(a).innerHTML
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.at('user').at(a).innerHTML
    end
    statuses << h
  end
  # pp statuses
}

libxml-ruby

Pros: Blistering fast
Cons: Hpricot has cooler name, REXML and Hpricot both feel easier to use out of the box

%w[benchmark pp rubygems].each { |x| require x }
gem 'libxml-ruby', '>= 0.8.3'
require 'xml'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, posts = parser.parse, []
  doc.find('//posts/post').each do |p|
    posts << p.attributes.inject({}) { |h, a| h[a.name] = a.value; h }
  end
  # pp posts
}


################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, statuses = parser.parse, []
  doc.find('//statuses/status').each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.find(a).first.content
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.find('user').first.find(a).first.content
    end
    statuses << h
  end
  # pp statuses
}

Conclusion

I’ll probably start using libxml-ruby but Hpricot is more fun (and I’ve used it a ton). Oh, if you are curious, this was the output from the scripts above on my machine.

=rexml
delicious     0.020000   0.000000   0.020000 (  0.021139)
twitter       0.940000   0.020000   0.960000 (  0.988666)

=hpricot
delicious     0.010000   0.000000   0.010000 (  0.005548)
twitter       0.250000   0.010000   0.260000 (  0.258320)

=libxml-ruby
delicious     0.000000   0.000000   0.000000 (  0.007829)
twitter       0.030000   0.010000   0.040000 (  0.034040)

The twitter one is slower because of the loops and hashes most likely. I doubt it has much to do with the actual parsing, though it is a larger file and would be a bit slower.

9 Comments

  1. hi, did you already check that post: http://thebogles.com/blog/an-hpricot-style-interface-to-libxml/, that is using libxml in hpricot way. it looks nice.

  2. @jney – Sweet! No I hadn’t viewed that. Thanks for the link.

  3. Kunal Parikh Kunal Parikh

    Aug 12, 2008

    Have you looked @ SimpleXML?

  4. Since many web services also provide JSON feeds, have you done any benchmarking of libxml vs. json (and json-pure)?

  5. @Kunal – xml-simple uses rexml under the hood and I’m technically using it with HTTParty as I’m using Active Support which uses xml-simple. So yep, I’ve looked at it but it’s going to have the same speed issues as REXML.

  6. Rajmohan Rajmohan

    Aug 14, 2008

    Since you work so much with XML in ruby, was wondering if you have come across any ruby library that does SAX with Pull Parsing? Just like StAX in Java?

  7. HTTParty might benefit from the work I did replacing xml-simple in ActiveSupport in favor of libxml-ruby here.

    I found significant performance improvements for relatively little work, with these modifications.

  8. Soleone Soleone

    Aug 15, 2008

    I definitely think libxml-ruby with a nicer API (kinda like hpricot, but more xml oriented) is the way to go! Would be cool if we could standardize something like this.

    StAX would also be cool I guess, at least to have something to show the suits-people :)

  9. There is innerText method in Hpricot you can use instead of innerHTML. Recently I even have found out that innnerText converts entities (e.g. & to &) whereas innerHTML does not.

Sorry, comments are closed for this article to ease the burden of pruning spam.

About

Authored by John Nunemaker (Noo-neh-maker), a programmer who has fallen deeply in love with Ruby. Learn More.

Projects

Flipper
Release your software more often with fewer problems.
Flip your features.