Crack, The Easiest Way to Parse XML and JSON

An astute reader will remember that a while back, HTTParty divorced both ActiveSupport and the JSON gem in order to make it easier to use all around. With the JSON gem went the last gem dependency, which was kind of cool.

A few days back, it occurred to me that the parsing of XML and JSON that HTTParty used might be handy outside of HTTParty. In the spirit of sharing, I whipped together a new gem, named crack, that contains the XML and JSON parsers that formerly were bundled in HTTParty.

Why Crack?

I figured the name was easy and memorable, which is a requirement for anything I’m going to release. When I thought about parsing XML and JSON, for some reason, *crack*ing the code came to mind and thus crack had a name.

Credits

First, I’d like to make it abundantly obvious that I did not author any of this code. I tweaked it a bit and made sure it had tests, but the XML parsing was extracted from Merb (extlib) and the JSON parsing from Rails (ActiveSupport). I merely packaged them together for all to enjoy. Ok, now that we have that out of the way, let’s move onward.

So I ripped the two parsers out of HTTParty and put them in their own gem and then just set that as a dependency for HTTParty. HTTParty will still work the exact same, but if all you need is a really simple way to parse JSON or XML, crack is his name and parsing is his game.

Details

As always these days, I used shoulda and matchy for testing and jeweler to make the gem maintenance easy. That is pretty much it for details on this project. It is focused and simple so there isn’t much behind the scenes.

Installation

I registered a rubyforge project, but I’m waiting for approval. For now, you can get the gem from Github.

sudo gem install jnunemaker-crack -s http://gems.github.com

Usage

It has always slightly annoyed me that all the different XML and JSON (JSON.parse, ActiveSupport::JSON.decode) parsing mechanisms available in Ruby have different APIs. I think parse is the easiest to remember and it is consistent with HappyMapper, another project of mine, so whether you are working with XML or JSON, all you have to remember is parse.

xml = '<posts><post><title>Foobar</title></post><post><title>Another</title></post></posts>'
Crack::XML.parse(xml)
# => {"posts"=>{"post"=>[{"title"=>"Foobar"}, {"title"=>"Another"}]}}

json = '{"posts":[{"title":"Foobar"}, {"title":"Another"}]}'
Crack::JSON.parse(json)
# => {"posts"=>[{"title"=>"Foobar"}, {"title"=>"Another"}]}

That is pretty much all there is to it. Given XML or JSON, you get back a hash. The repositoryhas been up for a couple days, but I thought I would mention it here as well. The keys here are simple and consistent. If you just want to get dirty and you aren’t worried about performance, crack is a perfect fit.

12 Comments

Guoliang Cao
Apr 01, 2009

This is really nice! In our current project we are writing code to parse our web service xml/json response. I can see those be replaced by your gem.

One thing IMO would be nice is, this api can sniff the format of the string and choose the right parser or raise error. It’s a minor thing though.
John Nunemaker
Apr 01, 2009

@Guoliang Not sure there is any reliable way to detect on or the other, but glad you find it useful.
Elad
Apr 01, 2009

The XML you used in your example is not quite a valid XML document… how will the parser treat the <?xml> header?
Max
Apr 01, 2009

Seems like the simplest (and hopefully most reliable) way to discern whether it’s xml or json is…look at the first character! json = “{”, xml = “<”.

John, thanks for putting this together — even if it is highly derivative. Looks like a good tool.
John Nunemaker
Apr 01, 2009

@Max True, first character being “<” would denote XML and you could assume JSON otherwise. I can’t think of anytime when it wouldn’t.
Randy J Parker
Apr 02, 2009

The caveat at the end of your post “If you just want to get dirty and you aren’t worried about performance…” is too modest. Running your previous benchmarks for other xml parsers makes crack hard to resist.

Defining the fastest parser, libxml = 1, crack is about 2 (half as fast), hpricot = 4 or 5, and rexml is by far the worst at 30.

Memory consumption using your posts.xml & timeline.xml sample data: libxml = 22 MB, crack = 18, hpricot = 23, rexml = 5. I don’t know how much of the memory each parser uses is shared with (already mapped by) a typical rails app. If it turned out that some app wasn’t already mapping most of this footprint, rexml may have an advantage when memory is constrained.

Considering that crack delivers a perfectly convenient ruby hash, while all the other parsers deliver data structures that require futzing around, I think crack is the way to go.
Max
Apr 02, 2009

Hmm…I guess the best/worst thing about this library is that it does simply return a hash. Which means your XML may come out slightly…incomplete.

irb(main):014:0> Crack::XML.parse(%q{the content
})
=> {"post"=>"the content"}

Oh well. Don’t use attributes then?
John Nunemaker
Apr 03, 2009

@Randy Thanks for the comment. Interesting to know that it does that well.

John Nunemaker

Apr 03, 2009

@Max It works with attributes or elements, just not combinations.

Crack::XML.parse('&lt;post foo="123"&gt;the contents&lt;/post&gt;')
=&gt; {"post"=&gt;"the contents"}

Crack::XML.parse('&lt;post foo="123"&gt;&lt;/post&gt;')
=&gt; {"post"=&gt;{"foo"=&gt;"123"}}

Thanks for pointing that out though.

Chap
Apr 03, 2009
It looks like XmlSimple had a default option of parsing “posts” into
an array, even if there was only one item. It looks like XmlMini and
HTTParty do not do this.

I’m having problems because I am unsure how many “posts” will be
returned. Any tips on how I can enable this functionality, or a better
way to parse the data?

Here’s a quick example:
```
@xml['response']['post'].each do |p| 
  posts &lt;&lt; Post.new(p['author'], p['body']) 
end 
```
This works like a champ unless there is only 1 post and then it
explodes.

(I also posted this to the google group, but it didn’t ook like there was much activity there.)
John Nunemaker
Apr 03, 2009
If there are multiple post elements inside of posts it should turn that into an array. Also if there is a single post but the posts element has an attribute of type=“Array” or something like that it will return an array.

If your xml is…
```
&lt;posts&gt;&lt;post&gt;&lt;title&gt;Foo&lt;/title&gt;&lt;/post&gt;&lt;/posts&gt;
```
…it is kind of hard to assume that posts should return an array of posts.
aLLeRNiZo
Apr 15, 2009

Nice and easy parser but…
I think that for complex xmls the xml-mapping is one-way