lightweight xml parser

I have added a lightweight XML parser to the utils package. Yes, I realize writing a new XML parser is a little crazy, but I had a hunch I could come up with something small and fast, and it was a good exercise of my Ragel skills.

The Xml class parses using SAX-like events and builds a DOM by default. It supports a subset of XML features: elements, attributes, text, predefined entities, CDATA, mixed content. Namespaces are parsed as part of the element or attribute name. Prologs and doctypes are ignored. Only 8-bit character encodings are supported. Input is assumed to be well formed.

So, how does it perform? I compared it to javax.xml.parsers.DocumentBuilder on both the desktop and Android. On the desktop, it is slightly faster for small (1MB) XML documents, and slightly slower for very large (100MB) documents. On Android it is much faster on small documents. Android choked on the 100MB file, so I couldn’t test it. Here are the benchmark results, in seconds:

Android, 1MB:

Java: 1.010742206
Nate: 0.149536137

Desktop, 1MB:

Java: 0.057415785
Nate: 0.044828239

Desktop, 100MB:

Java: 2.189287045
Nate: 3.054689226

The entire XML parser was written in 251 lines of code using Ragel, a FSM compiler. Here is the core of the grammar:

attribute = ^(space | ‘/’ | ‘>’ | ‘=’)+ >buffer %attributeName space* ‘=’ space*
((‘\” ^’\”* >buffer %attribute ‘\”) | (‘”‘ ^'”‘* >buffer %attribute ‘”‘));
element = ‘<' space* ^(space | '/' | '>‘)+ >buffer %elementStart (space+ attribute)*
:>> (space* (‘/’ %elementEndSingle)? space* ‘>’ @element);
elementBody := space* <: ((^'<'+ >buffer %text) <: space*)? element? :>> (‘<' space* '/' ^'>‘+ ‘>’ @elementEnd);
main := space* element space*;

It’s similar to writing regex, except the names like “buffer” and “attribute” are named snippets of Java code that get executed during various FSM state changes. These snippets collect pieces of input, build the DOM, etc.

Ragel is really cool! Now that I’ve used it for TableLayout and this XML parser, I feel comfortable using it to quickly whip up DSLs.

-Nate

6 thoughts on “lightweight xml parser

  1. Good for you.

    But I’m afraid, that introducing a new unknown language or technology into already known framework, makes parts of it hard to maintain for regular java developer in the future.

  2. Hi Peter,

    I think that Regel is not a problem for the framework. It uses own syntax but it produces Java code in the end. If libgdx team refuses to use Regel in the future, they can edit generated source codes for their benefit. Futhemore it is insignificant wheter code is auto-generated or hard-coded for programmers using libgdx as black box API.

    The performace of Nate’s parser is stunning. Lightweight, target-specific logic implemenations makes difference in software products. On the other hand, I start to worry about SQA of the project. IMHO final release of libgdx 1.0 should contains at least basic set of unit tests (I’m not sure if CI from nice guys at l33tlabs.org supports automated testing of releases…). But I (personally) know, that test writing is so boooring. 🙂

  3. Unit tests are possible for utils classes. Unit testing graphical output (and thus 70% of the code of libgdx) is pretty much an impossibility i’m afraid. Testing on the CI server is no problem, TestNG and JUnit work just fine. The problem is that a lot of stuff is not headless and can thus not be tested automatically. I welcome all suggestions on that matter.

    Maintainability is a different story of course. I’m a “regular” Java developer in my dayjob, but i wouldn’t even begin to hope that any of my co-workers could pick up the project. Not because it’s to complex or because they are idiots, on the contrary. The reason is that the technology stack (and by this i don’t mean Regel, but things like OpenGL (ES)) is completely unknown to them. For game developers with some OpenGL experience it should be rather easy to traverse and make sense of the source, and that would be the target demographic to chose future maintainers from, e.g. people from http://www.java-gaming.org

  4. To give more meaningful performance numbers, you really should compare against streaming XML parsers — if this is for android, it comes bundled with xpp, but standard stax xml parsers (Woodstox, Aalto) can also be used. Cost of building a dom tree is pretty similar for all, and dominates any performance test.

  5. I’m going to use ragel to write the world’s fastest xml parser. You have probably already done 90% of the work.

    If I get it done and I can use your state machine, I’ll credit you.

    Neal

Leave a Reply

Your email address will not be published.