Wrapping libxml2 for Swift – The Red Queen Coder

I have spent the last month or so working on a project where I am wrapping a legacy C toolkit, libxml2, in Swift. I have created a sample project here where you can obtain wrapper classes written in Swift allowing you to utilize libxml2 into your code without having to touch C.

My job at SonoPlot is to write control software for our robotics systems. One of our included pieces of software is a CAD program for users to use that creates XML pattern documents that that are parsed by our control software so that it can tell the robotics how to draw a pattern.

Since some of our pattern files can be very large and complex, it was important to us to be able to use the fastest XML parsing we have available. I benchmarked three different ways of parsing XML:

Tree-Based Parsing using NSXMLDocument
Event-Based Parsing using NSXMLDocument
Tree-Based Parsing using libxml2

I had wanted to include event-based parsing with libxml2, but that utilizes C function pointers, which are not currently possible with Swift.

I have included the profiler in my sample code for this project, but in case you don’t feel like running it yourself, my benchmarking showed that using libxml2 was four times faster than tree-based NSXMLDocument parsing and three times faster than event-based NSXMLDocument parsing.

Clearly, there is a large difference between using libxml2 and any permutation of NSXMLDocument.

What is libxml2?

libXML2 is a toolkit for parsing XML that is written in C. Back when the first iPhones came out, libXML was highly utilized because it was incredibly fast and the first iterations of the iPhone were not particularly powerful.

As the iPhone has become more powerful and the number of frameworks has become more robust, libXML2 has fallen out of use. Since it is a C toolkit, you are taking responsibility for your memory management. It wasn’t written for a specific language, so it is rather sprawling and has a lot of obtuse documentation. It’s also rather old. When I was looking into it they had sample code written in Pascal.

Basically, as things like NSXMLParser and the use of JSON became more prevalent and easier to work with, there hasn’t really been a need for most people to bother with this difficult toolkit.

Hooking Up libXML and the Wrapper Classes

libXML is included on your machine, but it isn’t included by default in Xcode. It will need to be important and linked to your project for Xcode to be able to see it.

Go to your project and find your Build Settings in your Project. Search your Build Settings for “Header Search Paths”. In your project’s search paths, add the following:

$(SDKROOT)/usr/include/libxml2

While still in your Build Settings, search for “Other Linker Flags.” Add this line to your linker flags:

-lxml2

Now go to your Targets and find your Build Phases. Find the tab that says “Link Binaries with Library”. Click on the “+” sign to load a new library to link to the project. Search for “libxml2.” There should be two results. Add both of those results to your project.

Lastly, you will need to import an Objective-C bridging header into your project. The Objective-C bridging header is a file that exists in this project, so you just need to drag it from the sample project over to your project.

After the bridging header is added to your project, go back to your Build Settings and search for “Objective-C Bridging Header”. You will need to add a line that has your project name and the bridging header name in it. For example, in the sample project LibXMLWrapperExample, the line added to the bridging header was:

LibXMLWrapperExample/Bridging-Header.h

A really good README going over this process is here. If any of my steps aren’t clear enough or something isn’t working quite right, use this as a backup for the instructions to hook this up. It can be incredibly obtuse and easy to forget one little step and have the whole thing not work.

Also, the way that I verified this worked was to look at a project I got this working on and I searched both the Target and the Project for “xml” to see where I had added search paths and linker files. This is a good way to do a sanity check to make sure that you didn’t forget anything.

LibXMLDoc

The first wrapper class I created is the simplest one: LibXMLDoc. In XML, you have a tree that is made up of nodes. In libxml2 all of the nodes are contained in an xmldoc. You need two things from the LibXMLDoc class : a document and a root node. Once you are able to extract the root node, you are able to traverse the entire tree to find anything that you are looking for.

Since we are dealing with C, we are responsible for our own memory allocation and deallocation. I had to remember to deallocate the LibXMLDoc when we are done with it.

LibXMLNode

The second class, LibXMLNode, is where we do a lot of the heavy lifting. This class is responsible for finding and extracting values from our XML document.

The first property I needed to set up was nodeName. nodeName is going to be the primary way we are going to be accessing and dealing with each node, so this is a pretty important value to have easy access to.

Since libxml2 is a C toolkit, the type we are receiving is going to be a C string. The other thing complicating this is that libxml2 is set up to deal primarily with pointers to their node and document objects. We need to take a pointer to a C string and somehow do some alchemy on it to convert it to a Swift string. Swift strings have a method on them called “fromCString” that I was able to use. Also when dealing with C string types, you need to use unsafe mutable pointers. I also had to figure out how to navigate through a C pointer to access the actual values that they were referencing.

I was able to get this all down to one line of code that did all of these things:

return String.fromCString(UnsafePointer(self.xmlNode.memory.name))!

From there, I needed to think about what we need to do with a node. Every node we encounter is generally going to be one of two ways:

It will be a parent node that contains children nodes but no value.
It will be a child node that might contain a value but has no children.

I set up two lazy properties in the LibXMLNode class to deal with these two eventualities: nodeValue and nodeChildren.

nodeChildren takes a LibXMLNode as a parameter and iterates through that node’s children until it encounters “nil.” It then returns an array of LibXMLNode objects.

The first time I wrote this code, I was getting many more nodes than I was expecting to get. By printing out the array of node children I found that every other node was a node, which mean that if I was expecting five children nodes, I was actually receiving eleven, because the first and last nodes were nodes and every other node in between was also . This was rather inconvenient, so I found a libxml function that checks to see if the node is a node. If it is, the function returns 1. I do a check on each node and if the function returns 0, the node is added to our LibXMLNode array.

nodeValue covers the other contingency where you have a node that contains a value that needs to be extracted. Since it is possible to have empty XML tags that do not contain values, this property has to be optional. We need to extract the value and look inside to see if we have anything in there.

I used the libxml2 function “xmlNodeListGetString,” which takes a document, a node’s children, and their index as parameters and returns either a C string or nil. If there is a value, we use the code we used in the nodeName function to extract a Swift string, we free the C string, and we return the Swift string. If the function returns nil, we return nil.

Raiders of the Lost ARC

So I was all excited that I figured all of this stuff out and I was ready to start testing my classes. I wrote a bunch of tests that all failed immediately.

While looking over the tests, I realized that every time I tried to access any of the properties on the root node, they were failing because the root node didn’t exist.

I was becoming incredibly confused and frustrated when Brad had me add a println() in LibXMLDoc between when we initialize the document and when we initialize the root node. It turns out that ARC was deleting the LibXMLDoc immediately after it was being initialized because it wasn’t being held on to or referenced anywhere. D’oh!

In libxml, the document controls the memory for all of the nodes, not the nodes themselves. Basically the way I was writing the code was that I only referenced the document once, where it gets initialized. From there, since there were no other references to the document, ARC deleted it along with all of the nodes it contains.

That was a problem. Since the document contains all of our nodes, we really need it to stick around until we are finished with extracting all of the nodes and their values. We have to create a LibXMLNode instance in the LibXMLDoc class to hold the root node, we couldn’t just have the LibXMLNode class point to the LibXMLDoc class. We had to have them point to one another without creating a retain cycle, so the LibXMLNode class has a strong reference back to the LibXMLDoc class to prevent the instance from being deallocate before we are done with it. I then went back to LibXMLDoc and make the reference to the LibXMLNode a weak reference to avoid a retain cycle.

There is still some juggling that needs to be done in order to make sure we are able to prevent the document from being eliminated and that we are able to get the root node.

The solution utilized here was to replace the strongly referenced LibXMLNode root node with a private, internal weak root node and a computed property checking to see if this internal root node has been set yet. If it has, it is returned. If it hans’t, we extract the root node, set it to the internal root node, and return it. Since computed properties are basically methods that look like properties, for all intents and purposes we are replacing a strong reference with a weak reference and a method.

We are trying to resolve a few things by taking this path. First, we are trying to avoid having the root node and the document deallocate before we can use them. Second, since we are using a class and a class is a reference type, we want to make sure that we only create the root node once rather than having a bunch of instances all pointing back to the same memory location.

And this, kids, is why you still need to think about memory management and ARC even if you started coding after iOS 5, like I did.

…and knowing is half the battle!

Adding Bundled XML Files

The last part of this sample application that I want to cover is including XML files rather than accessing them from an internet URL.

I have included in my sample project a relatively simple pattern file generated by our CAD program to use as an example.

One thing you have to remember to do when you include an XML file with your program is that you have to include it in the application bundle. I have included a convenience function in the LibXMLDoc class called “bundleForResource” that takes the the resource name and returns an NSURL. This can then be passed into the parser where it asks for the URL of the resource.

You also have to make sure that your file resource shows up in “Copy Bundle Resources” in “Build Phases.” My original attempt at creating a sample project was trying to make this a command line application, but I wasn’t able to copy the bundle resources (because there was no bundle) and it generated an IO error.

The last convenience function I included in the LibXML wrapper classes is the “outputXMLTree” function. This function is recursive and it walks through the document tree checking each node to see if it has children or values. I am using this function in the App Delegate to demonstrate the the classes have in fact parsed the included document correctly.

Conclusions

Before attempting to work on this project, I decided I was going to avoid dealing with C in Swift as much as humanly possible. Considering the nature of what we do here, where we use serial communication in C with our control software, that was an incredibly stupid and wrong-headed way to approach things.

Yes, it is a little more complicated. It required some more work, some tenacity, and some help from my boss who has been around the block a few more times than I have. As much as part of me would like to not think about it, C isn’t going anywhere. I want to work with micro controllers and firmware in the future and deciding not to get C to play nicely with Swift is basically a non-starter for the projects I would like to do.

Even if you think that XML parsing is as boring as dry toast, hopefully the code will help you with figuring out how to integrate older C code into your projects or at least give you a clean way to add better XML parsing to your applications.

Again, the sample code associated with this post is here.

Hit me up on Twitter if you have any questions about it.