PHP, DOMDocument, XPath 1.0, Case-Insensitivity, and Performance • Flailing Wildly

TL;DR: How I improved the performance of case-insensitive XPath queries by 30–35%, reducing an 8× performance hit to only 4.5–5×.

Title card for “Hackerman”, a character from the short film “Kung Fury”. http://www.kungfury.com

Parse-at-all-costs

Most feeds are a mess. The old SimplePie “OG” took a parse-at-all-costs philosophy, and could handle many of the most broken feeds you could find — at a cost. While the early versions of SimplePie supported the letter of the RSS 2.0 specification, there were a surprising number of feeds which didn’t.

Once SimplePie started to get popular (2006–2008), we started getting bug reports from users who were working with RSS feeds containing elements such as <pubdate> (instead of <pubDate>) and <managingeditor> (instead of <managingEditor>). At first we told users that the feeds were broken — which they were. But then we started getting enough reports that we decided to do something about it.

Introducing XPath

Fast-forward to the summer of 2017 when I started work on SimplePie NG in earnest. There are a number of things I’m doing differently (read: better) this time around. The first is that the fastest approach is the default approach. A corollary to this principle is that if you want to do more things, you will pay for them with performance penalties.

During my time working at Amazon Web Services on the SDK for PHP, I discovered some substantial performance gains by moving a lot of the response-parsing code to XPath. As such, the core XML parsing in SimplePie NG is all built around DOMDocument and XPath queries.

To solve this case-insensitivity problem, searching Stack Overflow for “case insensitive xpath” tells you about the XPath 2.0 functions matches() and lower-case(). However, I was surprised to learn that PHP only supports XPath 1.0. After doing some digging, the reason appears to be that the underlying libxml2 library only supports XPath 1.0, with no updated support on the horizon.

The only alternative that Google and Stack Overflow had for me was the XPath 1.0 function, translate(). In PHP, the case-insensitive query for the <rss> element would be:

/*[translate(name(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = 'rss']

It’s simple enough to turn this into a pattern inside of a function. Case-insensitive XML parsing. Done. Boo-yah.

Performance-testing

A little while later, I started some early work on benchmarking SimplePie NG. I parsed a number of normal-sized feeds, and got back a bunch of perfectly reasonable results. But one thing that I wanted to test was memory usage to make sure there were no memory leaks.

I put together a quick and dirty test suite by starting with Tim Bray’s feed (one of the more nuanced and complex Atom 1.0 feeds), duplicating the entries to a total of 500 (increasing the size to around 3 MB), and then wrote a test that flexed everything about SimplePie NG that I could think of. I then started running the test over and over again, collecting data about the timing and memory usage, and when the cache kicks-in and the engine warms up.

Test machine and environment

I’m running this on a 2011 “Core i7” MacBook Pro, with 16 GB of RAM and an after-market SATA-III SSD. I have various background processes running, so it isn’t the same as running it on a fresh Linux web server. I also have XDebug enabled, and I’m testing on the CLI where Zend OpCache is disabled.

First pass; Case-insensitive with XPath translate()

The intial results for this 3 MB, 500-entry feed — with case-insensitivity enabled by way of the XPath translate() function — had an average runtime of 26 seconds. That was quite a bit slower than I was hoping for (especially on PHP 7.2), but then again it was a big file with a lot of entries.

Let’s compare to case-insensitivity turned off (i.e., case-sensitive XML parsing).

Second pass; Normal, case-sensitive

The next round of results on the same 3 MB, 500-entry feed—with standard case-sensitive XPath queries—had an average runtime of 3.5 seconds. That’s a lot better.

To do some quick math, the normal query took only 14% of the amount of time it took to do a case-insensitive query. Or, put another way, the case-insensitive query took around 7.5× longer than the normal query. That’s awful!

Experimentation

I had to find a way to improve the performance of the case-insensitive XPath query. Could I reduce the number of times I had to call translate()?

XSLT

I tried experimenting with XSLT for a few days. The goal was to transform the XML once with XSLT into a new XML document where all elements were lowercase, then I could just use regular XPath queries and avoid translate() all-together.

Overall, I still think this is a fantastic idea if you know where your XML data is coming from. Unfortunately for me, I don’t, and I was completely unable to craft an appropriate XSLT template that would allow me to convert all tag names to lowercase without breaking a bunch of other things (e.g., entities). I ended up having to abandon this path.

Enabling PHP functions in XPath

I only dabbled with this briefly, but there was no discernable performance improvement that I can recall. Also, the PHP documentation is lacking around this feature, so it was a lot of trial and error.

Simplify translate()

Finally, I wondered if I could reduce the amount of time that translate() takes if I simply gave it less work to do. Instead of giving it the entire alphabet, what if I only gave it the letters that were in the XML element name?

PHP has a function count_chars() that can return the unique characters in a string. From here, we can create upper and lower-case versions of the string, and just use those in the translate() function.

<?php

$word           = 'rss';
$elementLetters = \count_chars($word, 3);
$lettersLower   = \mb_strtolower($elementLetters);
$lettersUpper   = \mb_strtoupper($elementLetters);

$query = \sprintf(
    '/*[translate(name(), \'%s\', \'%s\') = \'%s\']',
    $lettersUpper,
    $lettersLower,
    $word
);

# /*[translate(name(), 'RS', 'rs') = 'rss'
$results = $domxpath->query($query);

Testing this approach on the same 3 MB, 500-entry feed — with case-insensitivity enabled by way of our smarter translate() function — had an average runtime of 17 seconds.

Running the same benchmarks against my other test feeds consistently showed a 30–35% improvement in performance when using only the required letters in the translate() function instead of the entire alphabet.

Wrapping-up

Even with this technique (on this particular set of data, with this particular testing approach), case-insensitive queries are still 4.5–5× slower than their case-sensitive counterparts. Using the translate() XPath 1.0 function in PHP has a substantial impact on performance, so don’t use it if you don’t have to.

I still think that there is some viability in leveraging XSLT in a first pass, which I expect would substantially reduce the case-insensitive processing time, but someone with more XSLT experience than me would need to contribute that code.

Lastly, SimplePie NG performs the faster case-sensitive queries by default. You are able to opt-in to case-insensitive mode on a per-feed basis. If you’re just processing a few average-sized feeds with this mode enabled, you probably won’t notice much of an impact.