❔ Large XML Documents Performance Issues
Running into performance issues trying to parse large XML documents into memory. By large, I mean 200+ MB XML files containing over hundreds of thousands of lines of text. They contain snapshots of telemetry data for machines the client owns and operates. Of of their purposes is to assist upper tier support with resolving tickets by making this data available to them. Currently the data it is not available unless the support is on premise. The problem is most of the documents with the more useful information are so large that they can't be opened by the machines the support teams use. Heck, some of them I can't even open on my developer machine.
Since we can't go back and redesign the XML Schemas to be parsed into smaller documents, my idea to resolve this is to develop a service that parses the documents and let's the user navigate through the XML document through a web page, letting cloud resources do the heavy lifting. The primary reason I've landed on this is because it falls under the umbrella of what we are authorized to do (gov client). Initially I was going to use Kubernetes for this, but again, do the authorization reasons, the client has asked that we stick to Azure Functions.
Currently, the data is stored in Azure Files, however, for other reasons, some of that data has been stored in SQL with each file stored within an XML column (they refuse to split the data up relationally because, and I quote "it would add too many tables to the db"). So I've been testing parsing the data into an XPathDocument using EF Context GetDbConnection Reader. It is able to mostly parse files up to ~70mb, and while a bit slow (8100ms response time with the largest successfully parsed file), it gets the job done. This is using Azure Functions Premium on EP1. EP2 saw faster response times (8100 was the fastest), but still would timeout on anything larger than 50-70mb.
The response time seems to grow exponentially with the document size with anything over the 70mb timing out with the default 5 minute timeout time (which seems really long). As proof of concept, the gov side of the team originally developed a similar solution using the front-end framework (adobe coldfusion) and letting the web server handle loading the documents, and they found major performance gains pulling the data directly from Azure Files where the originally files are hosted (although, they still had to upgrade the web server from AP2 v2 to D8 v3)
Wondering if someone can lend me some insight on why our response times seem to grow exponentially, why pulling the file from Azure Files instead of Azure SQL seems to be faster (testing this with my function today to confirm), and what are some other possible solutions for this using Azure Functions.
Note: the next steps for me if this was not timing out and maybe a bit more responsive would've been to return the requested xpath data to the user, while also caching that data plus a number of nodes at x depth from that data into redis so that when the user stepped up or down in the document, the data was immediately available, and more nodes would be cached as the user reached the bounds of the original caching.
4 Replies
How are you "parsing" the XML? XmlDocument.Load/LoadXml()?
Typically, with large XML files you will need to resort to using XmlReader, which is the lowest-level API for processing XML in .NET. It is actually quite fast, and memory efficient, but can be difficult to work with.
My recommendation would be to use the XmlReader to read through the document and transform the necessary pieces into a format that is more efficient to access.
I'm going to second what @MarkPflug said. For large XML documents (and I used to work with XML files that were 1GiB+) you want to use a fast, forward-only streaming mechanism like that provided by XmlReader. Loading the entire document into an XmlDocument in memory really isn't an option when the files get large.
Yeah, I started switching to XmlReader yesterday afternoon when I realized that I probably wasn't going to be loading it into memory. However, we use XPath for lookups, and I've been running into problems integrating that without rebuilding XPath support.
Was this issue resolved? If so, run
/close
- otherwise I will mark this as stale and this post will be archived until there is new activity.