C#•14mo ago

Reading large xml file from archive by using XmlReader in Parallel mode

Hello 👋. I am looking for how can I read data from archive xml file in Parallel mode. I have archive someFiles.zip with my needed data and it has largeXmlFile.xml file inside. This file is 40gb. It looks kinda of it (but has thousands of objects :Ok:):

<root>
  <OBJECT data1="123" data2="456" />
  <OBJECT data1="321" data2="654" />
</root>

<root>
  <OBJECT data1="123" data2="456" />
  <OBJECT data1="321" data2="654" />
</root>

Now I am opening this file from archive and get Stream

using var zipFile = ZipFile.OpenRead(@"someFiles.zip");
var myFile = zipFile.Entries.FirstOrDefault(file => file.Name is "largeXmlFile.xml");
var myFileStream = myFile.Open();

using var zipFile = ZipFile.OpenRead(@"someFiles.zip");
var myFile = zipFile.Entries.FirstOrDefault(file => file.Name is "largeXmlFile.xml");
var myFileStream = myFile.Open();

then putting this Stream into XmlReader:

using var xmlReader = XmlReader.Create(myFileStream , new() { Async = true });

using var xmlReader = XmlReader.Create(myFileStream , new() { Async = true });

And I am simply reading it:

var objects = new List<MyObject>();
while (await xmlReader.ReadAsync())
{
    if (xmlReader is { NodeType: XmlNodeType.Element, Name: "OBJECT" })
    {
        objects.Add(ReadMyObject(xmlReader));
    }
}

var objects = new List<MyObject>();
while (await xmlReader.ReadAsync())
{
    if (xmlReader is { NodeType: XmlNodeType.Element, Name: "OBJECT" })
    {
        objects.Add(ReadMyObject(xmlReader));
    }
}

It takes ages for reading this file, so my question is: How can I change my code so I will read this XML in Parallel mode?

23 Replies

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

holy crap 40 gb xml couldn't you consider keeping a "cache" in an alternative format? especially if it's that simple

kurumiOP•14mo ago

Hah, yeah. It's painful :harold: I have no other alternatives

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

why not, you could have a batch that translates xml to minimized json and use the json instead of the xml or rather, it's just <a b=c d=e /> then you could try rolling your own parser or just benchmarking it

kurumiOP•14mo ago

this archive I got from government and they only provide XML format. So I will need extra parse this to json that is actually another task to do

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

but it's a small one

kurumiOP•14mo ago

yeah, I was thinking of creating my own and somehow split Stream into multiple. But have no idea how lol

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

having a single reader from disk will be faster than having multiple readers to me it makes no sense to parallelize it, at least at that stage

kurumiOP•14mo ago

hmm, so the best what I can do is move this file into fast SSD disk?

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

it's not in an ssd already?! really?

kurumiOP•14mo ago

it is, but... I have 5 large files inside of this trojan horse ZIP bomb, hahha

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

how much would you want to improve the performance of this deserialization?

canton7•14mo ago

Have you actually profiled this to see what/where the bottlenecks are? That's step 0 in any optimisation problem

kurumiOP•14mo ago

As much as possible with safe C# (or unsafe if it is not painful). Also I need to add these into local database

canton7•14mo ago

I rather suspect it's one of: 1. Reading that much data from disk 2. Zip decompression 3. Creating a list with 40gb of elements in it None of those are the actual XML parsing, and "Parallel mode" won't help with any of them

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

also do you have 40 GB of ram? because if not... it's all swapping like, how much ram this process takes?

canton7•14mo ago

XmlReader doesn't load the whole lot into ram at one time. That's the point. But, a List with 40gb of elements in will

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

no but how is the zip managed?

canton7•14mo ago

Pretty sure that's streamed too?

kurumiOP•14mo ago

Yeah, I realized that I provide you wrong code. Actually, the limit of this list in real task is 1k elements and then it goes into local database. After the query completed the list will Clear and I fill it once again until see EOF

canton7•14mo ago

Still, you need to profile this before trying to optimise it As a very crude first pass: if you open task manager, is your CPU maxed out, or your disk I/O?

kurumiOP•14mo ago

Alright, I will bench it and reply later But if you were me, what kind of steps you will do? And by reading 1k objects and pass 'em into DB is good idea or not? I am looking for some good advices now :heartowo:

Ꜳåąɐȁặⱥᴀᴬ•14mo ago

how big a single object is? i woud still benchmark this, maybe optimal is 500 items, maybe 2000, who knows

canton7•14mo ago

Feels vaguely sensible, but you really need to have a profiler up. The no. 1 rule of optimization is that the slow-downs are never where you think they are So you can spend an awful lot of time trying things which are never going to make any difference, while missing the real problem entirely (and I mean actual profiling, not benchmarking. A profiler looks at your code as it's running and tells you where it's spending the most time)

Gaming

Programming

Reading large xml file from archive by using XmlReader in Parallel mode

Did you find this page helpful?