Reading large xml file from archive by using XmlReader in Parallel mode
Hello 👋. I am looking for how can I read data from archive xml file in Parallel mode.
I have archive
someFiles.zip
with my needed data and it has largeXmlFile.xml
file inside. This file is 40gb. It looks kinda of it (but has thousands of objects :Ok:):
Now I am opening this file from archive and get Stream
then putting this Stream
into XmlReader
:
And I am simply reading it:
It takes ages for reading this file, so my question is:
How can I change my code so I will read this XML in Parallel mode?23 Replies
holy crap 40 gb xml
couldn't you consider keeping a "cache" in an alternative format?
especially if it's that simple
Hah, yeah. It's painful :harold:
I have no other alternatives
why not, you could have a batch that translates xml to minimized json and use the json instead of the xml
or rather, it's just
<a b=c d=e />
then you could try rolling your own parser
or just benchmarking itthis archive I got from government and they only provide XML format. So I will need extra parse this to json that is actually another task to do
but it's a small one
yeah, I was thinking of creating my own and somehow split Stream into multiple. But have no idea how lol
having a single reader from disk will be faster than having multiple readers
to me it makes no sense to parallelize it, at least at that stage
hmm, so the best what I can do is move this file into fast SSD disk?
it's not in an ssd already?!
really?
it is, but... I have 5 large files inside of this trojan horse ZIP bomb, hahha
how much would you want to improve the performance of this deserialization?
Have you actually profiled this to see what/where the bottlenecks are?
That's step 0 in any optimisation problem
As much as possible with safe C# (or unsafe if it is not painful). Also I need to add these into local database
I rather suspect it's one of:
1. Reading that much data from disk
2. Zip decompression
3. Creating a list with 40gb of elements in it
None of those are the actual XML parsing, and "Parallel mode" won't help with any of them
also do you have 40 GB of ram?
because if not... it's all swapping
like, how much ram this process takes?
XmlReader doesn't load the whole lot into ram at one time. That's the point.
But, a List with 40gb of elements in will
no but how is the zip managed?
Pretty sure that's streamed too?
Yeah, I realized that I provide you wrong code. Actually, the limit of this list in real task is 1k elements and then it goes into local database. After the query completed the list will Clear and I fill it once again until see EOF
Still, you need to profile this before trying to optimise it
As a very crude first pass: if you open task manager, is your CPU maxed out, or your disk I/O?
Alright, I will bench it and reply later
But if you were me, what kind of steps you will do? And by reading 1k objects and pass 'em into DB is good idea or not?
I am looking for some good advices now :heartowo:
how big a single object is?
i woud still benchmark this, maybe optimal is 500 items, maybe 2000, who knows
Feels vaguely sensible, but you really need to have a profiler up. The no. 1 rule of optimization is that the slow-downs are never where you think they are
So you can spend an awful lot of time trying things which are never going to make any difference, while missing the real problem entirely
(and I mean actual profiling, not benchmarking. A profiler looks at your code as it's running and tells you where it's spending the most time)