❔ Trying to speed up CSV import times
Hi, I was wondering if anyone could give me pointers on how I can speed importing a CSV. Right now I am using CSVReader.
Sorry if it's a little vague, it's a research project so I can't share the code fully. I'm just looking for general tips on how to speed up the process. I've tried using a parallel foreach loop, but the issue I run into is that I need to keep track of the lines as I write any data that is bad to a file which contains the line number of where the bad data is located. Bad data is extremely rare, and writing that file doesn't seem to be a bottleneck. Around 80% of the compute time is spent in this method with calculations & data outputting taking up the rest.
17 Replies
What's the bottleneck? Disk I/O? CPU? Something else? Step 0 is always to profile
Looks to be reading I think? Sorry, pretty newbie with all this stuff - I'm coming from a scripting with python background, so I've never used a profiler before 😓
A quick and easy way to check is to open Task Manager, Performance tab (the one where it shows little graphs of disk + cpu + whatever), and see which one maxes out when it's loading the CSV
Task manager showed a spike on the CPU usage when running, particularly at the end. But that lasted a relatively short while
That's the profile that's built into VS, not task manager? What did task manager show?
Yeah, what I described is from windows task manager
I also attached the VS profiler just in case
So there's no significant disk or cpu usage while it's loading?
No
Then it's time to profile your code properly I suspect
How big is the csv file you read?
They are between 200-500mb generally
One idea might be (granted your memory allows it and the I/O part is not the bottleneck right now) to read either the entire file, or bigger chunks at once, and then doing the CPU bound check in parallel
You'd still be able to retain the line number by having a file number index and then simply adding it to the index of the chunk you loaded, incrementing the total line number after each chunk
But just an idea of course
I'm curious about the big CPU spike at the end though, since IIRC CsvHelper lazy loads the content, so it should be a constant load for each line
The big cpu spike is likely the calculation I'm running on the data, but the overall runtime for the calculation is quite fast
Oh, so that's outside of the scope of the method itself, that makes sense then
Thanks for the suggestions, not familiar with them but Ill have a read
But in general I'm with canton7, unless you know the limiting factor, it's probably all premature optimization
Was this issue resolved? If so, run
/close
- otherwise I will mark this as stale and this post will be archived until there is new activity.