C
C#17mo ago
sjbs

❔ Trying to speed up CSV import times

Hi, I was wondering if anyone could give me pointers on how I can speed importing a CSV. Right now I am using CSVReader.
public IDictionary<string, IList<Array>> Import()
{
using var streamReader = new StreamReader(Filepath);

var Config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
Delimiter = ",",
IgnoreBlankLines = false,
TrimOptions = TrimOptions.Trim,
HasHeaderRecord = checkHeader(streamReader) // bool that just checks if the first line is a header
Mode = CsvMode.NoEscape,
};

using var dataReader = new dataReader(streamReader, Config);
var getLines = dataReader.GetRecords<DataObj>(); // all the data is kept as a string, exists for easier assigning/indexing of the columns

int lineCounter = 0;
foreach (var line in getLines)
{
lineCount++;

var goodData = CheckData.Validate(line, lineCount); // bunch of if statements just ensuring the data fits within parameters, returns the good data in an array or null, otherwise

if (validatedClaim != null)
{
AddDataToDict(goodData.Factor1, goodDataArray!); // Adds good data to a dictionary, each value is a list, where the arrays returned from check data are appended
}
}

return _dataDict;
}
public IDictionary<string, IList<Array>> Import()
{
using var streamReader = new StreamReader(Filepath);

var Config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
Delimiter = ",",
IgnoreBlankLines = false,
TrimOptions = TrimOptions.Trim,
HasHeaderRecord = checkHeader(streamReader) // bool that just checks if the first line is a header
Mode = CsvMode.NoEscape,
};

using var dataReader = new dataReader(streamReader, Config);
var getLines = dataReader.GetRecords<DataObj>(); // all the data is kept as a string, exists for easier assigning/indexing of the columns

int lineCounter = 0;
foreach (var line in getLines)
{
lineCount++;

var goodData = CheckData.Validate(line, lineCount); // bunch of if statements just ensuring the data fits within parameters, returns the good data in an array or null, otherwise

if (validatedClaim != null)
{
AddDataToDict(goodData.Factor1, goodDataArray!); // Adds good data to a dictionary, each value is a list, where the arrays returned from check data are appended
}
}

return _dataDict;
}
Sorry if it's a little vague, it's a research project so I can't share the code fully. I'm just looking for general tips on how to speed up the process. I've tried using a parallel foreach loop, but the issue I run into is that I need to keep track of the lines as I write any data that is bad to a file which contains the line number of where the bad data is located. Bad data is extremely rare, and writing that file doesn't seem to be a bottleneck. Around 80% of the compute time is spent in this method with calculations & data outputting taking up the rest.
17 Replies
canton7
canton717mo ago
What's the bottleneck? Disk I/O? CPU? Something else? Step 0 is always to profile
sjbs
sjbsOP17mo ago
Looks to be reading I think? Sorry, pretty newbie with all this stuff - I'm coming from a scripting with python background, so I've never used a profiler before 😓
canton7
canton717mo ago
A quick and easy way to check is to open Task Manager, Performance tab (the one where it shows little graphs of disk + cpu + whatever), and see which one maxes out when it's loading the CSV
sjbs
sjbsOP17mo ago
Task manager showed a spike on the CPU usage when running, particularly at the end. But that lasted a relatively short while
canton7
canton717mo ago
That's the profile that's built into VS, not task manager? What did task manager show?
sjbs
sjbsOP17mo ago
Yeah, what I described is from windows task manager I also attached the VS profiler just in case
canton7
canton717mo ago
So there's no significant disk or cpu usage while it's loading?
sjbs
sjbsOP17mo ago
No
canton7
canton717mo ago
Then it's time to profile your code properly I suspect
Sossenbinder
Sossenbinder17mo ago
How big is the csv file you read?
sjbs
sjbsOP17mo ago
They are between 200-500mb generally
Sossenbinder
Sossenbinder17mo ago
One idea might be (granted your memory allows it and the I/O part is not the bottleneck right now) to read either the entire file, or bigger chunks at once, and then doing the CPU bound check in parallel You'd still be able to retain the line number by having a file number index and then simply adding it to the index of the chunk you loaded, incrementing the total line number after each chunk But just an idea of course I'm curious about the big CPU spike at the end though, since IIRC CsvHelper lazy loads the content, so it should be a constant load for each line
sjbs
sjbsOP17mo ago
The big cpu spike is likely the calculation I'm running on the data, but the overall runtime for the calculation is quite fast
Sossenbinder
Sossenbinder17mo ago
Oh, so that's outside of the scope of the method itself, that makes sense then
sjbs
sjbsOP17mo ago
Thanks for the suggestions, not familiar with them but Ill have a read
Sossenbinder
Sossenbinder17mo ago
But in general I'm with canton7, unless you know the limiting factor, it's probably all premature optimization
Accord
Accord17mo ago
Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.
Want results from more Discord servers?
Add your server