C#•2y ago

❔ Trying to speed up CSV import times

Hi, I was wondering if anyone could give me pointers on how I can speed importing a CSV. Right now I am using CSVReader.

    public IDictionary<string, IList<Array>> Import()
    {
        using var streamReader = new StreamReader(Filepath);

        var Config = new CsvConfiguration(CultureInfo.InvariantCulture)
        {
            Delimiter = ",",
            IgnoreBlankLines = false,
            TrimOptions = TrimOptions.Trim,
            HasHeaderRecord = checkHeader(streamReader) // bool that just checks if the first line is a header 
            Mode = CsvMode.NoEscape,
        };

        using var dataReader = new dataReader(streamReader, Config);
        var getLines = dataReader.GetRecords<DataObj>(); // all the data is kept as a string, exists for easier assigning/indexing of the columns

        int lineCounter = 0;
        foreach (var line in getLines)
        {
            lineCount++; 
            
            var goodData = CheckData.Validate(line, lineCount); // bunch of if statements just ensuring the data fits within parameters, returns the good data in an array or null, otherwise 

            if (validatedClaim != null)
            {
                AddDataToDict(goodData.Factor1, goodDataArray!); // Adds good data to a dictionary, each value is a list, where the arrays returned from check data are appended
            }
        }

        return _dataDict;
    }

    public IDictionary<string, IList<Array>> Import()
    {
        using var streamReader = new StreamReader(Filepath);

        var Config = new CsvConfiguration(CultureInfo.InvariantCulture)
        {
            Delimiter = ",",
            IgnoreBlankLines = false,
            TrimOptions = TrimOptions.Trim,
            HasHeaderRecord = checkHeader(streamReader) // bool that just checks if the first line is a header 
            Mode = CsvMode.NoEscape,
        };

        using var dataReader = new dataReader(streamReader, Config);
        var getLines = dataReader.GetRecords<DataObj>(); // all the data is kept as a string, exists for easier assigning/indexing of the columns

        int lineCounter = 0;
        foreach (var line in getLines)
        {
            lineCount++; 
            
            var goodData = CheckData.Validate(line, lineCount); // bunch of if statements just ensuring the data fits within parameters, returns the good data in an array or null, otherwise 

            if (validatedClaim != null)
            {
                AddDataToDict(goodData.Factor1, goodDataArray!); // Adds good data to a dictionary, each value is a list, where the arrays returned from check data are appended
            }
        }

        return _dataDict;
    }

Sorry if it's a little vague, it's a research project so I can't share the code fully. I'm just looking for general tips on how to speed up the process. I've tried using a parallel foreach loop, but the issue I run into is that I need to keep track of the lines as I write any data that is bad to a file which contains the line number of where the bad data is located. Bad data is extremely rare, and writing that file doesn't seem to be a bottleneck. Around 80% of the compute time is spent in this method with calculations & data outputting taking up the rest.

17 Replies

canton7•2y ago

What's the bottleneck? Disk I/O? CPU? Something else? Step 0 is always to profile

sjbsOP•2y ago

Looks to be reading I think? Sorry, pretty newbie with all this stuff - I'm coming from a scripting with python background, so I've never used a profiler before 😓

canton7•2y ago

A quick and easy way to check is to open Task Manager, Performance tab (the one where it shows little graphs of disk + cpu + whatever), and see which one maxes out when it's loading the CSV

sjbsOP•2y ago

Task manager showed a spike on the CPU usage when running, particularly at the end. But that lasted a relatively short while

canton7•2y ago

That's the profile that's built into VS, not task manager? What did task manager show?

sjbsOP•2y ago

Yeah, what I described is from windows task manager I also attached the VS profiler just in case

canton7•2y ago

So there's no significant disk or cpu usage while it's loading?

sjbsOP•2y ago

canton7•2y ago

Then it's time to profile your code properly I suspect

Sossenbinder•2y ago

How big is the csv file you read?

sjbsOP•2y ago

They are between 200-500mb generally

Sossenbinder•2y ago

One idea might be (granted your memory allows it and the I/O part is not the bottleneck right now) to read either the entire file, or bigger chunks at once, and then doing the CPU bound check in parallel You'd still be able to retain the line number by having a file number index and then simply adding it to the index of the chunk you loaded, incrementing the total line number after each chunk But just an idea of course I'm curious about the big CPU spike at the end though, since IIRC CsvHelper lazy loads the content, so it should be a constant load for each line

sjbsOP•2y ago

The big cpu spike is likely the calculation I'm running on the data, but the overall runtime for the calculation is quite fast

Sossenbinder•2y ago

Oh, so that's outside of the scope of the method itself, that makes sense then

sjbsOP•2y ago

Thanks for the suggestions, not familiar with them but Ill have a read

Sossenbinder•2y ago

But in general I'm with canton7, unless you know the limiting factor, it's probably all premature optimization

Accord•2y ago

Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.

Gaming

Programming

❔ Trying to speed up CSV import times

Did you find this page helpful?