BulkInsert to SQL, parallel processing
I'm trying to improve the processing speed of a 3rd party integration (via SDK), wherein the processed data is written to SQL.
I was able to reduce the total time of processing the 3rd party system from 40 hours to 3 by improving the methods used to process the data, and to turn the sequencing processing into parallel processing. The challenge I face now is with writing the data to SQL.
What I currently have is the following:
With this method, I'll use a temp table with a merge / insert strategy to add or update the records. The problem is that the original
Parallel.ForEach
is sometimes running twice for the same record, when there truly is only 1 record available. There's 8 million records I need to process in this manner, and without using the Parallel method on the input records, it takes far too long to process all the data. Any thoughts / help?2 Replies
Couple things:
* Stop using
ConcurrentBag
. The concurrency mechanisms it uses are not as efficient as other ConcurrentXxx
data structures. Look at using Channels instead.
* Stop using DataTable
. It's an old data format that is no longer useful in the modern .net development system. Use a proper ORM, such as EFC, Linq2DB, or even Dapper
* For loading massive amounts of data into a SQL server, you want to use SqlBulkCopy
. Here's a starting point for some research on it: https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.sqlclient.sqlbulkcopy?view=sqlclient-dotnet-standard-5.1
* Alternatively, you can use the .BulkCopy()
method in linq2db, which will be simpler.SqlBulkCopy Class (Microsoft.Data.SqlClient)
Lets you efficiently bulk load a SQL Server table with data from another source.
Yeah, I was already using
SqlBulkCopy
. I've transitioned away from DataTable
to using the native class along with FastMember
(to generate the reader to pass to bulk.WriteToServer(reader)
. Overall, it's working fairly well.
I'll have to look at the other Concurrent data structures, and look at channels
Aside from the 3rd party's IEnumerable being slow as crap, this has worked fairly well (about 4 hours to process everything compared to my original 40 hours with single threaded processing for everything). The only issue is the line _recordBag.Count >= threshold
somehow reports true when the actual values are say 532 >= 50000
Also, I should have mentioned. I'm restricted to .NET Framework due to the 3rd party SDK