When generating 1000s of Entity Framework instances, when should I use SaveChanges()? (For example, during an import)

entity-framework import loops performance savechanges

Question

I'm carrying out an import that will produce thousands of records every run. Just seeking some verification of my suspicions:

Which of the following is the most logical:

  1. Run SaveChanges() every AddToClassName() call.
  2. Run SaveChanges() each and every n numberAddToClassName() calls.
  3. Run SaveChanges() all after theAddToClassName() calls.

The first choice is most likely sluggish, am I right? Since it must create SQL, examine the EF objects that are now in memory, etc.

Inferring from the fact that we can fit a try catch around the second choice, I take it to be the best of both worlds.SaveChanges() If one of them fails, you only lose n amount of records at a time. Perhaps keep each batch in a List. If theSaveChanges() If the call is successful, delete the list. Log the things if it doesn't work.

The last solution would likely prove to be quite sluggish as well since every EF object would need to be kept in memory untilSaveChanges() is known as. And nothing would be committed if the save failed, right?

1
75
12/18/2009 10:14:26 PM

Accepted Answer

To be certain, I would try it first. It's not necessary for performance to be subpar.

Call it after the AddToClassName class if you need to input all the rows in a single transaction. Save modifications after each row if each may be input separately. Consistency in databases is critical.

I don't like the second choice. If I did an import into the system and it rejected 10 rows out of 1000 merely because 1 was faulty, it would be perplexing for me (from the viewpoint of the ultimate user). If importing 10 fails, try importing each one one at a time before logging the failure.

Verify if it takes a long time. Write "propably" instead. You're not yet aware of it. Consider other solutions only when there is genuinely an issue (marc s).

EDIT

I ran the following tests (measured in miliseconds):

10000 rows:

After 1 row, save changes: 18510,534
After 100 rows, execute SaveChanges():4350,3075
after ten thousand rows: 5233,0635

50000 rows:

After 1 row, save changes: 78496,929
After 500 rows, save changes: 22302, 2835
After 50,000 rows, save changes: 24022,8765

Therefore, committing after n rows is really quicker than it was before.

My suggestion is to:

  • After n rows, call SaveChanges().
  • If a commit fails, try each one separately to identify the problematic row.

Test subjects:

TABLE:

CREATE TABLE [dbo].[TestTable](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [SomeInt] [int] NOT NULL,
    [SomeVarchar] [varchar](100) NOT NULL,
    [SomeOtherVarchar] [varchar](50) NOT NULL,
    [SomeOtherInt] [int] NULL,
 CONSTRAINT [PkTestTable] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

Class:

public class TestController : Controller
{
    //
    // GET: /Test/
    private readonly Random _rng = new Random();
    private const string _chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

    private string RandomString(int size)
    {
        var randomSize = _rng.Next(size);

        char[] buffer = new char[randomSize];

        for (int i = 0; i < randomSize; i++)
        {
            buffer[i] = _chars[_rng.Next(_chars.Length)];
        }
        return new string(buffer);
    }


    public ActionResult EFPerformance()
    {
        string result = "";

        TruncateTable();
        result = result + "SaveChanges() after 1 row:" + EFPerformanceTest(10000, 1).TotalMilliseconds + "<br/>";
        TruncateTable();
        result = result + "SaveChanges() after 100 rows:" + EFPerformanceTest(10000, 100).TotalMilliseconds + "<br/>";
        TruncateTable();
        result = result + "SaveChanges() after 10000 rows:" + EFPerformanceTest(10000, 10000).TotalMilliseconds + "<br/>";
        TruncateTable();
        result = result + "SaveChanges() after 1 row:" + EFPerformanceTest(50000, 1).TotalMilliseconds + "<br/>";
        TruncateTable();
        result = result + "SaveChanges() after 500 rows:" + EFPerformanceTest(50000, 500).TotalMilliseconds + "<br/>";
        TruncateTable();
        result = result + "SaveChanges() after 50000 rows:" + EFPerformanceTest(50000, 50000).TotalMilliseconds + "<br/>";
        TruncateTable();

        return Content(result);
    }

    private void TruncateTable()
    {
        using (var context = new CamelTrapEntities())
        {
            var connection = ((EntityConnection)context.Connection).StoreConnection;
            connection.Open();
            var command = connection.CreateCommand();
            command.CommandText = @"TRUNCATE TABLE TestTable";
            command.ExecuteNonQuery();
        }
    }

    private TimeSpan EFPerformanceTest(int noOfRows, int commitAfterRows)
    {
        var startDate = DateTime.Now;

        using (var context = new CamelTrapEntities())
        {
            for (int i = 1; i <= noOfRows; ++i)
            {
                var testItem = new TestTable();
                testItem.SomeVarchar = RandomString(100);
                testItem.SomeOtherVarchar = RandomString(50);
                testItem.SomeInt = _rng.Next(10000);
                testItem.SomeOtherInt = _rng.Next(200000);
                context.AddToTestTable(testItem);

                if (i % commitAfterRows == 0) context.SaveChanges();
            }
        }

        var endDate = DateTime.Now;

        return endDate.Subtract(startDate);
    }
}
56
12/21/2009 10:19:08 PM

Popular Answer

I just improved a very similar issue in my own code, and I'd like to mention an improvement that was successful for me.

I discovered that processing SaveChanges, whether processing 100 or 1000 records at once, takes up a significant amount of CPU time. As a result, I was able to significantly improve CPU core utilization and increase the rate of changes per second from 4000 to over 14,000 by processing the contexts according to a producer/consumer pattern (implemented using BlockingCollection). I have 8 cores, thus my CPU use increased from approximately 13% to roughly 60%. I scarcely stressed the (extremely fast) disk IO system even when employing many consumer threads, and SQL Server's CPU use was no more than 15%.

You may adjust both the amount of records to be committed and the number of threads carrying out the commit operations by offloading the saving to several threads.

In order to fine-tune the number of records committed every batch such that the number of items in the BlockingCollection varied between 0 and 1, I discovered that I needed to create 1 producer thread and (# of CPU Cores)-1 consumer threads (after a consumer thread took one item). In this manner, the consuming threads had exactly the right amount of work to perform well.

Naturally, this situation necessitates the creation of a new context for each batch, but for my use case, even a single-threaded scenario is quicker.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow