Trying to refactor some code that has gotten really slow recently and I came across a code block that is taking 5+ seconds to execute.
The code consists of 2 statements:
IEnumerable<int> StudentIds = _entities.Filters .Where(x => x.TeacherId == Profile.TeacherId.Value && x.StudentId != null) .Select(x => x.StudentId) .Distinct<int>();
_entities.StudentClassrooms .Include("ClassroomTerm.Classroom.School.District") .Include("ClassroomTerm.Teacher.Profile") .Include("Student") .Where(x => StudentIds.Contains(x.StudentId) && x.ClassroomTerm.IsActive && x.ClassroomTerm.Classroom.IsActive && x.ClassroomTerm.Classroom.School.IsActive && x.ClassroomTerm.Classroom.School.District.IsActive).AsQueryable<StudentClassroom>();
So it's a bit messy but first I get a Distinct list of Id's from one Table (Filters), then I query another Table using it.
These are relatively small tables, but it's still 5+ seconds of query time.
I put this in LINQPad and it showed that it was doing the bottom query first then running 1000 "distinct" queries afterwards.
On a whim I changed the "StudentIds" code by just adding .ToArray() at the end. This improved the speed 1000x ... it now takes like 100ms to complete the same query.
What's the deal? What am I doing wrong?
This is one of the pitfalls of deferred execution in Linq: In your first approach
StudentIds is really an
IQueryable, not an in-memory collection. That means using it in the second query will run the query again on the database - each and every time.
Forcing execution of the first query by using
StudentIds an in-memory collection and the
Contains part in your second query will run over this collection that contains a fixed sequence of items - This gets mapped to something equivalent to a SQL
where StudentId in (1,2,3,4) query.
This query will of course, be much much faster since you determined this sequence once up-front, and not every time the
Where clause is executed. Your second query without using
ToArray() (I would think) would be mapped to a SQL query with an
where exists (...) sub-query that gets evaluated for each row.
ToArray() Materializes the initial query to the server memory.
My guess would be the query provider is not able to parse the expression
StudentIds.Contains(x.StudentId). Hence it probably thinks that the
studentIds is an array already loaded to memory. So it's probably querying the database over and over again during the parsing phase. The only way to know for sure is to setup the profiler.
If you need to do this on the db server, use a join, instead of "contains". If you need to use contains to do what looks like a join problem, you are likely to be missing a surrogate primary key or a foreign key somewhere.
You could also declare
studentIds as IQueryable instead of IEnumerable. This might give the query provider the hint it needs to interpret the
studentIds as expression aka. data not already loaded to memory. I somehow doubt this but worth a try.
If all else fails, use
ToArray(). This will load the initial
studentIds to memory.