Comments on DataObjects.Net Team Blog: A good post about TPT ("table per type") inheritance mapping and its performance-related concerns in EF

I discovered your web site via Google while lookin...

2011-07-20T16:01:03.737+06:00

I discovered your web site via Google while looking for a related subject, lucky for me your web site came up, it seems to be great. I have bookmarked it in my google bookmarks.
Please Visit ===> Physician Assistant Training Class

2011-04-11T03:05:55.366+06:00

This comment has been removed by a blog administrator.

This is a great answer, thanks. Your points are ma...

2010-08-30T02:23:34.847+06:00

This is a great answer, thanks. Your points are mainly valid but I want to point out that a merge join does not needed to fully scan the orders table. If the orders are clustered by (customerid, order.id) then a single seek over a contiguous range is enough. Not a single scanned row is wasted.

Your reasoning is very clear and valid: The performance difference is small and customer complaints might come up.

"Batched prefetch a) forces a nested loops jo...

2010-08-29T22:30:52.543+06:00

"Batched prefetch a) forces a nested loops join and b) incurs unnecessary roundtrips."

Yes, that's true, although in case b) we do a lot to minimize them to just 1 or two.

In general, I agree that both ways must be available. IMHO, what we did so far is provided a better one of these two - at least, for majority of cases: I admit if we'd provide just a way with JOINs, we'd get requests like "why .Prefetch slows down my query by 100 times" almost immediately. So far we didn't get any of such requests, and that's good. We didn't get any performance related complains at all in this case.

So I think we must solve the problem when it will be clear it's really necessary for someone to get it working in another way.

Btw, there is a feature request for such prefetch API - also with a good discussion and some benchmarks. So we're ready to implement this, when it will be vital for any of our customers. But until that, it's better to focus on other features.

"If it was so easy to get best performance by...

2010-08-29T21:48:48.733+06:00

"If it was so easy to get best performance by just _not_ doing the join order processing, then why does sql server do it?"

That's because you consider it can optimize this, although practically, it can't. I described that all it can choose in this case is type of join algorithm to use, that actually isn't that important.

Btw, this also depends on order of prefetch joins relatively to other parts of the query, but I suspect if you put them closer to the accessed tables, this is even worse, since it can't re-order these LEFT OUTER JOINs with others (= must perform these JOINs first), and thus looses nearly any opportunity to optimize the query. So the idea to put prefetch joins closer to the leaf nodes of query is bad - they must be injected closer to the root.

2. I don't know what to add here - all I mean ...

2010-08-29T21:40:46.561+06:00

2. I don't know what to add here - all I mean is that actually SQL Server has nearly zero choice related to optimizations in typical cases. And when there is a choice, there are good solutions in case with DO as well.

Ah, forgot: may be the main benefit in our case is that cost of .Prefetch is quite predictable. If you use it, you can be sure you won't get completely different query plan. Yes, prefetching the items might be a bit more expensive in this case, but definitely not 50...100 times more expensive.

Now let's return back to the questions: "...

2010-08-29T21:34:51.843+06:00

Now let's return back to the questions:

"By taking the choice of join order away from SQL Server you are just completely eliminating any optimizations it may have done."

Let's think what kind of optimizations SQL Server may do to handle prefetch queries, if they'd be expressed as JOINs:

1. Such prefetches can be only expressed as LEFT OUTER JOIN - both in case with references and collections. So the only choice it has here (see the article) is the type of join algorithm to use.

But MERGE JOIN and HASH JOIN imply both sets must be fully scanned - i.e. in your example, all the Orders must be scanned. So these algorithms are attractive only when either joined table is relatively small, or you read nearly all the data. But if it's really small, there is no problem with .Prefetch as well, since SQL Server will use scans to handle prefetch queries as well. And if you read all the data (note this is relatively rare task), you can:
- do this directly - by fetching all the orders. Of course, if they'd fit in RAM.
- do this using .Prefetch. If indexes are build well for such scan (let's consider they are, otherwise the problem isn't related to DO), you'll get nearly the case with Orders and Customers described above. Bot what's really cool here is that DO will provide partial materialization and automatic GC for unused entities (its Session-level cache can work in "weak" mode), so you'll be able to process nearly arbitrary large result set in your application.

The only left type of join is NESTED LOOP JOIN -but that's exactly what we push it to do with .Prefetch.

I know this is arguable - we were carefully thinki...

2010-08-29T21:15:30.754+06:00

I know this is arguable - we were carefully thinking about the alternatives.

Let's start from your example with Orders and Customers - actually, it is pretty easy for DO.

Imagine, we must fetch all the Orders for, let's say, 30 Customers (1 page). As you've mentioned, ideal PK structure for Order here is pair of [Customer, Order.Id] (assuming Orders table has clustered PK). Customer.Orders in this case is EntitySet<Order^gt; paired to Order.Customer property.

DO will execute just a single query (probably - 2 in 1 batch - I don't remember the limits) to fetch all these EntitySets for 30 Customers. And, as you've mentioned, SQL Server will perform 30 PK range scans at max (index seeks fetching a range of records) to handle this. If these customers nearly sequential IDs (this depends on customers we're fetching the data for, but the same rule is applicable in your case as well), only few pages will fetched from disk to handle this, if they aren't cached.

If above case won't work well, nothing prevents us to e.g. create an index on Orders table including all the columns we need to display, and sorter in the way we need this (e.g. by Customer registration date). Of course, in this case we should use a different fetching strategy then .Prefetch (e.g. a query filtering by registration date range + local collection of Customer.Ids) to finally reach the goal of fetching all these orders in nearly single index range scan.

An Example would be customers and their orders. If...

2010-08-28T14:24:14.395+06:00

An Example would be customers and their orders. If you cluster the orders by (CustomerID, ID) instead of (ID), it takes a single range seek to fetch all orders for a customer. With fetching by ID it takes 10 bookmark lookups (on a nonclustered index containing ID) and 10 seeks into the CI.

By taking the choice of join order away from SQL S...

2010-08-28T03:28:42.543+06:00

By taking the choice of join order away from SQL Server you are just completely eliminating any optimizations it may have done. By fetching manually by ID you have forced an effective nested loop join for all joins on SQL Server (which now has no choice but to seek into the clustered index for every single entity. It cannot do range scans and merge joins anymore).
If it was so easy to get best performance by just _not_ doing the join order processing, then why does sql server do it?
This is unfortunately an anti-optimization, at least for flat hierarchies. DO is missing a join prefetch mechanism. Batched prefetch a) forces a nested loops join and b) incurs unnecessary roundtrips.

Nice post about SQL internals, good as usual Alex!...

2010-08-27T23:41:09.356+06:00

Nice post about SQL internals, good as usual Alex!