News, examples, tips, ideas and plans.
Thoughts around ORM, .NET and SQL databases.

Wednesday, July 15, 2009

Index-based query optimization - Part 2

In this post, I describe how the query execution engine selects the best index to use. When the engine finds an IndexProvider (for a primary index) in the source query, it searches for all the secondary indexes associated with the given primary index. Then, the engine performs the transformation of the filter predicate to the RangeSet (actually - to an expression returning RangeSet), for each found index. Also, the RangeSet for primary index is created. As I wrote earlier, we're converting the original predicate to Conjunctive Normal Form (CNF) to do this.

During the transformation of a predicate, only comparison operations having an access to key fields of an index can be used to restrict the set of index's entries which need to be loaded. Therefore, the usage of different indexes for the transformation produces different RangeSets.

At the next step, the engine calculates the cost of loading the data for each index. To do this, we compile the corrensponding RangeSet expression and evaluate it into the actual RangeSet. If the source predicate contains an instance of Parameter class, the engine uses the expected value of this parameter during evaluation of compiled RangeSet expression. So finally we get a RangeSet object identifying index ranges that must be extracted from a particular index to evaluate the query using this index.

The cost calculation is based on index statistics, which exists for each of our indexes. Approximately, statistics is a function returning approximate amount of data laying in particular index range. In fact, it's a histogram of data distribution, where the amount of data is bound to Y axes, and the index key value is bound to X axes.

After the completion of costs' calculation for all indexes, the engine selects the index and corresponding RangeSet associated with the minimal cost. Currently, we use pretty simple selection algorithm which selects the cheapest index for each part of the source predicate independently. In future we plan to implement the more complex and effective algoritm here, but for now it's ok.

When the index is selected, we perform actual query transformation:
  • If primary index is selected, the engine does not modify the source query.
  • If one of secondary indexes is selected, the engine inserts IndexProviders corresponding to selected indexes into the query, adds RangeProviders after them (extracting RangeSets associated with them) and finally joins primary index. The original filtering criteria (FilterProvider) follows all this chain.
  • We don't eliminate unused columns on this stage - this is done by additional column-based optimization step running further.
P.S. The article was actually written by Alexander Nickolaev. I just re-posted if after fixing some mistakes.

No comments:

Post a Comment