DataObjects.Net Team Blog

Friday, July 31, 2009

Validation & property constraints in DO4

There is an interesting topic related to validation, property constraints and compatibility with third-party validation frameworks in our support forum. I'm explaining some background concepts and internals of our validation framework (Xtensive.Integrity.Validation) there.

I'll be glad to answer any other questions related to validation, if they'll appear ;)

Wednesday, July 29, 2009

Preliminary ORM performance comparison: DataObjects.Net 4 vs NHibernate

We've just adopted our CrudTest for NHibernate. First results:

DO4 (LINQ):
Insert: 28,617 K/s.
Update: 34,111 K/s.
Fetch & GetField: 8,682 K/s.
Query: 1,486 K/s.
CachedQuery: 8,176 K/s.
Materialize: 358,671 K/s.
Remove: 41,108 K/s.

NHibernate (LINQ):
Insert: 12,936 K/s.
Update: 12,939 K/s.
Fetch & GetField: 7,152 K/s.
Query: 95,7/s.
CachedQuery: Cached queries are not supported in NH yet.
Materialize: 37,892 K/s.
Remove: 13,012 K/s

The highlighted numbers show ~ 10 times difference, although DO wins in all other cases as well. Results of this test for several other ORMs are upcoming, the project will be shared @ Google Code.

In addition, we're developing general LINQ test as well. For now LINQ for NHibernate passes ~ 25 tests out of 100. This mean only very basic LINQ stuff really works on this release. DO4 passes ~ 98 tests there (it still misses passing arrays/collections as query parameters).

P.S. The optimization we've made during last 2 weeks is already quite successful (although we still work on materialization performance). Pre-optimization results can be found here.

Tuesday, July 28, 2009

What models do we maintain

Since there are at least 3 visible models, it's necessary to explain why we maintain so many of them ;)

Our models stack is combined from the following models:

Xtensive.Storage.Model (+ Xtensive.Storage.Building.Definitions)

That's the top-level model used by storage. In fact, it consists of two parts:
- Definitions model: XxxDef types, e.g. TypeDef
- Runtime model: everything else. E.g. TypeInfo. Exposed via Domain.Model property.
- Serializable version of runtime model: see Xtensive.Storage.Model.Stored namespace.

Definitions are used on the first step of Domain build process. We reflect all the registered typed and build "model definition". The definitions can be freely added, removed or modified by your own IModules - you should just implement OnDefinitionsBuilt method there. This allow modules to dynamically build or change anything they want - e.g. they can add a property to any of registered types, or register a companion (associated) typefor each one of them. Note that nothing special must be done to make this happen, you should just ensure that necessary module is added to the set of registered types.

So definitions describe crude model - there is just minimal information needed to build a runtime model.

And as you might assume, runtime model is built by definitions gathered gathered on the previous step. It is much more complex - e.g. it fully describes any association and mapping.

In general, it is build to immediately answer on any question appearing during the Domain runtime.

Finally, there is an XML-serializable version of model. It is loaded & serialized to database during each schema upgrade. Our schema upgrade layer uses it to propertly translate type-level hints related to old model to its schema-level hints and uses it in upgrade process to make it more intelligent. You can find it serialized into one of rows in Metadata.Extension table.

Xtensive.Storage.Indexing.Model

Let's call it schema model. This is a low-level model of storage we use during schema comparison and upgrade process. It differs from Storage.Model, because:
- Storage model maintains two-way relationships betweem type-level and storage-level objects. E.g. betweem types and tables, properties and columns. But here we need only a part of this model related to storage-level objects.
- Storage model is built to quickly answer on common questions. This model is designed to schema change and comparison well.
- Storage model is more crude. E.g. DO isn't much interested of foreign keys - it should just know there must be a foreign key. Schema model knows all the details about it.

Schema model is used to:
- Compare extracted and required schema. This process is actually more complex than you might expect - to generate the upgarde actions well, we split the comparison process into a set of steps, and comparing the models related to them. In fact, we do something like: ExtractedModel -> Step1Model -> Step2Model -> ... -> RequiredModel. That's why we must be able to clone and change it nearly as any SQL Server does this. Step1Model here may refer to model with dropped foreign key constraints, Step2Model can be e.g. model containing temporarily renamed schema objects and so on. Yes, we can safely handle rename loops like A->B', B->C', C->A' - we detect & break such loops by renaming one of objects in it to temporary named one on intermediate step ;)
- Index engines use this schema as their native schema format. Actually it is quite fast as well, if locked ;)

Schema models are available via two Domain properties:
- ExtractedSchema
- Schema.

Xtensive.Sql.Model

This is the SQL schema model - a schema model of SQL database in its native form. It differs from the above one - it describes all the SQL schema terms instead of a part of them we need. For example, you can find View and Partition objects there, althogh for now we don't have their analogues in schema model.

This model us used to:
- Produce extracted schema model. SQL DOM provides Extractor allowing to extract it for any supported database; the result of its work is sent to SqlModelConverter (a part of any SQL storage provider) to produce the schema model from it. So this converter is responsible for such decisions as ignoring non-supported SQL schema objects and so on.
- Produce SQL statements (commands). SQL DOM refers to its objects by its statement model objects, such as SqlAlterTable.

You can't access this model in runtime, but it is available to any SQL storage provider via its DomainHandler.Mappings member (see Xtensive.Storage.Providers.Sql namespace - it looks like I expluded this part from the brief version of API reference). These mappings are used to produce generally any SQL command sent by provider.

Xtensive.Sql

This is our model of SQL language - so-called SQL DOM. You can think objects from this model (except SQL schema model objects) normally have rather short lifetime, since they represent parts of particular SQL commands. But actually this isn't true:
- We cache almost any SQL request model we build. Cached request models are bound to particular CRUD operations, LINQ and RSE queries. So in general we almost never build a request model twice.
- We cache even pre-translated versions of request parts. Relatively long strings from which we combine the final version of requests. This allows us to produce a version of request with differently named parameters almost instantly.
- Moreover, we support branching in SQL DOM request model. It is used to produce different versions of request containing external boolean parameters. Earlier I wrote this can be quite important: let's imagine we compiled a request with all the branches. One of such branches there may require table scan in query plan, and thus query plan the whole SQL request will rely on table scan. This "slow" branch could be a rarely used one (i.e. condition turning its logic "on" is quite rarely evaluated to true). But the plan will always use table scan, since RDBMS produces the most generic query plan version. An example of such query is "Select * from A where @All==1 or @Id==A.Id". Check out its plan on SQL Server. Then imagine, if normally @All is 0. Branching & pre-translated query parts allow us to handle such cases perfectly.

Monday, July 27, 2009

What we're busy with?

Since there is a vacations season, last weeks we've been working mainly on performance (I promised we'll spend some time on this some day). And I'm glad to announce we've reached almost invincible level:
- We beat plain SqlClient on insertion test by about 15%. Seems almost impossible, yes? Well, but this is the effect of our batching implementation. Later I'll uncover all the details.
- Update test is also quite close to SqlClient mark.
- Materialization is one more area where we've got really good progress. No any exact numbers here, since we're working on it now.

The new, ultra high-speed DO bolide will be shown by the end of this week.

Wednesday, July 22, 2009

Query transformation pipeline inside out: APPLY rewriter

DO4 is designed to support various RDBMS having different capabilities. One of such capabilities is possibility to use reference values from the left side of join operation in its right side. Microsoft SQL Server supports this kind of join via APPLY statement. But, for example, PostgreSQL does not provide any similar feature. However, an operator like APPLY is required to translate of many LINQ queries - such backreferences are very natural to LINQ.

As you might know, our LINQ translation layer translates LINQ queries to RSE - in fact, query plans. Further these plans are sent to our transformation & optimization pipeline, which is different for different RDBMS, although it is combined from the common set of optimizers (transformers). So if some RDBMS does not support certain feature, we add a transform rewriting the query to make it compatible with this RDBMS. In the end of all we translate the final query plan (RSE query) to native query for the current RDBMS. Note that if we meet something that can't be translated to native query on this step, an exception is thrown.

Let's return back to the subject of this artcile. There is ApplyProvider in RSE, which does the same job as APPLY statement. When query is compiled from LINQ to RSE, we freely use ApplyProvider everywhere where it is necessary. But as I've mentioned, there are some RDBMS that does not support it, and we must take care about this. That's the story behind APPLY rewriter.

To archive the same behavior for different RDBMS, we try to rewrite queries containing ApplyProvider (i.e. requiring APPLY statement to be translated "as is") to get rid of references to the left side from the right side. We perform the following modifications in the source query during rewriting process:

Transformation of an ApplyProvider to a PredicateJoinProvider. To create a predicate for a join operation, we traverse the right side searching for FilterProviders containing references to ApplyParameter.
Transferring CalculateProviders containing references to ApplyParameter above created PredicateJoinProvider.

Obviously, such rewriting is not always possible. Below there is the list of cases when we can rewrite the source query:

Right part returns a single row: in such a case rewriting is not required, since we can translate such ApplyProvider to SQL as "SELECT [left columns], [right columns] FROM ...", which is supported by any RDBMS
Columns used in expressions of considered FilterProviders or CalculateProviders are not removed by other providers (e.g. by AggregateProvider).
Reference to ApplyParameter is contained only in one of sources of a BinaryProvider, if such a provider exists in the right part of the original ApplyProvider.
Right part of ApplyProvider contains references only to its own ApplyParameter.

That's it. This is enough to get rid of APPLY from the most part of queries initially looking as if they'd involve it. Obviously, we can't eliminate APPLY from any queries.

An example LINQ query leading to CROSS APPLY on LINQ to SQL:

from category in Categories
from productGroup in (
  from product in Products
  where product.Category==category
  group product by product.UnitPrice
)
select new {category, productGroup}

Its SQL:

SELECT [t0].[CategoryID], [t0].[CategoryName], [t0].[Description], [t0].[Picture], [t2].[UnitPrice] AS [Key]
FROM [Categories] AS [t0]
CROSS APPLY (
  SELECT [t1].[UnitPrice]
  FROM [Products] AS [t1]
  WHERE [t1].[CategoryID] = [t0].[CategoryID]
  GROUP BY [t1].[UnitPrice]
  ) AS [t2]

We can't get rid of CROSS APPLY in such a query as well. On the other hand, the good thing is that we can translate it properly, if APPLY is supported - in contrast to almost any other ORM we've looked up (except EF and LINQ to SQL). The same is correct for APPLY-related query transformations - as we know, only EF and LINQ to SQL are aware about them.

Monday, July 20, 2009

DO4: continuous integration and testing

I planned to write this for a long time ago. We use TeamCity to continuously build & test DataObjects.Net 4 assemblies. Currently there are:

Post-commit tests. Run after every commit for all dependent projects.

Pre-commit project for Xtensive.Storage. There are 6 tests configurations:
- Memory,
- PostgreSQL 8.2, 8.3, 8.4
- SQL Server 2005, 2008.

The tests there are running when pre-tested commit TeamCity feature is used.

Nightly tests. There are 6 projects (one per each RDBMS version we support: Memory, PostgreSQL 8.2, 8.3, 8.4, SQL Server 2005, 2008), and each of them is tested in 6 different configurations to check everything works in all mapping strategies we support. To achieve this, we use two special IModule implementations in our tests:
- InheritanceSchemaModifier: sets InheritanceSchema to the specified one for all the hierarchies of Domain it is used in. Usage of this module multiplies possible test configurations by 3 (for ClassTable, SingleTable and ConcreteTable).
- TypeIdModified: if specified, injects TypeId column in to any primary key. This is important, because injection of TypeId may significantly affect on fetch performance for hierarchies with deep inheritance; moreover, since TypeId is handled specially in many cases, this allows to check all this logic properly works if TypeId is injected into key. Usage of this module multiplies possible test configurations by 2 (with and without TypeId in keys).

As you see, this gives 6 test configurations per each project, so totally we have 36 nightly test configurations.

All these tests are running on 3 primary test agents, althought there are few additional ones - e.g. we have a special agent running on virtual machine dedicated to build DataObjects.Net v3.9, since it requires outdated version of Sandcastle Help File Builder and some other tools.

Few screenshots:

Thursday, July 16, 2009

Huge June discounts are back in July!

Hi everyone! We decided to return huge June discounts back now. They'll be intact till the end of July, but if it will be desirable, we'll consider doing the same in August.

So it's still the perfect time to join DO4 camp ;)

Wednesday, July 15, 2009

Index-based query optimization - Part 2

In this post, I describe how the query execution engine selects the best index to use. When the engine finds an IndexProvider (for a primary index) in the source query, it searches for all the secondary indexes associated with the given primary index. Then, the engine performs the transformation of the filter predicate to the RangeSet (actually - to an expression returning RangeSet), for each found index. Also, the RangeSet for primary index is created. As I wrote earlier, we're converting the original predicate to Conjunctive Normal Form (CNF) to do this.

During the transformation of a predicate, only comparison operations having an access to key fields of an index can be used to restrict the set of index's entries which need to be loaded. Therefore, the usage of different indexes for the transformation produces different RangeSets.

At the next step, the engine calculates the cost of loading the data for each index. To do this, we compile the corrensponding RangeSet expression and evaluate it into the actual RangeSet. If the source predicate contains an instance of Parameter class, the engine uses the expected value of this parameter during evaluation of compiled RangeSet expression. So finally we get a RangeSet object identifying index ranges that must be extracted from a particular index to evaluate the query using this index.

The cost calculation is based on index statistics, which exists for each of our indexes. Approximately, statistics is a function returning approximate amount of data laying in particular index range. In fact, it's a histogram of data distribution, where the amount of data is bound to Y axes, and the index key value is bound to X axes.

After the completion of costs' calculation for all indexes, the engine selects the index and corresponding RangeSet associated with the minimal cost. Currently, we use pretty simple selection algorithm which selects the cheapest index for each part of the source predicate independently. In future we plan to implement the more complex and effective algoritm here, but for now it's ok.

When the index is selected, we perform actual query transformation:

If primary index is selected, the engine does not modify the source query.
If one of secondary indexes is selected, the engine inserts IndexProviders corresponding to selected indexes into the query, adds RangeProviders after them (extracting RangeSets associated with them) and finally joins primary index. The original filtering criteria (FilterProvider) follows all this chain.
We don't eliminate unused columns on this stage - this is done by additional column-based optimization step running further.

P.S. The article was actually written by Alexander Nickolaev. I just re-posted if after fixing some mistakes.

New benchmark results

Check out this article.

Wednesday, July 08, 2009

Index-based query optimization - Part 1

Probably you know, that DO4 includes our own implementation of RDBMS. Currently, we support in-memory DB only, but the development of file-based DB is scheduled. As well as other RDBMS we try to optimize a query execution to achieve the better performance. There are several ways to perform such optimization. In this post, I describe the optimization based on indexes.

The aim of this optimization is to reduce the amount of data to be retrieved from an index. This reduction can be achieved by loading only those index entries with keys belonging to specified ranges. The query execution engine tries to create these ranges by transforming predicates of FilterProviders found in the query. Currently, the engine can process only those filters which are placed immediately after IndexProvider, but this part of the algorithm will be improved.

The engine tries to transform predicates to Conjunctive Normal Form (CNF) before the extraction of index's keys ranges. If this transformation was successful then the engine analyzes terms of this CNF. There are two kinds of CNF's terms:

Comparison operation;
Stand-alone boolean expression.

Some of comparison operations can be transformed to ranges of index's keys. It is possible when only the one part of comparison operation contains an expression with Tuple.

The list of comparison operation which are recognized by the query execution engine:

>
<
==
!=
>=
<=
Compare methods
CompareTo methods
StartsWith methods

Examples:
person.Age > 10


person.Name.StartsWith("A")

If a term is stand-alone boolean expression (e.g. a is SomeType), then the engine creates range which will represents all index's keys or none.

We also support multi-column indexes. For example, expression person.FirstName == "Alex" && person.Age > 20 can be transformed to range of keys of an index built for FirstName and Age columns.

If a predicate can not be normalized, then the engine extracts index's keys ranges from it by walking recursively through its expression tree.

The following expressions are always transformed to the range representing all index's keys:

Comparison operation containing access to a Tuple in both of its parts;
Expressions which are not recognized as comparison operation.

In the next posts, I will describe further details of our index-based query optimization algorithm - e.g. how the engine selects the best index to use based on statistics.