DataObjects.Net Team Blog

Wednesday, July 15, 2009

Index-based query optimization - Part 2

In this post, I describe how the query execution engine selects the best index to use. When the engine finds an IndexProvider (for a primary index) in the source query, it searches for all the secondary indexes associated with the given primary index. Then, the engine performs the transformation of the filter predicate to the RangeSet (actually - to an expression returning RangeSet), for each found index. Also, the RangeSet for primary index is created. As I wrote earlier, we're converting the original predicate to Conjunctive Normal Form (CNF) to do this.

During the transformation of a predicate, only comparison operations having an access to key fields of an index can be used to restrict the set of index's entries which need to be loaded. Therefore, the usage of different indexes for the transformation produces different RangeSets.

At the next step, the engine calculates the cost of loading the data for each index. To do this, we compile the corrensponding RangeSet expression and evaluate it into the actual RangeSet. If the source predicate contains an instance of Parameter class, the engine uses the expected value of this parameter during evaluation of compiled RangeSet expression. So finally we get a RangeSet object identifying index ranges that must be extracted from a particular index to evaluate the query using this index.

The cost calculation is based on index statistics, which exists for each of our indexes. Approximately, statistics is a function returning approximate amount of data laying in particular index range. In fact, it's a histogram of data distribution, where the amount of data is bound to Y axes, and the index key value is bound to X axes.

After the completion of costs' calculation for all indexes, the engine selects the index and corresponding RangeSet associated with the minimal cost. Currently, we use pretty simple selection algorithm which selects the cheapest index for each part of the source predicate independently. In future we plan to implement the more complex and effective algoritm here, but for now it's ok.

When the index is selected, we perform actual query transformation:

If primary index is selected, the engine does not modify the source query.
If one of secondary indexes is selected, the engine inserts IndexProviders corresponding to selected indexes into the query, adds RangeProviders after them (extracting RangeSets associated with them) and finally joins primary index. The original filtering criteria (FilterProvider) follows all this chain.
We don't eliminate unused columns on this stage - this is done by additional column-based optimization step running further.

P.S. The article was actually written by Alexander Nickolaev. I just re-posted if after fixing some mistakes.

New benchmark results

Check out this article.

Wednesday, July 08, 2009

Index-based query optimization - Part 1

Probably you know, that DO4 includes our own implementation of RDBMS. Currently, we support in-memory DB only, but the development of file-based DB is scheduled. As well as other RDBMS we try to optimize a query execution to achieve the better performance. There are several ways to perform such optimization. In this post, I describe the optimization based on indexes.

The aim of this optimization is to reduce the amount of data to be retrieved from an index. This reduction can be achieved by loading only those index entries with keys belonging to specified ranges. The query execution engine tries to create these ranges by transforming predicates of FilterProviders found in the query. Currently, the engine can process only those filters which are placed immediately after IndexProvider, but this part of the algorithm will be improved.

The engine tries to transform predicates to Conjunctive Normal Form (CNF) before the extraction of index's keys ranges. If this transformation was successful then the engine analyzes terms of this CNF. There are two kinds of CNF's terms:

Comparison operation;
Stand-alone boolean expression.

Some of comparison operations can be transformed to ranges of index's keys. It is possible when only the one part of comparison operation contains an expression with Tuple.

The list of comparison operation which are recognized by the query execution engine:

>
<
==
!=
>=
<=
Compare methods
CompareTo methods
StartsWith methods

Examples:
person.Age > 10


person.Name.StartsWith("A")

If a term is stand-alone boolean expression (e.g. a is SomeType), then the engine creates range which will represents all index's keys or none.

We also support multi-column indexes. For example, expression person.FirstName == "Alex" && person.Age > 20 can be transformed to range of keys of an index built for FirstName and Age columns.

If a predicate can not be normalized, then the engine extracts index's keys ranges from it by walking recursively through its expression tree.

The following expressions are always transformed to the range representing all index's keys:

Comparison operation containing access to a Tuple in both of its parts;
Expressions which are not recognized as comparison operation.

In the next posts, I will describe further details of our index-based query optimization algorithm - e.g. how the engine selects the best index to use based on statistics.

Monday, July 06, 2009

Property constraints

Let us suppose that a Person class has Age property of type int and its value can not be negative. We certainly can implement this check in two different ways: Check value in property setter or in OnValidate method.

In first case we have to expand auto-property, and write a code like this:


[Field]
public int Age 
{ 
  get { return GetVieldValue<int>("Age"); }
  set {
    if (value < 0)
      throw new Exception(string.Format(
        "Incorrect age ({0}), age can't be less than {1}.",
        value, 0);
    SetVieldValue<int>("Age", value); 
  }
}

Second way is:


[Field]
public int Age { get; set; }

public override OnValidate()
{
  if (Age < 0)
    throw new Exception(string.Format(
      "Incorrect age ({0}), age can't be less than {1}.",
      Age, 0); 
}

Validation behavior in those two ways is not the same, exceptions will be thrown in different stages: setting property value or validating the object. There is no single point of view, which way is preferable.

Property constraints are property-level aspects integrated with the validation system, that allow to simplify value checks implementation of those kinds. Property constraint aspect is automatically applied to properties marked by appropriate attributes. In our example constraint declaration will look like this:


[Field]
[RangeConstraint(Min = 0, 
  Message = "Incorrect age ({value}), age can not be less than {Min}.",
  Mode = ValidationMode.Immediate]
public int Age { get; set; }


[Field]
[RangeConstraint(Min = 0, 
  Message = "Incorrect age ({value}), age can not be less than {Min}.")]
public int Age { get; set; }

Each constraint attribute contains two general properties: Message and Mode. Message property value is used as the exception message when check is failed. We also plan to add the ability to get messages from strings resources, this feature will be useful for applications localization.

Mode property value determines whether immediate (check in setter) or delayed (check on object validating) should be used. All property constraints on a particular instance can also be checked with CheckConstraints() extension method;

Constraints are designed not only to work with our entities, but with any classes, that implement IValidationAware interface.

By now following property constraints are available:

[EmailConstraint]	Ensures that email address is in correct format
[FutureConstraint]	Ensures that date value is in the future
[LengthConstraint]	Ensures string or collection length fits in specified range
[NotEmptyConstraint]	Ensures that string value is not empty
[NotNullConstraint]	Ensures property value is not null
[NotNullOrEmptyConstraint]	Ensures property value is not null or empty
[PastConstraint]	Ensures that date value is in the past
[RangeConstraint]	Ensures that numeric value fits in the specified range
[RegexConstraint]	Ensures property value matches specified regular expression

Other constraints can be easily implemented as PropertyConstraintAspect descendants.

Following example illustrates the variety of property constraints on a Person class:


[NotNullOrEmptyConstraint]
[LengthConstraint(Max = 20, 
  Mode = ValidationMode.Immediate)]
public string Name { get; set;}

[RangeConstraint(Min = 0, 
  Message = "Incorrect age ({value}), age can not be less than {Min}.")]
public int Age { get; set;}

[PastConstraint]
public DateTime RegistrationDate { get; set;}

[RegexConstraint(Pattern = @"^(\(\d+\))?[-\d ]+$", 
  Message = "Incorrect phone format '{value}'")]
public string Phone { get; set;}

[EmailConstraint]
public string Email { get; set;}

[RangeConstraint(Min = 1, Max = 2.13)]
public double Height { get; set; }

Friday, July 03, 2009

People from Microsoft are linking to our Tips blog

Here is the link. So I feel it was a good idea to create our Tips blog ;)

Who's Brad Wilson? "I'm a software developer at Microsoft, working on the ASP.NET team. I've previously worked on the CodePlex and patterns & practices teams." - he says. Here is some interview with him.

Thoughts: being first vs being best

I just read a nice post from DevExpress CTO about this.

We've been among first DO1.X, 2.X and 3.X - there were tons of unique features at that moment.

Some examples:
- We've been using runtime proxies initially - starting from v1.0 in 2003. NHibernate started to use them only in 2005. Now we're using PostSharp-based aspects, and I suspect, many others already started to look onto this ;)
- We've been first who integrated full-text search & indexing into ORM. Initial version supporting Microsoft Search service appeared in 2003; Lucene.Net support have been added in 2005. Now this feature is supported by many other ORM frameworks - e.g. NHibernate and Lightspeed.
- The same is about translatable properties - we've been supporting them starting from 2003. Now they're implemented e.g. in Genom-e.
- We've implemented versioning extension in 2005. Genom-e team added similar historization feature just recently.
- And finally, schema upgrade - it was a part of DO starting from v1.0. Now it exists in many frameworks, but is always implemented as design-time feature. Btw, I understand this brings some benefits (btw, we're going to provide similar feature shortly), but what about runtime? As application users, we used to install new versions of applications without caring about running upgrade scripts. Moreover, as developer, I'd prefer getting all necessary upgrade logic executed automatically. So why all these frameworks are capable only of generating such scripts at design time? Who'll combine them? Who will execute them? Who will produce their versions for other RDBMS? Who will check if existing database version isn't too old? I think, this approach also leads quite many TODOs for very common tasks to leave them for developers.

And, AFAIK, the following features are still unique:
- Paired (inverse) properties - at least as this is done in DO. I.e. in fully commutative way.
- Persistent interfaces - again, I'm speaking about our own version of this feature.
- Access control \ security system
- Action-level instead of state level change logging (used for disconnected state change replication).

So why did DO3.X die? Because of architectural lacks, that initially allowed us to develop it fast. E.g. our mapping support was quite limited. Really, being first usually != being best.

Do you know, that:
- I was 23 years old when first version of DO has been released. I'm 10 years younger than e.g. Frans Bouma ;)
- I had a good experience with RDBMS at that moment, but as you may find, my experience was mainly limited by the scope of SMB web applications, and finally this lead to some architectural lacks in v1.X.
- I had really good programming background - my CV was very good at that point. But I didn't develop a framework comparable by the scale to DO before. But framework development guidelines are quite different in comparison to application development guidelines. Initial architectural lacks are much more painful here: in some cases you simply can't overwhelm them by adding N-th module.

On the other hand, we've got huge ORM and database experience with v1.X-3.X. We've became real experts. We've studied & fixed lots of cases and issues - starting from very frequent ones to those that seem quite hard to face in a particular application (but this doesn't mean you shouldn't keep them in mind). We know much more about what is important, and how this must work.

And what's more important, we've been ultimately the first ones exploring many new paths and features - take a look at above lists ;) Most of features there were later adopted by others - this proves they were good enough, and what's more important, this shows we can generate and implement such ideas earlier than others do. May be that's because I don't like to barely repeat the others. Take NHibernate as an example: isn't it really boring to follow the path passed by others few years ago (I mean hibernate)? Ok, it is already successful, and such a path offers attractive way to start. But is it a good enough reason to plainly repeat it, instead of making something better?

Ok, it was really a kind of rush for us, until we reached our own limits - being the first. But as you know, we didn't stopped on this! We just started a new rush - I hope this shows well what kind characters are staying behind our team ;) Moreover, all these years we've been growing up using all the opportunities we had, and this allowed us to accomplish our almost unimaginable complex task at all. Yes, now we're the only ones supporting not just third-party databases, but having our own, shiny-new, real RDBMS integrated with ORM. And we're almost ready to show the full power of this competitive advantage (wait for sync, and further - Mono Silverlight support). We're going to eliminate necessity to study and use the whole spectra of technologies you need in most of cases, including Entity Framework, Sync Framework, ADO.NET Data Services, .NET RIA Services, SQL Server Compact \ SQLite - and this isn't a full list ;)

Btw, probably, we're the only ORM vendors that won't suffer much from EF appearance: first of all, because our paths are probably the most distant ones from the point of approach to the problem. And secondly, because we already suffered enough from 2-year pause in releases ;) I feel it will be the hard time for many many others - especially the ones offering similar features. LLBLGen Pro, Genom-e, Subsonic, Lightspeed, etc., and even NHibernate (still no LINQ!) - guys, are you well prepared for this fight? At least we are - you know, Russians are used to win the wars during long and cold winters ;)

So what about being first vs being best?

It's simple. We've been carefully taking our time for 2 years. If DO1.X-3.X were mainly the first, we're making DO4.0 mainly the best - although from many points of views it is the first one as well. I'm quite happy that the most complex part of our development path is already passed now. The wheel is rotating now, its speed is growing up. We delivered v4.0 + 2 updates just in June. July promises to be even more attractive from the point of features we're going to deliver.

So I'm repeating myself once more: join DO4 camp ;)

Thursday, July 02, 2009

ADO.NET Data Services (Astoria) sample for DO4

Please refer to this post.

What's new in v4.0.2

1. Improved installer

I hope we've fixed the last "big bugs" there. The most annoying ones are:
- 228: "Add\remove programs" issue on installing both DO4 and Xtensive.MSBuildTasks
- 232: DO 4.0.1 installer doesn't update assemblies located in PostSharp directory
- 233: Projects created by project template are bound to specific installation path of DO4

Because of 228 & 232, the recommended upgrade path to v4.0.2 is:
- Uninstall DO4. If the item absents in "Add\Remove programs", just remove its folder C:\Program Files\X-tensive.com (or the installation path you've chosen).
- Uninstall Xtensive.MSBuildTasks, if you have installed it. If the item absents in "Add\Remove programs", just remove its folders from C:\Program Files\MSBuild and C:\Program Files\X-tensive.com (or the installation path you've chosen).
- Uninstall all other components previously required by DO4, including Unity, Parallel Extensions, MSBuild Community Tasks and PostSharp.
- Install new DO4. It will suggest to install just PoshSharp. Everything else is optional now; all Unity and Parallel Extensions assemblies are installed into GAC automatically.

Other changes include the following ones:
- Installer automatically detects & requires to uninstall old version of DO4.
- All required assemblies are now installed into GAC. If you're worried about this, there are .bat files allowing to get rid of them with ease.
- There are new project templates (Console, Model, UnitTests, WebApplication, WPF). But they're only for C# for now.
- New Build.bat files build new DO automatically performing all "before first build" steps. So it's really easy now to make its custom build.

Useful links:
- Full list of installer-related issues
- New installation instruction
- New "Building DataObjects.Net" instruction

2. LINQ

As you might remember, two weeks ago we didn't support 2 LINQ features:
- Group joins
- First\Single(OrDefault) in subqueries (btw, as far as I remember, Single in subqueries isn't supported in EF at all)

Both features are supported now. So now we're fully ready to compare our LINQ implementation with others - a set of articles about this will appear here soon.

Useful links:
- LINQ-related issues.

3. Breaking changes in attributes

We've refactored our mapping attributes once more. Now there are:
- Separate [Association] attribute for associations
- Separate [Mapping] attribute allowing to specify mapping names.
- No more [Entity] attribute - it was necessary just to specify mapping name, but now this is handled by a separate attribute.

Earlier their functions were distributed over [Field] and old abstract MappingAttribute.

We think new version is better: specific (and, actually, more rarely necessary) features require specific attributes.

4. Schema upgrade

We've added Change FieldType Hint. So schema upgrade hint set is ideal now ;)

5. Documentation

We're slowly updating it. As you may find, we restructured our wiki. Manual is organized in step-by-step studying fashion now. Among other new articles, there is new Schema upgrade article - check it out.

6. ADO.NET Data Services (Astoria) sample

We've implemented ADO.NET Data Services (Astoria) sample on DO4. We decided to publish it separately:
- It isn't really polished yet
- It depends on Silverlight Tools, so we must decide if this additional dependency is acceptable.

It shows an Astoria service sharing entities via RESTful API, as well as WindowsForms and Silverlight clients consuming this service, showing and allowing to change the entities it gets.

What does this mean? You can share DO4 Entities using ADO.NET Data Services, query the service from the client using LINQ, update the entities on the client and send back the changes. Since Astoria client operates on Silverlight as well, you can implement Silverlight client utilizing DO4 on the server.

Btw... We disappointed in Astoria client features. You should do lots of tasks manually there, including registering of new entities, changed associations and so on. From the point of usability it's much worse than what is offered by DO4. So in general, upcoming sync will be much more attractive option for DO4 users. But on the other hand, Astoria allows to implement really simple RESTful integration API with almost zero coding.

The sample will be available @ our downloads section today.

7. Bugfixes

We've got really good results here. Earlier I wrote there are just few failing tests from about 1000 tests for Storage. Imagine:
- About 600 tests are related to our LINQ implementation, and indirectly - RSE implementation.
- All the tests produce the same results - even on Memory storage. This means our RSE execution & optimization engine works as expected.

So the version we have now seems really stable. Good luck trying it ;)

Wednesday, July 01, 2009

DataObjects.Net v4.0.2 is out

What's new? Check out. You can download it right now.

All the details will follow up shortly.

DataObjects.Net v4.0.2 is on the way to you ;)

I'm working on its publication right now. What's done? Check it out.

We've implemented 41 issues during last 2 weeks. And finally got really perfect test results:
- Auto Memory: Tests failed: 3, passed: 954, ignored: 43
- Auto PostgreSql: Tests failed: 3, passed: 956, ignored: 40
- Auto SqlServer: Tests failed: 3, passed: 955, ignored: 41

AFAIK, 1 of 3 failing tests actually fails because of specific restrictions on build agents. Others are related to rounding issues on different servers, and it seems there is no ideal way to resolve them. So likely, we'll just describe this. So we can almost fully honestly say all our tests are passing now.

There is difference in ignored tests as well - that's because some of them are provider-specific. E.g. schema upgrade tests don't run on Memory storage.

Complete test sequence includes about 30 different configurations (with various configuration domain options, etc.): ~ 10 for each provider type. They're running now, and there should be a bit more mistakes. Above results are for Auto configurations - it combines the most common options.

P.S. I'll briefly highlight the most important changes in the next post.