News, examples, tips, ideas and plans.
Thoughts around ORM, .NET and SQL databases.

Friday, August 28, 2009

Thursday, August 27, 2009

Upcoming update: disconnected state, über-batching, future queries, object versions, prefetch API, locking, Oracle support

I wrote that after June's release of v4.0 final DataObjects.Net development must go much faster. Earlier we were busy with LINQ (simply hell!), and recently we proved our implementation is not just mature enough. From many points it better than LINQ in EF.

Anyway, now we our hands are free from this. And here is a brief description of what we develop right now:

1. Disconnected state

Later I'm going dedicate a separate post for description of this feature. Moreover, in several days days I'll be ready to publish an example of its usage, since significant part of its logic is already implemented.

Disconnected state, in fact, turns the Session into NHibernate or EF-like Session/DataContext:
  • Anything it fetches will be truly stored there, rather than cached, while DisconnectedState is attached to the Session. "Truly stored" means Session won't try to update the state of each cached Entity in each subsequent transaction. So as you see, attached DisconnectedState "blocks" the ability of Session to automatically reflect the most current state from the storage.
  • Attached DisconnectedState "blocks" the ability of Session to automatically flush the changes to the storage - they will be flushed by explicit request only. Although there will be two options allowing to automatize this: AcceptChangesOnQuery and AcceptChangesOnCommit.
  • So attached DisconnectedState affects on queries. If changes aren't accepted before a particular query, its result might differ from the expected one by default.
  • Since DisconnectedState stored the data acquired in past and allow to change it, optimistic version checks are performed when these changes are persisted. See the description of "object versions" feature for further details.
But there are some differences:
  • Attached DisconnectedState affects on transactions. When a transaction starts in the Session with attached DisconnectedState, in fact, this is logical transaction. But it might lead to actual database transaction, if it will hit the database (so everything is nearly the same as when SessionOptions.AutoShortenTransactions is on). Each logical transaction (with physical one, if it is already attached to it) can be either committed or rolled back.
As you see, attached DisconnectedState is a transactional cache allowing to implement long-running transactions by caching the changes they make in memory. This feature differ DataObjects.Net from the most part of ORM tools: such a behavior is default for them. On the other hand, they do not implement default behavior of DataObjects.Net at all, although I think it is required more frequently now:
  • Web applications almost never need DisconnectedState at all.
  • The same is correct for web services. They are stateless.
Default behavior of DataObjects.Net implies you always "see" the same as you would see without it. No any unexpected caching. We're allowing to get any desirable caching logic in Session, but only after explicit request.

DisconnectedState can be attached / detached to a Session (any Session!) at any moment of time between its logical transactions; moreover, it is fully serializable.

When DisconnectedState can be used?
  • As local cache in WPF client applications. After getting this implemented, we can honestly say we fully support WPF.
  • As long-running transaction state cache in ASP.NET applications. So it can be useful e.g. on wizard pages.
How it is implemented internally?
  • It is DisconnectedState itself, as well as its own ancestor of ChainingSessionHandler. When DisconnectedState is attached to a Session, it replaces its SessionHandler to its own. So in fact, all the ditry job is done on SessionHandler level.
  • This implies we've added an API allowing to temporarily replace SessionHandler of any Session. This API is available via CoreServicesAncestor - there is ChangeSessionHandler method now. This API is secure: it can be utilized only by SessionBound types from assemblies registered in the Domain current Session belongs to.
  • Finally, we've extended SessionHandler by a set of methods necessary to implement DisconnectedState. In fact, we gathered all interception points we need there.
This means now DataObjects.Net has public API allowing to dramatically change its persistence behavior in runtime. In particular, this API will help us to provide global cache support (e.g. Velocity) in the nearest future.

2. Über-batching

It's actually quite simple: we already implemented almost everything I described in my previous post. The only part left unimplemented is parallel batch execution. Although you may notice we already implemented AsyncProcessor in Xtensive.Core.Threading.

So the age of über-über-batching is near :)

New batching is already integrated into the latest nightly build. Shortly we'll publish v4.0.6, that will deliver it along with few other features described here.

3. Future queries

This is one of features of generalized batching, so it is also already implemented. Working code from our unit tests:

Future Queries in DataObjects.Net 4

So this will be available in v4.0.6. Btw, have you noticed all future queries are compiled ones? :)

4. Object versions

As many of you know, DataObjects.Net v1.X-3.X were exposing "unified" Id and VersionId properties. First one was playing a role of "unified" key (Int64), and the second one - of "unified" version (Int32). And this approach worked - of course, if custom key or version were not necessary.

DataObjects.Net v4.0 provides new Key concept:
  • Key structure is defined on the root type of each hierarchy. Or, better, hierarchy is defined by its root type, and such a root type must explicitly define key structure for the whole hierarchy. HierarchyInfo, TypeInfo and KeyInfo types from Xtensive.Storage.Model describe these concepts in runtime.
  • Any Entity exposes Key property of Key type. Persistent properties composing the key are exposed as well; we require them to be read-only.
  • Key is assigned to each Entity when it is created, and further it can not be changed. By default we provide two protected constructors for each Entity allowing to assign either manually specified key, or automatically generated key. Both ways can be used together - you must simply provide two or more constructors. Moreover, we provide secure API allowing to create an entity of any type with manually specified key (PersistentAccessor).
  • You can extract key value (Tuple) from each Key by its Value property
  • Key can be created by defining its value (Tuple) and TypeInfo using Key.Create methods
  • Keys can be compared for equality. And this is really fast: they cache hash code; moreover, if hierarchy includes more than one type, we maintain identity map for its keys (actually, there is global LRU cache) to ensure we track known entity types for each key. Thus key comparison for such hierarchies is actually handled as reference comparison.
  • Keys can be resolved via Query.Single* methods
  • Keys can be used in LINQ queries (compared, passed as parameters and so on)
  • Keys can be converted to and from string represenation. Key.Format and Key.Parse methods handle this.
  • You can declare persistent properties of Key type. Such fields are stored as strings in the storage; on the other hand, they allow to reference any perisstent type.
So as you see, we provide absolutely the same features as before, but for keys of arbitrary types! This is achieved by providing our own unifying adapter (Key) for any of such arbitrary keys. I think it is one of important advantages provided by our framework: you can deal with keys without explicitly binding your code to their actual types (queries, comparison and so on). And we know currently only few frameworks implement the same concept.
  • NHibernate does not provide any similar concept.
  • ADO.NET Entity Framework provides similar EntityKey type.
Ok, that was a story about Key. But how this is related to object versions? Actually... We have implemented versions by almost identical fashion! Take a look at this beauty:
  • Entity.GetVersion method returning VersionInfo type provides access to unified version (current, actual or original).
  • Version is declared by the same way as Key: just mark some persistent property (or properties) by [Version] attribute. If you mark nothing, we consider version is formed from all the fields except lazy load fields. If hierarchy contains multiple types, it's possible that we won't be able to build version because some non lazy load fields declared in descendants aren't loaded. An exception is thrown on attempt to get version of such an instance (i.e. you must ensure every field participating in version is loaded before it is accessed).
  • VersionInfo objects (actually, structs) can be compared for equality. If you do this repeatedly, this is fast: they cache hash codes.
  • You can build VersionInfo from Tuple, and vice versa.
  • Query.Get/CheckVersions methods allow to get/check the stored versions of specified entities. They rely on future queries & batching, thus their involvement makes updates with optimistic version checks ~ 2 times slower than regular ones. Taking into account the fact that our regular updates are ~ 1.5 times faster than almost everywhere else, this is very good result ;)
  • EntityState now have an additional flag (IsStale) indicating if it is stale or not. Stale EntityState implies an optimistic version check must be performed when it is persisted to the storage.
That's it. Obviously, we're planning to use this everywhere: this part of the framework will be used by global caches and sync as well.

5. Prefetch (eager loading) API

This part is in "deep development" now, so I can provide only rough information. We expect it will be more or less complete in the middle of September. All the API based on several overloads of IQueryable.Prefetch extension method.

Prefetch API is fully based on expressions, so there are no any text-based prefetch paths.

Finally, it will be possible to choose eager load mode for a particular prefetch path:
  • Joins (no additional queries, but likely, huge client-server traffic)
  • Joins + future queries (smallest traffic, higher load on RDBMS)
  • Future queries (small traffic, but more additional queries)
6. Locking

Locking closes one of the gaps we had: there were no Lock-like methods.

Now this part is implemented ideally: there is a group of IQueryable.Lock extension methods allowing to specify:
  • LockMode: Shared, Update or Exclusive
  • LockBehavior: Wait, ThrowIfLocked, Skip.
This part will be fully working in today's nightly build, so v4.0.6 will include this as well.

7. Oracle support

This looks a bit fully, but... Its provider made us to implement much more specific adjustments for it in comparison to e.g. PostgreSQL. The worst thing is that it does not allow to perform a correlated subquery when a nested subquery references a column from a table referred to a parent statement any number of levels above the subquery; instead, it allow to reference a column from table referred by a parent statement only.

More precise description of this "feature" can be found here. Funny, but they specially disabled it - such queries worked well in 10g Release 1 (10.1), but later Oracle developers disabled this. Frankly speaking, I don't remember anyone else doing something similar - presence of this feature lead to no any disadvantages!

I'm curious, do they really think all the queries sent to Oracle are written & optimized by humans?

Anyway, now about 150 out of ~ 1000 tests fail on Oracle, and this happens mainly because of mentioned issue. Everything else seems working. If we'll be able to workaround this issue in the nearest days, you will be able to see Oracle provider in v4.0.7.

At this moment my brief description of our short-term plans is finished. I hope you like the direction and speed we maintain.

Sunday, August 23, 2009

Generalized batching

In this post I'll describe statement batching techniques used in DataObjects.Net. I'll describe a bit more than we already have - to give you a full picture of what you must find in the nearest update.

So what batching is? Batching is a way of reducing chattiness between RDBMS and its client. Normally client sends each command individually - DbCommand.Execute and similar methods are designed to return a single result immediately; moreover, DbConnection is not thread safe object.

So there are some issues. Let's think what can be done to approach them.

1. ADO.NET batching support

Note: we don't use this approach in DO4. I'm describe it here to provide full overview of possible approaches.
ADO.NET has built-in batching support, but currently this feature is exposed only as feature of SqlDataAdapter and OracleDataAdapter. Its underlying implementation is actually generic, but as it frequently happens with Microsoft, it is not exposed as public part of API. On the other hand, there are workarounds allowing to utilize it.

ADO.NET batching is implemented on communication protocol level: you shouldn't modify command text (e.g. set unique parameter names) to utilize it. Its another important advantage is possibility to get DML exacution result with ease.

Its main disadvantages are:
- You must use either internal API (SqlCommandSet class), or rely on SqlDataAdapter (must be slower; moreover, you lose all the genericity with it)
- Currently this works only with SQL Server. The same should work for Oracle as well, but it appears no one has tried to even test similar part of its internal API.

Finally, this approach ensures the sequence of commands executed on SQL Server will be exactly the same, since ADO.NET batching is implemented on protocol level, and this implies few other consequences:
- SQL Server won't try to reuse cached plan for the whole batch. Because there is simply no "whole batch". So it appears this approach is better, if we constantly send different batches: SQL Server is simply upable to reuse the plan in this case. Likely, this was very important until appearance of SQL Server 2005: it caches plans both for batches and individual statements there.
- On the other hand, it is worse when we constantly send very similar batches: although SQL Server could re-use their plans, it can't, because there are no real batches.
- There is a little bit of extra cost in the server per-RPC execution in this case.

If you're interested in further details, refer to:
- Good description of ADO.NET batching from ADO.NET team (rather old - May 2005)
- Plan caching in SQL Server 2008

2. Manual batching support

That's actually the most obvious solution: combine multiple SQL statements into a single command. DO4 currently relies on this approach.

Its main advantage is compatibility with most of databases. Probably, there are others (e.g. faster command processing because of plan caching for the whole batch), but they're very database dependent.

Our current tests for SQL Server shows significantly better results in comparison to ADO.NET batching; on the other hand, these tests are ideal for batch plan caching. So further we'll work on more realistic tests related to this part.

Here are typical DataObjects.Net batch requests (SQL was beautified with a tool):


This way is a bit more complex for implementation:
- You must be able to build each individual statement in the batch quickly. Taking into account parameter names must be unique there, this becomes a bit tricky. In our case this part is handled solely by SQL DOM: it implements caching of generated SQL for DOM branches, so in fact only first compilation of each SQL DOM request model is complex (actually it is pretty cheap anyway), but subsequent compilations of request with minor modifications of its model are extremally fast.
- You must decide how to deal with errors. I'll describe how we deal with them in further posts.
- Some databases (e.g. Oracle) require special constructions to be used for batches ("BEGIN" ... "END").
- If you batch not just CUD statements, everything becomes even more complex: although you can access each subsequent query result with DbDataReader.NextResult method, you must decide how to pass it to the code requiring it. In our case this makes everything more complex, since we initially relied on MARS, but in this case you can use MARS for the last batch query result only (others must be processed prior to it).

Generalized batching

Until now we were discussion just CUD (create, update, delete) sequence batching. Some of ORM tools on the market implement it replying either on the first (NHibernate) or on the second (DataObjects.Net) approach.

But is this all where the batching can be used? No. Let's start from very simple example:

var customer1 = new Customer() {...}
var customer2 = new Customer() {...}
foreach (var entity in Query.All.Where(...)) ...

As you see, there are two CUD operations and a single query. And all these operations can be executed as a single batch. But AFAIK, none of ORM tools implements even this simple case. Why? Actually, because their API makes this not so obvious for ORM developers:
- Normally there is a method flushing all cached changes. In case with DataObjects.Net this method is Session.Persist().
- This method guarantees all the changes are flushed on its successfull completion.
- Query API relies on such method to ensure all the changes are flushed before query execution.

As you see, such a logic makes this really less obvious: persist and query APIs are completely different, although in this case they must be partially intagrated.

Let's think how this could be performed on better API:
- Session.Persist(PersistOption.QueueOnly) ensures all the current changes are queued in low-level persist/query pipeline.
- This method is called before execution of any query.
- When query is executed, it is actually appended as the last statement in the last queued batch, if this is possible.

So Session.Persist(PersistOption.QueueOnly) completion means ORM framework has promised to persist all the changes before the next query (or transaction commit), instead of actual guarantee all the changes are already persisted.

That's what generalized batching is: batch everything that is possible to batch, but not just CUD operations.

It's interesting to calculate the effect of generalized batching from the point of chattines (number of network roundtrips):
- No batching: 1 (open transaction) + CudOperationCount + QueryCount + 1 (commit).
- CUD batching: 1 (open transaction) + CudBatchCount (~= CudOperationCount/3...25) + QueryCount + 1 (commit).
- Generalized batching: 1 (open transaction) + SeparateCudBatchCount (~= 0!) + QueryCount + 1 (commit).

Huge effect.

* SeparateCudBatchCount is count of CUD batch parts without queries. Since batch size is limited, they'll appear when the number of CUD operations between queries is large enough (~ >25). Likely, it will be close to zero in most of real life cases.

Generalized batching and transaction boundaries

As you see, it's possible to eliminate at least one more database roundtrip: opening a transaction. In fact, this command could be send as part of the first batch. Initially this looks attractive...

On the other hand, this makes impossible to use DbTransaction objects, as well as intagrate with System.Transactions. Moreover, duration of a single transaction is normally at least ~ 1/100...200 of second, although single network roundtrip is ~ 1/10000 of second.

So it looks like it's better to forget about this optimization.

Generalized batching and future queries

Some ORM frameworks implement future queries (e.g. NHibernate):
- When future query is sceduled for execution, ORM guarants it will be executed later, and will ensure its result will be available on attempt to get it (enumerate it or get its scalar value).
- This allows ORM to execute such a query as part of a single batch containing other similar queries. So there will be single netowork roundtrip instead of multiple ones.

Future queries can be integrated with generalized batching very well. Let's look on content of typical batch in this case. There will be:
- Optionally, CUD operations
- Optionally, future queries
- Query that lead to batch execution (either future query or a regular one).

And the total count of roundtrips is:
- Generalized batching w. future queries: 1 (open transaction) + SeparateCudBatchCount (~= 0!) + NonFutureQueryCount + 1 (commit).

But which queries can normally be executed as future ones? In general, all the queries that aren't "separated" by CUD operations can be executed as future queries!

So if you're mainly reading the data, future queries are right for you. Generalized batching makes them quite useful in data modification scenarios as well.

Parallel batch execution

Earlier you've seen I proposed to use Session.Persist(PersistOption.QueueOnly) - it ensures all the current changes are queued in low-level persist/query pipeline. Obviously, it may not just prepare the batches for later execution, but start a background thread execution "completed" batches from this queue as well.

E.g. if we created 110 objects, and batch size is 25, invocation of this method must lead to creation of 5 batches. First 4 of them must be completed, but the last one containing 10 statements is not. If parallel batch execution is implemented, we'll start to execute first batch from our queue in background immediately; and later - the next ones.

Obviously, this implies lots of complexities. Some of them are:
- Synchronization on queue and underlying DbConnection
- Careful error reporting: if error occurs during background batch processing, we must ensure it will be passed to the user thread.

On the other hand, all of them can be resolved.

Main benefit of this approach is that user thread won't wait while its changes are persisted. It can work to produce the next batch. And I think this is important: this may significantly reduce the duration of transactions in bulk operations. As you know, presence of ORM makes them noticeable longer (e.g. without this feature a time we spend on building the batches increases it; but if they're executed in parallel, we simply "flood" database with the job). Parallel batch execution in combination with generalized batching must bring nearly the performance of SqlBulkCopy to ORM based solutions.

Hopefully, I'll be able to prove this on tests further. I'm pretty optimistics here, since v4.0.5 with its CUD-only batching seems to be just 1.5 times slower than SqlBulkCopy.

Autmoatic background flushes

Currently DataObjects.Net automatically calls Session.Persist() when the number of modified, but not persisted objects becomes greater than 500. We'll replace this to Session.Persist(PersistOption.QueueOnly), and likely, will decrease the flush limit to ~ 100-200.

So in general it won't be necessary to manually call Session.Persist at all. Automatic background flushes will be a default feature.

Batching in DataObjects.Net

As I've mentioned, publicly available v4.0.5 implements just CUD sequence batching by manually composing the commands of multiple statements. This works pretty well - tests approve this (I'm showing the results here, since we decided to do not publish DataObjects.Net scores @ ORMBattle.NET until it will be accepted by the people):

DataObjects.Net (batch size is 25):
Create Multiple: 20105 op/s
Update Multiple: 35018 op/s
Remove Multiple: 42542 op/s

NHibernate (ADO.NET batching is on, batch size is 25):
Create Multiple: 12617 op/s
Update Multiple: 17148 op/s
Remove Multiple: 18510 op/s

Now we're working on everything else described in this post. I hope to show a complete solution till the middle of September. Such a solution for DataObjects.Net is even more attractive: as you know, it is designed to completely abstract the database. I'd like to mention just one example where such bulk operation performance must be necessary: we have a generic schema upgrade API. Renaming and copying of fields and classes are among its very base features; in the end of all you can implement custom upgrade. And we did all to allow you to use the same ORM API even here: we expose old structures as "recycled" persistent types accessible during a special Domain.Build phase along with new ones, so you can copy/transform their data using the same way as if you'd do this in runtime, and thus keeping your upgrade routines fully database independent. And as we know in practice, performance here is of huge importance.

ORM performance: what's next?

Next time I'll discuss materialization in DataObjects.Net - we really did a lot to compete well with materialization of simple entities in other frameworks (EF). Finally, I'll describe what you can expect from us further: parallelized materialization.

Wednesday, August 19, 2009

Issues with DataObjects.Net v4.0.5 on Windows 7

We discovered v4.0.5 installer suffers from the following issues on Windows 7:
- Shows "published unknown" warning (just ignore it)
- Can't install assemblies into GAC if you aren't logged in as local Administrator. I.e. if you are logged in as user from "Domain Admins" group, it won't work anyway. So it's necessary to log in as use from local "Administrators" group to install it.

Saturday, August 15, 2009

DataObjects.Net v4.0.5 is out

Full release announcement is published in our primary blog.

As usual, you can download it here.

Wednesday, August 12, 2009

ORMBattle.NET is unofficially launched

Check it out: http://ormbattle.net

ORMBattle.NET is devoted to direct ORM comparison. We compare quality of essential features of well-known ORM products for .NET framework, so this web site might help you to:
  • Compare the performance of your own solution (based on a particular ORM listed here) with peak its performance that can be reached on this ORM, and thus, likely, to improve it.
  • Choose the ORM for your next project taking its performance and LINQ implementation quality into account.
The story behind ORMBattle.NET is published here.

P.S. DataObjects.Net v4.0.5, which is tested there, is the most current version we have. It will be released tomorrow or the day after - along with official launch of ORMBattle.NET.

Saturday, August 01, 2009

DataContext vs Query.All

There is an interesting architectural question: why we don't use well-known DataContext, as well as do not provide any replacement for it?

1. DataContext brings serious disadvantages for extensibility / component-based frameworks

DO4 is designed for extensibility. We imply the whole application consists of a set of relatively loosely coupled components (modules, assemblies) containing their own persistent types.

Imagine you build a security module, and you want to run a query returning all the Users that belong to a particular Role role. DO4 allow you to write it as (Type{T1, T2} is a generic type here, since Blogger eats my <...> on post updates):

from u in Query{Users}.All
where u.Roles.Contains(role)
select u;

Now let's imagine how this might look with DataContext:

from u in dataContext.Users
where u.Roles.Contains(role)
select u;

Seems good as well, yes? But there are two problems:
1) Your security module knows nothing about an application it is used in, and thus - about a particular type of its own DataContext.
2) You must pass a particular instance of DataContext (~= pass a Session) to the place where this query is executed. Well, let's agree this issue is easy to solve - we simply pass it directly to any place where queries might run.

Ok, let's try to solve the problem. Imagine we declared ISecurityDataContext in our security module returning Users (IQueryable{User}) & Roles, and implemented this interface in application's DataContext.

So now our query might look as:

from u in securityDataContext.Users
where u.Roles.Contains(role)
select u;

It looks like it must work, yes? It will work for exactly this case. But it won't work for a bit more complex one, like this:

from u in securityDataContext.Users
where u.Roles.Contains(securityDataContext.Roles.Single(role))
select u;

What's changed? Well, I just referred to our securityDataContext inside expression in Where clause. It won't work, because almost any LINQ translator recognizes only the IQueryable expressions it "knows", so here it will stuck on securityDataContext.Roles method call. From the point of translator, it is unknown Roles method of unknown ISecurityDataContext type!

AFAIK, none except DO4 can handle this now. But we do: see issue #333. Btw, we implemented this mainly to be fully compatible with DataContext (repository) pattern, although our own approach is more generic.

Summary: developing of loosely-coupled comonents becomes much more complex in case you have DataContext. Of course you can workarkound all the mentioned issues, but it will be painful. In our case you must simply do what you want.

2. Particular DataContext is bound to a particular session

Thus you can't e.g. put an IQueryable you want to execute/modify further into a static variable, since it is bound to the original DataContext instance. Its subsequent execution will eventually fail.

And, as it was mention, you must care about passing the DataContext instance everywhere where queries might be executed.

Ok, how DO4 resolves these issues?

1. It provides Query{T}.All static member instead of DataContext.SomeType members.

So:
1) We can reference any persistent type by this way
2) Our queries aren't bound to a particular session. Instead, we resolve it on each execution usong Session.Demand() method (or Current property) relying on our context-scope pattern.

2. We use context-scope pattern and PostSharp aspects maintaining "correct" Session.Current everywhere.

You shouldn't pass current Session everywhere, because Session.Current is automatically set for duration of any public method call on SessionBound ancestor - this is done by our aspects, and this is really cheap.

So how to implement DataContext-like repository for DO4?

Just add properties like this one to your own Repository-like type:

public IQueryable{Customer} Customers {
  get { return Query{Customer}.All; }
}

Such properties, as well the whole type, can be static. That's it ;)