News, examples, tips, ideas and plans.
Thoughts around ORM, .NET and SQL databases.

Friday, June 05, 2009

DO4 vs its key competitors (EF, NHibernate, LLBLGen, Genom-e) - the very first post in upcoming long cycle

I'm going to publish a set of posts here comparing various EF and NH concepts to the same parts in DO. Obviously, this isn't a short-term task, so for now I'm publishing just a starting article here. I'll try to be fully objective showing all pros and cons of different approaches.

I hope to hear some critics in comments ;)

Let's go:

Difference #1: Entity and its state: are they all the same object, or different ones?

From the point of many pure ORM frameworks, including EF and NHibernate, it's the same object. Let's call this scenario as
Case 1.

From our point there is a set of different objects (Case 2):
- Entity. It acts like an adapter providing high-level (i.e. materialized, object-oriented) representation of its own EntityState. It's a lightweight object containing mainly just its State.
- EntityState. A lightweight object binding actual state data (a DifferentialTuple exposed via its State property), Entity it belongs to and Transaction from which this state originates.
You may find it is a descendant of TransactionalStateContainer(of DifferentialTuple) - it actually implements all state invalidation logic.
- DifferentialTuple - the low-level state representation. Again, a lightweight object aggregating two others: Difference and Origin. Both are Tuples. Origin describes the original (fetched) state. Difference describes all the changes made to it. DufferentialTuple exposes field updates stored in Difference as if they'd be applied to Origin - i.e. Differential tuple is Tuple as well.

It looks like I should provide a short description of our Tuples framework:

Tuples are lightweight objects providing access to their typed fields by numbers. Conceptually they're very similar to Tuples in .NET 4.0 BCL, but there are quite important differences:
- Our tuples aren't structs - they're classes. They're dynamically typed. E.g. there can be many different types of tuple exposing the same .NET 4.0 Tuple structure, not just one. Moreover, we can write a code working with generally any type of Tuple, but not just with the specified one - the methods providing access to their internals are virtual. So from this point our Tuples are closer to a List with specified type of each item, rather than to .NET 4.0 tuples; and on contrary, .NET 4.0 Tuples are quite similar to our Pair and Triplet (we haven't implement Quintet just because we didn't need it ;) ).
- There are generally 2 kinds of tuples: RegularTuples and TransformedTuples (although we don't put any restrictions here). Tuples of the first kind are actual data containers. Tuples of the second kind are lightweight data transformers: they don't store the data themselves, but transform them from their sources on demand. We use them in RSE (e.g. to join two record sets, cut out something and so on - certainly, limiting the depth of such transformation chain) and during the materialization.
- Our Tuples always have TupleDesciptor and maintain nullability and availability flags for each field. So they're designed for RDBMS-specific calculations.
- Actual types for RegularTuples are generated on demand in runtime. We don't generate generics to be able to "compress" such fields as boolean. Any boolean value including availability and nullability flag takes exactly one bit.
- Tuples are fast - nearly as fast as List(of T).

Ok, so now we can return back to entity and its state. Let's study pros and cons of our approach closer.

Further I'll use "persistent field" term, although normally they're exposed as properties. "Persistent field" means property, which value must be persisted. I use "field" here mainly because it's closer to the nature of such properties.


Cons (by importance):

C1. Slower materialization. As you see, we must instantiate 4 lightweight objects (Entity, EntityState and DifferentialTuple) in comparison to one Entity in e.g. EF. But in fact there isn't so dramatic difference: in addition to Entity we should usually materialize its Key. This implies at least one dictionary lookup (~ equal to 5 memory allocations) and creation of 1-2 objects. So in general, this may decrease materialization speed ~ twice.

On the other hand, Case 2 allows to re-use the sessions without post-transaction cleanups (see P3), and in this case many of these "additional" lightweight objects (Entity, EntityState and DifferentialTuple) will be simply re-used by subsequent transactions.

C2. Slower property reads - reasons are the same here. Although this is much less important, since intensive data readers must be spending more time on fetching the data from readers and materialization.

C3. Slower property writes - by the same reasons. This is even less important, since persist speed is limited by ~ 10-30K entities / sec. on the current hardware. Moreover, this is quite arguable: any ORM with change tracking service does nearly the same job in this case.

C4. Additional levels of abstractions - obviously, this isn't good, if we'll prove they aren't really necessary.


Pros:

P1. Most frequently used requirements are bundled into the framework. In particluar:

P1.1. Change tracking - DifferentialTuple perfectly handles this. The original state is always available, and there is no any need to capture it separately, as it's required in Case 1 (note: if this is implemented in Case 1, this affects either on materialization or on update speed). The same is about differences - we know exactly what fields are updated. Moreover, you shouldn't care about implementing any change tracking logic: when you call a protected SetField-like method, changes are tracked automatically.

P1.2. Lazy loading of reference fields: if we have a field of Entity type in Case 1, we have two options on materializing the entity with such a field: either materialize the referenced Entity as well, or set it to null / do nothing. First case isn't good because we most likely should fetch this entity (this is quite slow, since reqires DB roundtrip). The second one imply we will either show null instead of actual value, which obviously isn't good. Or we should use something like Lazy behind this field, and capture its key into some other field(s) on materialization. Ok, but such Lazy costs 8 bytes (4 bytes for boolean flag and 4 - for the reference) in addition to the key field(s). Moreover, further we must ensure they're kept in sync. Finally, we can maintain just key field(s) and use something like this.DataContext.Resolve(customerID). But what we just did? Exactly, we converted low-level data representation to the high-level one.

P1.3. Lazy loading of simple fields: availability flags make it easy to load non-reference fields on demand. But in Case 1 this is really ugly and difficult problem.

P1.4. Distinguishing between "assign field value" and "materialize field value" operations: in many cases it's important to know if set property operation is invoked by materializer. E.g. if there is some value validation logic, it must be skipped on materialization. In Case 1 you should explicitly check this (btw, is this possible in EF? Is so, how?); in Case 2 you don't have this problem at all, because we can "materialize" the state itself. Entity is materialized by invocation of a special protected constructor taking its state as the only argument of it (such constructors are automatically provided by one of our aspects).

So in general, all P1.X "pros" simplify developer's life. Their affection on performance exists as well, but from my point of view it is much less important.

P2. Absence of strong references between Entities. This allow us to use such Session caching policies as week reference based caching. If there are relatively long transactions processing lots of data, but their working set is relatively small, they can easily survive in Case 2, and will die with OutOfMemoryException in Case 1. E.g. reading ~ 1M of 2-field objects is enough to make EF to die with this exception on a PC with 2GB RAM!

Btw, this doesn't mean we can't use Dictionary-based caching policy. Just mention you need InfiniteCache in SessionConfiguration. Currently we use a chain of LruCache + WeakCache by default (but this is still a subject to possible changes).

P3. Zero cost of invalidating the state on cross-transaction boundary. In our case such state invalidation happens on the fly, when EntityState notices this.Transaction!=this.Session.Transaction. Case 1 implies real field-by-field cleanup with ~ linear cost. This is important, if you run lots of BLL on application server and reuse Sessions and Entities.

P4. Perfect for global caching. Imagine we have a global (i.e. Domain-level) cache storing the Origins of mentioned DifferentialTuples, i.e. fetched Entity states. We never modify the Origin - did I mention this? And the fact that Tuples supports concurrent reads? Ok, this means:
- We need almost nothing to materialize the Entity in this case: we'll create EntityState and Entity without any field-by-field copying.
- If there are cached EntityState & Entity in the Session (the chances of this are also high enough), we'll just set the Origin!
- We can easily add "originates from global cache" mark to our EntityState, which presence will imply later optimistic version check on update. Case 1 requires this mark to be stored externally.
- Entities from different Sessions will share the same state objects. So probably this approach will allow us to make the amount of RAM needed in Case 2 lower then in Case 1.

Why I wrote about caching? Because this is really important (even distributed cache hit is much cheaper than database roundtrip - not saying about local in-memory cache), and thus planned. This time we're going to provide a global cache API allowing you to decide what to cache and how to cache:
- We'll allow to cache anything (including query results), as well as fetch anything directly from global cache without any database-level version checks (but this will lead to optimistic version check on update of such an entity).
- Global cache API will be ready to Velocity.


Ok, that's enough for today. Obviously, from my point of view Case 2 is better. I think this is a distinguished element of Rich Domain Model - as you might know, that's what we like. On the other hand, Case 1 with its simple Entity design pushes you toward Anemic Domain Model (anti-pattern by Martin Fowler) - everything is clear here, but you should do even very simple things by your own. Try to implement a bit more complex application over such a DAL (50-100 persistent types), and you're almost in our camp. You need all of P1 everywhere hating to code them the same way - Ctrl-C, Ctrl-V, Ctrl-C, Ctrl-V, Ctrl-C, Ctrl-V, Ctrl-C, Ctrl-V...

Code duplication is probably what I hate the most. Especially if something I use pushes me to do this many many times.

Few nice links to end with:
- Anemic vs Rich Domain Models: do you know people from Java camp prefer Rich Models? In any case, this is a short post you can start the investigation from.
- Entity Framework as an OR/M: a good article from Genom-e's authors. A bit old taking into account upcoming EF4.0 (note, how fast they reached 4.0 ;) ), but anyway, very nice.

P.S. Shortly we'll publish some results of our LINQ implementation comparison. You'll laught: full LINQ support is myth. Even EF fails on really simple cases - not saying about the others. Obviously, except DO4 ;)

1 comment:

  1. Alex,
    I'm trying to use NHibernate after several years of working with Sooda ORM and totally agree with you - NH forces me to move any higher-order logic out of my data model, basically the NH entities are good only for transferring data to & from database. Apart from that, accessing the database with NH is painful in many places (querying for example). Hope EF is not that anemic...

    ReplyDelete