Performance – effect of various features

2 years ago I made a post about performance/benchmarking, and the fact that some groups like some magic black and white “X is better than Y” (and that there is only one measure of performance so it doesn’t matter what object graphs are used it will always be the same). The evidence is that they are wrong. Needless to say there will always be groups that don’t share our philosophy, or don’t have time to do a complete analysis (though publish their results knowing that they are incomplete and likely invalid, after all it’s not their software they’re maybe not presenting in a fair light). Recently we had another performance exercise. This came to the conclusion “Hibernate is better than DataNucleus, and you should really just get ObjectDB“. So we’re back in the territory of black and white. Yes, an OODBMS ought to be way faster than RDBMS, particular when the RDBMS has a persistence layer in front of it (and you have to pay for the OODBMS besides), but that is not the subject of this post. We’ll concentrate on the former component of that conclusion.

There is nothing to add to the previous blog post in terms of correctness, we stand by all of it and nothing has been demonstrated to the contrary. This blog post simply takes the recent exercise sample and demonstrates how enabling/disabling certain features has a major impact on (DataNucleus) performance. The author of that exercise demonstrated results showing that JDO and JPA with DataNucleus were on a par in terms of performance, but below Hibernate in terms of INSERTs (anything between 1.5 and 2 times) and on a par for SELECTs (some faster, some slower but more or less the same). Since JDO and JPA are shown to be equivalent, we’ll just run the exercise with JDO here, but the same is easily demonstratable using JPA (because in DataNucleus you have full control over all persistence properties and features regardless of API).

The sample data used by this case is that of 3 classes. Student has a (1-N unidirectional) List of Credit and has a (1-1 unidirectional) Thesis. We persist 100000 Students each with 1 Credit and 1 Thesis. So that’s 300000 objects to be inserted, and then 100000 Students queried.

The INSERT is as follows

try
{
    pm.currentTransaction().begin();
    for (int x = 0; x < 100000; x++);
    {
        Student student = new Student();
        Thesis thesis = new Thesis();
        thesis.setComplete(true);
        student.setThesis(thesis);
        List credits = new ArrayList();
        Credit credit = new Credit();
        credits.add(credit);
        student.setCredits(credits);
        pm.makePersistent(student);
    }
    pm.currentTransaction().commit();
}
finally
{
    pm.close();
}
and the SELECT is as follows
try
{
    Query q = pm.newQuery(
        "select from " + Student.class.getName() +
        " where thesis.complete == true && credits.size()==1");
    Collection result = (Collection) q.execute();
    ... loop through results, so we know they're loaded
}
finally
{
    pm.close();
}

So we’ll run (on H2 database, on a Core i5 64-bit PC running Linux, 4Gb RAM) and vary our persistence properties to see the effect.

Original persistence properties (from original author)

optimistic=true, L2 cache=true, persistenceByReachabilityAtCommit=false, detachAllOnCommit=false, detachOnClose=false, manageRelationships=false, connectionPooling=builtin
INSERT = 120s, SELECT = 6.5s

Disabled L2 cache

Since we’re persisting huge numbers of objects and it takes time to cache those, and in the original authors case Hibernate had no L2 cache enabled, lets turn the L2 cache off. So we now have
optimistic=true, L2 cache=false, persistenceByReachabilityAtCommit=false, detachAllOnCommit=false, detachOnClose=false, manageRelationships=false, connectionPooling=builtin
INSERT = 106s, SELECT = 4.0s
Why the improvement? : because objects didn’t need caching, so DataNucleus didn’t need to generate the cacheable form of those 300000 objects on INSERT, and 100000 objects on SELECT.

Disabled Optimistic Locking

Now instead of using optimistic locking (queue all operations until commit/flush), we allow all persists to be auto-flushed. As our exercise is bulk-insert we don’t care about optimistic locking since we’re creating the objects. So we now have
optimistic=false, L2 cache=false, persistenceByReachabilityAtCommit=false, detachAllOnCommit=false, detachOnClose=false, manageRelationships=false, connectionPooling=builtin
INSERT = 42s, SELECT = 4.0s
Why the improvement ? : because objects are flushed as they are encountered so we don’t have to hang on to a large number of changes, so the memory impact is less. Note that we could have observed a noticeable speed up also if we had instead called “pm.flush()” in the loop after every 1000 or 10000 objects. See the performance tuning guide for that.

Use BoneCP connection-pooling

Use BoneCP instead of built-in DBCP, so we have
optimistic=false, L2 cache=false, persistenceByReachabilityAtCommit=false, detachAllOnCommit=false, detachOnClose=false, manageRelationships=false, connectionPooling=bonecp
INSERT = 42s, SELECT = 3.8s
Why the (slight) improvement ? : because BoneCP has benchmarks showing that it has less overhead than DBCP

Conclusion

As you can see, with very minimal tweaking we’ve reduced the INSERT time by a factor of 3, and the SELECT time by a factor of 1.7! That would equate to being noticeably faster than Hibernate in the authors original timings (for both INSERT and SELECT). Note that we already had the detach flags set to not detach anything, so they didn’t need tuning (but should be included if you hadn’t already looked at those in your performance tests, similarly all of the other features listed in the Performance Tuning Guide referenced above).

Does the above mean that “DataNucleus is faster than Hibernate” ? Not as such, it is in some situations and not in others. We can turn on/off many things and get different results, just as Hibernate likely can (though I’d say DataNucleus is more configurable than the majority if not all of the other persistence solutions so at least you have significant flexibility to do this with DataNucleus). In the same way we could persist other object graphs and get different results due to some parts of the persistence process being more optimised than others. One thing you can definitely say is that DataNucleus has very good performance (300000 objects persisted in 42secs on a PC, and 100000 objects queried in less than 4secs) and that performance can be significantly tuned.

The other thing that we said in the original blog post and repeat here, if you are serious about performance analysis you have to dig into the details to understand why and, as a consequence, you have an idea what to tune. You also need to assess what your application really needs to perform and what is considered acceptable performance; if you’re not going to make a proper attempt at tuning a persistence solution (whether that is DataNucleus, Hibernate, or any other), best not bother at all and just use what you were going to use anyway since you don’t have the time to give a fair representation (which is why we don’t present any Hibernate results here, so nothing hypocritical in that).

One important thing to note is that it is extremely useful to have the ability to set many of these properties on a PersistenceManager (or EntityManager) basis (so you could have a PM just for bulk inserts and disable L2 caching, or set the transaction to not be “optimistic”). JDO 3.1 adds the ability to set persistence properties on the PersistenceManager, though DataNucleus only currently supports a minimal set there – SVN trunk now has the ability to turn off the L2 cache in a PM while have it enabled for the PMF as a whole.

Posted in Uncategorized | Leave a comment

Enhancing in v3.2

Whilst a “final release” of version 3.2 of DataNucleus is still some way off, some important changes have been made to the enhancement process that people need to be aware of, and can benefit from.

JDO : Ability to enhance all classes as “detachable” without updating metadata

When you enhance classes for the JPA API they are all made detachable without a need to specify anything in the metadata (since JPA doesn’t have a concept of not being detachable). With JDO the default is not detachable (for backwards compatibility with JDO1 which didn’t have the detachment concept). In v3.2 of DataNucleus you can set the alwaysDetachable option (see the enhancer docs ) and all classes will be enhanced detachable without the need to touch the metadata; much easier than opening up every class or metadata file and adding detachable=”true” !

JPA : Throwing of exceptions due to the bytecode enhancement contract

The bytecode enhancement contract requires that classes throw exceptions under some specific situations where information is either not present or not valid. These always used JDO-based exceptions before to match the JDO bytecode enhancement contract exactly. These are now changed to better suit the JPA API, and remove a need to understand JDO when using JPA.

  • if a non-detached field was accessed then a JDODetachedFieldAccessException was thrown; this is now changed to an (java.lang.)IllegalAccessException.
  • in some cases where an internal error occurred a JDOFatalInternalException would be thrown; this is now changed to an (java.lang.)IllegalStateException.

No “datanucleus-enhancer.jar”, and no need of external “asm.jar”

The DataNucleus enhancer was always maintained as a separate project, but is now merged into datanucleus-core.jar and so will be available directly whenever you have DataNucleus in your CLASSPATH. Taking this further, the enhancer makes use of the excellent ASM library and in v3.2 datanucleus-core.jar includes a repackaged version of the ASM v4.1 classes internally. This means that you have one less dependency also and can do enhancement with less thinking.
PS Remember, bytecode enhancement is “evil”, developers of some other persistence solution told you that back in 2003, and you should never forget it! 😉
Posted in Uncategorized | Leave a comment

Persistence to Neo4j graph datastores

Whilst DataNucleus JDO/JPA already supported persistence and querying of objects to/from RDBMS (all variants), ODBMS (NeoDatis), Documents (XML, Excel, ODF), Web (JSON), Document-based (MongoDB), Map-based (HBase, AppEngine, Cassandra), as well as others like LDAP and VMForce, it was clear that we didn’t yet have a plugin to any of the nice new graph datastores like Neo4j. To this end, we now provide a new store plugin, supporting persistence to Neo4j.


Usage

Just like all of the other store plugins we aim to make its usage as seamless and transparent as possible so that you, the user, has a high level of portability for your application. In simple terms you just mark your model classes with JDO or JPA metadata (annotations or XML) just as you would do for RDBMS (or any other datastore), and write your JDO or JPA persistence code in the normal way. The only difference is that the data is persisted into Neo4j transparently. I’ve not had time to write up a tutorial yet, but the model and persistence code would be identical to persisting to any other datastore, just that in the definition of the datastore “URL” it would be something like 
datanucleus.ConnectionURL=neo4j:{my_datastore_location}

Refer to the DataNucleus docs for more details. Note that the plugin is not yet released, but is available as a nightly build for anyone wishing to give it a try


Currently supported

  • Each object of a class becomes a Neo4j Node.
  • Supports datastore identity, application identity, and nondurable identity
  • Supports versioned objects
  • Fields of all primitive and primitive wrappers can be persisted
  • Fields of many other standard Java types can be persisted (Date, URL, URI, Locale, Currency, JodaTime, javax.time, plus many more)
  • 1-1, 1-N, M-N, N-1 relation is persisted as a Neo4j Relationship (doesn’t support Map fields currently)
  • JDOQL/JPQL queries can be performed, and the operators &&, ||, ==, !=, >, >=, <, <= are processed using Cypher, with any remaining syntax handled in-memory currently.
  • Support for using Neo4j-assigned “node id” for “identity” value strategy.
  • Checks for duplicate object identity
  • Embedded (and nested embedded) 1-1 fields, and querying of these fields


Likely supported soon

  • Processing of more JDOQL/JPQL syntaxis in Cypher to minimise any in-memory processing
  • Support for backed SCO collection wrappers allowing more efficient Relationship management.


Feedback is welcome (over on the DataNucleus Forum, or below in the comments). Additionally if anyone with more experience in Neo4j who would like this plugins capabilities to be enhanced why not get involved? You contribute a few patches for example – the source code is available here, and the issue tracker is a good place to start
Enjoy!

Posted in JDO, JDOQL, JPA, JPQL, Neo4j, Persistence | 7 Comments

DataNucleus AccessPlatform v3.1 coming soon …

Almost a year from the release of version 3.0 and we move close to the release of version 3.1 (due late in July 2012). So what has changed in that time ?

Consolidation

While DataNucleus’ plugin architecture is very flexible, it can lead to a large number of plugins being available. This in itself is not a bad thing but, if your application is using many features, you do have to keep track of more plugins and their versions. Version 3.1 merges the following plugins into other plugins

  • datanucleus-management was a plugin providing JMX capabilities to DataNucleus usage. It is now merged into datanucleus-core and is now part of a new statistics monitoring API.
  • datanucleus-javaxtime was a plugin providing support for the new javax.time classes that will provide a real Date/Time API for Java. This will be part of JDK 1.8 IIRC, so we have moved support for these Java types into datanucleus-core. More and more people will be using them and expecting their persistence to be seamless.
  • datanucleus-cache had support for an early version of the forthcoming javax.cache standardised Caching API, but the API has since changed and is reaching a level of maturity. As a result we now provide support for the latest javax.cache API in datanucleus-core, so the typical user (when javax.cache is widely implemented) will not need the datanucleus-cache plugin
  • datanucleus-xmltypeoracle was a plugin providing support for persisting String fields to XMLType JDBC columns for Oracle. It is now merged into the datanucleus-rdbms plugin.
As a result of these changes the typical application will only need datanucleus-core, datanucleus-api-jdo or datanucleus-api-jpa, as well as the datanucleus-{datastore} plugin of your choice in the CLASSPATH at runtime. In addition some persistence properties have more sensible defaults that will mean that more applications won’t need some value setting to work optimally.

JPA 2.1

The latest revision of the JPA spec (JSR0338) is under way, and has some new features already fleshed out. In Version 3.1 of DataNucleus we provide early access support for
  • Stored Procedure API : This allows users of JPA to invoke stored procedures in their RDBMS and get back output parameters and/or result sets. Obviously not applicable when using JPA with a non-RDBMS datastore.
  • Type Converter API : This defines a way in which a user can have a field in their Entity and wish to convert the value before it gets to the datastore (and back on retrieval). For example if you have some  Java type of your own and want to persist it as a String you could define an attribute converter.
Obviously as JPA2.1 continues we will continue adding features to match their spec.

Other New Features

Whilst some of these could be argued to deserve their own section in this blog, I list here other prominent changes in version 3.1 
  • The REST API has had significant work, and now provides much more enhanced support for JDOQL/JPQL including order clauses etc. It also now supports use of datastore identity, bulk delete, and much more.
  • The enhancer will now work with JDK1.7 (and higher), using the latest version of ASM.
  • JTA handling with JPA is now complete
  • Support for nondurable identity is now provided for RDBMS, MongoDB, HBase, Excel and ODF.
  • You can now have any nontransactional updates persisted atomically. Previously only nontransactional persists and deletes were able to be performed atomically. This means we now have a real “auto-commit” mode of operation
  • The HBase plugin adds support for multitenancy, as well as obeying JDO/JPA naming strategies.
  • The MongoDB plugin adds support for embedded objects with inheritance, obeys JDO/JPA naming strategies, and adds support for several new query features being evaluated in the datastore.
  • The Excel and ODF plugins add support for JDO/JPA naming strategies.
  • The plugin for the Google AppEngine datastore has had a long-needed upgrade, and now works with DataNucleus v3.x. So users of that platform can get access to all of the work that has happened since 2009, finally!

DB4O dropped!

Whilst it generally is policy to add capabilities with every release, it occasionally makes sense to remove functionality that is not considered worthwhile. Support for persisting to db4o datastores now falls under this category. As versions of db4o have been released, public APIs have changed making it hard to follow their development. Additionally Versant, the parent company of db4o, have recently released their primary object datastore with a JPA API (to add to its existing JDO API). Consequently it is felt that as Versant have done absolutely nothing to assist in the process of us providing a standards based API for their software, as they are commercial and perfectly capable of committing resource to their projects, our support is now withdrawn. It remains in DataNucleus SVN for anyone who needs such like, but no resource from this project will be directed at their (commercial) datastore.
And that’s it. Maintenance of version 3.0 is now at an end (except commercial), and maintenance of version 3.1 will start once we release 3.1 as well as, at some point, the start of development for version 3.2.
Posted in AccessPlatform, GAE, JDO, JPA | Leave a comment

GAE/J and DataNucleus v3 – Part 2

In the previous post we saw some initial changes to make GAE/J DataNucleus plugin work with the latest version of DataNucleus plugins. In this post we describe some further features of interest to GAE users that they weren’t able to use before.

Storage Version
With v2 of the plugin it will, by default, persist using a new “storage version”. In v1 of the plugin it persisted no explicit information about relations, and instead relied on doing queries for parent key to find related objects; obviously when all relations were owned then this was valid. In v2 of the plugin it persists a property in the Entity for each relation (containing the Key(s) of the related object(s)), at the owner side always. In the case of unowned relations (see below) it also will persist a property in the Entity at the non-owner side of a bidirectional relation. Obviously all existing data uses v1 of the storage version, but don’t let that concern you since the plugin will check for presence of this property, and if not present then fall back to v1 behaviour to get the related objects. As entities are updated the data will be migrated to v2 storage version (a migration tool to do the job in one pass is in the works also).

 

Unowned Relations

By default in GAE/J all relations are owned meaning that any child objects have the parent object Key as part of their Key, and persisted as part of the same entity-group. This is obviously useful in optimising retrieval of data, but there are times when you simply want your model persisting and not have imposition of ownership. In v2 of the plugin you can have unowned relations, where each object is in its own entity-group. To define a relation like this, see the following example

@PersistenceCapable
public class A
{
    @Persistent(primaryKey="true", valueStrategy=IdGeneratorStrategy.IDENTITY) 
    long id;

@Unowned
B b;


}

@PersistenceCapable
public class B
{
@Persistent(primaryKey=”true”, valueStrategy=IdGeneratorStrategy.IDENTITY)
long id;

@Unowned
@Persistent(mappedBy=”b”)
A a;

String name;


}

So when we persist an object of type A with related B it will do the following

  1. PUT the A, generating its Key, but without property for B
  2. PUT the B, generating its Key, and with a property referring to the key of A
  3. PUT the A with the property referring to the key of B.

It should be noted that the @Unowned annotation is simply a shortcut for @Extension(vendorName=”datanucleus”, key=”gae.unowned”, value=”true”)

Be aware that if you persist unowned relations in a transaction then you will need to have multi-entity-group transactions enabled, since each object is in its own entity group.

 

Datastore Identity

With JDO the user has the choice of having their own primary key field (application-identity), or having the identity of the object defined for them (datastore-identity). GAE v1 only allowed application-identity. In v2 of the GAE DN plugin it also allows datastore-identity. To give an example

@PersistenceCapable
@DatastoreIdentity(strategy=IdGeneratorStrategy.IDENTITY)
public class MyClass
{
    ...
}

 

So with this class it will persist an Entity and its Key will use IDENTITY strategy.

Interface Fields
With v2 of the plugin you can now have fields of interface type (representing a persistable type) and persist them as you would normally do. Refer to the DataNucleus docs for how to do it (paying particular attention to the type of the field in metadata)

That’s a brief summary of some of the more noteworthy improvements, and hopefully now GAE/J using JDO (or JPA) is a much more pleasant place to be, and you can refer almost directly to the DataNucleus docs for many more features now. In addition to the above changes, and in fixing various other minor bugs, the code structure has been changed quite a bit so future enhancements ought to be much more rapidly achievable

Posted in AppEngine, GAE, JDO, JPA, Uncategorized | Leave a comment

GAE/J and DataNucleus v3 – Part 1

Some time ago I wrote a post about GAE/J and how it provides JDO/JPA. It had many limitations and shortcomings. Recently we have had the chance to update their DataNucleus plugin to work with version 3.0. Here are the major changes that users of that plugin will see if they build and use GAE/J DataNucleus plugin from SVN trunk.
JDOQL/JPQL : Support for methods/operators
If a user sets the query extension/hint “datanucleus.query.evaluateInMemory” then the query will be evaluated in-memory. This has an obvious drawback in terms of memory utilisation (if the number of results is large), but the big plus is that it will evaluate almost all JDOQL/JPQL syntax.

JDOQL : Support for input candidate collection
You can now specify the instances that you want to query over using query.setCandidateCollection(…). Means that you have a list of instances and can query which of them match a particular filter criteria.
Primary Key Types
Previously you could only have Long, String or Key. You can now also have long.
Plugin package naming
Now uses com.google.appengine.datanucleus as its package root, hence not using the DataNucleus-owned domain.

JDOQL/JPQL setResultClass
This is now supported for the standard types of result classes, so you no longer need to convert manually the result into your required type.

Value Generation
GAE/J users can now make use of other DataNucleus value generators, such as “uuid“, “uuid-hex

PersistenceManagerFactory
The PersistenceManagerFactory used is now the standard DataNucleus PMF, not any custom GAE variant. To be specific the PersistenceManagerFactoryClass is now org.datanucleus.api.jdo.JDOPersistenceManagerFactory. If you want to have a singleton PMF, simply set the persistence property datanucleus.singletonPMFForName to true. This will then return any existing PMF if present for the requested persistence-unit, or create it if not present.

EntityManagerFactory
The EntityManagerFactory used is now the standard DataNucleus EMF, not any custom GAE variant. To be specific the PersistenceProvider is org.datanucleus.api.jpa.PersistenceProviderImpl. If you want to have a singleton EMF, simply set the persistence property datanucleus.singletonEMFForName to true. This will then return any existing EMF if present for the requested persistence-unit, or create it if not present.

JPA2
By using DataNucleus v3 you now have available all of the changes made in JPA2, so things like Criteria queries, metamodel, etc.

JDO3
By using DataNucleus v3 you now have available all of the changes made in JDO3.0/JDO3.1. This means query timeouts, metadata API, enhancer API, as well as the DataNucleus proposal for Typesafe JDO queries.

Level2 Caching
Level2 Caching is enabled by default, using an internal map-based cache. You can improve this further by setting the persistence property datanucleus.cache.level2.type to “javax.cache” and include datanucleus-cache.jar in your CLASSPATH. This will then cache using GAE Memcache

Non-transactional Persistence
DataNucleus non-transactional behaviour is different now, with any call to pm.makePersistent, pm.deletePersistent, em.persist, em.merge, em.remove being atomic, sent to the datastore immediately. Any updates to fields via setters are still queued.

JPA RetainValues
JPA usage, by default, has datanucleus.RetainValues set to true now. This means that when you commit a transaction the object will retain the values of its fields (previously it migrated to hollow state).

Persistence of other java types
In GAE/J v1 you can only persist fields of the following types : primitive, primitive wrapper, String, Date, Enum, BigDecimal, some com.google.appengine types, as well as Collection types. With v2 you can now persist fields of types Currency, Locale, Timezone, BigInteger, Color, Point, StringBuffer, Jodatime, javax.time, and many more.

Be aware though … there is more to come

Posted in AppEngine, JDO, JPA | 3 Comments

Performance, benchmarking

Every so often some individual or group decides they’re going to “invent” a new benchmark. They ignore all that have been done before (like PolePos) maybe due to NotInventedHereSyndrome, or maybe due to “oh someone who helped write that was related to some datastore therefore it must be dodgy”. While it is highly likely that existing benchmarks don’t include their particular specific case, they never make any reference to what was there before that people are familiar with. They make minimal efforts at configuring other persistence solutions other than what is their favourite or what they have experience with, and then publish their results with a flourish to websites.

Such benchmarks have been known to draw conclusions that you have a standard performance difference across all types of tests. Anyone who has ever looked at the persistence process would know that this conclusion is flawed since good persistence solutions all provide particular features, and by turning on these features you get beneficial things at the expense of some performance, and so certain operations will be better with one tool than with another. Hibernate has some good features, DataNucleus has some good features. If you enable particular features you get poorer performance in other areas.

The other aspect of their conclusion is that they simply want a headline grabber black and white this is better than that. They seemingly aren’t interested in thinking about the different methods employed by the software under test and the particular options. If I was exploring the topic of performance (and its an interesting topic, that can be useful in influencing priorities) I’d want to think about what I was asking the software to do and, bearing in mind that these benchmarks use open source software, and I’d have information available as to how a particular implementation attempts to do something. This could then be termed constructive benchmarking.

Recently we had one which took a flat class, no inheritance or relationships, and persisted it, … many times … and then called itself a “Detailed Comparison”. While this may be an operation that an application may need to do, the response time for the persist of an object of this form would not differ by any perceptible amount to the end user of that software. If a user is wanting to persist a large number of objects (like in bulk loading), then this would typically be performed in a background task anyway. DataNucleus has never been optimised for such a task (it would logically be better at complex object graphs due to the process it uses for detection of changes), but even with it, is actually configurable to give pretty good performance. As above, it’s good at some things and less good at others. Anyone claiming to make a “detailed comparison” would have bothered to try it on a range of cases, to look at the full capabilities, to look at outstanding issues etc.

Here, to give you something to talk about. I just ran a flat class

@Entity
public class A
{
@Id
Long id;

String name;

public A(long id, String name)
{
this.id = id;
this.name = name;
}
}

I ran it through the following persistence code

EntityManager em = emf.createEntityManager();
EntityTransaction txn = em.getTransaction();
txn.begin();
for (int i=0;i<10000;i++)
{
A a = new A(i, "First");
em.persist(a);
}
txn.commit();

I used the config of the aforementioned (Hibernate) users for running Hibernate, and tuned DataNucleus myself. And the runtimes ? Hibernate (3.6.1.Final) took 8164ms, and DataNucleus (SVN trunk on 3/Mar/2011) took 7227ms (using HSQLDB 1.8.0.4 embedded). So on that case DataNucleus was faster. Is this significant ? Well no, since as I already explained it’s a particular case, but demonstrates the principle clearly enough. We can all turn on/off particular options and get some results. Besides which persisting 10000 objects in just 7-8 seconds (on this PC!) is pretty impressive in anyone’s book and never a bottleneck in a normal application.

If you are performing something of the nature of a flat object bulk persist you would not use transactions to avoid the overhead, and you would turn off various features that are not of interest – managed relationships, persistence-by-reachability-at-commit, L2 caching even. Then if using an identity generator you would allocate large blocks of values with your metadata. That said, if you are so serious about performance and persisting flat objects to RDBMS then anyone sensible would use either JDBC direct, or a JDBC wrapper like MyBatis or SpringJDBC. This is “right tool for the job”. Since the benchmark was provided by a group of Hibernate users (they provide various Hibernate tutorials and nothing for any other persistence implementation) we only have to assume that this must be what they think is the “best tool” – if so why not reproduce your benchmark with well-written JDBC and let us know what you find. I even posed this question about applicability of a flat class on their blog and my comment was deleted. Whether it was deleted by them, or by the blogger system I’ve no idea, but it was there for an hour and at that point was deleted. Our blog has never lost comments, and it’s on the same host system.

One of the (undeleted) comments on their benchmark was from the author of an ODBMS who seems to like to decide publically what I ought to spend my spare time on. Is he a commercial client ? no. Is his software open source ? no. Or free ? not if I want to use it on anything serious. The response to him is simple : let me decide what I spend my time on, and you concentrate on your own software; I don’t tell you what to do. It’s an ancient custom called “respect” – sadly symptomatic of the attitudes in the IT profession.

A benchmark to use as the basis for choosing what software you use in your own application needs to cover the different persistence operations that you will perform. If you have a web application that creates a few objects, deletes a few, updates a few, etc continually, then the likelihood is that the performance will not impact on you or your end users to one iota. What will impact on you is whether the persistence solution allows you to do what you want to do or whether it has a large number of unfixed bugs that force you to continually implement workarounds or compromises on your design. Why not have a look through the issue tracker of the software and see what types of problems people are having, and how long the issues have existed?

Edit : to give an example of another benchmark for JPA providers, here is one that was presented in 2012 comparing the 4 most well-known JPA providers on some more complicated models. DataNucleus comes out very well. Note that in this example the author actually bothered to investigate what was happening under the covers.

DataNucleus 3.0 development is initially focussing on architecture, since we believe in getting the architecture right first to take the software to the level we think it needs to be. This means that in early milestones (the benchmark referenced above decided to use 3.0M1) we spend time on refactoring etc, and not on performance. This doesn’t mean that performance isn’t important, just that we feel our users want to be able to perform their tasks first and foremost and then speed things up later. This is the same methodology employed by PostgreSQL to much success, who for years had to listen to “MySQL is faster” comments. Even with that general philosophy anyone using current SVN code would already see some very noticeable speed up in non-transactional persistence performance, and anyone using the MongoDB plugin would also see much more optimised insert performance. These benefits are due to extending the architecture to do some things that we’ve wanted to do for some time but didn’t since we wanted to maintain backwards compatibility and due to resourcing.

Next time you look at some “performance benchmark” we suggest that you bear this in mind. We won’t be spending our time analysing their results, or responding to their claims. This is because we’d rather spend it developing this software, rather than “mine is better than yours” negative mentality discussions.

Posted in JDO, JPA, Persistence | 3 Comments

DataNucleus v3 and MongoDB

There has obviously been a recent shift to look at highly scalable datastores for use in “the cloud”. Google wrote a plugin for their own BigTable datastore back in 2009, providing access to some of the features of JDO/JPA. Unfortunately they didn’t have the intention of providing a full and fair reflection of those persistence specifications, and so reaction to it was mixed. Some people attempted to argue that APIs like JDO did not match these types of datastores (I see nothing in the API or query language of JDO that led to this conclusion, but anyway) and that using standard APIs on them was inappropriate; they were asked to provide concrete examples of features of these datastores could not be handled adequately by JDO but unfortunately didn’t come up with anything.

With DataNucleus v3 we have the opportunity to spend some time on providing good support for these types of datastores, adding support for missing features. A previous blog post documented efforts to upgrade the support for HBase. In this blog post we describe the features in the new plugin for the MongoDB document-based “NoSQL” store.

Features that this plugin currently supports include

  • Support for single MongoDB instances, and for MongoDB replica sets
  • Support for application identity (defined by the user), and datastore identity (surrogate field in the JSON document)
  • Basic persistence/update/delete of objects
  • Support for persistence of (unembedded) Collections/Maps/arrays by way of storing the identity of any related object(s) in the field.
  • Persistence of related objects (1-1/N-1) as “flat” embedded, where all fields of the related object are fields of the owner JSON document. This also supports nested related objects (unlimited depth).
  • Persistence of related objects (1-1/N-1) as nested embedded, where the related object is stored as a nested JSON document within the owner JSON document. This also supports nested related objects (unlimited depth).
  • Persistence of related collections/arrays (1-N) as nested embedded, where the related objects are stored as an array of nested JSON documents. This also supports nested relations.
  • Persistence of related maps (1-N) as nested embedded, where the related map is a nested array of map entries with fields “key”,”value”. Supports nested relations.
  • Persistence of fields as serialised.
  • Polymorphic queries. If the query requests a class or subclasses then that is what is returned. This implies the execution of any query against the MongoDB collection for each of the possible candidate classes.
  • Access to the native MongoDB “DB” object via the standard JDO datastore connection accessor
  • Support for persistence of object version, stored in a separate field in the JSON document and support for optimistic version checking
  • Support for “identity” value generation using the MongoDB “_id” field value. The only restriction on this is that a field/property using “identity” value generation has to be of type String
  • Support for “increment” value generation (numeric fields).
  • Support for SchemaTool creation/deletion of schemas. This supports the document collection for the classes, as well as any indices required (including unique).
  • JDOQL/JPQL querying, including support for fetch groups, so you can restrict how much data is returned by the query.
  • Basic JDOQL/JPQL filter clauses (comparison operations) are evaluated in the datastore where possible.
  • Support for running queries on MongoDB slave instances.
  • Support for persistence of discriminators, so a user can store multiple classes into the same MongoDB collection, and we use the discriminator to determine the object type being returned.

As you can see, we already provide a very good JDO (and JPA) capability for MongoDB, and the feature list is shown as a matrix here.

Input to this plugin is obviously desired, particularly from people with more intimate knowledge of MongoDB. Source code can be found at SourceForge, and issues are tracked via the NUCMONGODB project in JIRA.

Posted in JDO, JPA, MongoDB, NoSQL | Leave a comment

JPA : TCK request and JPA2.1

It is now almost exactly a year since we submitted a request for access to the JPA2 TCK (JSR0317). We provided everything requested of us by Oracle, and we still haven’t received the JPA2 TCK. Just to say Happy Anniversary Oracle on this evidently controversial request. The only possible reasons that I can think of for lack of provision of this are either incompetence, or deliberate prevention of access. Needless to say, we no longer wish to have access to this TCK.

On a related note I was asked whether I’d be participating in the JPA2.1 (JSR0338) expert group. I did think about it, no really! 🙂 But then I remembered that I spent many hours sending emails chasing Oracle employees to fulfil the request for the JPA2.0 TCK all to no avail. We also have to remember that JPA was born out of politics nothing more. Consequently the answer had to be a NO.

I have every willingness to participate in truly open standards on Java persistence, based on technical reasons. Any such participation would be with open mailing-lists/forums, and an open testing capability so that anyone can validate the compliance of any implementation. JPA, and the JCP process it operates under, is not that; JPA standardises things like “Criteria” that is widely recognised as inelegant (many call it ugly actually) and hard to use. A standard should take something that has become accepted as a good way of doing something (such as QueryDSL, for example) and convert it into something coherent with a clean API (like what we did with JDO Typesafe); it should not invent something and impose it on people. The JPA philosophy is simply wrong in this respect. I wish them good luck with JPA2.1, and DataNucleus will, at some point (typically as soon as the spec is made public), implement what they come up with, but I will not be party to its “design”.

Posted in JCP, JPA, Oracle | 1 Comment

DataNucleus v3 and HBase

DataNucleus AccessPlatform v3 provides an opportunity to bring some of our other datastore plugins closer to the standard of the more mature long-supported datastores (e.g RDBMS). In the case of HBase, the plugin provided with v2.0 offered basic persistence and querying, at best. In v3 this is already much improved.

  1. You can now run SchemaTool against HBase. This operates in either “create” or “delete” modes, and allows you to manage the schema required by your persistable classes.
  2. When a relationship was persisted with v2.0 it simply serialised the related object. This broke JDO/JPA cascade rules. In v3 the column for the relation in the owner stores the identity (or identities when persisting multi-value relations). This also provides correct cascading of persist and update.
  3. When a String field was persisted in v2.0 it was Java serialised, meaning that it was not readable with something like “hbase shell”. In v3.0 String/char fields are persisted as the bytes of the field value, hence readable.
  4. With v2.0 we provided value generation using “uuid”, “uuid-hex”, etc simple generators, but not accessible by default. In v3.0 the default (JDO “native”, JPA “auto”) is “uuid-hex”, and we also provide an “increment” generator (contrib from Peter Rainer).
  5. In v2.0 we only supported “application identity”. In v3.0 we also support “datastore identity” (surrogate identity column).
  6. In v2.0 we didn’t support storing a version against the object. In v3.0 we do allow this.
  7. You can now embed persistable fields into the table of the owning object, and also nested embedded fields.
  8. Now supports HBase 0.90

Obviously there is still much more that can be done – see the datastore features comparison table. One that will give more performance is to handle more of any query filter in the datastore, rather than just processing all in-memory. Contributions for this and other things are very welcome. But then we aren’t even at 3.0 Milestone 1 yet …

Posted in hbase, JDO, JPA, NoSQL | 3 Comments