Tuesday 1 February 2011

Monday 31 January 2011

Storing Java objects with db4o

These days I was looking for a database that should hold some native Java objects in a very easy to use way. Because when I documented myself about the NoSQL databases I have read a few words about Versant db4o database, I thought this is the time to know more about it.

The idea behind this kind of database is to store the objects exactly as they exist in your application. No need for an additional relational layer which sometimes (often) maps the fields from a Java object to many fields in many table in a relational database. We have some immediately advantages from this approach:

- no need to install (or use) a RDBMS
- the access speed to the database is increased due to the fact that we could bypass the relational layer

The speed is also a strong point because under the hood the objects are stored in the form of graphs, which allow very efficient algoritms for reading, writing and searching.

The database is written for two programming languages: Java and .NET. So, I have started with the download of the Java API and the binaries of the database, version 7.12. The archive is around 48MB and after extracting it, we have some directories containing the sources, the Object Manager Enterprise, which is a kind of administrative tool, and a directory containing some documentation and tutorials about how to install and how to use the database.

The installation is very easy, and all that you have to do is to add in the classpath of your project (I used Eclipse to create a new project) of a library designed for your installed Java version. Personally I use Java 6, so I have chosen db4o-7.12.156.14667-all-java5.jar file.

Storing objects is very simple and intuitive. First we have to create an ObjectContainer, then use its methods which resembles with the one used in relational databases (I assume that many of us are familiarized with JDBC methods). So, for example we have an object Athlete and we want to store it in the database, we have to write (in an over-simplified way):

ObjectContainer db = Db4oEmbedded.openFile("db.dbo");
try{

Athlete athlete = new Athlete("Haile Gebrselasie", "Berlin Marathon", "2:03:59");
db.store(athlete1);
db.commit();
}catch(Exception e){
e.printStackTrace();
db.rollback();
}finally{
db.close();
}


To search information in the database, we need queries. Here, the queries work over the instances of the same kind of objects. So, we could have Queries By Example which work based on an input template and returns all the objects that match all non-default fields of the template. A more general type of queries are the Native Queries, which in fact are the recommended way to search information. The lowest level type of queries are the so called SODA Queries (Simple Object Data Access) and they work directly with the nodes of the database graph.  

For example retrieving all Athlet objects from the database using a Query By Example, first we have to define a prototype, then pass it to the query and wait for the result:

Athlete proto = new Athlete(null, null, null);
List res = db.queryByExample(proto);

System.out.println("Name \t\t Race \t\t Time");
for(Athlete crtAthlete : res){
System.out.println(crtAthlete.getName() + "\t\t" + crtAthlete.getRace() + "\t\t" + crtAthlete.getTime());
}


I wanted to compare the speed of the db4o database vs a truly RDBMS (in fact I have MySQL 5 installed on my machine). For this, I have defined a table called ATHLETS in my RDBMS, following the schema:

create table athlets(
id int unsigned not null auto_increment primary key,
name varchar(50),
race varchar(50),
besttime varchar(50)
) engine=InnoDB;


Then, for 1000 records having the Name, Race and BestTime fields filled, the insertion time was 2109 miliseconds.

For 1000 objects of type Athlet, with the same fields filled, the total insertion time was 106 miliseconds, so the db4o database is around 20 times faster than a relational database.

To manage the records in the database, Versant provide an Eclipse plugin called Object Manager Enterprise (OME) which is very easy to install and then it gives you the opportunity to access and query the records in the database.

I found this database very interesting and useful when need to store native Java objects. It has an amazing speed and also is very easy to use. The queries are checked directly in the compilation phase, so the parameters type too. The database is somehow schemaless. The object could change their structure then they are persisted as they are, no need for changes in the database layer (unlike RDMBS, where a change in the model implies the change in the database structure and more than that, a change in the SQL queries). It is embedded directly in the application, no need for an extra RDBMS to be installed somewhere locally or in the network. Being integrated in the application, it means that it is loaded in the same process. More than this, Versant db4o supports ACID transactions.

I found also some cons, one of them from my point of view, is that for commercial applications or if you want support from the database provider, you have to pay for it. There is a scheme of licensing based on the number of processor cores and for a strong server, the price could escalate easily.

Saturday 29 January 2011

Who's afraid of the NoSQL databases

No-SQL databases became a buzzword nowadays and myself being somehow involved in working with them, naturally I was asked by someone a couple of days ago what are they good for and to explain what’s the deal with them. Well, there is much to say about this subject, but below I will try to explain the essential things in  few words.

The need for this kind of storage was initiated by the web applications that more and more dealt with bigger quantities of data which were sometimes distributed over more than one computer or were installed on servers which couldn't manage them at some point because of their physical limitations. In other words they couldn’t vertically scale, so they had to do it horizontally . For those who are not familiar with these terms, horizontally scaling it is the opposite for the vertical scaling in which we could improve the performance of a database by buying additional memory or increase the number of processors for the server which hosts the database. The horizontal scaling allows the increasing of the performance by adding supplementary nodes in a network, nodes that will host instances of the database.

Since 2009, a lot of different concepts of No-SQL databases appeared, all of them coming with advantages and disadvantages, but every one of them covering specific needs. Some of them are columnar databases (they store data in so called “columns”), other are key-value based and other store entire documents, they are called document store databases ,etc.

Key-Value Databases
Columnar Databases
Document Store Databases
Object Databases
Graph Databases


Other distinct characteristics of No-SQL databases are : they are schemaless, most of them are open-source with convenient licensing, are distributed and accessible through simple APIs which make them attractive for applications running in the Cloud, where they could spread over many network nodes. On the other hand, the No-SQL databases being usually distributed, they obey the Eric Brewer’s CAP theorem which says that for any distributed system, the following properties of the system cannot be provided in the same time:

  • Consistency (all nodes see the same data at the same time)
  • Availability (the system should have a response for every request in a specified amount of time )
  • Partition Tolerance (the system should work despite the failures of some of its nodes)  

So,  being distributed automatically means that partitions are already created. From Brewer’s Theorem we know that we must trade between the other two remained properties. That’s the reason for we have databases that are more available than consistent or more consistent than available. For example Cassandra is highly available database, but eventually consistent, which means that in absence of any update operation during a limited period of time, all the nodes will hold eventually the same state.

In a next post I will talk a little about the data model in the NoSQL databases.