Saturday 22 August 2015

Saving Google Forms data in Google Spreadsheet

These days I dealt with a Google Forms application which did many things, breaking a little the standard use cases, and I considered useful for those who want to squeeze more from its functionality, to share some points from the quest of developing it. From the very beginning, Google Forms is a great application if you want to design a form in just a few clicks and collect the submitted data. Very useful when you want to implement a survey or, if you are a teacher, you could easily prepare an examination grid and collect the answers from your students.

Google Forms helps to design a user interface, offering a collection of predefined form controls, allowing some validations over the fields, but it is a little rigid in respect with the look and user experience, I mean here the positioning in the page and customization. In fact, this is not its purpose, and they can be ignored. Once the design of the form being done, the application is ready to collect the submitted data, in a spreadsheet, without writing a line of code.   

If you want to do more with your form, Apps Script programming language is your ally. Every application from Google Apps suite has its own programming API, exposing different objects and methods. The application framework supports stand-alone methods and event handlers. You can write modular code spreaded in many distinct files (having .gs extension) but at the end it is the same as they are in a single big file. To debug the code, Google offers an integrated debugger that can execute step-by-step, and a Logger object which helps log some useful data to an execution console. Besides this, there is an Execution Transcript process which logs every interpreted line of code.

Suppose you collect all the data from a form in a spreadsheet, but besides this, you want to have one or more columns in the spreadsheet based on some processed data after the form submission. This is not one of the usual use cases, so a little coding is necessary. First of all, two applications are involved: Google Forms and Google Scripts. The flow could be:

  1. Collect the data in Google Forms
  2. Submit the form
  3. In the form submit handler of Google Forms, open the background spreadsheet and write the processed data

Or the flow could be:

      Collect the data in Google Forms
  1. Submit the form
  2. In the form submit handler of Google Spreadsheet, get the collected data from the form  and write the processed data

Both flows look pretty the same but let's see what happens.

In the first case, the form submission handler is attached to the Google Form app:

function onFormSubmit(e){
   var processedData = "Data processed from some of the form fields";
 
 var sheet = SpreadsheetApp.openById("1gIqbtpsttN5Z8EKuHBrA1YI0eqPh6AGQo6tYtsFm7YY");
 
 var activeRange = e.range;
 
 var rangeToWrite = sheet.getRange(activeRange.getRow(), activeRange.getLastColumn()+1);
 rangeToWrite.setValue(processedData);
}

Running this code will fall into:

Execution failed: You do not have permission to call openById (line XX, file "Code") [0.002 seconds total runtime]

because Google Forms app tried to open the SpreadsheetApp, and I didn’t find any solution to make it work.

A workaround for this situation is to place the code in the onFormSubmit handler of the Google Spreadsheet.

function onFormSubmit(e){
   var processedData = "Data processed from some of the form fields";
 
 var sheet = SpreadsheetApp.getActive().getSheetByName("Form Responses 1");  
 var activeRange = e.range;
 
 var rangeToWrite=sheet.getRange(activeRange.getRow(), activeRange.getLastColumn()+1);
 rangeToWrite.setValue(processedData);
}

In this case the value fields is received through the Event object e and no need to open the Forms App, to fall in the same error, so the write operation works fine.

Other useful things to consider is about sending emails from inside Google Spreadsheets. Usually MailApp is used for this, but you cannot change the “From” field of the email. The From address will be always the email address of the entity that created the trigger for form submission. If GmailApp is used instead of MailApp, the From field can be changed to an alias of the entity that created the trigger, but not to a complete different email address.


Saturday 5 January 2013

Setting riak to store data in an EBS volume in Amazon EC2

These days I've performed some tests in a riak cluster configuration, involving about a million write operations from many virtual machines hosted in EC2. Due to the database replication and the number and size of the records to be written, the default storage space coming with an Ubuntu Server 12.04LT instance was insufficient, so the solution was to add extra storage space through Elastic Storage Blocks (EBS).

The procedure is quite simple and will be described below in order to save the time for those who need to do this storage extension.

First of all a new volume must be created from AWS console and attach it to the desired instance.

See how the volume is attached in Ubuntu by isuing the command:

df

The result of the command is:

Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/xvda1       8256952 793960   7043564  11% /
udev              838328      8    838320   1% /dev
tmpfs             338520    172    338348   1% /run
none                5120      0      5120   0% /run/lock
none              846300      0    846300   0% /run/shm
/dev/xvdb      153899044 192068 145889352   1% /mnt

The new volume is 
/dev/xvdb and prior to use it it must be formatted appropriately. For doing this:

Unmount the /dev/xvdb volume:

sudo umount /dev/xvdb

Format the partition using ext4

sudo mkfs.ext4 -j /dev/xvdb

Mount back the volume:

sudo mount /dev/xvdb

Find out how was the EBS volume mounted:

df

The result of the command is:

Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/xvda1       8256952 860640   6976884  11% /
udev              838328      8    838320   1% /dev
tmpfs             338520    172    338348   1% /run
none                5120      0      5120   0% /run/lock
none              846300      0    846300   0% /run/shm
/dev/xvdb      153899044 191936 145889484   1% /mnt

Modify riak configuration file 
/etc/riak/app.config:

Find the riak_core setting and modify

{platform_data_dir, "/var/lib/riak"} to: 
{platform_data_dir, "/mnt/var/lib/riak"}

and bitcask setting

{data_root, "/var/lib/riak/bitcask"} to: 
{data_root, "/mnt/var/lib/riak/bitcask"}

On the new formatted volume mounted under /mnt prepare the directories that will hold the date on the new volume:

/mnt/var/lib/riak/bitcask
/mnt/var/lib/riak/kv_node 

and make them writable.

Then, start the riak service by issuing the command:

sudo service riak start 

After these steps, riak will use the new EBS volume to store data instead the default storage space offered by the Ubuntu Server instance.


Wednesday 2 January 2013

Delete all records from a riak database

Note: Applies to riak 1.2.1 installed on Ubuntu Server 12.04 LT (hosted in Amazon EC2)

The riak API allows to delete a bucket only if its content is empty. Sometimes (if you are performing various performance tests over the database, for example) the contained data is not important, so a script or program iterating through all the buckets and deleting all the content in a bucket is very time consuming.

A most convenient solution is to delete the directories holding the data on the riak server. In order to do this, follow the steps below:

1. Stop the riak server

sudo service riak stop

2. Change directory to the riak data

cd /var/lib/riak/bitcask
sudo rm -r *

cd /var/lib/riak/kv_nodes
sudo rm -r *

3. Restart riak service

sudo service riak start

After these steps, the database is empty and riak is up and running.
Setting a riak cluster in Amazon EC2

Just some words related to a riak 1.2.1 cluster installation inside the Amazon AWS network. The official information is very poor documented, some tips were discovered by trial and error, so maybe this post will help you if you are in the same position to build a riak cluster and fighting with poor documentation.

I used Ubuntu Server 12.04 LT micro instances, 64 bits, from cost reasons.

I use the following terminology in order to refer the nodes inside the cluster: Master node and a Slave nodes. The slave nodes will join the Master as they are needed.

The first step is to install riak on every node as it is described here.

First of all, do not use public IPs in configuring the individual nodes. Use the private IPs obtained inside the Amazon AWS network.

Even I followed exactly the steps, when tried to start the riak service, I've fallen in the following error:

Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.

After examining the riak console, by issuing the

riak console

command, the following information is useful :

** Found 1 name clashes in code paths
08:06:51.374 [info] Application lager started on node 'riak@XX.XXX.XX.XXX'
08:06:51.418 [warning] No ring file available.
08:06:51.457 [error] CRASH REPORT Process <0 data-blogger-escaped-.165.0=".165.0"> with 0 neighbours exited with reason: eaddrnotavail in gen_server:init_it/6 line 320

This means that another node is running in the memory and there is a clash in what concerns the port number.

The solution is to issue a

ps aux | grep riak

command and kill all the running riak processes, then restart the service.

If the virtual machine hosting the riak service was stopped for various reasons, restarting it will change its IP so the configuration steps must be reiterated. However the ring information is no more valid anymore. In my situation, being a test installation, I didn't need the stored data so I have deleted the ring info by issuing the follow commands:

cd /var/lib/riak/ring
sudo rm -rf *.*

In case of valuable data, the ring must be redone by performing:

riak-admin reip <old_nodename> <new_nodename>

where  <old_nodename> and <new_nodename> are the names given in /etc/riak/vm.args

Restart then the riak service. It should work now.

If the settings were done on the Master slave, leave it as it is and go to the Slave node(s). For a Slave node, the next step is to join the current node to the Master node. This is done by issuing the command:

riak-admin cluster join <master_nodename>

This is a staged command, it is only planned but not committed since a specific commit command isn’t issued:

riak-admin cluster commit

Perform  these steps for every Slave node in the cluster.  

Hope this post will help you to set a riak cluster in Amazon much faster and smoothly than me.

Tuesday 1 February 2011

Monday 31 January 2011

Storing Java objects with db4o

These days I was looking for a database that should hold some native Java objects in a very easy to use way. Because when I documented myself about the NoSQL databases I have read a few words about Versant db4o database, I thought this is the time to know more about it.

The idea behind this kind of database is to store the objects exactly as they exist in your application. No need for an additional relational layer which sometimes (often) maps the fields from a Java object to many fields in many table in a relational database. We have some immediately advantages from this approach:

- no need to install (or use) a RDBMS
- the access speed to the database is increased due to the fact that we could bypass the relational layer

The speed is also a strong point because under the hood the objects are stored in the form of graphs, which allow very efficient algoritms for reading, writing and searching.

The database is written for two programming languages: Java and .NET. So, I have started with the download of the Java API and the binaries of the database, version 7.12. The archive is around 48MB and after extracting it, we have some directories containing the sources, the Object Manager Enterprise, which is a kind of administrative tool, and a directory containing some documentation and tutorials about how to install and how to use the database.

The installation is very easy, and all that you have to do is to add in the classpath of your project (I used Eclipse to create a new project) of a library designed for your installed Java version. Personally I use Java 6, so I have chosen db4o-7.12.156.14667-all-java5.jar file.

Storing objects is very simple and intuitive. First we have to create an ObjectContainer, then use its methods which resembles with the one used in relational databases (I assume that many of us are familiarized with JDBC methods). So, for example we have an object Athlete and we want to store it in the database, we have to write (in an over-simplified way):

ObjectContainer db = Db4oEmbedded.openFile("db.dbo");
try{

Athlete athlete = new Athlete("Haile Gebrselasie", "Berlin Marathon", "2:03:59");
db.store(athlete1);
db.commit();
}catch(Exception e){
e.printStackTrace();
db.rollback();
}finally{
db.close();
}


To search information in the database, we need queries. Here, the queries work over the instances of the same kind of objects. So, we could have Queries By Example which work based on an input template and returns all the objects that match all non-default fields of the template. A more general type of queries are the Native Queries, which in fact are the recommended way to search information. The lowest level type of queries are the so called SODA Queries (Simple Object Data Access) and they work directly with the nodes of the database graph.  

For example retrieving all Athlet objects from the database using a Query By Example, first we have to define a prototype, then pass it to the query and wait for the result:

Athlete proto = new Athlete(null, null, null);
List res = db.queryByExample(proto);

System.out.println("Name \t\t Race \t\t Time");
for(Athlete crtAthlete : res){
System.out.println(crtAthlete.getName() + "\t\t" + crtAthlete.getRace() + "\t\t" + crtAthlete.getTime());
}


I wanted to compare the speed of the db4o database vs a truly RDBMS (in fact I have MySQL 5 installed on my machine). For this, I have defined a table called ATHLETS in my RDBMS, following the schema:

create table athlets(
id int unsigned not null auto_increment primary key,
name varchar(50),
race varchar(50),
besttime varchar(50)
) engine=InnoDB;


Then, for 1000 records having the Name, Race and BestTime fields filled, the insertion time was 2109 miliseconds.

For 1000 objects of type Athlet, with the same fields filled, the total insertion time was 106 miliseconds, so the db4o database is around 20 times faster than a relational database.

To manage the records in the database, Versant provide an Eclipse plugin called Object Manager Enterprise (OME) which is very easy to install and then it gives you the opportunity to access and query the records in the database.

I found this database very interesting and useful when need to store native Java objects. It has an amazing speed and also is very easy to use. The queries are checked directly in the compilation phase, so the parameters type too. The database is somehow schemaless. The object could change their structure then they are persisted as they are, no need for changes in the database layer (unlike RDMBS, where a change in the model implies the change in the database structure and more than that, a change in the SQL queries). It is embedded directly in the application, no need for an extra RDBMS to be installed somewhere locally or in the network. Being integrated in the application, it means that it is loaded in the same process. More than this, Versant db4o supports ACID transactions.

I found also some cons, one of them from my point of view, is that for commercial applications or if you want support from the database provider, you have to pay for it. There is a scheme of licensing based on the number of processor cores and for a strong server, the price could escalate easily.

Saturday 29 January 2011

Who's afraid of the NoSQL databases

No-SQL databases became a buzzword nowadays and myself being somehow involved in working with them, naturally I was asked by someone a couple of days ago what are they good for and to explain what’s the deal with them. Well, there is much to say about this subject, but below I will try to explain the essential things in  few words.

The need for this kind of storage was initiated by the web applications that more and more dealt with bigger quantities of data which were sometimes distributed over more than one computer or were installed on servers which couldn't manage them at some point because of their physical limitations. In other words they couldn’t vertically scale, so they had to do it horizontally . For those who are not familiar with these terms, horizontally scaling it is the opposite for the vertical scaling in which we could improve the performance of a database by buying additional memory or increase the number of processors for the server which hosts the database. The horizontal scaling allows the increasing of the performance by adding supplementary nodes in a network, nodes that will host instances of the database.

Since 2009, a lot of different concepts of No-SQL databases appeared, all of them coming with advantages and disadvantages, but every one of them covering specific needs. Some of them are columnar databases (they store data in so called “columns”), other are key-value based and other store entire documents, they are called document store databases ,etc.

Key-Value Databases
Columnar Databases
Document Store Databases
Object Databases
Graph Databases


Other distinct characteristics of No-SQL databases are : they are schemaless, most of them are open-source with convenient licensing, are distributed and accessible through simple APIs which make them attractive for applications running in the Cloud, where they could spread over many network nodes. On the other hand, the No-SQL databases being usually distributed, they obey the Eric Brewer’s CAP theorem which says that for any distributed system, the following properties of the system cannot be provided in the same time:

  • Consistency (all nodes see the same data at the same time)
  • Availability (the system should have a response for every request in a specified amount of time )
  • Partition Tolerance (the system should work despite the failures of some of its nodes)  

So,  being distributed automatically means that partitions are already created. From Brewer’s Theorem we know that we must trade between the other two remained properties. That’s the reason for we have databases that are more available than consistent or more consistent than available. For example Cassandra is highly available database, but eventually consistent, which means that in absence of any update operation during a limited period of time, all the nodes will hold eventually the same state.

In a next post I will talk a little about the data model in the NoSQL databases.