We are Cambridge Energy Data Lab, a smart energy startup based in Cambridge, UK.
This blog, named "Cambridge Energy Data Analysis", aims to incrementally unveil our big data analysis and technologies to the world. We are a group of young geeks: computer scientists, data scientists, and serial entrepreneurs, having a passion for smart energy and sustainable world.

Friday 31 October 2014

Big Data Crunching

BigData 2267x1146 white

There has been much talk about Big Data in the last years and the word cloud shows terms commonly related to the definition of Big Data. First and foremost, the most important attribute of Big Data comes as no surprise: its volume! Big Data is, as the name suggests, BIG. What big actually means in regards to bytes or number of records is circumstantial. It becomes big when your traditional way of data processing hits a wall and becomes unfeasible.

The first symptom will be that your data does not fit into memory. In the beginning you might simply beef up your computer with some extra memory. This is commonly called to scale up. A more sophisticated solution would be to load only partial data into memory as disk size is much less of a bottleneck. This is how a database operates. A join operation on two massive tables, e.g. in Postgres, will load and write many chunks of intermediate data but will eventually succeed even though all the data never fit into memory at once. Relational Databases and scaling up to more powerful computers was the gold standard for tackling growing data volumes. Things changed in particular after 2004 with Google's publication of "MapReduce: Simplified Data Processing on Large Clusters"[1].

Instead of running huge databases on expensive supercomputers the trend went to massive parallelisation on clusters of cheap hardware. With this came new challenges which MapReduce successfully addresses:

  • parallelisation must be easy
  • automatic distribution of data between the workers of the cluster
  • fault tolerance
If you process data on a big cluster of cheap hardware the chances are quite high that one of the computers breaks down. In the ACID world of RDBs (all or nothing transactions) this would mean we never get any results.

So what exactly is MapReduce doing differently?

The MapReduce Paradigm

Let's discuss a simple example inspired by a common task in processing genetic data: imagine you have vast amount of strings and you want to trim the last 5 letters of the strings.

In such a task we have to process each single record. This means the task is of linear order:
\[ O(n) \]
where the computational effort grows linearly with the number of records.

However, each record can be processed independently from the other records which allows to scale out the task over multiple processes, cores, or computers.

Let \(k\) be the number of processes, cores, or computers available, the order of our task becomes
\[ O \left ( \frac{n}{k} \right ). \]
This is much better for the case of Big Data when \(n\) is very large as we can control the computational effort easily by increasing \(k\). Additionally, if one subtask fails we only have to rerun that specific subtask. A complete rollback of the transaction is not required as it would be the case in ACID conform RDBs.

Let’s consider a slightly more complex task: the "hello world" program in the world of  MapReduce is the counting of word frequencies in a very big number of documents. As it was the case in the previous task, the word count of a single document is independent from the other documents, this makes the task perfectly suitable for scaling out:

A common pattern is emerging here: we use a function which maps each document to a list of independent word counts. The result of this map is a distributed list of word counts. So far our MapReduce programme comprises the following steps:
  1. distribute the documents over multiple computers in a cluster
  2. apply a word count function on each computer
  3. generate a distributed list of word counts
The next step is to aggregate the distributed list of word counts. However, we want to scale out the aggregation again over multiple computers:

This aggregation step is also called reduce. The steps involved are as follows
  1. send the same words to the same computer for aggregation (called shuffle)
  2. apply a sum function to generate the word count over the complete set of documents (reduce)
And there we have the complete MapReduce paradigm:

Screenshot from 2014-10-31 12:29:09.png

Some tasks might be more difficult to translate into map and reduce steps and can require multiple rounds of mapreduce. However, the mapreduce ecosystem is growing steadily with new libraries implementing now even complex machine learning algorithms in mapreduce [3,4,5].

Last but not least, comparing mapreduce to RDB we see that mapreduce is using schema at read, which is ideal for messy and inconsistent data, and RDB is traditionally using schemas at write. In the world of Big Data the schema at read approach has the following advantages:
  • the flexibility to store data of any kind including unstructured or semi-structured data
  • it allows flexible data consumption
  • it allows the storage of raw data for future processing and changing objectives
  • it removes the cost of data formatting at the moment of data creation which results in faster data availability
  • it allows you to experiment with the data at low risk as the raw data can be kept to correct mistakes

There is always the elephant in the room when speaking about MapReduce: Hadoop!

Most importantly, Hadoop is not MapReduce it is just one implementation of the mapreduce framework! Hadoop is quite a beast and targets the really BIG Big Data. An alternative implementation we are using here at Cambridge Energy Data Lab is Disco.

The main reason we use Disco over Hadoop: Disco jobs are written in Python and Hadoop jobs are mainly written in Java. (Strictly speaking you can also use other languages with Hadoop). Also Disco is much lighter and easier to administrate. [2]

The word count example in Disco is as simple as the underlying problem itself:

  from disco.core import Job, result_iterator

   def map(line, params):
       for word in line.split():
           yield word, 1

   def reduce(iter, params):
       from disco.util import kvgroup
       for word, counts in kvgroup(sorted(iter)):
           yield word, sum(counts)

   if __name__ == '__main__':
       input = [""]
       job = Job().run(input=input, map=map, reduce=reduce)
       for word, count in result_iterator(job.wait()):
           print word, count

I leave it to you to compare this with the Java version for Hadoop: