Cambridge Energy Data Analysis: Processing multi-dimensional data visually

In an earlier post I discussed the challenges we face currently at CEDL when we look at Big Data (http://blog.camenergydatalab.com/2014/10/big-data-crunching.html). This has been complex already, but we love new challenges here at CEDL. So let’s talk about multi-dimensionality.

If we presume that our data is arranged in a table like this:

ID	location	date	temperature	humidity
1	London	2015/02/12	4	68
2	Cambridge	2015/02/12	2.5	55

then aspects of big data refer loosely to the number of rows and multi-dimensionality of the data refers to the number of columns. Basically, we do not only have a lot of data (rows) but it is also complex due to the high number of features (columns).

Understandably, it is very challenging to extract information from such complex data in particular when we do not know what we are looking for. As part of the data exploration a data scientist will look for patterns or clusters that might tell us more about the processes which shape the data.

Of course, a data scientist wants to use the best tools available to find patterns and clusters in the data and as it turns out the most powerful machine for pattern detection is the visual cortex! The brain is your very personal supercomputer. The challenge in utilising the brain for detecting patterns in multidimensional data sets does not, thankfully, come down to brain surgery. Nevertheless, a problem still remains: how to interface the visual cortex with the data set? The only and best working interface are of course the eyes. All what is required is to transform the data set into a representation suitable for the eyes -> visual cortex interface. You might wonder why this sounds rather like an engineering problem than the typical task of a data scientist. Unfortunately, the role of a data scientist is commonly misunderstood. In fact, with today’s challenges the task is not so much about calculating statistics but to engineer a way to access and consume data.

Usually, this happens in the form of charts and plots and it is up to the data scientist to find a suitable data representation for the problem at hand: to explore data, find answers, and to communicate them.

For example, a fantastic way to represent multidimensional data are Parallel Coordinates [1] if you want to utilise the pattern recognition abilities of the brain’s visual cortex.

In a chart with parallel coordinates each column of the table is a vertical axis and each row becomes a line in the chart. Here follows an example:

This type of chart works with both discrete and continuous data. Additionally, colour and line types can be used to add some additional context to the data. Obviously, this chart is very simple but it can help us understand how parallel coordinates work. For example, looking at the location axis we can note that we have three records with location London and three with location Cambridge while looking at the axis data we note that we have two records for each day.

This chart shows you parallel coordinates in full action:

The chart shows the visualization of the mtcars dataset [2]. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of car design and performance for 32 cars (1973–74 models).

A way to explore data in parallel coordinates is called “brushing”: Simply select a range over one or multiple axis and explore how the data segregates.

For example compare the models with better fuel economy versus models with less miles per gallon (mpg):

Whereas the cars with low fuel economy don’t seem to show any specific segregation, the cars with good fuel economy are light cars with 4 cylinders and small displacement.

[1] http://web.cs.ucdavis.edu/~ma/ECS289H/papers/Inselberg1997.pdf

[2] http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

3 comments:

Hadoop Online Training28 October 2015 at 07:27
Thanks a lot and keep up the good work with your ever continuing research on subjects like data-science and cloud. Thanks and Regards, data scientist online training.
Unknown28 July 2016 at 09:27
pocessing multi dimensional posts ..

Big data training .All the basic and get the full knowledge of hadoop.
Big data training

Unknown25 August 2016 at 05:27
Nice job and keep blogging,hadoop is the online training course in hyderabad for more details refer hadoop online training

Description

Friday, 13 February 2015

Processing multi-dimensional data visually

3 comments: