We are Cambridge Energy Data Lab, a smart energy startup based in Cambridge, UK.
This blog, named "Cambridge Energy Data Analysis", aims to incrementally unveil our big data analysis and technologies to the world. We are a group of young geeks: computer scientists, data scientists, and serial entrepreneurs, having a passion for smart energy and sustainable world.

Wednesday, 18 February 2015

The smart meter rollout: current status

We could write a very long blog post answering questions like "What are smart meters?", "Why do we need smart meters?", "Who is installing smart meters?", or even "Can smart meters read my mind?"... But instead, we'll stick to the data available out there, plot it, and try to analyse it! If you are interested in knowing more about smart meters and how you can benefit from them, we advise you to read the nice and simple article about smart meters on, or our own blog post.

A smart meter looks like this:

Figure 1: A smart meter!

The smart meter rollout timeline

In 2007, the UK government started to investigate the possibility of a smart-meters rollout. In 2009, it was agreed to proceed with the rollout with a target of replacing every single traditional meter with its smart version by 2020. An intermediate target is to have 20 million meters fitted between 2016 and 2018. The peak of smart meter installation should happen in 2019... a year before the target. Let's check where we are now.

Current status of the rollout

Thanks to the great data portal of the UK government, we can access some numbers about the smart meter rollout. The number of domestic meters by type and quarter is represented in figure 2.

Figure 2: Number of domestic gas and electricity meters by meter type and quarter. Click on "Traditional Meters" to realise how far we are from the 2020 target. No reason to panic though... 5 years to go.

First of all, some jargon clarification. "Smart Meters" is the official term to design licensed meters as defined by the regulatory organism OFGEM. "Smart-Type meters" corresponds to meters installed by utility companies which have some similarities with smart meters (they can store real-time consumption data, be accessed remotely...) but don't fully comply with the current regulation. Therefore, they will have to be replaced by official smart meters by the end of 2020.  We now understand that smart meter is a very precise term and being able to display electricity consumption does not necessarily qualify your device to fit in the "smart meters" category.

By the end of 2014, 500 thousand smart meters had been installed which corresponds to a tiny percent of the totality of gas and electricity meters. The beginning of the massive rollout should however happen in 2015 which should be an exciting year for the smart meter rollout and therefore for electricity data-analysis.


Friday, 13 February 2015

Processing multi-dimensional data visually

In an earlier post I discussed the challenges we face currently at CEDL when we look at Big Data ( This has been complex already, but we love new challenges here at CEDL. So let’s talk about multi-dimensionality.

If we presume that our data is arranged in a table like this:


then aspects of big data refer loosely to the number of rows and multi-dimensionality of the data refers to the number of columns. Basically, we do not only have a lot of data (rows) but it is also complex due to the high number of features (columns).

Understandably, it is very challenging to extract information from such complex data in particular when we do not know what we are looking for. As part of the data exploration a data scientist will look for patterns or clusters that might tell us more about the processes which shape the data.

Of course, a data scientist wants to use the best tools available to find patterns and clusters in the data and as it turns out the most powerful machine for pattern detection is the visual cortex! The brain is your very personal supercomputer. The challenge in utilising the brain for detecting patterns in multidimensional data sets does not, thankfully, come down to brain surgery. Nevertheless, a problem still remains: how to interface the visual cortex with the data set? The only and best working interface are of course the eyes. All what is required is to transform the data set into a representation suitable for the eyes -> visual cortex interface. You might wonder why this sounds rather like an engineering problem than the typical task of a data scientist.  Unfortunately, the role of a data scientist is commonly misunderstood. In fact, with today’s challenges the task is not so much about calculating statistics but to engineer a way to access and consume data.
Usually, this happens in the form of charts and plots and it is up to the data scientist to find a suitable data representation for the problem at hand:  to explore data, find answers, and to communicate them.

For example, a fantastic way to represent multidimensional data are Parallel Coordinates [1] if you want to utilise the pattern recognition abilities of the brain’s visual cortex.

In a chart with parallel coordinates each column of the table is a vertical axis and each row becomes a line in the chart. Here follows an example:

This type of chart works with both discrete and continuous data. Additionally, colour and line types can be used to add some additional context to the data. Obviously, this chart is very simple but it can help us understand how parallel coordinates work. For example, looking at the location axis we can note that we have three records with location London and three with location Cambridge while looking at the axis data we note that we have two records for each day.

This chart shows you parallel coordinates in full action:

The chart shows the visualization of the mtcars dataset [2]. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of car design and performance for 32 cars (1973–74 models).
A way to explore data in parallel coordinates is called “brushing”: Simply select a range over one or multiple axis and explore how the data segregates.
For example compare the models with better fuel economy versus models with less miles per gallon (mpg):
Whereas the cars with low fuel economy don’t seem to show any specific segregation, the cars with good fuel economy are light cars with 4 cylinders and small displacement.

Tuesday, 10 February 2015

Renewable energy in Europe: how far are we from the targets?

In 2009, the European Union set mandatory targets for renewable energy use that every member state has to reach by the year 2020. In this post we will analyse the progress of each member state using the latest estimates released by Eurostat.

Shares of renewable energy in 2012

In the figure below we have the shares of gross final renewable energy consumption for each member state and how far the states are from their target: Here we note that Sweden, Estonia and Bulgaria already reached their targets while Malta Luxembourg and the UK have the lowest shares of renewable energy in gross final energy consumption. Also, Norway is the country with the highest share of renewable energy. Netherland, France and the UK are the countries furthest from their targets.

Increase since 2006

In the following chart we compare the increase of shares from 2006 to 2012 of each country: From this chart we note that all the member states increased their share of renewable energy since 2006. Another interesting fact we note here is that the three states with the highest increases are, in order, Malta, the UK and Belgium, which are also some of the countries furthest from the achievement of their targets.

Evolution of the shares from 2004 to 2012

In this figure we compare the trend of the shares of renewable energy among the biggest European countries excluding the Scandinavian ones: We can observe that Italy and the UK had the fastest growth of renewable energy shares, but while the UK share has never been comparable to the ones of the other countries, Italy was able to overtake France and Germany in 2011. We can also see that the German share had the slowest growth and that Spain is the country with the highest share since 2009.

Wednesday, 5 November 2014

Feed in tariffs: Small scale solar PV cost

In our previous posts we focused on the excess energy that can be generated using photovoltaic (PV) panels and on on its trends. In this post we will focus on the price of the solar PV installation through two interactive visualizations based on the latest data provided by the UK government about the cost per kW of PV deployments by month.

Cost per kW installed by size band

In the following chart we can compare the cost per kW of solar deployments on a monthly basis for different size bands: Each point of this chart represents the median of the costs in a particular month while the error bars represent the 95% confidence interval of the cost in a given month. Which means that we can be 95% certain that the mean cost per kW lies within the two values reported by the error bar assuming that the cost data entered is a non-biased sample. We can note that the cost per kW installed has remained almost static over the 12 months. With the median of the cost per kW ranging from £1830 in June 2013 to £2010 in October 2013. The chart also shows that in the higher size bands the cost per kW is lower, but since the confidence intervals are wider, it is also more changeable.

Number of PV installations

In this figure we compare the number of installations over a monthly period. As in the previous chart, we differentiated deployments into three different size bands: Here we note that in correspondence of the decreases of the deployment costs, June 2013 and March 2014, the number of installations substantially increased. The month with the lowest number of installation is July 2013 while the month with the highest is March 2014. We also have an increasing trend of installation from June 2013 to November 2013.

Friday, 31 October 2014

Big Data Crunching

BigData 2267x1146 white

There has been much talk about Big Data in the last years and the word cloud shows terms commonly related to the definition of Big Data. First and foremost, the most important attribute of Big Data comes as no surprise: its volume! Big Data is, as the name suggests, BIG. What big actually means in regards to bytes or number of records is circumstantial. It becomes big when your traditional way of data processing hits a wall and becomes unfeasible.

The first symptom will be that your data does not fit into memory. In the beginning you might simply beef up your computer with some extra memory. This is commonly called to scale up. A more sophisticated solution would be to load only partial data into memory as disk size is much less of a bottleneck. This is how a database operates. A join operation on two massive tables, e.g. in Postgres, will load and write many chunks of intermediate data but will eventually succeed even though all the data never fit into memory at once. Relational Databases and scaling up to more powerful computers was the gold standard for tackling growing data volumes. Things changed in particular after 2004 with Google's publication of "MapReduce: Simplified Data Processing on Large Clusters"[1].

Instead of running huge databases on expensive supercomputers the trend went to massive parallelisation on clusters of cheap hardware. With this came new challenges which MapReduce successfully addresses:

  • parallelisation must be easy
  • automatic distribution of data between the workers of the cluster
  • fault tolerance
If you process data on a big cluster of cheap hardware the chances are quite high that one of the computers breaks down. In the ACID world of RDBs (all or nothing transactions) this would mean we never get any results.

So what exactly is MapReduce doing differently?

The MapReduce Paradigm

Let's discuss a simple example inspired by a common task in processing genetic data: imagine you have vast amount of strings and you want to trim the last 5 letters of the strings.

In such a task we have to process each single record. This means the task is of linear order:
\[ O(n) \]
where the computational effort grows linearly with the number of records.

However, each record can be processed independently from the other records which allows to scale out the task over multiple processes, cores, or computers.

Let \(k\) be the number of processes, cores, or computers available, the order of our task becomes
\[ O \left ( \frac{n}{k} \right ). \]
This is much better for the case of Big Data when \(n\) is very large as we can control the computational effort easily by increasing \(k\). Additionally, if one subtask fails we only have to rerun that specific subtask. A complete rollback of the transaction is not required as it would be the case in ACID conform RDBs.

Let’s consider a slightly more complex task: the "hello world" program in the world of  MapReduce is the counting of word frequencies in a very big number of documents. As it was the case in the previous task, the word count of a single document is independent from the other documents, this makes the task perfectly suitable for scaling out:

A common pattern is emerging here: we use a function which maps each document to a list of independent word counts. The result of this map is a distributed list of word counts. So far our MapReduce programme comprises the following steps:
  1. distribute the documents over multiple computers in a cluster
  2. apply a word count function on each computer
  3. generate a distributed list of word counts
The next step is to aggregate the distributed list of word counts. However, we want to scale out the aggregation again over multiple computers:

This aggregation step is also called reduce. The steps involved are as follows
  1. send the same words to the same computer for aggregation (called shuffle)
  2. apply a sum function to generate the word count over the complete set of documents (reduce)
And there we have the complete MapReduce paradigm:

Screenshot from 2014-10-31 12:29:09.png

Some tasks might be more difficult to translate into map and reduce steps and can require multiple rounds of mapreduce. However, the mapreduce ecosystem is growing steadily with new libraries implementing now even complex machine learning algorithms in mapreduce [3,4,5].

Last but not least, comparing mapreduce to RDB we see that mapreduce is using schema at read, which is ideal for messy and inconsistent data, and RDB is traditionally using schemas at write. In the world of Big Data the schema at read approach has the following advantages:
  • the flexibility to store data of any kind including unstructured or semi-structured data
  • it allows flexible data consumption
  • it allows the storage of raw data for future processing and changing objectives
  • it removes the cost of data formatting at the moment of data creation which results in faster data availability
  • it allows you to experiment with the data at low risk as the raw data can be kept to correct mistakes

There is always the elephant in the room when speaking about MapReduce: Hadoop!

Most importantly, Hadoop is not MapReduce it is just one implementation of the mapreduce framework! Hadoop is quite a beast and targets the really BIG Big Data. An alternative implementation we are using here at Cambridge Energy Data Lab is Disco.

The main reason we use Disco over Hadoop: Disco jobs are written in Python and Hadoop jobs are mainly written in Java. (Strictly speaking you can also use other languages with Hadoop). Also Disco is much lighter and easier to administrate. [2]

The word count example in Disco is as simple as the underlying problem itself:

  from disco.core import Job, result_iterator

   def map(line, params):
       for word in line.split():
           yield word, 1

   def reduce(iter, params):
       from disco.util import kvgroup
       for word, counts in kvgroup(sorted(iter)):
           yield word, sum(counts)

   if __name__ == '__main__':
       input = [""]
       job = Job().run(input=input, map=map, reduce=reduce)
       for word, count in result_iterator(job.wait()):
           print word, count

I leave it to you to compare this with the Java version for Hadoop:

Friday, 19 September 2014

6 screens for monitoring with Raspberry Pi cluster

Do you monitor?

Key performance indicators (KPI) , project progresses, server status, user logs (and many more) are constantly changing and the amount of data collected is growing bigger everyday. However it is hard to monitor these values, and sharing them with all the members of the team.
Instead, let's constantly display it.

We had too many things to display on a single monitor. Our solution was to buy 6 monitors!

6 screens for monitoring.

In order to handle 6 monitors, we would normally have to buy a massive desktop PC and a top-end graphic board. The solution we found is using Raspberry Pi! Raspberry Pi is one of the smallest computers in the world and it was born in Cambridge UK. Each Pi's performance is not perfect but it can be powerful enough if we assemble them together to form a cluster.

We immediately bought 6 Raspberry Pis. The main issue was to find a rack to store the 6 Pis. It's common to build a case from lego, however, we didn't have any Lego laying around. Our solution we found is using a shoebox!

Raspberry Pis in a Nike shoe case.
Just Do It!

It looks cool. Cambridge style monitoring environment. Try it like us!

Monday, 15 September 2014

Challenge for Excess Generation

Do you consider to buy a photovoltaic (PV) installation for your house? Or do you already have some solar panels on your roof and you are looking for ways to maximise your return from this investment? Read on as we have some important advice for you:

So far the PV industry's main growth is driven by the government's support with subsidies. Unfortunately, solar panel installations have not yet achieved a competitive advantage due to their still limited efficiency and high cost. This is bound to change in the future. However, if you consider investing into a domestic PV installation today, the governmentally backed financial incentives are key to your investment's return! Here is the secret.

What is Excess Generation?

Sometimes excess generation is also referred to as  surplus generation, excess electricity, or exported energy among others. Excess generation is defined as the amount of electricity generated by your rooftop panels (1. Total generation) minus your daytime electricity consumption (2. Electricity used). It is such excess generation which is available for export to a grid system (3. Export energy).

Feed-in-tariff Incentive Scheme. 

Feed-in tariffs (FITs) are the most widely used policy in the world for accelerating renewable energy (RE) deployment, accounting for a greater share of RE development than either tax incentives or renewable portfolio standard (RPS) policies. In the European Union (EU), FIT policies have led to the deployment of more than 15,000 MW of solar photovoltaic (PV) power and more than 55,000 MW of wind power between 2000 and the end of 2009. In total, FITs are responsible for approximately 75% of the global PV deployment.

In a grid connected rooftop photovoltaic power station, the generated electricity can be sold to the grid at a higher price than what the grid charges for the consumers. This arrangement provides a secure return for the installer’s investment. Many consumers from across the world are switching to this mechanism due to the revenue yielded. However, the details of the financial mechanism varies depending on countries as illustrated by two examples as follows:

Case Study 1: Desincentive for Excess Generation (UK)

In the UK, consumers have a stronger  incentive to minimise Excess Generation, by using the majority of their generated electricity on sunny days. The UK customers receive a guaranteed Feed-in-tariff for all electricity generation (10-14 p/kWh) , plus an 'Export tariff' (4.77 p/kWh) for their excess generation, which, however, is much smaller than the average electricity bill (12-15 p/kWh) . Therefore, customers should consume their generated electricity rather than export it to the grid.

As goes the theory.

However, in reality the ratio of excess generation is fixed to 50% of PV generation due to a lack of smart-meters. Thus, the importance of 'Excess Generation' will definitely emerge with the rollout of smart-meters in the near future.

Case Study 2: Incentive for Excess Generation (JAPAN)

In Japan, FiTs are only paid for 'Excess Generation', not "Total Generation' as it is the case in the UK. FiT’s price is currently much higher (38-42 JPY per kWh) than the average electricity bill (20-25 JPY per kWh), so customers have a strong financial incentive to maximise their amount of 'Excess Generation'. Therefore, customers are willing to change their consumption behaviour by shifting the usage of electricity-heavy appliances, such as dishwashers and wasching machines to the nighttime with cheaper electricity tariff. This individual behavioral change is expected to contribute to a nationwide peak-reduction in the future.

Our Challenge for Excess Generation

We are developing Eneberga Domestic PV Generation Forecasting and Trading Software. Eneberg is mainly dealing with aggregated "Excess Generation". Whilst there is a vast body of research and models dealing with  PV Generation and Energy Demand, "Excess Generation" is still an open frontier. It is our aim to pioneer in that new field of Excess Generation.

In-depth excess generation analysis is already covered by this post: Energy Surplus Trends from Domestic UK Solar Panels in October 2013 to January 2014