Exploratory Visualizations part-2 - Decoding ggplot2

In the first post of this series we looked at the out-of-the-box visualizations possible with matplotlib (here) and the base R graphics package and drew a comparison. In this post, let's take a look at the ever-so-awesome ggplot2.

ggplot2 is invaluable for its sophistication and the way it enables you to write complex plots in just a few lines of code.

Let's get right to it by importing the required packages.


In [1]:
#Import R magic so we can execute R code within this notebook.
%load_ext rmagic

To start with let's use a dataset that contains meteorological and other data about forest fires, in the NE region of Portugal. This dataset is provided by the UCI ML repo.

In [252]:
  X Y month day FFMC  DMC    DC  ISI temp RH wind rain area
1 7 5   mar fri 86.2 26.2  94.3  5.1  8.2 51  6.7  0.0    0
2 7 4   oct tue 90.6 35.4 669.1  6.7 18.0 33  0.9  0.0    0
3 7 4   oct sat 90.6 43.7 686.9  6.7 14.6 33  1.3  0.0    0
4 8 6   mar fri 91.7 33.3  77.5  9.0  8.3 97  4.0  0.2    0
5 8 6   mar sun 89.3 51.3 102.2  9.6 11.4 99  1.8  0.0    0
6 8 6   aug sun 92.3 85.3 488.0 14.7 22.2 29  5.4  0.0    0

Let's begin by creating a simple scatter plot with ggplot that we'll continue to improve on with subsequent plots.

In [365]:
ggplot(data=fires, aes(x=month, y=area))+geom_point()+
       ylab('burned area')+ggtitle('Understanding forest fires')

Let's look at what we did there. With the first command, we created a ggplot2 plot object called fire_scatter. In it we defined where the data is coming from and what goes in the x and y axis. Then we configured the plot object with additional parameters.

geom_point lets us simply plot the data points, as in scatter plots. We then finally added better labels and a title to the plot.

This plot so far tells us that the burned area is at its worst in the summer months of Jul-Sep.

Let's kick this up a notch.

In [237]:
fire_scatter<-ggplot(data=fires, aes(x=temp, y=DC, size=area))
fire_scatter<-fire_scatter+ylab('drought code')+
ggtitle('Understanding forest fires')

Here we plotted the temperature against drought code, which is indicator of seasonal drought effects on forest fuels and a size parameter that is governed by the burned area of the forest. This shows that there were more fires when DC was high and identifies the worst fires.

What if we could identify the months as well in this case?

Also note above that we saved the ggplot object. This is quite handy to reuse/reconfigure plots.

In [367]:
ggplot(data=fires, aes(x=temp, y=DC, size=area, col=month))+
ylab('drought code')+
ggtitle('Understanding forest fires')

There it is. The top 2 worst fires were in Aug and Sep. We also see again that there were way more fires in the summer. We could of course group the months into seasons so there are fewer colors and its easier to understand.

But you can already see how powerful ggplot2 is. It makes data analysis exceedingly simple and also has the added advantage of being aesthetically appealing out of the box.

There is another interesting way to visualize the same thing.

In [368]:
ggplot(data=fires, aes(x=temp, y=DC, size=area, col=wind))+
ylab('drought code')+
ggtitle('Understanding forest fires')+

We've done two things here. One is to use a facet grid to show all twelve months and next is to color the plots by the wind speed. ggplot2 was smart enough to realise wind is a continuous value and gave us a nice color bar.

The chart is so simple yet so meaningful. There are many more geoms possible with ggplot. Here are a few examples.

In [369]:
ggplot(data=fires, aes(x=RH, y=area, col=wind))+
xlab('Relative Humidity')+
ylab('burned area')+
ggtitle('Understanding forest fires')
In [393]:
ggtitle('Weight, Time and Chicks :-)')

We used the sample R dataset, Chick weights for this example, which shows how chicks (as in chickens) gain weight with time. We can observe that some of them gain as much as 400 times their inital weight.

ggplot2 can also display for instance the smoothed conditional mean in plots. In the below example we use the diamonds sample dataset and plot carats vs price.

We make use of the jitter option to reduce overplotting and use the smooth parameter to plot a custom (quadratic) smoothed conditional over our existing scatter plot.

In [466]:
geom_smooth(colour='black',method="glm", formula=y~poly(x,2))+
ggtitle('Diamond Carats vs Price')

The above plot revealed an important aspect of ggplot2. It works based on layers, that is we plotted the data points then on top of it we plotted the smoothed conditional line. Both of these are layers on top of the template x and y layer we defined using carat and price.

Layers makes ggplot2 extremely customizable and powerful. The same syntax above can be rewritten as follows (thanks to sape for this example) :

In [468]:
ggplot() + 
coord_cartesian() +
scale_x_continuous() +
scale_y_continuous() +
scale_color_hue() +
facet_wrap(~cut) +
ggtitle('Diamond Carats vs Price')+
  mapping=aes(x=carat, y=price, color=color), 
) +
  stat_params=list(method="glm", formula=y~poly(x,2)),

As we can see, the inital data points form the first layer then we add an additional layer for the smoothed quadratic line. We can continue to build as many layers as we want to make really complex plots. Therein lies the awesomeness of ggplot2.

We also see that there're multiple ways of writing the same code with ggplot2. You could go from highly verbose as in the former to extremely light with qplot, another useful option for quick plotting as highlighted below.

qplot tries to simplify ggplot2 syntax so even novice users can start using it. However you also lose some functionality and power to customize by lightening up on the syntax.

In [470]:
qplot(Sepal.Length, Petal.Length, data = iris, color = Species)

ggplot for Python

Yes, you heard it right. The team at yhat has created a port for ggplot in Python. Let's do some quick plots to see how it stacks up against the real ggplot2.

We'll start with an example from yhat.

In [485]:
from ggplot import *
%matplotlib inline 

a=ggplot(meat, aes('date','beef')) + \
    geom_line(color='black') + \
    scale_x_date(breaks=date_breaks('7 years'), labels='%b %Y') + \
<ggplot: (8765269121165)>

That's not too bad at all. How about one of our earlier scatter plots?

In [518]:
import pandas as pd

ggplot(fires, aes('temp', 'DC', size='area', color='month'))+ \
geom_point()+ \
ylab('drought code')+ \
ggtitle('Understanding forest fires')
<ggplot: (8765266248489)>

Looks pretty neat. Some minimal changes of syntax and I was able to get it to work. But the size parameter doesn't seem to be reflected though so we see all points with the same size.

Let's try the diamond facet wrap plot with yhat's ggplot.

In [498]:
ggplot(diamonds, aes('carat','price',color='color')) + \
geom_point(position="jitter") + \
geom_smooth(colour='black') + \
facet_wrap('cut') + \
ggtitle('Diamond Carats vs Price')
<ggplot: (8765269233493)>

It works! There are obviously some aesthetic differences here, I liked the R version a bit better. But there's definitely nothing to complain here. We could even do a facet grid with multiple dimensions. Apologies for the messy chart though!

In [513]:
ggplot(diamonds, aes('carat','price',color='color')) + \
geom_point(position="jitter") + \
geom_smooth(colour='black') + \
facet_grid('cut','color') + \
ggtitle('Diamond Carats vs Price')
<ggplot: (8765265132061)>

So that brings this post to a closure. Hopefully you saw the value of ggplot2 and how it can make complicated visualizations a cakewalk while making it visually appealing. I am really looking forward to watching yhat's ggplot improve with time.

We'll pick up this series next time with matplotlib and try to improve on its out-of-the-box config.

Thanks for reading.

Share more, Learn more!


Comments powered by Disqus