Dive-in to Pandas (Basic)
Pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly.
It provides an R-like DataFrame, produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays. In this post, I'll go through the basics of pandas. Check out the (very readable) pandas docs or the pandas cookbook if you want to learn more.
OK, let's get started.
#Importing required Python packages
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint
import pandas as pd
import zipfile
import requests
import StringIO
import datetime
import seaborn
np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
Reading from CSV
It's easy to read CSV files directly in Pandas. Let's read a file from the UCI Machine Learning Repo.
Dataset name - Citi Bike Trip Histories
Source - NYC Bike Share/311 calls from NYC Open Data
The dataset has info about where Citi Bikers ride, When they ride, How far they go and so on. We'll be looking at the Bike Trip dataset for Jan 2014.
#The Dataset is a zip file. So let's first read in the dataset through requests then pass it on to Pandas through the
#read_csv command
url=requests.get('https://s3.amazonaws.com/tripdata/201402-citibike-tripdata.zip')
z=zipfile.ZipFile(StringIO.StringIO(url.content))
bike=pd.read_csv(z.open('2014-02 - Citi Bike trip data.csv'))
#Display the Top 5 rows in the dataset
print bike.head()
#Let's specify some optional parameters to make the dataset more digestable
bike=pd.read_csv(z.open('2014-02 - Citi Bike trip data.csv'), encoding='utf8', parse_dates=['starttime','stoptime'],
index_col=None)
bike.set_index('bikeid',drop='False')
#Note above, we could've simply said index_cols = bikeid but it would drop the actual column. To retain bikeid in the
#dataset, I set index to None initially then replaced it with bikeid, this time setting the option not to drop the column.
#If this is complicated, skip and continue, I will illustrate in detail later.
print bike.head()
#The object we've imported bike is called a DataFrame, and is made up of rows and columns.
#Think of this like excel on wheels.
#Let's get some info on the columns our dataframe.
bike.info()
#Some more detail
bike.describe()
#Accessing Columns
print bike['gender'].head() #or just bike.gender
#Plenty of functions/transformations available - this below lists the unique values.
print bike.gender.unique()
#Calling a histogram on a column by using hist()
bike['start station id'].hist()
#Try hitting tab after bike.colname to see options
#Plotting is easy, we could just do a table.colname.plot() but to get something pretty to display, let's do
#a scatter plot
plt.scatter(bike.index[:25], bike['tripduration'][:25], cmap=plt.cm.RdBu)
plt.show()
#Slicing - first 5 rows in the DF
bike[:5]
#Slicing - first 5 rows of specific column(s)
bike[['gender','tripduration']][:5]
##Note the 2 sets of [] - list of lists
#Find the startstation id with the maximum trips
bike['start station id'].value_counts()
#It's station 293 with 2920 trips
#Find the start station with the longest trip
#First find the longest trip
print "Longest Trip: %.6f" %bike.tripduration.max()
#Now retrieve row from DF where duration is the longest
bike[bike.tripduration==bike.tripduration.max()]
#bike[bike.tripduration.idxmax()]
#Find the start station with the longest trip by a Female Biker
#First find the longest trip by a female biker
print "Longest Trip by a Female Biker: %.6f" %bike[bike.gender==1].tripduration.max()
#Now retrieve row from DF where duration is the longest
bike[bike.tripduration==bike[bike.gender==1].tripduration.max()]
#bike[bike.tripduration.idxmax()]
#Plot the Top 10 longest trips by Male Subscribers as a Bar Plot
#Let's first get all Male Subscribers
bike[(bike.usertype=="Subscriber") & (bike.gender==2)]
#Now, valuecounts of tripduration for these subscribers - only the Top 10
print "Top 10 Male Subscribers with the longest trips"
print bike[(bike.usertype=="Subscriber") & (bike.gender==2)].tripduration.value_counts()[:10]
#Now we'll just plot this as a bar
bike[(bike.usertype=="Subscriber") & (bike.gender==2)].tripduration.value_counts()[:10].plot(kind='bar')
plt.suptitle('Top 10 Longest Trips by Male Subscribers', fontsize=14)
plt.show()
#The columns of a Data Frame are of type called Series. These are internally just numpy arrays, which we can find out
#by calling the values() method
bike['start station id'].values
#OK let's dig into some Time Series functionality. Pandas again is excellent in this area.
#Say we want to get the day from the timestamp
temp=pd.DatetimeIndex(bike.starttime)
temp.day
#NOTE: the .day, .month.. functions work only if the Date field is indexed. Since in our case it's not we had to
#temporarily index it. Otherwise we could've just done a bike.starttime.day
#Let's calc and store the weekday. Note we're adding this as a new column to bike
bike['weekday']=temp.weekday
#GROUP BY - now let's say we want to group all bikers by weekday we could do
#Let's add weekday as an index to the DF. We're choosing to append rather than replace the original index.
bike.set_index('weekday',drop='false', append='true')
test=bike.groupby('weekday').aggregate('count')['bikeid']
print test
#Say we wanted to plot this, but we want to change the indexes to something meaningful
test.index=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
test.plot(kind='bar',color='orange')
plt.show()
#Let's now group all Customers on working days
female_customers=bike[(bike.usertype=='Customer')]
grouped=female_customers.groupby('weekday').aggregate('count')
grouped=grouped[(grouped.index.values>=1) & (grouped.index.values<=5)]
grouped.index=['Monday','Tuesday','Wednesday','Thursday','Friday']
grouped['bikeid'].plot(kind='bar')
plt.suptitle('Customers biking on Working Days', fontsize=14)
plt.show()
#OK great let's now grab some weather data from San Francisco from Weather Underground. Firstly we'll define a custom URL
#to grab specific months of data.
#The URL template will have placeholders for dates so we can fill them up as we like.
url_template='http://www.wunderground.com/history/airport/KSFO/{start_year}/{start_month}/{start_day}' \
'/CustomHistory.html?dayend={end_day}&monthend={end_month}&yearend={end_year}' \
'&req_city=NA&req_state=NA&req_statename=NA&MR=1&format=1'
#Let's generate the URL to retrieve data from Sep 1, 2014 to Dec 10,2014
url=url_template.format(start_year=2014, start_month=9, start_day=1, end_year=2014, end_month=12, end_day=10)
#Great, let's now open the URL and read it in as a CSV.
sfo=pd.read_csv(url, header=1, parse_dates=True, index_col='PST')
print sfo.head()
#Let's remove spaces in column names to make things easier down the line.
sfo = sfo.rename(columns=lambda x: x.replace(' ', ''))
sfo.columns
#Great, now let's plot the mean temp
with seaborn.axes_style('darkgrid'):
sfo.MeanTemperatureF.plot(figsize=(15,5), color='brown')
plt.suptitle('San Francisco - Mean Temperatures in Fahrenheit',fontsize=15)
plt.show()
#That was as easy as it could get.
sfo.describe()
#It's easy to concactenate dataframes
#Let's generate the URL to retrieve data from Aug 1, 2014 to Aug 31,2014
url=url_template.format(start_year=2014, start_month=8, start_day=1, end_year=2014, end_month=8, end_day=31)
#Great, let's now open the URL and read it in as a CSV.
sfo_aug=pd.read_csv(url, header=1, parse_dates=True, index_col='PDT')
#Renaming columns
sfo_aug = sfo_aug.rename(columns=lambda x: x.replace(' ', ''))
#Concatenate Aug and Sep months
sfo = pd.concat([sfo_aug,sfo])
sfo.index.get_values
#Saving back to CSV
sfo.to_csv('SFO_Weather_Aug_to_Dec_14.csv')
Share more, Learn more!
Comments
Comments powered by Disqus