Data Munging with Pandas (Advanced)
Pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly.
In this post I will demonstrate some advanced data munging/wrangling concepts with the awesome Pandas. I intend to code more and write less but will add help text as much as possible. Check out the pandas help documentation to learn more.
OK, let's get started.
For basics of Pandas, refer to my earlier post Dive-in to Pandas.
In [11]:
#Importing required Python packages
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint
import pandas as pd
import zipfile
import requests
import StringIO
import datetime
import seaborn
from datetime import datetime
np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
In [8]:
#Let's read in a sample crime database from Sacramento for our analysis.
crime=pd.read_csv('SacramentocrimeJanuary2006.csv', parse_dates='True', index_col='cdatetime')
In [10]:
crime.head()
Out[10]:
In [15]:
#Quick Histogram of crimes in different districts.
crime.district.hist(bins=20)
plt.show()
In [29]:
#Let's recapture the datetime as a different field so we could play with it (it's currently set as the index)
crime['date']=crime.index
#Check data type - note that its not in any particular format
crime.date.dtype
#We can convert this field to datetime easily
pd.to_datetime(crime.date)
Out[29]:
In [31]:
crime[crime.date.isnull()]
Out[31]:
In [48]:
#Merge and Concat
#Let's create two simple dataframes and try to merge them
df1=pd.DataFrame(dict(id=range(10), score=np.random.randint(1,11, size=10)))
df2=pd.DataFrame(dict(id=range(15), precision=np.random.randint(6,20,size=15)))
print df1.describe(), df2.describe()
#So df1 and df2 are just a bunch of random integers. df2 is slightly bigger with some overlap to df1.
#Merge performs an inner join so returns only matching indexes, exactly 10 in our case.
df3 = pd.merge(df1,df2) #we can make this outer join with how='outer'
print df3
In [49]:
#While concat just appends rows (or cols if axis=1)
df4=pd.concat([df1,df2])
print df4
In [56]:
#Column wise concat
print pd.concat([df1,df2[:10]['precision']],axis=1)
#We can convert this to an inner join with a join = 'inner' parameter
In [75]:
#We can create hierarchical indexes by specifying keys
df2['recall']=np.random.random(15)
df5=pd.concat([df1,df2[:10]], join='inner', keys=['id1','id2'],axis=1)
df5 #Hierarchical indexes can be referenced as df5.id2.precision and so on.
Out[75]:
In [152]:
#Now, let's try to find burglaries in sacramento using the str.contains functions
burglaries=crime[crime.crimedescr.str.contains('BURGLARY', 'burglary')][:10]
print burglaries.head()
#Let's find out how many crimes occurred by weekday
burglaries.index=pd.to_datetime(burglaries.index)
burglaries['weekday']=burglaries.index.weekday #capture weekday
burglaries.groupby('weekday')['district'].hist(bins=20)
plt.suptitle('Burglaries occuring by Weekday in Sacramento', fontsize=14)
plt.show()
In [182]:
#Let's now read in a different CSV file containing residential/commerical Insurance details from Florida
ins=pd.read_csv('FL_insurance_sample.csv',parse_dates='True', index_col=None)
ins.set_index(['county','policyID'], drop=False, inplace=True)
#setting index seperately and retaining the col so we can play with it
ins.head()
Out[182]:
In [180]:
#Stack displays by cols instead of rows
ins.stack()
Out[180]:
In [194]:
#It's easy to create pivot tables. We can also use ins.pivot but cannot specify hierarchical indexes and use
#aggregation functions such as below.
ins.pivot_table(rows='county',columns='line',values='eq_site_limit', aggfunc=sum)
Out[194]:
In [195]:
#Crosstab simply gives the group frequencies
pd.crosstab(ins.line, ins.construction)
Out[195]:
In [199]:
#Mapping columns to different values/labels is done through the map function
ins.line.unique()
ins['lineno']=ins.line.map({'Residential':'1', 'Commercial':'2'})
ins.lineno.unique()
Out[199]:
In [200]:
#We can also simply replace
ins.lineno.replace({'1':'Res', '2':'Comm'})
Out[200]:
In [202]:
#Creating a design matrix is very simple (DM is basically 1s and 0s for different values of a col represented as multiple
#columns). For instance
pd.get_dummies(ins.construction).head()
Out[202]:
In [254]:
#We can also cut a column into multiple buckets..similar to grouping
ins['site_buckets']=pd.cut(ins.eq_site_limit, [-1.0, 100000.0, 500000.0, 10000000.0], 3, labels=['B1','B2','B3'])
ins[['eq_site_limit', 'site_buckets']].head()
#pd.qcut cuts by quantiles (percentages)
Out[254]:
In [259]:
#Sorting by county (asc) and policyID (desc)
ins.sort(['county','policyID'],ascending=[1,0]).head()
Out[259]:
In [266]:
#Random Sort - permutations
sort=np.random.permutation(len(ins))
ins.take(sort).head()
Out[266]:
In [298]:
#Group By and it's relatives - Split Apply Combine
#Let's convert Policy ID to string first so its not included in any numerical operations
ins.policyID=ins.policyID.astype('str')
#Group By Construction type and sum quantities (aggregate)
ins.groupby(ins.construction).agg(sum)
#Convenience function available for common agg type
ins.groupby(ins.construction).mean()
Out[298]:
In [299]:
#Change labels as necessary to indicate aggregation
ins.groupby(ins.construction).mean().add_prefix('avg_')
Out[299]:
In [301]:
#We can also define custom aggregation
ins.groupby(ins.construction).transform(lambda x: x-x.mean()/x.std()).add_suffix('_norm').head()
Out[301]:
In [317]:
#InList
ins[ins.construction.isin(['Wood','Masonry'])].head()
Out[317]:
In [319]:
#Simple column renaming
ins.rename(columns = {'eq_site_deductible':'eq_site_deductible'}).head()
Out[319]:
In [322]:
#Formula renaming
ins.rename(columns=lambda x: x.replace(' ',''), inplace=True)
ins.head()
Out[322]:
In [323]:
#lower-case all DataFrame column names
ins.columns = map(str.lower, ins.columns)
In [332]:
#Lower-case everything in a DataFrame column
ins.construction = ins.construction.str.lower()
ins.construction.head()
Out[332]:
In [336]:
#Sort dataframe by multiple columns
ins = ins.sort(['construction','line'],ascending=[1,1])
ins.head(3)
Out[336]:
In [341]:
#select from DataFrame using multiple keys of a hierarchical index
ins.xs(('CLAY COUNTY', 119736), level=('county', 'policyID'))
Out[341]:
In []:
#Change all NaNs to None - one of the methods of handling NULL, there are several
ins = ins.where((pd.notnull(ins)), None)
In [343]:
#Slice values in a DataFrame column (aka Series)
ins.line.str[0:2].head(3)
Out[343]:
In [344]:
#Get quick count of rows in a DataFrame
len(ins.index)
Out[344]:
In [347]:
#change data type of DataFrame column
ins.construction = ins.construction.astype(str)
In [348]:
# Get rid of non-numeric values throughout a DataFrame:
for col in ins.columns.values:
ins[col] = ins[col].replace('[^0-9]+.-', '', regex=True)
In [350]:
#Clean up missing values in multiple DataFrame columns
ins = ins.fillna({
'line': 'missing',
'construction': 'missing',
'eq_site_limit': '0.0',
})
In [353]:
#Doing calculations with DataFrame columns that have missing values
#In example below, swap in 0 for df['col1'] cells that contain null
ins['eq_new_limit'] = np.where(pd.isnull(ins['eq_site_limit']),0,ins['eq_site_limit']) + ins['fr_site_limit']
ins.head(3)
Out[353]:
In [400]:
#Delete a column, Copy a DF
ins1=ins.copy()
del ins1['policyid'], ins1['county']
#Reset indexes, that is get back hierarchical indexes as columns
ins1.reset_index(inplace=True)
ins1.head(3)
Out[400]:
Share more, Learn more!
Comments
Comments powered by Disqus