Data Munging with Pandas (Advanced)

Pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly.

In this post I will demonstrate some advanced data munging/wrangling concepts with the awesome Pandas. I intend to code more and write less but will add help text as much as possible. Check out the pandas help documentation to learn more.

OK, let's get started.

For basics of Pandas, refer to my earlier post Dive-in to Pandas.

In [11]:
#Importing required Python packages
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint
import pandas as pd
import zipfile
import requests
import StringIO
import datetime
import seaborn
from datetime import datetime

np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5) 

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
In [8]:
#Let's read in a sample crime database from Sacramento for our analysis.
crime=pd.read_csv('SacramentocrimeJanuary2006.csv', parse_dates='True', index_col='cdatetime')
In [10]:
crime.head()
Out[10]:
                         address  district        beat  grid  \
cdatetime                                                      
1/1/06 0:00   3108 OCCIDENTAL DR         3  3C          1115   
1/1/06 0:00  2082 EXPEDITION WAY         5  5A          1512   
1/1/06 0:00           4 PALEN CT         2  2A           212   
1/1/06 0:00       22 BECKFORD CT         6  6C          1443   
1/1/06 0:00     3421 AUBURN BLVD         2  2A           508   

                                crimedescr  ucr_ncic_code   latitude  \
cdatetime                                                              
1/1/06 0:00  10851(A)VC TAKE VEH W/O OWNER           2404  38.550420   
1/1/06 0:00     459 PC  BURGLARY RESIDENCE           2204  38.473501   
1/1/06 0:00  10851(A)VC TAKE VEH W/O OWNER           2404  38.657846   
1/1/06 0:00   476 PC PASS FICTICIOUS CHECK           2501  38.506774   
1/1/06 0:00   459 PC  BURGLARY-UNSPECIFIED           2299  38.637448   

              longitude  
cdatetime                
1/1/06 0:00 -121.391416  
1/1/06 0:00 -121.490186  
1/1/06 0:00 -121.462101  
1/1/06 0:00 -121.426951  
1/1/06 0:00 -121.384613  
In [15]:
#Quick Histogram of crimes in different districts.
crime.district.hist(bins=20)
plt.show()
In [29]:
#Let's recapture the datetime as a different field so we could play with it (it's currently set as the index)
crime['date']=crime.index

#Check data type - note that its not in any particular format
crime.date.dtype

#We can convert this field to datetime easily
pd.to_datetime(crime.date)
Out[29]:
cdatetime
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
1/1/06 0:00   2006-01-01
...
1/31/06 23:00   2006-01-31 23:00:00
1/31/06 23:00   2006-01-31 23:00:00
1/31/06 23:00   2006-01-31 23:00:00
1/31/06 23:11   2006-01-31 23:11:00
1/31/06 23:27   2006-01-31 23:27:00
1/31/06 23:31   2006-01-31 23:31:00
1/31/06 23:36   2006-01-31 23:36:00
1/31/06 23:40   2006-01-31 23:40:00
1/31/06 23:41   2006-01-31 23:41:00
1/31/06 23:45   2006-01-31 23:45:00
1/31/06 23:50   2006-01-31 23:50:00
Name: date, Length: 7584, dtype: datetime64[ns]
In [31]:
crime[crime.date.isnull()]
Out[31]:
Empty DataFrame
Columns: [address, district, beat, grid, crimedescr, ucr_ncic_code, latitude, longitude, date]
Index: []
In [48]:
#Merge and Concat

#Let's create two simple dataframes and try to merge them
df1=pd.DataFrame(dict(id=range(10), score=np.random.randint(1,11, size=10)))
df2=pd.DataFrame(dict(id=range(15), precision=np.random.randint(6,20,size=15)))

print df1.describe(), df2.describe()

#So df1 and df2 are just a bunch of random integers. df2 is slightly bigger with some overlap to df1.

#Merge performs an inner join so returns only matching indexes, exactly 10 in our case.
df3 = pd.merge(df1,df2) #we can make this outer join with how='outer'

print df3
             id     score
count  10.00000  10.00000
mean    4.50000   4.80000
std     3.02765   2.20101
min     0.00000   1.00000
25%     2.25000   3.25000
50%     4.50000   5.00000
75%     6.75000   6.00000
max     9.00000   9.00000               id  precision
count  15.000000  15.000000
mean    7.000000  13.133333
std     4.472136   3.758166
min     0.000000   7.000000
25%     3.500000  11.500000
50%     7.000000  13.000000
75%    10.500000  15.500000
max    14.000000  19.000000
   id  score  precision
0   0      3         13
1   1      6         15
2   2      4         11
3   3      6         13
4   4      5         18
5   5      6          8
6   6      5         12
7   7      1          7
8   8      3         17
9   9      9         12

In [49]:
#While concat just appends rows (or cols if axis=1)
df4=pd.concat([df1,df2])

print df4
    id  precision  score
0    0        NaN      3
1    1        NaN      6
2    2        NaN      4
3    3        NaN      6
4    4        NaN      5
5    5        NaN      6
6    6        NaN      5
7    7        NaN      1
8    8        NaN      3
9    9        NaN      9
0    0         13    NaN
1    1         15    NaN
2    2         11    NaN
3    3         13    NaN
4    4         18    NaN
5    5          8    NaN
6    6         12    NaN
7    7          7    NaN
8    8         17    NaN
9    9         12    NaN
10  10         19    NaN
11  11         15    NaN
12  12         16    NaN
13  13         14    NaN
14  14          7    NaN

In [56]:
#Column wise concat
print pd.concat([df1,df2[:10]['precision']],axis=1)

#We can convert this to an inner join with a join = 'inner' parameter
   id  score  precision
0   0      3         13
1   1      6         15
2   2      4         11
3   3      6         13
4   4      5         18
5   5      6          8
6   6      5         12
7   7      1          7
8   8      3         17
9   9      9         12

In [75]:
#We can create hierarchical indexes by specifying keys
df2['recall']=np.random.random(15)
df5=pd.concat([df1,df2[:10]], join='inner', keys=['id1','id2'],axis=1)

df5 #Hierarchical indexes can be referenced as df5.id2.precision and so on.
Out[75]:
   id1         id2                     
    id  score   id  precision    recall
0    0      3    0         13  0.911068
1    1      6    1         15  0.571435
2    2      4    2         11  0.492971
3    3      6    3         13  0.112970
4    4      5    4         18  0.515867
5    5      6    5          8  0.592116
6    6      5    6         12  0.219727
7    7      1    7          7  0.434970
8    8      3    8         17  0.787740
9    9      9    9         12  0.299421
In [152]:
#Now, let's try to find burglaries in sacramento using the str.contains functions
burglaries=crime[crime.crimedescr.str.contains('BURGLARY', 'burglary')][:10]
print burglaries.head()

#Let's find out how many crimes occurred by weekday
burglaries.index=pd.to_datetime(burglaries.index)
burglaries['weekday']=burglaries.index.weekday #capture weekday
burglaries.groupby('weekday')['district'].hist(bins=20)
plt.suptitle('Burglaries occuring by Weekday in Sacramento', fontsize=14)
plt.show()
                         address  district        beat  grid  \
cdatetime                                                      
1/1/06 0:00  2082 EXPEDITION WAY         5  5A          1512   
1/1/06 0:00     3421 AUBURN BLVD         2  2A           508   
1/1/06 0:00        2217 16TH AVE         4  4A           957   
1/1/06 0:00     3421 AUBURN BLVD         2  2A           508   
1/1/06 0:01     5551 REXLEIGH CT         5  5C          1661   

                               crimedescr  ucr_ncic_code   latitude  \
cdatetime                                                             
1/1/06 0:00    459 PC  BURGLARY RESIDENCE           2204  38.473501   
1/1/06 0:00  459 PC  BURGLARY-UNSPECIFIED           2299  38.637448   
1/1/06 0:00      459 PC  BURGLARY VEHICLE           2299  38.537173   
1/1/06 0:00     459 PC  BURGLARY BUSINESS           2203  38.637448   
1/1/06 0:01      459 PC  BURGLARY VEHICLE           2299  38.446592   

              longitude         date  
cdatetime                             
1/1/06 0:00 -121.490186  1/1/06 0:00  
1/1/06 0:00 -121.384613  1/1/06 0:00  
1/1/06 0:00 -121.487577  1/1/06 0:00  
1/1/06 0:00 -121.384613  1/1/06 0:00  
1/1/06 0:01 -121.442378  1/1/06 0:01  

In [182]:
#Let's now read in a different CSV file containing residential/commerical Insurance details from Florida
ins=pd.read_csv('FL_insurance_sample.csv',parse_dates='True', index_col=None)
ins.set_index(['county','policyID'], drop=False, inplace=True) 
#setting index seperately and retaining the col so we can play with it
ins.head()
Out[182]:
                      policyID statecode       county  eq_site_limit  \
county      policyID                                                   
CLAY COUNTY 119736      119736        FL  CLAY COUNTY       498960.0   
            448094      448094        FL  CLAY COUNTY      1322376.3   
            206893      206893        FL  CLAY COUNTY       190724.4   
            333743      333743        FL  CLAY COUNTY            0.0   
            172534      172534        FL  CLAY COUNTY            0.0   

                      hu_site_limit  fl_site_limit  fr_site_limit    tiv_2011  \
county      policyID                                                            
CLAY COUNTY 119736        498960.00       498960.0       498960.0   498960.00   
            448094       1322376.30      1322376.3      1322376.3  1322376.30   
            206893        190724.40       190724.4       190724.4   190724.40   
            333743         79520.76            0.0            0.0    79520.76   
            172534        254281.50            0.0       254281.5   254281.50   

                        tiv_2012  eq_site_deductible  hu_site_deductible  \
county      policyID                                                       
CLAY COUNTY 119736     792148.90                   0              9979.2   
            448094    1438163.57                   0                 0.0   
            206893     192476.78                   0                 0.0   
            333743      86854.48                   0                 0.0   
            172534     246144.49                   0                 0.0   

                      fl_site_deductible  fr_site_deductible  point_latitude  \
county      policyID                                                           
CLAY COUNTY 119736                     0                   0       30.102261   
            448094                     0                   0       30.063936   
            206893                     0                   0       30.089579   
            333743                     0                   0       30.063236   
            172534                     0                   0       30.060614   

                      point_longitude         line construction  \
county      policyID                                              
CLAY COUNTY 119736         -81.711777  Residential      Masonry   
            448094         -81.707664  Residential      Masonry   
            206893         -81.700455  Residential         Wood   
            333743         -81.707703  Residential         Wood   
            172534         -81.702675  Residential         Wood   

                      point_granularity  
county      policyID                     
CLAY COUNTY 119736                    1  
            448094                    3  
            206893                    1  
            333743                    3  
            172534                    1  
In [180]:
#Stack displays by cols instead of rows
ins.stack()
Out[180]:
county       policyID                    
CLAY COUNTY  119736    statecode                   FL
                       eq_site_limit           498960
                       hu_site_limit           498960
                       fl_site_limit           498960
                       fr_site_limit           498960
                       tiv_2011                498960
                       tiv_2012              792148.9
                       eq_site_deductible           0
                       hu_site_deductible      9979.2
                       fl_site_deductible           0
...
PINELLAS COUNTY  398149    tiv_2011                 373488.3
                           tiv_2012                 596003.7
                           eq_site_deductible              0
                           hu_site_deductible              0
                           fl_site_deductible              0
                           fr_site_deductible              0
                           point_latitude           28.06444
                           point_longitude         -82.77459
                           line                  Residential
                           construction              Masonry
                           point_granularity               1
Length: 586144, dtype: object
In [194]:
#It's easy to create pivot tables. We can also use ins.pivot but cannot specify hierarchical indexes and use
#aggregation functions such as below.
ins.pivot_table(rows='county',columns='line',values='eq_site_limit', aggfunc=sum)
Out[194]:
line                 Commercial  Residential
county                                      
ALACHUA COUNTY     3.249922e+08   99142554.6
BAKER COUNTY       2.476577e+06     378068.4
BAY COUNTY         5.988847e+08   41356479.6
BRADFORD COUNTY    3.483000e+07    2445573.6
BREVARD COUNTY     2.399631e+08   37744344.9
BROWARD COUNTY     1.639098e+09  125713440.0
CALHOUN COUNTY     0.000000e+00     827662.5
CHARLOTTE COUNTY   1.556845e+08   31873591.8
CITRUS COUNTY      2.686367e+08   27245415.6
CLAY COUNTY        3.324845e+06   13818491.1
COLLIER COUNTY     3.773078e+08   31509111.6
COLUMBIA COUNTY    1.849905e+06    4706864.1
...                         ...          ...
SANTA ROSA COUNTY  1.915291e+08   52968218.4
SARASOTA COUNTY    4.544602e+08   55764873.9
SEMINOLE COUNTY    2.382608e+08   66281730.3
ST  JOHNS COUNTY   2.631974e+08   37636201.8
SUMTER COUNTY      0.000000e+00    6679728.0
SUWANNEE COUNTY    4.354253e+07   13221386.1
TAYLOR COUNTY      2.279104e+07    9770446.8
UNION COUNTY                NaN     810488.7
VOLUSIA COUNTY     1.674162e+08   60949216.8
WAKULLA COUNTY     3.833100e+06    7965052.2
WALTON COUNTY      3.784400e+07   16741737.9
WASHINGTON COUNTY  0.000000e+00    1067338.8

[67 rows x 2 columns]
In [195]:
#Crosstab simply gives the group frequencies
pd.crosstab(ins.line, ins.construction)
Out[195]:
construction  Masonry  Reinforced Concrete  Reinforced Masonry  Steel Frame  \
line                                                                          
Commercial          0                 1299                4225          272   
Residential      9257                    0                   0            0   

construction   Wood  
line                 
Commercial        0  
Residential   21581  
In [199]:
#Mapping columns to different values/labels is done through the map function
ins.line.unique()
ins['lineno']=ins.line.map({'Residential':'1', 'Commercial':'2'})
ins.lineno.unique()
Out[199]:
array(['1', '2'], dtype=object)
In [200]:
#We can also simply replace
ins.lineno.replace({'1':'Res', '2':'Comm'})
Out[200]:
county       policyID
CLAY COUNTY  119736       Res
             448094       Res
             206893       Res
             333743       Res
             172534       Res
             785275       Res
             995932      Comm
             223488       Res
             433512       Res
             142071       Res
...
PINELLAS COUNTY  592170      Res
                 827844      Res
                 876385      Res
                 167630      Res
                 241496      Res
                 780814      Res
                 404309      Res
                 465875      Res
                 791209      Res
                 322627      Res
                 398149      Res
Name: lineno, Length: 36634, dtype: object
In [202]:
 #Creating a design matrix is very simple (DM is basically 1s and 0s for different values of a col represented as multiple
#columns). For instance
pd.get_dummies(ins.construction).head()
Out[202]:
                      Masonry  Reinforced Concrete  Reinforced Masonry  \
county      policyID                                                     
CLAY COUNTY 119736          1                    0                   0   
            448094          1                    0                   0   
            206893          0                    0                   0   
            333743          0                    0                   0   
            172534          0                    0                   0   

                      Steel Frame  Wood  
county      policyID                     
CLAY COUNTY 119736              0     0  
            448094              0     0  
            206893              0     1  
            333743              0     1  
            172534              0     1  
In [254]:
#We can also cut a column into multiple buckets..similar to grouping
ins['site_buckets']=pd.cut(ins.eq_site_limit, [-1.0, 100000.0, 500000.0, 10000000.0], 3, labels=['B1','B2','B3'])
ins[['eq_site_limit', 'site_buckets']].head()
#pd.qcut cuts by quantiles (percentages)
Out[254]:
                      eq_site_limit site_buckets
county      policyID                            
CLAY COUNTY 119736         498960.0           B2
            448094        1322376.3           B3
            206893         190724.4           B2
            333743              0.0           B1
            172534              0.0           B1
In [259]:
#Sorting by county (asc) and policyID (desc)
ins.sort(['county','policyID'],ascending=[1,0]).head()
Out[259]:
                         policyID statecode          county  eq_site_limit  \
county         policyID                                                      
ALACHUA COUNTY 999597      999597        FL  ALACHUA COUNTY              0   
               999293      999293        FL  ALACHUA COUNTY           2700   
               998177      998177        FL  ALACHUA COUNTY            450   
               997366      997366        FL  ALACHUA COUNTY              0   
               996781      996781        FL  ALACHUA COUNTY            810   

                         hu_site_limit  fl_site_limit  fr_site_limit  \
county         policyID                                                
ALACHUA COUNTY 999597       3130306.73              0              0   
               999293          2700.00           2700           2700   
               998177           450.00            450            450   
               997366        107966.70              0              0   
               996781           810.00            810            810   

                           tiv_2011    tiv_2012  eq_site_deductible  \
county         policyID                                               
ALACHUA COUNTY 999597    3130306.73  3891146.56                   0   
               999293       2700.00     3691.70                   0   
               998177        450.00      542.40                   0   
               997366     107966.70   137551.08                   0   
               996781        810.00      735.35                   0   

                         hu_site_deductible  fl_site_deductible  \
county         policyID                                           
ALACHUA COUNTY 999597                   0.0                   0   
               999293                  54.0                   0   
               998177                   9.0                   0   
               997366                   0.0                   0   
               996781                  16.2                   0   

                         fr_site_deductible  point_latitude  point_longitude  \
county         policyID                                                        
ALACHUA COUNTY 999597                     0       29.621651       -82.378632   
               999293                     0       29.669180       -82.308040   
               998177                     0       29.672756       -82.439331   
               997366                     0       29.682240       -82.387540   
               996781                     0       29.669120       -82.341230   

                                line        construction  point_granularity  \
county         policyID                                                       
ALACHUA COUNTY 999597     Commercial  Reinforced Masonry                  1   
               999293    Residential                Wood                  1   
               998177    Residential                Wood                  1   
               997366    Residential                Wood                  1   
               996781    Residential                Wood                  3   

                        lineno site_buckets  
county         policyID                      
ALACHUA COUNTY 999597        2           B1  
               999293        1           B1  
               998177        1           B1  
               997366        1           B1  
               996781        1           B1  
In [266]:
#Random Sort - permutations
sort=np.random.permutation(len(ins))
ins.take(sort).head()
Out[266]:
                            policyID statecode             county  \
county            policyID                                          
ORANGE COUNTY     622266      622266        FL      ORANGE COUNTY   
DUVAL COUNTY      158604      158604        FL       DUVAL COUNTY   
MIAMI DADE COUNTY 154173      154173        FL  MIAMI DADE COUNTY   
ST  JOHNS COUNTY  312502      312502        FL   ST  JOHNS COUNTY   
DUVAL COUNTY      394483      394483        FL       DUVAL COUNTY   

                            eq_site_limit  hu_site_limit  fl_site_limit  \
county            policyID                                                
ORANGE COUNTY     622266                0   4.194000e+08              0   
DUVAL COUNTY      158604                0   1.411931e+05              0   
MIAMI DADE COUNTY 154173                0   5.553381e+05              0   
ST  JOHNS COUNTY  312502            18315   1.831500e+04          18315   
DUVAL COUNTY      394483                0   5.742139e+05              0   

                            fr_site_limit      tiv_2011      tiv_2012  \
county            policyID                                              
ORANGE COUNTY     622266                0  4.194000e+08  3.960000e+08   
DUVAL COUNTY      158604                0  1.411931e+05  2.208260e+05   
MIAMI DADE COUNTY 154173                0  5.553381e+05  8.225912e+05   
ST  JOHNS COUNTY  312502            18315  1.831500e+04  2.250855e+04   
DUVAL COUNTY      394483                0  5.742139e+05  8.161187e+05   

                            eq_site_deductible  hu_site_deductible  \
county            policyID                                           
ORANGE COUNTY     622266                     0                 0.0   
DUVAL COUNTY      158604                     0                 0.0   
MIAMI DADE COUNTY 154173                     0                 0.0   
ST  JOHNS COUNTY  312502                     0               366.3   
DUVAL COUNTY      394483                     0                 0.0   

                            fl_site_deductible  fr_site_deductible  \
county            policyID                                           
ORANGE COUNTY     622266                     0                   0   
DUVAL COUNTY      158604                     0                   0   
MIAMI DADE COUNTY 154173                     0                   0   
ST  JOHNS COUNTY  312502                     0                   0   
DUVAL COUNTY      394483                     0                   0   

                            point_latitude  point_longitude         line  \
county            policyID                                                 
ORANGE COUNTY     622266         28.377000       -81.558700   Commercial   
DUVAL COUNTY      158604         30.246492       -81.764511  Residential   
MIAMI DADE COUNTY 154173         25.754580       -80.296220  Residential   
ST  JOHNS COUNTY  312502         30.117205       -81.604431  Residential   
DUVAL COUNTY      394483         30.158780       -81.598960  Residential   

                           construction  point_granularity lineno site_buckets  
county            policyID                                                      
ORANGE COUNTY     622266    Steel Frame                  4      2           B1  
DUVAL COUNTY      158604           Wood                  1      1           B1  
MIAMI DADE COUNTY 154173        Masonry                  3      1           B1  
ST  JOHNS COUNTY  312502           Wood                  1      1           B1  
DUVAL COUNTY      394483        Masonry                  1      1           B1  
In [298]:
#Group By and it's relatives - Split Apply Combine

#Let's convert Policy ID to string first so its not included in any numerical operations
ins.policyID=ins.policyID.astype('str')

#Group By Construction type and sum quantities (aggregate)
ins.groupby(ins.construction).agg(sum)

#Convenience function available for common agg type
ins.groupby(ins.construction).mean()
Out[298]:
                       eq_site_limit  hu_site_limit    fl_site_limit  \
construction                                                           
Masonry                193878.508772   8.496695e+05    145441.406633   
Reinforced Concrete   4015452.193995   1.601223e+07   3313604.618938   
Reinforced Masonry     712295.683030   3.709712e+06    652968.647361   
Steel Frame          59778529.411765   1.090228e+08  57188051.470588   
Wood                    23953.954233   9.261205e+04     17715.800510   

                       fr_site_limit      tiv_2011      tiv_2012  \
construction                                                       
Masonry                270188.807011  8.608683e+05  1.041986e+06   
Reinforced Concrete   6748500.692841  1.665118e+07  2.021243e+07   
Reinforced Masonry    1526479.014249  3.824857e+06  4.621373e+06   
Steel Frame          65920036.764706  1.169841e+08  1.334925e+08   
Wood                    30746.678231  9.371507e+04  1.134930e+05   

                     eq_site_deductible  hu_site_deductible  \
construction                                                  
Masonry                      102.921130         3607.542811   
Reinforced Concrete         4659.850115        55441.015012   
Reinforced Masonry          1011.101538        17593.810296   
Steel Frame                62130.272426       252545.330515   
Wood                          16.356888          435.125583   

                     fl_site_deductible  fr_site_deductible  point_latitude  \
construction                                                                  
Masonry                      218.946851            3.402830       27.912242   
Reinforced Concrete          346.420323            0.000000       27.724533   
Reinforced Masonry           762.449964          213.017751       27.691648   
Steel Frame                 3308.833456            0.000000       27.509396   
Wood                          20.952973            1.793244       28.269267   

                     point_longitude  point_granularity  
construction                                             
Masonry                   -81.718078           1.516474  
Reinforced Concrete       -81.567918           1.501925  
Reinforced Masonry        -81.503285           1.479527  
Steel Frame               -81.437564           1.628676  
Wood                      -82.087569           1.734396  
In [299]:
#Change labels as necessary to indicate aggregation
ins.groupby(ins.construction).mean().add_prefix('avg_')
Out[299]:
                     avg_eq_site_limit  avg_hu_site_limit  avg_fl_site_limit  \
construction                                                                   
Masonry                  193878.508772       8.496695e+05      145441.406633   
Reinforced Concrete     4015452.193995       1.601223e+07     3313604.618938   
Reinforced Masonry       712295.683030       3.709712e+06      652968.647361   
Steel Frame            59778529.411765       1.090228e+08    57188051.470588   
Wood                      23953.954233       9.261205e+04       17715.800510   

                     avg_fr_site_limit  avg_tiv_2011  avg_tiv_2012  \
construction                                                         
Masonry                  270188.807011  8.608683e+05  1.041986e+06   
Reinforced Concrete     6748500.692841  1.665118e+07  2.021243e+07   
Reinforced Masonry      1526479.014249  3.824857e+06  4.621373e+06   
Steel Frame            65920036.764706  1.169841e+08  1.334925e+08   
Wood                      30746.678231  9.371507e+04  1.134930e+05   

                     avg_eq_site_deductible  avg_hu_site_deductible  \
construction                                                          
Masonry                          102.921130             3607.542811   
Reinforced Concrete             4659.850115            55441.015012   
Reinforced Masonry              1011.101538            17593.810296   
Steel Frame                    62130.272426           252545.330515   
Wood                              16.356888              435.125583   

                     avg_fl_site_deductible  avg_fr_site_deductible  \
construction                                                          
Masonry                          218.946851                3.402830   
Reinforced Concrete              346.420323                0.000000   
Reinforced Masonry               762.449964              213.017751   
Steel Frame                     3308.833456                0.000000   
Wood                              20.952973                1.793244   

                     avg_point_latitude  avg_point_longitude  \
construction                                                   
Masonry                       27.912242           -81.718078   
Reinforced Concrete           27.724533           -81.567918   
Reinforced Masonry            27.691648           -81.503285   
Steel Frame                   27.509396           -81.437564   
Wood                          28.269267           -82.087569   

                     avg_point_granularity  
construction                                
Masonry                           1.516474  
Reinforced Concrete               1.501925  
Reinforced Masonry                1.479527  
Steel Frame                       1.628676  
Wood                              1.734396  
In [301]:
#We can also define custom aggregation
ins.groupby(ins.construction).transform(lambda x: x-x.mean()/x.std()).add_suffix('_norm').head()
Out[301]:
                      eq_site_limit_norm  hu_site_limit_norm  \
county      policyID                                           
CLAY COUNTY 119736         498959.506785       498957.883794   
            448094        1322375.806785      1322374.183794   
            206893         190724.031858       190723.410099   
            333743             -0.368142        79519.770099   
            172534             -0.368142       254280.510099   

                      fl_site_limit_norm  fr_site_limit_norm   tiv_2011_norm  \
county      policyID                                                           
CLAY COUNTY 119736         498959.597407        498959.41114   498957.797895   
            448094        1322375.897407       1322375.71114  1322374.097895   
            206893         190724.075007        190723.94993   190723.398041   
            333743             -0.324993            -0.45007    79519.758041   
            172534             -0.324993        254281.04993   254280.498041   

                       tiv_2012_norm  eq_site_deductible_norm  \
county      policyID                                            
CLAY COUNTY 119736     792146.947494                -0.060878   
            448094    1438161.617494                -0.060878   
            206893     192475.825423                -0.047723   
            333743      86853.525423                -0.047723   
            172534     246143.535423                -0.047723   

                      hu_site_deductible_norm  fl_site_deductible_norm  \
county      policyID                                                     
CLAY COUNTY 119736                9978.918554                -0.026253   
            448094                  -0.281446                -0.026253   
            206893                  -0.193938                -0.032017   
            333743                  -0.193938                -0.032017   
            172534                  -0.193938                -0.032017   

                      fr_site_deductible_norm  point_latitude_norm  \
county      policyID                                                 
CLAY COUNTY 119736                  -0.028769            12.545728   
            448094                  -0.028769            12.507403   
            206893                  -0.033160            13.092729   
            333743                  -0.033160            13.066386   
            172534                  -0.033160            13.063764   

                      point_longitude_norm  point_granularity_norm  
county      policyID                                                
CLAY COUNTY 119736              -29.815643               -0.536889  
            448094              -29.811530                1.463111  
            206893              -39.692517               -0.544931  
            333743              -39.699765                1.455069  
            172534              -39.694737               -0.544931  
In [317]:
#InList
ins[ins.construction.isin(['Wood','Masonry'])].head()
Out[317]:
                     policyID statecode       county  eq_site_limit  \
county      policyID                                                  
CLAY COUNTY 119736     119736        FL  CLAY COUNTY       498960.0   
            448094     448094        FL  CLAY COUNTY      1322376.3   
            206893     206893        FL  CLAY COUNTY       190724.4   
            333743     333743        FL  CLAY COUNTY            0.0   
            172534     172534        FL  CLAY COUNTY            0.0   

                      hu_site_limit  fl_site_limit  fr_site_limit    tiv_2011  \
county      policyID                                                            
CLAY COUNTY 119736        498960.00       498960.0       498960.0   498960.00   
            448094       1322376.30      1322376.3      1322376.3  1322376.30   
            206893        190724.40       190724.4       190724.4   190724.40   
            333743         79520.76            0.0            0.0    79520.76   
            172534        254281.50            0.0       254281.5   254281.50   

                        tiv_2012  eq_site_deductible  hu_site_deductible  \
county      policyID                                                       
CLAY COUNTY 119736     792148.90                   0              9979.2   
            448094    1438163.57                   0                 0.0   
            206893     192476.78                   0                 0.0   
            333743      86854.48                   0                 0.0   
            172534     246144.49                   0                 0.0   

                      fl_site_deductible  fr_site_deductible  point_latitude  \
county      policyID                                                           
CLAY COUNTY 119736                     0                   0       30.102261   
            448094                     0                   0       30.063936   
            206893                     0                   0       30.089579   
            333743                     0                   0       30.063236   
            172534                     0                   0       30.060614   

                      point_longitude         line construction  \
county      policyID                                              
CLAY COUNTY 119736         -81.711777  Residential      Masonry   
            448094         -81.707664  Residential      Masonry   
            206893         -81.700455  Residential         Wood   
            333743         -81.707703  Residential         Wood   
            172534         -81.702675  Residential         Wood   

                      point_granularity lineno site_buckets  
county      policyID                                         
CLAY COUNTY 119736                    1      1           B2  
            448094                    3      1           B3  
            206893                    1      1           B2  
            333743                    3      1           B1  
            172534                    1      1           B1  
In [319]:
#Simple column renaming
ins.rename(columns = {'eq_site_deductible':'eq_site_deductible'}).head()
Out[319]:
                     policyID statecode       county  eq_site_limit  \
county      policyID                                                  
CLAY COUNTY 119736     119736        FL  CLAY COUNTY       498960.0   
            448094     448094        FL  CLAY COUNTY      1322376.3   
            206893     206893        FL  CLAY COUNTY       190724.4   
            333743     333743        FL  CLAY COUNTY            0.0   
            172534     172534        FL  CLAY COUNTY            0.0   

                      hu_site_limit  fl_site_limit  fr_site_limit    tiv_2011  \
county      policyID                                                            
CLAY COUNTY 119736        498960.00       498960.0       498960.0   498960.00   
            448094       1322376.30      1322376.3      1322376.3  1322376.30   
            206893        190724.40       190724.4       190724.4   190724.40   
            333743         79520.76            0.0            0.0    79520.76   
            172534        254281.50            0.0       254281.5   254281.50   

                        tiv_2012  eq_site_deductible  hu_site_deductible  \
county      policyID                                                       
CLAY COUNTY 119736     792148.90                   0              9979.2   
            448094    1438163.57                   0                 0.0   
            206893     192476.78                   0                 0.0   
            333743      86854.48                   0                 0.0   
            172534     246144.49                   0                 0.0   

                      fl_site_deductible  fr_site_deductible  point_latitude  \
county      policyID                                                           
CLAY COUNTY 119736                     0                   0       30.102261   
            448094                     0                   0       30.063936   
            206893                     0                   0       30.089579   
            333743                     0                   0       30.063236   
            172534                     0                   0       30.060614   

                      point_longitude         line construction  \
county      policyID                                              
CLAY COUNTY 119736         -81.711777  Residential      Masonry   
            448094         -81.707664  Residential      Masonry   
            206893         -81.700455  Residential         Wood   
            333743         -81.707703  Residential         Wood   
            172534         -81.702675  Residential         Wood   

                      point_granularity lineno site_buckets  
county      policyID                                         
CLAY COUNTY 119736                    1      1           B2  
            448094                    3      1           B3  
            206893                    1      1           B2  
            333743                    3      1           B1  
            172534                    1      1           B1  
In [322]:
#Formula renaming
ins.rename(columns=lambda x: x.replace(' ',''), inplace=True)
ins.head()
Out[322]:
                     policyID statecode       county  eq_site_limit  \
county      policyID                                                  
CLAY COUNTY 119736     119736        FL  CLAY COUNTY       498960.0   
            448094     448094        FL  CLAY COUNTY      1322376.3   
            206893     206893        FL  CLAY COUNTY       190724.4   
            333743     333743        FL  CLAY COUNTY            0.0   
            172534     172534        FL  CLAY COUNTY            0.0   

                      hu_site_limit  fl_site_limit  fr_site_limit    tiv_2011  \
county      policyID                                                            
CLAY COUNTY 119736        498960.00       498960.0       498960.0   498960.00   
            448094       1322376.30      1322376.3      1322376.3  1322376.30   
            206893        190724.40       190724.4       190724.4   190724.40   
            333743         79520.76            0.0            0.0    79520.76   
            172534        254281.50            0.0       254281.5   254281.50   

                        tiv_2012  eq_site_deductible  hu_site_deductible  \
county      policyID                                                       
CLAY COUNTY 119736     792148.90                   0              9979.2   
            448094    1438163.57                   0                 0.0   
            206893     192476.78                   0                 0.0   
            333743      86854.48                   0                 0.0   
            172534     246144.49                   0                 0.0   

                      fl_site_deductible  fr_site_deductible  point_latitude  \
county      policyID                                                           
CLAY COUNTY 119736                     0                   0       30.102261   
            448094                     0                   0       30.063936   
            206893                     0                   0       30.089579   
            333743                     0                   0       30.063236   
            172534                     0                   0       30.060614   

                      point_longitude         line construction  \
county      policyID                                              
CLAY COUNTY 119736         -81.711777  Residential      Masonry   
            448094         -81.707664  Residential      Masonry   
            206893         -81.700455  Residential         Wood   
            333743         -81.707703  Residential         Wood   
            172534         -81.702675  Residential         Wood   

                      point_granularity lineno site_buckets  
county      policyID                                         
CLAY COUNTY 119736                    1      1           B2  
            448094                    3      1           B3  
            206893                    1      1           B2  
            333743                    3      1           B1  
            172534                    1      1           B1  
In [323]:
#lower-case all DataFrame column names
ins.columns = map(str.lower, ins.columns)
In [332]:
#Lower-case everything in a DataFrame column
ins.construction = ins.construction.str.lower()
ins.construction.head()
Out[332]:
county       policyID
CLAY COUNTY  119736      masonry
             448094      masonry
             206893         wood
             333743         wood
             172534         wood
Name: construction, dtype: object
In [336]:
#Sort dataframe by multiple columns
ins = ins.sort(['construction','line'],ascending=[1,1])
ins.head(3)
Out[336]:
                     policyid statecode       county  eq_site_limit  \
county      policyID                                                  
CLAY COUNTY 119736     119736        FL  CLAY COUNTY       498960.0   
            448094     448094        FL  CLAY COUNTY      1322376.3   
            785275     785275        FL  CLAY COUNTY            0.0   

                      hu_site_limit  fl_site_limit  fr_site_limit    tiv_2011  \
county      policyID                                                            
CLAY COUNTY 119736        498960.00       498960.0       498960.0   498960.00   
            448094       1322376.30      1322376.3      1322376.3  1322376.30   
            785275        515035.62            0.0            0.0   515035.62   

                        tiv_2012  eq_site_deductible  hu_site_deductible  \
county      policyID                                                       
CLAY COUNTY 119736     792148.90                   0              9979.2   
            448094    1438163.57                   0                 0.0   
            785275     884419.17                   0                 0.0   

                      fl_site_deductible  fr_site_deductible  point_latitude  \
county      policyID                                                           
CLAY COUNTY 119736                     0                   0       30.102261   
            448094                     0                   0       30.063936   
            785275                     0                   0       30.063236   

                      point_longitude         line construction  \
county      policyID                                              
CLAY COUNTY 119736         -81.711777  Residential      masonry   
            448094         -81.707664  Residential      masonry   
            785275         -81.707703  Residential      masonry   

                      point_granularity lineno site_buckets  
county      policyID                                         
CLAY COUNTY 119736                    1      1           B2  
            448094                    3      1           B3  
            785275                    3      1           B1  
In [341]:
#select from DataFrame using multiple keys of a hierarchical index
ins.xs(('CLAY COUNTY', 119736), level=('county', 'policyID'))
Out[341]:
                     policyid statecode       county  eq_site_limit  \
county      policyID                                                  
CLAY COUNTY 119736     119736        FL  CLAY COUNTY         498960   

                      hu_site_limit  fl_site_limit  fr_site_limit  tiv_2011  \
county      policyID                                                          
CLAY COUNTY 119736           498960         498960         498960    498960   

                      tiv_2012  eq_site_deductible  hu_site_deductible  \
county      policyID                                                     
CLAY COUNTY 119736    792148.9                   0              9979.2   

                      fl_site_deductible  fr_site_deductible  point_latitude  \
county      policyID                                                           
CLAY COUNTY 119736                     0                   0       30.102261   

                      point_longitude         line construction  \
county      policyID                                              
CLAY COUNTY 119736         -81.711777  Residential      masonry   

                      point_granularity lineno site_buckets  
county      policyID                                         
CLAY COUNTY 119736                    1      1           B2  
In []:
#Change all NaNs to None - one of the methods of handling NULL, there are several
ins = ins.where((pd.notnull(ins)), None)
In [343]:
#Slice values in a DataFrame column (aka Series)
ins.line.str[0:2].head(3)
Out[343]:
county       policyID
CLAY COUNTY  119736      Re
             448094      Re
             785275      Re
Name: line, dtype: object
In [344]:
#Get quick count of rows in a DataFrame
len(ins.index)
Out[344]:
36634
In [347]:
#change data type of DataFrame column
ins.construction = ins.construction.astype(str)
In [348]:
# Get rid of non-numeric values throughout a DataFrame:
for col in ins.columns.values:
  ins[col] = ins[col].replace('[^0-9]+.-', '', regex=True)
In [350]:
#Clean up missing values in multiple DataFrame columns
ins = ins.fillna({
    'line': 'missing',
    'construction': 'missing',
    'eq_site_limit': '0.0',
})
In [353]:
#Doing calculations with DataFrame columns that have missing values
#In example below, swap in 0 for df['col1'] cells that contain null
ins['eq_new_limit'] = np.where(pd.isnull(ins['eq_site_limit']),0,ins['eq_site_limit']) + ins['fr_site_limit']
ins.head(3)
Out[353]:
                     policyid statecode       county  eq_site_limit  \
county      policyID                                                  
CLAY COUNTY 119736     119736        FL  CLAY COUNTY       498960.0   
            448094     448094        FL  CLAY COUNTY      1322376.3   
            785275     785275        FL  CLAY COUNTY            0.0   

                      hu_site_limit  fl_site_limit  fr_site_limit    tiv_2011  \
county      policyID                                                            
CLAY COUNTY 119736        498960.00       498960.0       498960.0   498960.00   
            448094       1322376.30      1322376.3      1322376.3  1322376.30   
            785275        515035.62            0.0            0.0   515035.62   

                        tiv_2012  eq_site_deductible         ...          \
county      policyID                                         ...           
CLAY COUNTY 119736     792148.90                   0         ...           
            448094    1438163.57                   0         ...           
            785275     884419.17                   0         ...           

                      fl_site_deductible  fr_site_deductible  point_latitude  \
county      policyID                                                           
CLAY COUNTY 119736                     0                   0       30.102261   
            448094                     0                   0       30.063936   
            785275                     0                   0       30.063236   

                      point_longitude         line construction  \
county      policyID                                              
CLAY COUNTY 119736         -81.711777  Residential      masonry   
            448094         -81.707664  Residential      masonry   
            785275         -81.707703  Residential      masonry   

                     point_granularity  lineno site_buckets eq_new_limit  
county      policyID                                                      
CLAY COUNTY 119736                   1       1           B2     997920.0  
            448094                   3       1           B3    2644752.6  
            785275                   3       1           B1          0.0  

[3 rows x 21 columns]
In [400]:
#Delete a column, Copy a DF
ins1=ins.copy()
del ins1['policyid'], ins1['county']

#Reset indexes, that is get back hierarchical indexes as columns
ins1.reset_index(inplace=True)
ins1.head(3)
Out[400]:
        county  policyID statecode  eq_site_limit  hu_site_limit  \
0  CLAY COUNTY    119736        FL       498960.0      498960.00   
1  CLAY COUNTY    448094        FL      1322376.3     1322376.30   
2  CLAY COUNTY    785275        FL            0.0      515035.62   

   fl_site_limit  fr_site_limit    tiv_2011    tiv_2012  eq_site_deductible  \
0       498960.0       498960.0   498960.00   792148.90                   0   
1      1322376.3      1322376.3  1322376.30  1438163.57                   0   
2            0.0            0.0   515035.62   884419.17                   0   

          ...          fl_site_deductible  fr_site_deductible  point_latitude  \
0         ...                           0                   0       30.102261   
1         ...                           0                   0       30.063936   
2         ...                           0                   0       30.063236   

   point_longitude         line construction point_granularity  lineno  \
0       -81.711777  Residential      masonry                 1       1   
1       -81.707664  Residential      masonry                 3       1   
2       -81.707703  Residential      masonry                 3       1   

  site_buckets eq_new_limit  
0           B2     997920.0  
1           B3    2644752.6  
2           B1          0.0  

[3 rows x 21 columns]

Share more, Learn more!





Comments

Comments powered by Disqus