Below is a collection of all my Data Science related work. It includes mostly IPython notebooks but contains code from multiple languages/packages and spanning mostly Data Science topics such as:
- Data Munging
- Data Analysis & Exploration
- Machine Learning
- Data Visualizations
Feel free to use any of this for your own reference but expect more code than text!
Rendezvous with the real-world
These are some of my attempts at creating (and solving) real-world technical challenges, primarily focused on Data Science. I will publish the full source code and other artifacts here once the individual projects are complete.
Predicting Restaurant Health Scores using Yelp User Reviews and SF Open Data
City Health Inspectors provide a health score for each restaurant they visit. They also lodge their complaints if any and request the owners to correct the issues. Once the issues have been corrected, the restaurants are eligible to receive a score. Knowing the popularity of Yelp and being a Yelper myself, this project is basically an attempt to see if user reviews or Yelp restaurant details like the location, neighborhood, price etc. can give us an indication of the potential Health Score. Perhaps some users talked about how unclean a restaurant was or in some neighborhoods, things are always spic and span.
I call these mini-projects since they are based on common scenarios/applications of Machine Learning. But they do cover the full spectrum of the learning process beginning with Data procurement to Model Comparison & Visualization and are based on real data sets (open data). I intend to host solutions in multiple toolsets here but I will start with scikit-learn (+ other python packages) and R.
Facial Image Compression and Dimensionality Reduction using Principal Component Analysis (PCA)
Understanding k-Nearest Neighbours with the PIMA Indians Diabetes dataset (UCI)
Recognizing Hand Written Digits (UCI ML Repo) with Support Vector Machines (SVM)
Color Compression using the K-Means algorithm
Predicting LendingClub interest rates with Linear Regression
Evaluating the Statlog (German Credit Data) Data Set with Random Forests
Caifornia house price predictions with Gradient Boosted Regression Trees
Following are some of the competitions I've participated in at Kaggle. Where possible, I've also documented my entire approach as IPython notebooks.
Currently participating in (will upload solutions once done):
- Predicting Click-through-Rate for Avazu
- National Data Science Bowl
- Predict probabilistic distribution of hourly rain given polarimetric radar measurements
Saving the Titanic with R & IPython
In this session I work on the Titanic Survival prediction challenge. The solution is entirely in R but is hosted in an IPython notebook. It should also serve as a good starter code since it explains multiple models and compares results.
Predicting Forest Cover Types with Ensemble Learning
In this session I work on the Forest Cover Type prediction challenge. The solution is entirely in Python. I try to demonstrate the ensemble approach to machine learning with this competition (with reasonable success).
The focus here is to understand the concepts given the tool at hand. I intend to compare and contrast multiple tools for the same purpose.
Visualization Tools - matplotlib, ggplot2, lattice etc.