Predicting House Prices in a Kaggle Machine Learning Competition
Photo by Brendan Bell
Our team, composed of Ansel Santos, Sal Lascano, Yicong Xu, and Moon Kang, joined a machine learning (House Prices: Advanced Regression Techniques) competition in Kaggle. Participants are competing with each other to find the most accurate model for predicting house prices using the data provided by the website. We created a model which gave us a score of 0.11599 that made us the Champions within our cohort (12th Cohort) and put our group in the top 9% on Kaggle's public leaderboard.
We used the following references for our work: Stacked Regressions Top 4 on Leaderboard by serigne for the modeling and stacking code, Comprehensive Data Exploration with Python by pmarcelino for out Exploratory Data Analysis (EDA), A Study on Regression Applied to the Ames Dataset by juliencs for our Features Engineering, and Regularized linear models by apapiu for setting up our first pipeline.
II. Exploratory Data Analysis
We did an initial analysis of the data by using Python’s Pandas and Plotly. Looking at the graphs of the train data, we saw that there are outliers that needed to be removed. Also, a look at the distribution of the SalePrice variable revealed that it is skewed and required a Log Transformation. We proceeded by dropping the outliers whose SalePrice was below 300,000 and GrLivArea above 4,000. Then the SalePrice was Log Transformed to reduce its skewness.
The correlation heat map was useful in giving us an overview of which numerical features or variables are important, and which variables are highly correlated with each other and can be combined. The variables which have a more yellow color are the ones which have a higher correlation to the target variable while the variables which are more green are less (negatively) correlated.
III. Features Engineering
Features engineering had three parts, namely, filling the missingness, transformations, and applying the box-cox transformation to numerical variables and dummifying categorical variables.
Filling the Missingness
We counted missingness from each column to see which feature had the most missingness. We handled missingness in two ways. The first way was by using the description.txt provided by Kaggle. The description.txt contains information on what empty data points meant for some of the columns, which helped us on the imputation.
The second way we addressed missingness was by determining what kind of missingness occurred and then deciding how to impute. There are three kinds of missingness, namely, missing at random, missing not at random, and missing completely at random. Based on this classification, we decided on the imputation method to use.
Plots for Analyzing Transformations
Before moving to the transformations, we ran some plots to see relationships of the features with the SalePrice. We did scatterplots, boxplots and distribution plots to check the relationship to SalePrice. The visualizations guided us in identifying which transformations can potentially increase the accuracy of our predictions.
This section is divided into two parts, specifically, numeric variables that needed to be transformed into categories and categorical variables that needed to be transformed into numeric values.
The MSSubclass, MoSold, and YrSold features were numeric, though once they are analyzed they should be categorical. Assuming a linear model is to be used, a house with a subclass of 180 is not nine times more valuable than a house of class 20. Therefore, this variable should be categorical. The same concept can be applied to the MoSold and YrSold features because housing market prices do not go in only one direction.
The team manually did label encoding by looking for categorical features that can be simplified by converting into integers. An example is the basement condition variable whose categories were transformed as follows: No basement into 0, poor into 1, fair into 2, typical into 3, good into 4, and excellent into 5.
We created new variables simplifying our ordinal numeric variables. We did this by grouping values within a range together. For the variable describing the overall quality of the house, it was simplified by grouping 1 to 3, 4 to 6, and 7 to 10 together.
With most of the data set cleaned and transformed we noticed that you can combine variables. This was the case for overall quality and condition. Since these features are similar to each other, multiplying them together will allow our models to interpret them as one, which can increase the accuracy of our predictions.
Box-Cox Transformation and Dummifying Variables
After doing all of the transformations, the numeric variables whose distribution have high skewness were transformed using a box-cox transformation, while categorical variables that were not label encoded were dummified.
Now that the data is ready, we can start creating the model for predicting the LogSalePrice! The group tested various models but ended up with two models that were stacked. We found that the Lasso Regression and Gradient Boosting models, when stacked, made the best prediction of the target variable.
Cross-validation scores were computed to help us decide which model to use. The team ran Lasso Regression, Ridge Regression, Elastic Net, Extreme Gradient Boosting, Gradient Boosting, Light Gradient Boosting, and Random Forest. The lowest cross-validation score came from Lasso Regression, a linear model. We were not surprised upon viewing the results as the exploratory data analysis we did showed that features had a noticeable linear relationship with the target variable.
The Gradient Boosting model was selected not because it provided the best cross-validation score, but because it improved our model when it was stacked with Lasso. Gradient Boosting, being a tree-based model, complemented Lasso regression on features which did not have a clear linear relationship with the target. We believe that this is the reason why stacking it with Lasso increased prediction accuracy.
We further improved the models by tuning the parameters using the GridSearchCV and RandomizedSearchCV function in python's sklearn package. The grid search for Lasso’s alpha variable gave a value of 0.0001 and Gradient Boosting’s learning_rate and min_samples_leaf variables a value of 0.11 and 13 respectively. We adjusted the variables to 0.0005, 0.05 and 13 respectively as the ones that grid search gave overfit within the training sample and needed to be adjusted manually for the test data.
With the models chosen and their parameters set, we ran a stacking code with Lasso and Gradient Boosting as the base models and Lasso as our meta model. Stacking is a type of ensembling which improves model accuracy by combining a list of base models using a meta model. For the predictions made by the base models, since we are using Lasso as our meta model, a beta will be multiplied to each of these predictions which are calculated by running a Lasso regression.
Stacking enables the combination of models which has the ability to improve the score further. As in the modeling that we did, the cross-validation score of 0.1119 using a plain Lasso model improved to 0.1069 when we used stacking.
The team got a score of 0.11599 when the test set predictions were uploaded to Kaggle, which is the best within our cohort and is in the top 9% in the public leaderboard.
This exercise gave us the experience of working in a data science team environment. We realized how important it is to not allow each member's ego to get the best of our team. Also, constant communication with each other should be exercised, as this makes the team identify and resolve potential problems before they happen.