Predicting House Prices with Machine Learning

This is a brief presentation of my analysis, click here for my full report. Click here for my full Kaggle notebook.

Introduction

A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. A house buyer or even a seller may focus on a few key points on house value such as location or square footage. However, different features give different effects on their presence of each other. For example, a house buyer may want a large house with a backyard in a nice location. However, if the house were to be also in a different state, that may compromise the value of certain features such as square footage. Machine learning can take of this problem a lot easier than regular or even seasoned real estate enthusiasts.

Data

We will be using the Ames data set from Ames, Iowa. This data is part of a Kaggle competition if you want to take a shot at it! Important note: location is one of the most important features when predicting house prices. In this dataset, all houses are from Ames, Iowa.

Understanding the Client and their Problem

Client Housebuyer: This client wants to find their next dream home with a reasonable price tag. They have their locations of interest ready. Now, they want to know if the house price matches the house value. With this study, they can understand which features (ex. Number of bathrooms, location, etc.) influence the final price of the house. If all matches, they can ensure that they are getting a fair price.

Client Houseseller: Think of the average house-flipper. This client wants to take advantage of the features that influence a house price the most. They typically want to buy a house at a low price and invest on the features that will give the highest return. For example, buying a house at a good location but small square footage. The client will invest on making rooms at a small cost to get a large return.

Test Variable

Along with many other scientific studies, it is best to have a single target test variable to predict. In this case, we have the sale price of a house. We are going to see whether different features have different effects on the sale price of a house. This way, our client housebuyer can ensure they are getting the best value out of their home purchase and client houseseller can ensure they know what key areas to focus on to maximize profits.

Here is a histogram of what house prices we will be working with. Note: This data is from Ames, Iowa. The location is extremely correlated with Sale Price. (I had to take a double-take at a point, since I consider myself a house-browsing enthusiast)

Notice how the house prices a more heavily distributed below the mean of $180,921. This makes sense because there should be less extravagant houses due to how our wealth distribution is like in the U.S.!

Multivariable Analysis

Something extremely important to know, there are different types of data or features in housing data: categorical and numerical.

Categorical data is just like it sounds. It is in categories. It isn't necessarily linear, but it follows some kind of pattern. For example, take a feature of "Downtown". The response is either "Near", "Far", "Yes", and "No". Back then, living in downtown usually meant that you couldn't afford to live in uptown. Thus, it could be implied that downtown establishments cost less to live in. However, today, that is not the case. (Thank you, hipsters!) So we can't really establish any particular order of response to be "better" or "worse" than the other.

Numerical data is data in number form. (Who could have thought!) These features are in a linear relationship with each other. For example, a 2,000 square foot place is 2 times "bigger" than a 1,000 square foot place. Plain and simple. Simple and clean.

Numerical data is easier to make models out of but more difficult to dictate when buying a house. For example, we can't say the exact price of a house we are looking for. However, we could give a range. Categorical data is easy to ask for but difficult to make models out of. It's easy to say that we want to live in a certain location, but difficult to model whether certain locations should be more expensive than others or not. The relationships between all data, categorical and numerical are what gives a house its final price. Luckily for us, we have 80 features to play with in this data set, 43 categorical and 37 numerical.

Let's take a look at the top 10 features most correlated with Sale Price.

The top 3 numerical features most correlated with Sale Price are: Overall Quality, Living Area Square Feet, and Size of Garage.

Let's see a visualization of how overall quality affects its sale price.

Let's see a visualization of how living area size affects its sale price.

As you can see, there are two outliers in our data set. The two points after 4500 GrLivArea will be removed when making our prediction model.

Finally, let's see how garage area affects its sale price. This time, we will use the number of cars instead of overall square footage to get a better representation.

As you can see, we have an outlier again. The prices increases as the garage capacity but suddenly drops after 3 cars. We'll have to remove this data in before we make our prediction model.

Machine Learning Model to Predict House Prices

After we clean our data, we use several machine learning models such as Lasso, Elasticnet, Kernal ridge, Gradient boosting, Xgboost, and LGBM. We then average the scores and use an ensemble prediction to get the best out of every machine learning model. Eventually, we were able to get a root mean squared logarithmic error of 6.2% on our training set which is the top 4% of all submissions in Kaggle. Although this is great for competitive predictions, let's discuss whether it is useful for our clients.

Conclusion

For our general clients, they simply want to know a few features that they can focus on to maximize value of what they are seeking, whether it be house value or profits. This could simply be done by making a correlation chart. However, the machine learning model pushes this study even further. Being able to see how certain features react with each other can really help websites such as Zillow or Redfin make accurate predictions on house prices. It can also help a tech-savvy user take advantage of datasets to make a huge profit when diving into real estate.

Improvements to Study

A major downfall to this study is that it is focused in one place, Iowa. Something simple we could do is make it cover the entire nation. That way, we can analyze different prices per location. We could also record prices per season each year to have data on price fluctuations. That way, we could do a time-series analysis to predict when is the best time to buy a house. Finally, we could move on to making a recurrent neural network due to the increased amount of data available and also the fact that we have time-series data to work with.

This challenge is available on Kaggle as the “Zillow’s Home Value Prediction (Zestimate)”. The grand prize includes $1,000,000 for the best submission.

#Portfolio #HousePrices #DataScience #MachineLearning #FeatureExtraction #Predict #Kaggle #Zillow #MultivariableAnalysis