Machine Learning to Predict Future Housing Prices

Friday, March 3, 2023

When I chose my final senior thesis and project for my computer science degree I chose to write an artiificial intelligence based model that used a custom machine learning code that I trained to predict future housing prices based on macroeconomic data. I wanted to dive more into machine learning and what sorts of insights can be found with properly trained data models.

Thesis: A Future Looking Random Forest Regression Model

Many individuals eventually look to purchase a second home as a vacation home. Unfortunately, the house qualities are usually the driving factor in selecting a house, and timing does not factor. When they are in the mode to purchase, they will purchase within a few months as they search.

However, if given proper macroeconomic indicators, these individuals may delay or speed up their decision to purchase a home if they knew the housing market would be dropping or rising, respectively, allowing them to potentially save tens of thousands of dollars.

Stock market traders have already adopted time forecast AI for stock price trading advantages, but these machine learning models have yet to be trained at scale on the real estate market for the purpose of informing home purchase decisions, whether strictly as investments or as a vacation home (or both).

The current housing market is making quite a bit of use of machine learning technologies. However, the use appears to be highly targeted within a very narrow use case. In fact, this specific, narrow use case is so popular that many beginner level machine learning blogs utilize a house prediction problem to teach linear regression. One can find several examples of a simple linear regression house prediction model online, even using the same data sets.

However, these tutorials are very narrow in scope (mostly because they have to fit the format of a short blog post). As such, the data is normally already cleaned and features chosen prior to the start of the tutorial.

Additionally, using a linear regression may make training the model simpler, it is not without challenges. For example, house pricing is full of unique cases. Houses can be pricey if they have been owned by a celebrity, or if they have an incredible view. Houses can also be reduced in value if there has been mold in the basement, for example. As such, the likelihood of outliers is extremely high and linear regression can have a hard time accurately working around such outliers.

Therefore, there is a need for a machine learning model that uses a more robust model. In our case, we will be using a Random Forest Regressor in order to more efficiently deal with outliers, as well as non-linear relationships. Additionally, our model will be trained on macroeconomic data that will help forecast the housing prices well into the future, allowing for a more competitive insight into the market.

Cleaning and Exploring the Data

The raw data was ingested from CSV files located in the project folder. We used Pandas’ built in library “read_csv” in order to read and parse the data into a Pandas Dataframe. This was performed for each CSV file, and then each one was cleaned and features renamed for easier manipulation and analysis.

Below is a samples of the code to ingest the data, along with the data. The one is of physical features of the houses and the other dataset is of the historic economic data used in the model training. The raw code is available for examination in the “data” folder within the project folder.

The data was first processed and cleaned for several reasons. The first was to eliminate any NaN and the second was to properly setup certain datasets. For example, when the lumber data was graphed against the housing price data, there was clearly a correlation, and a strong one, too. However, the lumber was a leading indicator and would typically run about two years ahead of the housing market (in general). To help the machine learning model better see the correlation, I offset the lumber to match the prices so the correlation was happening at the same time rather than offset by such a large time-delay.

Additionally, I used the built-in framework’s moving average feature to smooth out weekly and daily fluctuations in lumber prices, as seen here in lines 5 and 6.

df_l = pd.DataFrame()
df_l['Date'] = df_e['Date']
df_l['Date'] = pd.to_datetime(df_l['Date'], format='%Y-%m-%d') df_l['ShiftedLumber'] = df_e['ShiftedLumber'] df_l['LumberMA'] = df_e['ShiftedLumber'].rolling(24).mean() df_l['PricesMA'] = df_e['Prices'].rolling(12).mean()
df_l = df_l.ffill().bfill() # backfill NaN
df_l['Prices'] = df_e['Prices']

By creating a 2-year and 1-year rolling average, I was able to more clearly see the trends in the data after the noise had been smoothed out. To keep the original data, this moving average data was added as a new column of data to the original, as can be seen in the code above.

Below is a screen capture of the plotted data, showing the housing price (red) and the lumber’s moving average (green). For comparison, the housing price’s moving average is in black.

The Random Forest Regressor was applied to the physical properties and another model applied to the historical time-price data. In this way, I could run a model against the physical or time features of the property.

As can be seen in the example to the right, the physical model was run and the error was about $28,000 on a $200k+ house. While we were not thrilled with such a high error rate, I believe that if we narrowed down the dataset to geographic regions, it would result in a much lower result. As it stands, these are for all houses in the United States, so there are houses that would match physical features, but still fluctuate in price drastically due to the geographic location of the house.

Since this explanation was reasonable, I proceeded with the rest for the project.

As mentioned previously, price was tested using the Random Forest Regressor. This can be seen in the code to the right. In addition, I performed the standard breakup of the data into two groups (a 70/30 split) to train and test the data, respectively.

This was repeated for the economic data, which was run against the CPI, FedFundsRate, Lumber
Prices, House Prices, and for fun, the price of Eggs. The thought for the latter was the direct
impact of one element that is representative of inflation. While there was a correlation to higher egg prices to higher house prices, the inclusion of the egg prices with the Random Forest Regressor proved to raise the error too high and, in the end, the price of eggs was not included in the actual model training.

Data Model Testing and Accuracy

This application was tested by dividing the known dataset into two parts, one for training and one for testing (in a 70/30 split, respectively). After training the model with the larger set, the known test set was run against through the model and was seen to give a very high accuracy score as well as visually show a very tight scatterplot.

Both the accuracy score and tight scatterplot were strong indicators that the macroeconomic Regressor model was very accurate and the testing illustrated this. The images with this data are included earlier in this paper where the details of the accuracy and the scatterplot data are discussed.

The unfortunate part of this dataset, however, is that the dataset ended just prior to 2020, and did not reflect the extreme volatility seen in the 2020-2023 housing market. Once that dataset becomes available, it will prove interesting to retrain this model with that data to verify if the data maintains its accuracy. For this reason, I am very motivated to retrain and retest this model once the newer data becomes available.

That said, I did “spot check” to test the accuracy of the model with additional known data. Instead of forecasting into the future, I added data for known houses and known prices at various times and “forecasted” to a date that was a future date for the model but in the past for me, allowing me to test real world house prices with forecasted prices from the model. During this examination period, I found the model to be reasonably accurate, despite having a limited number of sample houses I could manually enter.

Ciao! I'm Scott Sullivan, an software developer and machine learning nerd. I divide my time between the tranquil countryside of Lancaster, Pennsylvania, and northern Italy, visiting family, close to Cinque Terre and La Spezia. Professionally, I'm using my Master's in Data Analytics and my Bachelor's degree in Computer Science, to create compelling software products that user AI, run lighting, robots, and automation effects for a large Christian theatrical productions to spread the message of Christ's salvation.