Exploring the Relationship between Income and Real Estate in the US

11 min readMay 12, 2021

Introduction

Affordable housing is important for the physical and mental health of those who live in them. Across America, there are stark differences in what the affordability of real estate looks like in comparison to income, and it is vital that we identify the regions that need the most work towards building affordable housing. While intuitively we know that bigger cities and states that have higher incomes will have higher real estate prices (such as NY and CA), it is important that we establish a proper framework to uncover these patterns. That is part of what the project was intended to do — to provide a base that can be used to build on and carry out more complex analyses for policymaking. To that end, this project explored the relationship between income and real estate prices across zip codes and states, and includes the following analyses:

Most expensive zip codes and states to live in, both by the absolute and relative differences in income and average real estate price
Wealthiest and poorest zip codes in California
Zip codes in California that have the highest income disparity

The overall flow of the project is shown in the following diagram:

As can be seen from the diagram above, the project had multiple phases:

Data Sources: The datasets were obtained from the sources
Ingestion onto HDFS: The datasets were then uploaded to Peel using the “scp” command, and then put onto HDFS by using the “put” command.
Cleaning using MapReduce: Each dataset was cleaned individually by a MapReduce program written specifically for it.
Analysis: Various kinds of descriptive analyses were performed in Impala and Hive, and a regression analysis was done in Spark.

Motivation

As mentioned briefly before, the purpose of this study was to uncover ways of understanding the relationship between income and real estate better, so that it could assist in identifying regions that need the most attention with regards to affordability, and also inform decision making. Specifically, the analytic was developed with the goal that researchers can:

Use it to build affordability indexes and identify regions of disparity
Use it to help identify the effect big tech offices have on real estate prices in their proximity.

Another motivation for me personally was to see if this study could help me in identifying the best places to live in, especially in the remote COVID-19 era. Why live in New York when you can get everything done from a much cheaper and scenic location? As a whole, I think that as housing is something that is vital to the economy (it has only been just over a decade since a housing crisis caused a global recession), this study is interesting to the average person as it helps them gather some insight into what is going on in the world of real estate.

Related Work

Some relevant work has been done by the National Association of Realtors. The NAR publishes a housing affordability index, which in their own words “measures whether or not a typical family earns enough income to qualify for a mortgage loan on a typical home”. Their available data however is presented in a very technical way, and takes into account mortgage rates in order to determine affordability. The Center for Neighborhood Technology (CNT) also provides an online Housing + Transportation index, which contains in depth information about specific locations. However, it is quite difficult to do broad comparisons quickly between zip codes or states because of the nature of the online tool. The Council for Community and Economic Research also offers an index of the cost of living by State which is used by news outlets such as CNN, however this index is a paid product that can not be used by the average consumer. US News offers a free online index of the most affordable states to live in, but does not provide the source of their data or the values that went into determining the order of the index. Howmuch.net and ArcGIS also provide interactive reports on key affordability insights, but their full data is not publicly available. As such, the purpose of this project was to create a free and dynamic resource that can be replicated easily to properly explore affordability in the US.

Description of Datasets

The project utilizes the following two datasets:

IRS Income Tax Dataset — which contains information about the number of tax residents by zip code and the average state and local tax every resident had to pay. It has a total of 153 columns, which include the following relevant columns:

Zip code: Number that identifies the locality.
Number of returns: Number of returns filed in that zip code.
Total income amount: Number that sums the total income in that zip code
State and local income taxes amount: Number that states how much tax is owed
Taxable income amount: Number that states how much of the income is considered taxable

Realtor.com Real Estate Data — which contains information about the listing prices of real estate by zip code. It has a total of 40 columns, which include the following relevant columns:

Zip code: Number that identifies the locality
Month: Date that specifies when the prices are from
Median Listing Price: Number that gives the median price real estate was listed at in that zip code
Active Listing Count: Number of active listings in that zip code
Average Listing Price: Number

Analytic Stages and Process

Cleaning:

During the MapReduce phase, both datasets were cleaned using separate programs. Specifically, the following work was done:

IRS Dataset: Rows that had the zip codes 0 or 99999 were dropped. Out of the 153 total columns, the ones kept were: zip code, state, agi_stub, n1, total_income, state_and_local (taxes), real_estate (taxes), num_paid, paid_amount, taxable amount.

Realtor.com dataset: Rows that had dates outside of 201801 and 201812 (basically data that was not from the year 2018) were dropped. Out of the total 40 columns, the ones kept were date, zip code, median listing price, active listing count, median listing price per square foot, median square feet, average listing price and total listing count

Screenshot of the cleaned Realtor.com Dataset

Descriptive Analysis — SQL Tools

The first analytic attempted was in Hive, the commands for which can be found in test/exploration. However, I quickly realized that Hive was not suited for the job, as it is primarily a tool for batch processing, and was simply too slow for the purpose of this study. As such, I made the switch to Impala.

In Impala, the idea was to incrementally build SQL queries that would help explain the patterns in the data. This process was important so that we would understand how to properly normalize the data — for example the values in the IRS income set were in thousands of dollars, while those in the realtor dataset were not. An important thing to note is that most of the analysis was done by taking into account the average post tax income, which was defined as the result of subtracting state and local as well as real estate taxes from the income column.

A variety of SQL queries were written here — for example, I ran queries that looked into:

The most expensive zip codes in the US overall. As it turned out, 7/10 of them happen to be in California.
What are the most expensive states to live in, defined by the absolute difference between post tax income and the median listing price in that state?
What are the most expensive states to live in by the proportion of post tax income and median listing price? How many years would it take you to buy a median real estate, assuming you saved all of your post tax income?
Since California has 7/10 of the most expensive zip codes, what are the most expensive zip codes in CA? Which zip codes in CA have the highest income disparity, and what do their real estate prices look like?

Regression in Spark

In order to see if there was an actual correlation between post tax income and median real estate piece, a regression was done in PySpark. In order to achieve this, the following steps were taken:

The ability to create tables using SQL and access them as Spark DataFrames was tested out
A query was written to aggregate the the two tables separately and then join them
The results were stored in a PySpark dataframe
The features were selected, and the training and test data frames were split.
A linear regression model was trained, and then tried out on the test set.

Overall, an r² of 0.5 was achieved. This means that 50% of the data can not be explained by the linear model used, and as such the results can not be taken at face value.

There are a number of reasons why the r² is low in this case, and they generally have to do with the way the regression was carried out:

The income data is segmented by income bracket, and in this project, I aggregated their values. It could be that the price of real estate in an area is determined only by incomes of those in a certain selection of income brackets, and that by using an aggregate, we are clouding this data.
We are not taking into account potentially confounding variables such as average utility bills and average transportation costs because we do not have access to them at the moment.
It could be that the data does not follow a simple linear regression, but rather observes a logistic, exponential or power function. In that case, we would need to change our choice of model.

Graphs

There were a variety of descriptive analyses done in Impala. Let us first look at the most expensive zip codes to live in the US, in terms of the absolute difference between post-tax income and median real estate price:

In terms of the absolute difference between the average post tax income in a state and the average median price of real estate in that state, the top 10 states that had the biggest differences are shown here:

The same values can be plotted as follows:

As we can see, California and Hawaii seem to be overpriced. If we now remove them from the graph and add a trend line, we see the following:

This is a linear line of best fit with an r-square of 0.713, and so there seems to be a correlation here.

Let us now look at the case where instead of the absolute difference between income and real estate prices, we look at the relative difference. More specifically, we can determine the number of years it would take a person completely saving their post tax income (assuming they earn the state average) to afford the median real estate price.

As we can see, Hawaii and California are the least affordable states even when we measure by using the number of years as a metric. It would take an average person over 13 years to buy median real estate in Hawaii, even with all of their post tax income! This number is just over 8 for California, although most of the remaining states hover around 5.5.

California consistently features as an expensive place to live. This combined with the fact that California is a tech hub prompted my interest to dive more deeply into the data relating to California, and so the study looks more deeply into that as well. One thing we can do is to look at the zip codes in California that have the worst income disparity:

Now if we plot this visually on a map, we discover something interesting:

6 out of the 10 zip codes in CA with the worst income disparity happen to be in close proximity to Apple’s headquarters! It is important to note at this point that this is a very simple observation, and so we can not use it to claim anything major. It could be that the income disparity observed is so high because of the influx of new (and well paid) employees into the region due to the moving of the headquarters, and that it is only a temporary phenomenon. It is however a very interesting thing to think about, which can be used as a starting point to do a more in depth study on the effects of big corporations on the income and real estate prices of the regions they move into.

Conclusion

This study looks at the relationship between income and real estate prices throughout the US, and uses simple metrics to determine the order of affordability by state and by zip code. It also takes a closer look at California, and determines zip codes that have the highest income disparity. It finds interesting patterns, such as the finding that 6 of the 10 zip codes in California with the highest income disparity happen to be in close proximity with Apple’s headquarters. In the process, it raises some interesting questions, such as:

What role do big company offices play in raising prices of real estate and levels of disparity around them?
Are there certain income brackets that determine what real estate prices end up being?
How important are factors like transportation and utility bills in determining the affordability of a region?

In doing all of this, this study provides a straightforward and scalable framework for doing analyses involving real estate and income, which can be used by ordinary people as well as researchers to uncover more insights. While this study was limited in its scope due to the availability of data, team size and time constraints, it sets a baseline for future research into this topic. The work done here specifically for California can be replicated for other states as well, and general trends across the US can be uncovered, thereby having implications for policymaking.

References

Arcgis.com, www.arcgis.com/apps/Cascade/index.html?appid=1419fe7ee70c4267a7258eb 59a9a824c.

C2ER Cost of Living Index, 2 June 2016, www.coli.org/products/.
“H+T® Index.” Welcome to The H+T Affordability Index, htaindex.cnt.org/.

“Housing Affordability Index.” Www.nar.realtor, www.nar.realtor/research-and-statistics/housing-statistics/housing-affordabilit y-index#:~:text=The%20Monthly%20Housing%20Affordability%20Index,month ly%20price%20and%20income%20data.

Irena Editor, et al. “Median U.S. Home Prices and Housing Affordability by State.” HowMuch, howmuch.net/articles/home-affordability-in-the-US.

“Realtor.com Real Estate Data and Market Trends for Download.” Realtor.com Economic Research, www.realtor.com/research/data/.

“SOI Tax Stats — Individual Income Tax Statistics — ZIP Code Data (SOI).” Internal Revenue Service, www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-da ta-soi.

“These Are the Most Affordable States in the U.S.” U.S. News & World Report, U.S. News & World Report, www.usnews.com/news/best-states/rankings/opportunity/affordability.