Reviwing a Case Study About Real Estate Market Prediction

The paper reviewed here attempts to predict outcomes about the real estate market with binary classification. Though the paper’s research design and results were lacking, it gave us a chance to have a discussion about practices for experimental design.
paper
Author

Athula Pudhiyidath

Published

February 23, 2023

Why this paper

In my work, I deal with real estate data. The data consists of various attributes of real estate properites, some of which follow the ebbs and flows of the mutable economy. I came across this paper while perusing arxiv, and I was compelled by its abstract: using real estate data that spanned 10 years from a single county in Florida, the research compares four ML models that make predictions about whether or not a property’s appraised price matches its eventually listed price. Since the data spans such a long time, the researchers accounted for fluctuations in socio-economic factors over time in the model; this facet was of particular interest to me to conceptualize my own work and I was interested to learn about the authors incorporated these factors in machine learning algorithms to make predictions about the real estate market.

While this abstract seemed reasonably promising to understand the application of ML in real estate evaluation, when I read the paper more closely, I found myself questioning many of the approaches the authors took to answer their central question. Therefore, this blog is structured such that it summarizes the authors’ work as outlined in this paper and offers alternative considerations from our group’s discussion.

The data

The authors wanted to build and compare models for predicting home prices. They posit that previous literature addressed home pricing with hedonic regression models, which are linear regression models with specific focus on price of goods, i.e. property price, but that this work did not necessarily account for broader socio-economic factors in their models. With the modeling conducted in this paper, the authors attempted to model home pricing accounting for such factors. However, instead of predicting home pricing as a continous variable as with regression, the authors chose to make home prediction a binary classification problem. The data used in this research consist of ~94,000 rows of publicly available real estate sale data from a single county in Florida, Volusia County.

Outcome variable

The way the authors defined the binary outcome variable for this research was interesting: they took the final sale price of the home and compared it to the appraised government price of the home, assigning the variable 1 if the property’s selling price is above the appraised price (which they call high price), and 0 if not (called low price); reasoning on why this designation was useful to classify was left unsaid. More importantly there was no information on how or when the properties were initially appraised by the government, which makes the comparison to final sale price unclear. Government appraisals are technically known as a home’s assessed values and differ from appraised values. Assessed valuations are made to determine yearly property tax rates for a home in a given area; this is generally determined by the broad characteristics of a home and the taxed values of homes in the region. An appraised valuation of a home, on the other hand, is done by an appraiser who does a more thorough check of the features of a home, like its style and appliances, and determines a price for the current market price of your home. Thus, it is possible the authors of the paper were using home assessment prices instead of appraisal prices. The authors don’t provide a tally on how many of the properties are classified as high price, though the authors do show a plot (see paper Fig. 2) that suggests that the total count of high price properties have steadily increased in the 10 years of this data. But despite this fact, appraisal year or sale year are not predictor variables the authors considered.

Another way the author could have approached the outcome variables was to have compared the initial listing price to the final sale price. In fact, in the abstract the authors tout that the listing price is baseline against which the final sale price would be compared, but the final models differ from this initial assertion. Regardless, this binary outcome variable falls short because the magnitude of difference between any two price comparisons would be lost; a high price differential of $100k would be treated the same as a lower price differential of $5k.

Predictor variables

For the predictor variables, the authors chose 21 features or columns from the data for consideration for the models. However, some of the variables were not considered in the right format: parid (property identifier), nbhd (neighborhood code), yrblt (year built), zip21 (ZIP code of area), sale_date (sale date), and luc (property class) were all variables that the authors encoded as continuous variables but should have been treated as categorical variables (see Table 1). Furthermore, the variable sale_date (sale date of the property) was not separated into its component month or year values and instead was used as a single datetime value. From our group’s discussion about this paper, it was also suggested that the authors could have considered time variables in terms of seasons or quarters, as home sales typically vary by such cadences.

Because, this data spanned approximately 10 years, the authors attempted to account for the market fluctuations in those years by incorporating economic factors of gross domestic product (GDP), consumer price index (CPI), producer price index (PPI), housing price index (HPI), and effective federal funds rate (EFFR). But, it was unclear where the authors collected these values from and how exactly they matched this collected information to the real estate data. There was little discussion about why these variables were chosen out of the many other indices available, and little information about how these indices were good markers for the central question. Furthermore, it was unclear whether all of these ecoonimic factors were actually available at the time of prediction when the model is used for prediction.

Data culling

The authors took their initial dataset and used a myriad of approaches to reduce and refactor the data. First, the authors correlated the predictor variables to the outcome variables and excluded variables which were not correlated; however, they did not account for variables that are highly correlated with one another (see paper Fig. 5) which is cause for multicollinearity in how the authors interpret the importance of features in the models subsequently. Furthermore, later on when the authors introduce the voting classifier, they deem that “the method performs best when the predictors are as independent from each another as possible.”

Next, the authors perform a set of preliminary modeling, using the dataset in random forest and XGBoost models and examining the resulting feature importance of those model predictions to make decisions about which of the variables are used in the final models. Every step of this interim modeling process was unclear, thus making it hard to give any credit to the authors’ interpretations of the variable importance charts (Fig. 6-8). Regardless, the authors used these outcomes to further enmesh the predictor variables, using a technique called ‘mean encoding’ to merge together highly ranked variables, creating two new variables which they term F1 and F2. Mean encoding seems to be a technique for encoding variable identities into a model that accounts for how those variables interact with the outcome variables. This type of encoding of variables, in addition to the method of feature selection of predictor variables through interim modeling of the data before processing those select features to a bigger model, is an approach that is teeming with data leakage.

Algorithm predictions and results

Finally, the authors get to their model comparison stage. At this point of the paper it was unclear why exactly these models were being compared against one another. The ultimate question the authors laid out was out of focus and it was unclear what exactly model comparison of the data-leaked dataset would bring.

The authors went on to compare the dataset with random forest, XGBoost, voting classifier, and logistic regression classifiers. The authors said they used a 10-fold cross validation method (separate to the 5-fold cross validation method they did for the preliminary modeling for the feature selection step). While hyperparameter tuning is briefly discussed, the authors don’t provide any details on what models were tuned or the parameters. The authors evaluate differences between these classifiers namely by using accuracy, precision, and recall. However, why the authors chose these metrics to evaluate their high-low classifier is unclear, and how it served their goals. Is it more important for the classifier to catch correctly labeled lower labels, or higher labels? And if it’s better at one rather than the other, what impact would it have for the market?

By this point, the research question was not sufficiently motivated, and the data was ‘double-dipped’ and overly engineered. In fact, in presenting the comparisons between algorithm performance, the authors showcase how the engineered data performs better at classification of the task, which is of no surprise because the data features were initially chosen because they were better at classifying the task (see paper Fig. 20, 21).

The seeming conclusion the authors draw from this section that XGBoost was a superior classifier, but considering all of these factors above, this conclusion did not follow logic. Perhaps a more logical conclusion could have been reached had the authors employed a linear regression model instead; in this way they could have predicted the final sale price as a continous variable instead of a binary variable one as they attempt to do here.

Final thoughts

This paper’s central question is unclear, and deviated from the beginning to end. The paper switched from comparing listing home prices to sold home prices, to appraisal home prices (which were likely assessed home prices) to sold home prices, to figuring out which model is best. If the authors had a more structured story on the questions they wanted to answer, the baseline with which they wanted to compare outcomes, and shored up methods on those comparisons, this would have been a more compelling read. As it stands, however, it is not possible to draw any conclusions from this paper.