Using Regression Analysis to Predict HDB Resale Prices in Singapore

Valerie Lim
6 min readFeb 19, 2020

Background

You and your significant other wish to live together, but were unsuccessful in securing a BTO (Build-To-Order) flat, or the BTO launches are not within your desired locations. What would you do? Turn to housing resale market? However, you’re a first-time home buyer. Unless you’re sufficiently well-read and/or has done extensive research about the property market, the various resale property websites such as SRX, PropertyGuru etc, and aplenty customisable features on their websites would probably leave you slightly overwhelming and with a paradox of choice. More importantly, a house would likely be one of your biggest investments. How would you know whether the housing price is valued at its “true value”? And what features are truly important in determining its value?

A kickstarter guide for first-time home buyers

Without sufficient prior knowledge, first-time home buyers are probably either clueless at guesstimating the housing price and/or letting their property agents or sellers lead the conversation or negotiation process. A guide that informs buyers of the key features that affect housing price or whether a certain flat is overvalued compared to the “norm” would be useful, which would hopefully help to tilt the previously asymmetrical balance between agents/sellers and buyers to a more symmetrical one. In this article, I will walk you through how I used Lasso Regression to find out what these key features are, and the estimated housing price based on variations of those key features.

The Process

For this initial concept, I focused on HDB listings that are 3 room and above as couples like you are commonly looking at this range.

Data Collection

HDB listings were scraped from SRX website using BeautifulSoup.

A screenshot of a listing on SRX website

Each listing has various information from main features such as housing type, location etc to peripheral features such as whether the flat is renovated or a corner unit. In total, 1,980 listings were collected.

Data Cleaning

1. Dropped duplicated listings: Some agents posted the same listings on different pages.

2. Removed leading and trailing whitespace characters in PSF (Per Square Foot)

3. Different variations of model type were reduced to its common denominator: E.g.: Model A Simplified is considered as Model A.

4. Some listings omitted that other listings provided. These missing ones were imputed as 0

5. Ordinal variables (e.g. property type, model type) were converted to dummy variables.

6. Prices were log transformed as they were slightly left skewed.

Feature Engineering

1. For ease of interpretation, age of the flat was derived from number of years since it was built

2. HDB towns were aggregated into regions to investigate whether different areas (e.g. Central, North, South, East, West) in Singapore have an impact on prices.

Feature Selection

Feature selection was conducted at various stages of the model building process.

1. Before building the models: To avoid multicollinearity, between features that were highly correlated with one another, one of them was removed. Features that had zero correlation with the target were dropped too.

2. After selecting an appropriate linear model: Features were selected using backward stepwise method. Features with the p-values above 0.05 were removed.

3. After cross-validation established that a linear model was likely to be suitable for the data: Lasso regression was used for feature selection. An alpha value was obtained using LassoCV and applied to the model. Features that had zero-value coefficients were removed.

Here’s a summary of the variables and why they were removed:

Table of variables, and their status (removed/included in model)

Model building

Thefinal dataset was split into 20% test data, and 80% training-validation data (for k-fold cross-validation). 5 k-fold cross-validation was used due to the small number of observations. Backward stepwise method was applied using Lasso regression, to determine whether a simpler model (one that has fewer features) can explain similar amount of variance as a model with more features. The final model had a Mean Absolute Error of 0.1. After reversing the log transformation, this error is equivalent to $52k, which means buyers using this tool can expect roughly that much wiggle room in determining housing prices based on the features mentioned earlier.

To best understand how the model works, let’s dive into the features. The plot below shows the relative importance of each feature.

The strongest positive predictors is PSF (per square foot). With a model intercept of 13.1, for 1 unit increase in PSF, price are estimated to increase by $61k . Property type is the next important feature. As compared to 3-room flats, executive flats and 5-room flats are are estimated to be $55k more expensive, while Jumbo flats are $51k more pricey. Lastly, for an additional bedroom, prices are estimated to increase by around $54k.

Sklearn’s Lasso regression was selected over statsmodel’s Ordinary Least Square model because the former uses coordinate descent for local optimization, without having to find the exact, closed-form equation. Hence, Lasso regression would be more appropriate as I scale up this model to include more observations or to include another target variable.

Checking Assumptions

Left to Right: Scatterplot of y-residuals and y-predicted; Q-Q plot of y-residuals; Scatterplot of actual price (log) and predicted price (log).

There are a few outliers in which the predicted price is lower than the actual price. Upon inspecting them, they have features that are outside the ‘norm’ e.g. corner unit in Central area etc. Hence, the current model may not be suitable to predict these listings. Collecting more data could improve the model’s predictability. Other than that, there were no discernible patterns from the scatter plots, and y-residuals appear to be normally distributed.

Conclusion

There are different types of HDB houses you can call home. Besides the intuitive finding that you could consider a 3-room flat with smaller PSF if you have costs constraints, vs. opting for a bigger living space such as a 5-room, jumbo or executive flat if you are planning to start a family soon or have more budget, the current model also suggests that aesthetics factors (e.g. sea view, renovated, corner unit) are not as significant in determining prices as core features (e.g. property or model type), but they could matter based on individuals’ preferences.

Therefore, instead first-time home buyers like yourself could use these nuggets of insights as a rough indicator of the market price of your desired home and to focus on the aforementioned features if you feel overwhelmed by the wide range of customizable features on property market websites.

Future Work

Current model has still a lot of room for improvement. To increase the accuracy of prediction, I can collect more data, as this model is based only on 1000 data points. Having more listings would lower the MAE. At the same time, the current source of data is limited to SRX. For a more comprehensive analysis, I could scrape from multiple property market websites such as Propertyguru, 99.co. However, relying on data from property portals may be biased as agents may hike up prices to earn a larger share of the pie. For an even more accurate analysis, I could gather the actual purchasing prices from ERA. I could also create additional features e.g. distance from business districts — Central Business District, Mapletree Business District, Jurong Lake District to determine whether distance between homes and workplaces affect housing prices. Lastly, feature selection using statsmodel Ordinary Least Square model to perform backward stepwise method may be too rash. I could apply clustering algorithm(s) on features before performing feature selection.

[This project was done as part of an immersive data science program called Metis. Linear regression and web scraping were project requirements. You can find the files for this project at my GitHub and the slides here.]

Feel free to reach out with any questions.

--

--

Valerie Lim

A fast learner and self-starter, Valerie is results driven and possesses strong analytical skills | Data Scientist @ Dell | linkedin.com/in/valerie-lim-yan-hui/