Modelling of Singapore HDB resale price

1. Problem

There are multiple factors affecting the selling price of HDB resale flats in Singapore. My objective is to construct a regression model to estimate how the selling price of a HDB resale flat changes based on its distance to the CBD and the nearest MRT station, as well as its flat size, floor level and the years of lease remaining on the flat.

1.1 Objective

To predict selling price of a HDB resale flat with a regression model, based on the explanatory variables given: (1)distance from CBD (2)distance from nearest MRT station (3)flat size (4)floor level (5)years of lease remaining

There is only 1 target output variable, which is the resale price of the HDB which is what we would like to predict. Previously, I have built a Python tool to extract the Longitute and Latitude of the MRT station and CDB region from the onemap.sg API.

2. Data Mining (Collection of the raw data needed for the problem) and Feature engineering

In [13]:
#import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [1]:
#extract longitude/latitude information based on street name

import json
import requests
import openpyxl
import time

# Open Excel sheet
wb = openpyxl.load_workbook('out.xlsx')
In [2]:
#sheet = wb['out']
sheet = wb.active
In [3]:
sheet['B1'] = 'LONGITUDE'
sheet['C1'] = 'LATITUDE'

for irow in range(2, sheet.max_row+1):
    street_name = sheet['A' + str(irow)].value
    query_string = 'https://developers.onemap.sg/commonapi/search?searchVal=' + str(
         street_name) + '&returnGeom=Y&getAddrDetails=N&pageNum=1'
    result = requests.get(query_string)

    # Convert JSON into Python Object
    data = json.loads(result.content)
    if data['results'] == []: ## check if the longitude and latitude can be found, if not, update to NA
        sheet['B' + str(irow)].value = 'NA'    
        sheet['C' + str(irow)].value = 'NA'    
    else:
        # Extract data from JSON
        sheet['B' + str(irow)].value = data['results'][0]['LONGITUDE']    
        sheet['C' + str(irow)].value = data['results'][0]['LATITUDE']
    
wb.save('Street_Name_List.xlsx')
print('Job Done.')
Job Done.
In [4]:
sheet['B1'] = 'LONGITUDE'
sheet['C1'] = 'LATITUDE'

for irow in range(2, sheet.max_row+1):
    street_name = sheet['A' + str(irow)].value
    query_string = 'https://developers.onemap.sg/commonapi/search?searchVal=' + str(
         street_name) + '&returnGeom=Y&getAddrDetails=N&pageNum=1'
    result = requests.get(query_string)

    # Convert JSON into Python Object
    data = json.loads(result.content)
    if data['results'] == []: ## check if the longitude and latitude can be found, if not, update to NA
        sheet['B' + str(irow)].value = 'NA'    
        sheet['C' + str(irow)].value = 'NA'    
    else:
        # Extract data from JSON
        sheet['B' + str(irow)].value = data['results'][0]['LONGITUDE']    
        sheet['C' + str(irow)].value = data['results'][0]['LATITUDE']
    
wb.save('Street_Name_List.xlsx')
print('Job Done.')
Job Done.
In [5]:
sheet2 = wb.active
In [6]:
print(sheet2.max_row)
544
In [7]:
sheet2['B1'] = 'LONGITUDE'
sheet2['C1'] = 'LATITUDE'

for irow in range(2, 100):
    mrt_station_english = sheet2['A' + str(irow)].value
    query_string = 'https://developers.onemap.sg/commonapi/search?searchVal=' + str(
         mrt_station_english) + '&returnGeom=Y&getAddrDetails=N&pageNum=1'
    
    result = requests.get(query_string)

    # Convert JSON into Python Object
    data = json.loads(result.content)

    # Extract data from JSON
    sheet2['B' + str(irow)].value = data['results'][0]['LONGITUDE']    
    sheet2['C' + str(irow)] = data['results'][0]['LATITUDE']

wb.save('mrt_List_withlong.xlsx')
print('Job Done.')
Job Done.
In [8]:
#Trying to retrieve the data of one station (Clementi)
query_string_clementiMRTstation='https://developers.onemap.sg/commonapi/search?searchVal=' + "CLEMENTI " + '&returnGeom=Y&getAddrDetails=Y&pageNum=1'
In [9]:
result_clementi= requests.get(query_string_clementiMRTstation)
In [10]:
# Convert JSON into Python Object
data2 = json.loads(result_clementi.content)
In [11]:
print(data2)
{'found': 688, 'totalNumPages': 69, 'pageNum': 1, 'results': [{'SEARCHVAL': 'CLEMENTI GREEN', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTI GREEN', 'ADDRESS': 'CLEMENTI GREEN', 'POSTAL': 'NIL', 'X': '21371.60329', 'Y': '34043.70148', 'LATITUDE': '1.3241529440000002', 'LONGITUDE': '103.7737589', 'LONGTITUDE': '103.7737589'}, {'SEARCHVAL': 'CLEMENTIWOODS CONDOMINIUM', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTIWOODS CONDOMINIUM', 'ADDRESS': 'CLEMENTIWOODS CONDOMINIUM', 'POSTAL': 'NIL', 'X': '20686.3451', 'Y': '30967.179439999996', 'LATITUDE': '1.296329825', 'LONGITUDE': '103.7676022', 'LONGTITUDE': '103.7676022'}, {'SEARCHVAL': 'CLEMENTI GARDENS', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTI GARDENS', 'ADDRESS': 'CLEMENTI GARDENS', 'POSTAL': 'NIL', 'X': '21400.35712', 'Y': '33957.86196', 'LATITUDE': '1.323376648', 'LONGITUDE': '103.7740173', 'LONGTITUDE': '103.7740173'}, {'SEARCHVAL': 'CLEMENTI PARK', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTI PARK', 'ADDRESS': 'CLEMENTI PARK', 'POSTAL': 'NIL', 'X': '20505.1575', 'Y': '34640.21757', 'LATITUDE': '1.32954742', 'LONGITUDE': '103.7659733', 'LONGTITUDE': '103.7659733'}, {'SEARCHVAL': 'CLEMENTI PEAKS', 'BLK_NO': '463', 'ROAD_NAME': 'CLEMENTI AVENUE 1', 'BUILDING': 'CLEMENTI PEAKS', 'ADDRESS': '463 CLEMENTI AVENUE 1 CLEMENTI PEAKS SINGAPORE 120463', 'POSTAL': '120463', 'X': '20715.45633', 'Y': '32585.55105', 'LATITUDE': '1.310965805', 'LONGITUDE': '103.76786340000001', 'LONGTITUDE': '103.76786340000001'}, {'SEARCHVAL': 'CLEMENTI GATEWAY', 'BLK_NO': '208', 'ROAD_NAME': 'CLEMENTI AVENUE 6', 'BUILDING': 'CLEMENTI GATEWAY', 'ADDRESS': '208 CLEMENTI AVENUE 6 CLEMENTI GATEWAY SINGAPORE 120208', 'POSTAL': '120208', 'X': '20076.31474', 'Y': '33849.4405', 'LATITUDE': '1.322395808', 'LONGITUDE': '103.76212009999999', 'LONGTITUDE': '103.76212009999999'}, {'SEARCHVAL': 'CLEMENTI MEADOWS', 'BLK_NO': '318A', 'ROAD_NAME': 'CLEMENTI AVENUE 4', 'BUILDING': 'CLEMENTI MEADOWS', 'ADDRESS': '318A CLEMENTI AVENUE 4 CLEMENTI MEADOWS SINGAPORE 121318', 'POSTAL': '121318', 'X': '20417.167709999998', 'Y': '33302.05542', 'LATITUDE': '1.3174455409999999', 'LONGITUDE': '103.765183', 'LONGTITUDE': '103.765183'}, {'SEARCHVAL': 'CLEMENTI NORTHARC', 'BLK_NO': '210', 'ROAD_NAME': 'CLEMENTI AVENUE 6', 'BUILDING': 'CLEMENTI NORTHARC', 'ADDRESS': '210 CLEMENTI AVENUE 6 CLEMENTI NORTHARC SINGAPORE 120210', 'POSTAL': '120210', 'X': '20125.72735', 'Y': '33948.67057', 'LATITUDE': '1.323293222', 'LONGITUDE': '103.76256409999999', 'LONGTITUDE': '103.76256409999999'}, {'SEARCHVAL': 'CLEMENTI CREST', 'BLK_NO': '445', 'ROAD_NAME': 'CLEMENTI AVENUE 3', 'BUILDING': 'CLEMENTI CREST', 'ADDRESS': '445 CLEMENTI AVENUE 3 CLEMENTI CREST SINGAPORE 120445', 'POSTAL': '120445', 'X': '20274.65972', 'Y': '32850.22638', 'LATITUDE': '1.313359326', 'LONGITUDE': '103.7639026', 'LONGTITUDE': '103.7639026'}, {'SEARCHVAL': 'CLEMENTI RIDGES', 'BLK_NO': '312', 'ROAD_NAME': 'CLEMENTI AVENUE 4', 'BUILDING': 'CLEMENTI RIDGES', 'ADDRESS': '312 CLEMENTI AVENUE 4 CLEMENTI RIDGES SINGAPORE 120312', 'POSTAL': '120312', 'X': '20281.04805', 'Y': '33481.06633', 'LATITUDE': '1.319064417', 'LONGITUDE': '103.7639598', 'LONGTITUDE': '103.7639598'}]}
In [14]:
#I have generated a list of street names from the street_name column of the HDBresales dataset
#By using the street names as inputs for the onemap.sg Geocode API, I could retrieve the information on the longitude(long) and latitude(lat) of the street.
#Street_Name_List(5).csv dataset has already all the retrieved information of the longitude and Latitude coordinates of the street as indicated below.
coordinate_street=pd.read_csv("Street_Name_List (5).csv")
In [15]:
coordinate_street.head()
Out[15]:
street_name LONGITUDE LATITUDE
0 ANG MO KIO AVE 10 103.855026 1.360136
1 ANG MO KIO AVE 4 103.838826 1.378136
2 ANG MO KIO AVE 5 103.857187 1.375021
3 ANG MO KIO AVE 1 103.844819 1.365271
4 ANG MO KIO AVE 3 103.847809 1.367542
In [16]:
#drop the rows with missing information on the longitude and latitude
coordinate_street.head()
coordinate_street=coordinate_street.dropna(axis = 0, how ='any') 
In [17]:
#Now we have both the lat and long of our dataset of streetname, we shall make a list to generate the distance from the nearest mrt  
list_of_lat = coordinate_street['LATITUDE']
list_of_long = coordinate_street['LONGITUDE']
In [18]:
# List if coordinates of all unique addresses in our dataset
list_of_coordinates = []
for lat, long in zip(list_of_lat, list_of_long):
    list_of_coordinates.append((lat,long))
In [19]:
len(list_of_coordinates)
Out[19]:
542
In [20]:
#I have made a list of all the MRT station names from the webpage:https://data.gov.sg/dataset/train-station-chinese-names
#By using the MRT station names as inputs for the onemap.sg Geocode API, I could retrieve the information on the longitude(long) and latitude(lat) of the MRT station.
#Street_Name_List(5).csv dataset has already all the retrieved information of the longitude and Latitude coordinates of the street as indicated below.
mrt_location=pd.read_csv("MRT_Log_lat_List.csv")

#drop the rows with missing information on the longitude and latitude
mrt_location=mrt_location.dropna(axis = 0, how ='any') 
In [21]:
mrt_location.head()
Out[21]:
mrt_station_english LONGITUDE LATITUDE
0 Jurong East MRT Station 103.741735 1.334009
1 Bukit Batok MRT Station 103.749769 1.349512
2 Bukit Gombak MRT Station 103.751727 1.359065
3 Choa Chu Kang MRT Station 103.744556 1.385426
4 Yew Tee MRT Station 103.747405 1.397535
In [22]:
# List of MRT Coordinates in Singapore
mrt_lat = mrt_location['LATITUDE']
mrt_long = mrt_location['LONGITUDE']
In [23]:
list_of_mrt_coordinates = []

for lat, long in zip(mrt_lat, mrt_long):
    list_of_mrt_coordinates.append((lat, long))
    
len(list_of_mrt_coordinates)
Out[23]:
183
In [24]:
# Calculting the distance to nearest MRT
from geopy.distance import geodesic
In [25]:
list_of_dist_mrt = []
min_dist_mrt = []
for origin in list_of_coordinates:
    for destination in range(0, len(list_of_mrt_coordinates)):        
        list_of_dist_mrt.append(geodesic(origin,list_of_mrt_coordinates[destination]).meters)

    shortest = (min(list_of_dist_mrt))
    min_dist_mrt.append(shortest)
    list_of_dist_mrt.clear()
In [26]:
coordinate_street['min_dist_mrt'] = min_dist_mrt
In [27]:
print(coordinate_street.head())
         street_name   LONGITUDE  LATITUDE  min_dist_mrt
0  ANG MO KIO AVE 10  103.855026  1.360136   1149.779402
1   ANG MO KIO AVE 4  103.838826  1.378136    774.027817
2   ANG MO KIO AVE 5  103.857187  1.375021   1059.623897
3   ANG MO KIO AVE 1  103.844819  1.365271    691.062328
4   ANG MO KIO AVE 3  103.847809  1.367542    277.657001
In [28]:
#Retrieving the dataset on HDB resales price pooled from the original data files which contains the resale-flat-prices dated from 1990 to 2020 
resaleHDB_df=pd.read_csv("pooledHDB.csv")
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [29]:
print(resaleHDB_df.head())
     month        town flat_type block       street_name storey_range  \
0  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
1  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     04 TO 06   
2  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
3  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     07 TO 09   
4  1990-01  ANG MO KIO    3 ROOM   216  ANG MO KIO AVE 1     04 TO 06   

   floor_area_sqm      flat_model  lease_commence_date  resale_price  \
0            31.0        IMPROVED                 1977        9000.0   
1            31.0        IMPROVED                 1977        6000.0   
2            31.0        IMPROVED                 1977        8000.0   
3            31.0        IMPROVED                 1977        6000.0   
4            73.0  NEW GENERATION                 1976       47200.0   

  remaining_lease  year  recalculate_remaining_lease  storey_to  storey_from  \
0             NaN  1990                           86         10           12   
1             NaN  1990                           86          4            6   
2             NaN  1990                           86         10           12   
3             NaN  1990                           86          7            9   
4             NaN  1990                           85          4            6   

   average_storey  
0             NaN  
1             NaN  
2             NaN  
3             NaN  
4             NaN  
In [30]:
#Merging the pooled HDB resale, longitude and latitude datasets of street address and  min_dist_mrt together
combined_df=resaleHDB_df.merge(coordinate_street, how='inner', left_on='street_name', right_on='street_name')
In [31]:
#Generate a new column for resale price per unit area
combined_df['price_per_sqm'] = combined_df['resale_price'].div(combined_df['floor_area_sqm'])
In [32]:
#Generate a new column to calculate the distance from CBD using the coordinates of Raffles Place MRT Station: 103.8514631, 1.283933262
#Formulate to calculate distance from CBD
#Difference in Latitude=(latitude1-latitude2)*110.574
#Difference in Longitude=(Longitude1-longitude2)*111.32
#Distance between 2 locations=(Distance in Latitude^2+Distance in Longitude^)^0.5
In [33]:
dist_from_cbd = [ ((((row.LONGITUDE-103.8514631)*111.32)**2) + ((row.LATITUDE-1.283933262)*110.574)**2)**0.5 for index, row in combined_df.iterrows() ]
In [34]:
combined_df['dist_from_cbd']=dist_from_cbd
In [35]:
combined_df.head(10)
Out[35]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price ... year recalculate_remaining_lease storey_to storey_from average_storey LONGITUDE LATITUDE min_dist_mrt price_per_sqm dist_from_cbd
0 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 9000.0 ... 1990 86 10 12 NaN 103.844819 1.365271 691.062328 290.322581 9.024223
1 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 04 TO 06 31.0 IMPROVED 1977 6000.0 ... 1990 86 4 6 NaN 103.844819 1.365271 691.062328 193.548387 9.024223
2 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 8000.0 ... 1990 86 10 12 NaN 103.844819 1.365271 691.062328 258.064516 9.024223
3 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 07 TO 09 31.0 IMPROVED 1977 6000.0 ... 1990 86 7 9 NaN 103.844819 1.365271 691.062328 193.548387 9.024223
4 1990-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 04 TO 06 73.0 NEW GENERATION 1976 47200.0 ... 1990 85 4 6 NaN 103.844819 1.365271 691.062328 646.575342 9.024223
5 1990-01 ANG MO KIO 3 ROOM 308 ANG MO KIO AVE 1 07 TO 09 82.0 NEW GENERATION 1976 59000.0 ... 1990 85 7 9 NaN 103.844819 1.365271 691.062328 719.512195 9.024223
6 1990-01 ANG MO KIO 3 ROOM 308 ANG MO KIO AVE 1 10 TO 12 67.0 NEW GENERATION 1976 47200.0 ... 1990 85 10 12 NaN 103.844819 1.365271 691.062328 704.477612 9.024223
7 1990-01 ANG MO KIO 3 ROOM 220 ANG MO KIO AVE 1 07 TO 09 67.0 NEW GENERATION 1977 35000.0 ... 1990 86 7 9 NaN 103.844819 1.365271 691.062328 522.388060 9.024223
8 1990-01 ANG MO KIO 3 ROOM 219 ANG MO KIO AVE 1 07 TO 09 82.0 NEW GENERATION 1977 52000.0 ... 1990 86 7 9 NaN 103.844819 1.365271 691.062328 634.146341 9.024223
9 1990-01 ANG MO KIO 3 ROOM 320 ANG MO KIO AVE 1 07 TO 09 73.0 NEW GENERATION 1977 40000.0 ... 1990 86 7 9 NaN 103.844819 1.365271 691.062328 547.945205 9.024223

10 rows × 21 columns

In [36]:
#drop columns with missing values
clean_df=combined_df.drop(['remaining_lease','average_storey'],axis=1)
clean_df
Out[36]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price year recalculate_remaining_lease storey_to storey_from LONGITUDE LATITUDE min_dist_mrt price_per_sqm dist_from_cbd
0 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 9000.0 1990 86 10 12 103.844819 1.365271 691.062328 290.322581 9.024223
1 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 04 TO 06 31.0 IMPROVED 1977 6000.0 1990 86 4 6 103.844819 1.365271 691.062328 193.548387 9.024223
2 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 8000.0 1990 86 10 12 103.844819 1.365271 691.062328 258.064516 9.024223
3 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 07 TO 09 31.0 IMPROVED 1977 6000.0 1990 86 7 9 103.844819 1.365271 691.062328 193.548387 9.024223
4 1990-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 04 TO 06 73.0 NEW GENERATION 1976 47200.0 1990 85 4 6 103.844819 1.365271 691.062328 646.575342 9.024223
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
812310 2020-03 ANG MO KIO 3 ROOM 650 ANG MO KIO ST 61 16 TO 18 67.0 Model A 2015 435000.0 2020 94 16 18 103.842745 1.377691 517.528694 6492.537313 10.412512
812311 2020-02 QUEENSTOWN 4 ROOM 92 DAWSON RD 10 TO 12 95.0 Premium Apartment Loft 2016 843888.0 2020 95 10 12 103.810134 1.294248 440.079176 8883.031579 4.740046
812312 2020-02 QUEENSTOWN 5 ROOM 89 DAWSON RD 34 TO 36 99.0 Premium Apartment 2016 1038000.0 2020 95 34 36 103.810134 1.294248 440.079176 10484.848485 4.740046
812313 2020-03 QUEENSTOWN 4 ROOM 89 DAWSON RD 07 TO 09 83.0 Premium Apartment 2016 760000.0 2020 95 7 9 103.810134 1.294248 440.079176 9156.626506 4.740046
812314 2020-02 SENGKANG 5 ROOM 472C FERNVALE ST 16 TO 18 113.0 Improved 2016 572000.0 2020 95 16 18 103.881709 1.396902 153.395793 5061.946903 12.937255

812315 rows × 19 columns

In [37]:
#Since the detail of the exact floor of the resale house is not provided in the dataset, utilise the average of the range
clean_df['floorlevel']=clean_df[['storey_to','storey_from']].mean(axis=1)
In [38]:
clean_df.to_csv('finalsecondcleanHDBdatadraft2.csv',index=False)

3. Exploratory Data Analysis by visualisation

In [39]:
# Plotting Scatterplot to examine the relationship between HDB price and distance from mrt
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['min_dist_mrt'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set_xlabel('Distance_from_MRT(m)',fontsize=20)
g.set_ylabel('Price_per_sqm(SGD$)',fontsize=20)
g.set(xlim=(0,2000))
plt.show()
In [40]:
# Plotting Scatterplot to examine the relationship between HDB price and floorlevel
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['floorlevel'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set_xlabel('Floor_level',fontsize=20)
g.set_ylabel('Price_per_sqm(SGD$)',fontsize=20)
plt.show()
In [41]:
# Plotting Scatterplot to examine the relationship between HDB price and remaining_lease
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['recalculate_remaining_lease'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set_xlabel('Years of Remaining Lease',fontsize=20)
g.set_ylabel('Price_per_sqm(SGD$)',fontsize=20)
plt.show()
In [42]:
# Plotting Scatterplot to examine the relationship between flatsize and resale_price
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['floor_area_sqm'],y=clean_df['resale_price'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.40, box.height*0.40]) # resize position

g.set_xlabel('Flat size(sqm)',fontsize=20)
g.set_ylabel('Resale price(SGD$)',fontsize=20)

plt.show()
In [43]:
# Plotting Scatterplot to examine the relationship between HDB price and distance from CBD
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['dist_from_cbd'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set(xlim=(0,22))

g.set_xlabel('Distance from CBD(km)',fontsize=20)
g.set_ylabel('Resale price(SGD$)',fontsize=20)
plt.show()

4. Hedonic Regression modelling of resales prices

In [44]:
# Import the libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split,cross_val_score 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
In [45]:
#We want to build the dataframe to be used for the modelling, features to be used for HDB resale price estimations: distance from mrt,flat size,floorlevel,distance from CBD
In [46]:
hedonic_model_data=clean_df[['resale_price','recalculate_remaining_lease','floor_area_sqm','min_dist_mrt','dist_from_cbd','floorlevel']]
In [47]:
hedonic_model_data
Out[47]:
resale_price recalculate_remaining_lease floor_area_sqm min_dist_mrt dist_from_cbd floorlevel
0 9000.0 86 31.0 691.062328 9.024223 11.0
1 6000.0 86 31.0 691.062328 9.024223 5.0
2 8000.0 86 31.0 691.062328 9.024223 11.0
3 6000.0 86 31.0 691.062328 9.024223 8.0
4 47200.0 85 73.0 691.062328 9.024223 5.0
... ... ... ... ... ... ...
812310 435000.0 94 67.0 517.528694 10.412512 17.0
812311 843888.0 95 95.0 440.079176 4.740046 11.0
812312 1038000.0 95 99.0 440.079176 4.740046 35.0
812313 760000.0 95 83.0 440.079176 4.740046 8.0
812314 572000.0 95 113.0 153.395793 12.937255 17.0

812315 rows × 6 columns

4.1 Data Transformation¶(Normalise the resale price and all other variables) to reduce data skewness

In [48]:
#resale_price before transformation
#histogram
sns.distplot(hedonic_model_data['resale_price'], bins=50, kde=False)
plt.show()
In [49]:
#perform data transformation of resale_price
from scipy.stats import norm
from scipy import stats
#histogram and normal probability plot
sns.distplot(hedonic_model_data['resale_price'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['resale_price'], plot=plt)
In [50]:
hedonic_model_data['resale_price']=np.log(hedonic_model_data['resale_price'])
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [51]:
#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['resale_price'], fit=norm,  bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['resale_price'], plot=plt)
In [52]:
sns.distplot(hedonic_model_data['recalculate_remaining_lease'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['recalculate_remaining_lease'], plot=plt)
In [53]:
#data transformation (normalisaton of remaining_lease)
hedonic_model_data['recalculate_remaining_lease'] = np.log(hedonic_model_data['recalculate_remaining_lease'])
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [54]:
#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['recalculate_remaining_lease'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['recalculate_remaining_lease'], plot=plt)
In [55]:
#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['floor_area_sqm'], fit=norm,  bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['floor_area_sqm'], plot=plt)
In [56]:
#data transformation (normalisaton of floor_area_sqm)
hedonic_model_data['floor_area_sqm']= np.log(hedonic_model_data['floor_area_sqm'])
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [57]:
#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['floor_area_sqm'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['floor_area_sqm'], plot=plt)
In [58]:
#data transformation (normalisaton of min_dist_mrt )
hedonic_model_data['min_dist_mrt']= np.log(hedonic_model_data['min_dist_mrt'])
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [59]:
#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['min_dist_mrt'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['min_dist_mrt'], plot=plt)
In [60]:
#data transformation ()
hedonic_model_data['dist_from_cbd']= np.log(hedonic_model_data['dist_from_cbd'])
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [61]:
#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['dist_from_cbd'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['dist_from_cbd'], plot=plt)

4.2 Data Preparation to fit into Regression Model

In [62]:
#Our target variable is resale price
#The rest are our non-dependent variable
y = hedonic_model_data.iloc[:,0].values
X = hedonic_model_data.iloc[:,1:6].values
In [63]:
print(y)
[ 9.10497986  8.69951475  8.98719682 ... 13.85280634 13.54107371
 13.25689427]
In [64]:
y=y.reshape((-1, 1))
In [65]:
print(y)
[[ 9.10497986]
 [ 8.69951475]
 [ 8.98719682]
 ...
 [13.85280634]
 [13.54107371]
 [13.25689427]]
In [66]:
print(X)
[[ 4.4543473   3.4339872   6.53823002  2.19991242 11.        ]
 [ 4.4543473   3.4339872   6.53823002  2.19991242  5.        ]
 [ 4.4543473   3.4339872   6.53823002  2.19991242 11.        ]
 ...
 [ 4.55387689  4.59511985  6.08695466  1.5560469  35.        ]
 [ 4.55387689  4.41884061  6.08695466  1.5560469   8.        ]
 [ 4.55387689  4.72738782  5.03302147  2.56011115 17.        ]]

4.3 Training of data into Regression Model

In [67]:
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
In [68]:
# Fitting Multiple Linear Regression to the Training set
regression = LinearRegression()
regression.fit(X_train, y_train)
Out[68]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [69]:
X_test
Out[69]:
array([[ 4.41884061,  4.29045944,  5.82349365,  2.68551385,  5.        ],
       [ 4.41884061,  4.17438727,  6.37147795,  1.71488222, 11.        ],
       [ 4.41884061,  4.79579055,  5.32283293,  2.71604536,  5.        ],
       ...,
       [ 4.55387689,  4.4308168 ,  6.83244097,  2.93284319, 11.        ],
       [ 4.39444915,  4.21950771,  5.55425558,  2.7918474 ,  2.        ],
       [ 4.52178858,  4.29045944,  5.14609487,  2.61849239,  5.        ]])
In [70]:
# Predicting the Test set results
y_pred = regression.predict(X_test)
In [71]:
df = pd.DataFrame(data=y_test, columns=['y_test'])
In [72]:
df['y_pred'] = y_pred
In [73]:
df
Out[73]:
y_test y_pred
0 11.995352 11.977201
1 10.714418 11.891205
2 12.337101 12.843526
3 12.881565 13.059468
4 13.028053 12.988736
... ... ...
162458 12.013701 12.149067
162459 12.899220 13.177646
162460 12.001505 12.026516
162461 11.794338 11.859944
162462 12.072541 11.829116

162463 rows × 2 columns

In [74]:
pred_resaleprice=np.exp(y_pred)
In [75]:
original_resaleprice=np.exp(y_test)
In [76]:
df2 = pd.DataFrame(data=original_resaleprice, columns=['original_resaleprice'])
In [77]:
df2['pred_resaleprice(LinearRegression)']=pred_resaleprice
In [78]:
df2.head(10)
Out[78]:
original_resaleprice pred_resaleprice(LinearRegression)
0 162000.0 159086.091431
1 45000.0 145977.145520
2 228000.0 378331.384911
3 393000.0 469520.953648
4 455000.0 437457.850855
5 156000.0 188416.218103
6 320000.0 312732.222292
7 505000.0 421527.193041
8 310000.0 127777.436617
9 275000.0 297963.199907
In [79]:
# Since all values are numeric, do a correction and sort to determine the most important features relative to HDB resale prices
corr = hedonic_model_data.corr()
corr.sort_values(["resale_price"], ascending = False, inplace = True)
print(corr["resale_price"])
resale_price                   1.000000
floor_area_sqm                 0.632594
floorlevel                     0.153240
dist_from_cbd                  0.084885
min_dist_mrt                  -0.017710
recalculate_remaining_lease   -0.032795
Name: resale_price, dtype: float64
In [80]:
from sklearn.metrics import r2_score

yhat_train = regression.predict(X_train)
yhat_test = regression.predict(X_test)
print("R2 Score for train_dataset(LinearRegression)={}".format(r2_score(y_train,yhat_train))) # R2 score, 1.0 is best
print("R2 Score for test_dataset(LinearRegression)={}".format(r2_score(y_test,yhat_test)))
R2 Score for train_dataset(LinearRegression)=0.5283939441894563
R2 Score for test_dataset(LinearRegression)=0.5263996650021419
In [81]:
accuracy = regression.score(X_train, y_train)
"Accuracy for linear regression algorithm for train dataset is: {}%".format(int(round(accuracy * 100)))
Out[81]:
'Accuracy for linear regression algorithm for train dataset is: 53%'
In [82]:
accuracy = regression.score(X_test, y_test)
"Accuracy for linear regression algorithm for test dataset is: {}%".format(int(round(accuracy * 100)))
Out[82]:
'Accuracy for linear regression algorithm for test dataset is: 53%'
In [83]:
# Plot predictions between train and test data set
plt.scatter(yhat_train ,y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(yhat_test,y_test, c = "red", marker = "s", label = "Validation data")
plt.title("Linear regression")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([11.5, 15.5], [11.5, 15.5], c = "red")
plt.show()
In [84]:
from sklearn.metrics import mean_squared_error
print("RMSE for train_set(LinearRegression) is: {}".format(mean_squared_error(y_train, yhat_train)))
RMSE for train_set(LinearRegression) is: 0.16479033312033647
In [85]:
from sklearn.metrics import mean_squared_error
print("RMSE for test_set(LinearRegression) is: {}".format(mean_squared_error(y_test, yhat_test)))
RMSE for test_set(LinearRegression) is: 0.16357068439514563

4.4 Boosting the accuracy of pricing models with Gradient Boosting Regressor Method

In [86]:
from sklearn import ensemble

GB_default = ensemble.GradientBoostingRegressor().fit(X_train, y_train)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
In [87]:
from sklearn.metrics import r2_score

y_GB_train = GB_default.predict(X_train)
y_GB_test = GB_default.predict(X_test)
print("R2 Score for train_dataset(GBregression)={}".format(r2_score(y_train,y_GB_train))) # R2 score, 1.0 is best
print("R2 Score for test_dataset(GBregression)={}".format(r2_score(y_test,y_GB_test)))
R2 Score for train_dataset(GBregression)=0.66418457210734
R2 Score for test_dataset(GBregression)=0.66326078272227
In [88]:
# Average R2 score and standard deviation of 5-fold cross-validation
scores = cross_val_score(GB_default, X_test, y_test, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Accuracy: 0.66 (+/- 0.01)
In [89]:
# Average R2 score and standard deviation of 5-fold cross-validation
scores = cross_val_score(GB_default, X_train, y_train, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Accuracy: 0.66 (+/- 0.00)
In [90]:
accuracy = GB_default.score(X_train, y_train)
"Accuracy for train_set(GBregression) is: {}%".format(int(round(accuracy * 100)))
Out[90]:
'Accuracy for train_set(GBregression) is: 66%'
In [91]:
accuracy = GB_default.score(X_test, y_test)
"Accuracy for test_set(GBregression) is: {}%".format(int(round(accuracy * 100)))
Out[91]:
'Accuracy for test_set(GBregression) is: 66%'
In [92]:
from sklearn.metrics import mean_squared_error
print("RMSE for train_set(GBregression) is: {}".format(mean_squared_error(y_train, y_GB_train)))
RMSE for train_set(GBregression) is: 0.11734186944285323
In [93]:
from sklearn.metrics import mean_squared_error
print("RMSE for test_set(GBregression) is: {}".format(mean_squared_error(y_test, y_GB_test)))
RMSE for test_set(GBregression) is: 0.11630199592880998
In [94]:
# Plot predictions between train and test data set
plt.scatter(y_GB_train , y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_GB_test, y_test, c = "red", marker = "s", label = "Validation data")
plt.title("Gradient Boosting Regressor")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([11.5, 15.5], [11.5, 15.5], c = "red")
plt.show()
In [95]:
pred_resaleprice_GB=np.exp(y_GB_test)
In [96]:
original_resaleprice=np.exp(y_test)
In [97]:
GB_df = pd.DataFrame(data=original_resaleprice, columns=['original_resaleprice'])
In [98]:
GB_df['pred_resaleprice(GB)']=pred_resaleprice_GB
In [99]:
GB_df.head(10)
Out[99]:
original_resaleprice pred_resaleprice(GB)
0 162000.0 157313.904895
1 45000.0 95920.476177
2 228000.0 352681.001643
3 393000.0 423058.559235
4 455000.0 393523.911626
5 156000.0 164427.684788
6 320000.0 337801.573486
7 505000.0 424158.909266
8 310000.0 92818.295564
9 275000.0 280507.717972
In [100]:
#Examining the feature importance being analysed by the GB model
import pandas as pd

pd.concat((pd.DataFrame(hedonic_model_data.iloc[:, 1:].columns, columns = ['variable']), 
           pd.DataFrame(GB_default.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]
Out[100]:
variable importance
1 floor_area_sqm 0.696131
0 recalculate_remaining_lease 0.210749
3 dist_from_cbd 0.057097
4 floorlevel 0.022349
2 min_dist_mrt 0.013674

5. Conclusions

Gradient Boosting Regression Model is the best model for this data set because it has: Highest accuracy scores Highest R2 scores Lowest RMSE scores

Floor_area_sqm and number of years of remaining lease are important features in the prediction of the resale HDB flat price.

In [ ]: