Modelling of Singapore HDB resale price¶

1. Problem¶

There are multiple factors affecting the selling price of HDB resale flats in Singapore. My objective is to construct a regression model to estimate how the selling price of a HDB resale flat changes based on its distance to the CBD and the nearest MRT station, as well as its flat size, floor level and the years of lease remaining on the flat.

1.1 Objective¶

To predict selling price of a HDB resale flat with a regression model, based on the explanatory variables given: (1)distance from CBD (2)distance from nearest MRT station (3)flat size (4)floor level (5)years of lease remaining

There is only 1 target output variable, which is the resale price of the HDB which is what we would like to predict. Previously, I have built a Python tool to extract the Longitute and Latitude of the MRT station and CDB region from the onemap.sg API.

2. Data Mining (Collection of the raw data needed for the problem) and Feature engineering¶

#import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#extract longitude/latitude information based on street name

import json
import requests
import openpyxl
import time

# Open Excel sheet
wb = openpyxl.load_workbook('out.xlsx')

#sheet = wb['out']
sheet = wb.active

sheet['B1'] = 'LONGITUDE'
sheet['C1'] = 'LATITUDE'

for irow in range(2, sheet.max_row+1):
    street_name = sheet['A' + str(irow)].value
    query_string = 'https://developers.onemap.sg/commonapi/search?searchVal=' + str(
         street_name) + '&returnGeom=Y&getAddrDetails=N&pageNum=1'
    result = requests.get(query_string)

    # Convert JSON into Python Object
    data = json.loads(result.content)
    if data['results'] == []: ## check if the longitude and latitude can be found, if not, update to NA
        sheet['B' + str(irow)].value = 'NA'    
        sheet['C' + str(irow)].value = 'NA'    
    else:
        # Extract data from JSON
        sheet['B' + str(irow)].value = data['results'][0]['LONGITUDE']    
        sheet['C' + str(irow)].value = data['results'][0]['LATITUDE']
    
wb.save('Street_Name_List.xlsx')
print('Job Done.')

Job Done.

sheet['B1'] = 'LONGITUDE'
sheet['C1'] = 'LATITUDE'

for irow in range(2, sheet.max_row+1):
    street_name = sheet['A' + str(irow)].value
    query_string = 'https://developers.onemap.sg/commonapi/search?searchVal=' + str(
         street_name) + '&returnGeom=Y&getAddrDetails=N&pageNum=1'
    result = requests.get(query_string)

    # Convert JSON into Python Object
    data = json.loads(result.content)
    if data['results'] == []: ## check if the longitude and latitude can be found, if not, update to NA
        sheet['B' + str(irow)].value = 'NA'    
        sheet['C' + str(irow)].value = 'NA'    
    else:
        # Extract data from JSON
        sheet['B' + str(irow)].value = data['results'][0]['LONGITUDE']    
        sheet['C' + str(irow)].value = data['results'][0]['LATITUDE']
    
wb.save('Street_Name_List.xlsx')
print('Job Done.')

Job Done.

sheet2 = wb.active

print(sheet2.max_row)

544

sheet2['B1'] = 'LONGITUDE'
sheet2['C1'] = 'LATITUDE'

for irow in range(2, 100):
    mrt_station_english = sheet2['A' + str(irow)].value
    query_string = 'https://developers.onemap.sg/commonapi/search?searchVal=' + str(
         mrt_station_english) + '&returnGeom=Y&getAddrDetails=N&pageNum=1'
    
    result = requests.get(query_string)

    # Convert JSON into Python Object
    data = json.loads(result.content)

    # Extract data from JSON
    sheet2['B' + str(irow)].value = data['results'][0]['LONGITUDE']    
    sheet2['C' + str(irow)] = data['results'][0]['LATITUDE']

wb.save('mrt_List_withlong.xlsx')
print('Job Done.')

Job Done.

#Trying to retrieve the data of one station (Clementi)
query_string_clementiMRTstation='https://developers.onemap.sg/commonapi/search?searchVal=' + "CLEMENTI " + '&returnGeom=Y&getAddrDetails=Y&pageNum=1'

result_clementi= requests.get(query_string_clementiMRTstation)

# Convert JSON into Python Object
data2 = json.loads(result_clementi.content)

print(data2)

{'found': 688, 'totalNumPages': 69, 'pageNum': 1, 'results': [{'SEARCHVAL': 'CLEMENTI GREEN', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTI GREEN', 'ADDRESS': 'CLEMENTI GREEN', 'POSTAL': 'NIL', 'X': '21371.60329', 'Y': '34043.70148', 'LATITUDE': '1.3241529440000002', 'LONGITUDE': '103.7737589', 'LONGTITUDE': '103.7737589'}, {'SEARCHVAL': 'CLEMENTIWOODS CONDOMINIUM', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTIWOODS CONDOMINIUM', 'ADDRESS': 'CLEMENTIWOODS CONDOMINIUM', 'POSTAL': 'NIL', 'X': '20686.3451', 'Y': '30967.179439999996', 'LATITUDE': '1.296329825', 'LONGITUDE': '103.7676022', 'LONGTITUDE': '103.7676022'}, {'SEARCHVAL': 'CLEMENTI GARDENS', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTI GARDENS', 'ADDRESS': 'CLEMENTI GARDENS', 'POSTAL': 'NIL', 'X': '21400.35712', 'Y': '33957.86196', 'LATITUDE': '1.323376648', 'LONGITUDE': '103.7740173', 'LONGTITUDE': '103.7740173'}, {'SEARCHVAL': 'CLEMENTI PARK', 'BLK_NO': '', 'ROAD_NAME': 'NIL', 'BUILDING': 'CLEMENTI PARK', 'ADDRESS': 'CLEMENTI PARK', 'POSTAL': 'NIL', 'X': '20505.1575', 'Y': '34640.21757', 'LATITUDE': '1.32954742', 'LONGITUDE': '103.7659733', 'LONGTITUDE': '103.7659733'}, {'SEARCHVAL': 'CLEMENTI PEAKS', 'BLK_NO': '463', 'ROAD_NAME': 'CLEMENTI AVENUE 1', 'BUILDING': 'CLEMENTI PEAKS', 'ADDRESS': '463 CLEMENTI AVENUE 1 CLEMENTI PEAKS SINGAPORE 120463', 'POSTAL': '120463', 'X': '20715.45633', 'Y': '32585.55105', 'LATITUDE': '1.310965805', 'LONGITUDE': '103.76786340000001', 'LONGTITUDE': '103.76786340000001'}, {'SEARCHVAL': 'CLEMENTI GATEWAY', 'BLK_NO': '208', 'ROAD_NAME': 'CLEMENTI AVENUE 6', 'BUILDING': 'CLEMENTI GATEWAY', 'ADDRESS': '208 CLEMENTI AVENUE 6 CLEMENTI GATEWAY SINGAPORE 120208', 'POSTAL': '120208', 'X': '20076.31474', 'Y': '33849.4405', 'LATITUDE': '1.322395808', 'LONGITUDE': '103.76212009999999', 'LONGTITUDE': '103.76212009999999'}, {'SEARCHVAL': 'CLEMENTI MEADOWS', 'BLK_NO': '318A', 'ROAD_NAME': 'CLEMENTI AVENUE 4', 'BUILDING': 'CLEMENTI MEADOWS', 'ADDRESS': '318A CLEMENTI AVENUE 4 CLEMENTI MEADOWS SINGAPORE 121318', 'POSTAL': '121318', 'X': '20417.167709999998', 'Y': '33302.05542', 'LATITUDE': '1.3174455409999999', 'LONGITUDE': '103.765183', 'LONGTITUDE': '103.765183'}, {'SEARCHVAL': 'CLEMENTI NORTHARC', 'BLK_NO': '210', 'ROAD_NAME': 'CLEMENTI AVENUE 6', 'BUILDING': 'CLEMENTI NORTHARC', 'ADDRESS': '210 CLEMENTI AVENUE 6 CLEMENTI NORTHARC SINGAPORE 120210', 'POSTAL': '120210', 'X': '20125.72735', 'Y': '33948.67057', 'LATITUDE': '1.323293222', 'LONGITUDE': '103.76256409999999', 'LONGTITUDE': '103.76256409999999'}, {'SEARCHVAL': 'CLEMENTI CREST', 'BLK_NO': '445', 'ROAD_NAME': 'CLEMENTI AVENUE 3', 'BUILDING': 'CLEMENTI CREST', 'ADDRESS': '445 CLEMENTI AVENUE 3 CLEMENTI CREST SINGAPORE 120445', 'POSTAL': '120445', 'X': '20274.65972', 'Y': '32850.22638', 'LATITUDE': '1.313359326', 'LONGITUDE': '103.7639026', 'LONGTITUDE': '103.7639026'}, {'SEARCHVAL': 'CLEMENTI RIDGES', 'BLK_NO': '312', 'ROAD_NAME': 'CLEMENTI AVENUE 4', 'BUILDING': 'CLEMENTI RIDGES', 'ADDRESS': '312 CLEMENTI AVENUE 4 CLEMENTI RIDGES SINGAPORE 120312', 'POSTAL': '120312', 'X': '20281.04805', 'Y': '33481.06633', 'LATITUDE': '1.319064417', 'LONGITUDE': '103.7639598', 'LONGTITUDE': '103.7639598'}]}

#I have generated a list of street names from the street_name column of the HDBresales dataset
#By using the street names as inputs for the onemap.sg Geocode API, I could retrieve the information on the longitude(long) and latitude(lat) of the street.
#Street_Name_List(5).csv dataset has already all the retrieved information of the longitude and Latitude coordinates of the street as indicated below.
coordinate_street=pd.read_csv("Street_Name_List (5).csv")

coordinate_street.head()

#drop the rows with missing information on the longitude and latitude
coordinate_street.head()
coordinate_street=coordinate_street.dropna(axis = 0, how ='any')

#Now we have both the lat and long of our dataset of streetname, we shall make a list to generate the distance from the nearest mrt  
list_of_lat = coordinate_street['LATITUDE']
list_of_long = coordinate_street['LONGITUDE']

# List if coordinates of all unique addresses in our dataset
list_of_coordinates = []
for lat, long in zip(list_of_lat, list_of_long):
    list_of_coordinates.append((lat,long))

len(list_of_coordinates)

542

#I have made a list of all the MRT station names from the webpage:https://data.gov.sg/dataset/train-station-chinese-names
#By using the MRT station names as inputs for the onemap.sg Geocode API, I could retrieve the information on the longitude(long) and latitude(lat) of the MRT station.
#Street_Name_List(5).csv dataset has already all the retrieved information of the longitude and Latitude coordinates of the street as indicated below.
mrt_location=pd.read_csv("MRT_Log_lat_List.csv")

#drop the rows with missing information on the longitude and latitude
mrt_location=mrt_location.dropna(axis = 0, how ='any')

mrt_location.head()

# List of MRT Coordinates in Singapore
mrt_lat = mrt_location['LATITUDE']
mrt_long = mrt_location['LONGITUDE']

list_of_mrt_coordinates = []

for lat, long in zip(mrt_lat, mrt_long):
    list_of_mrt_coordinates.append((lat, long))
    
len(list_of_mrt_coordinates)

183

# Calculting the distance to nearest MRT
from geopy.distance import geodesic

list_of_dist_mrt = []
min_dist_mrt = []
for origin in list_of_coordinates:
    for destination in range(0, len(list_of_mrt_coordinates)):        
        list_of_dist_mrt.append(geodesic(origin,list_of_mrt_coordinates[destination]).meters)

    shortest = (min(list_of_dist_mrt))
    min_dist_mrt.append(shortest)
    list_of_dist_mrt.clear()

coordinate_street['min_dist_mrt'] = min_dist_mrt

print(coordinate_street.head())

         street_name   LONGITUDE  LATITUDE  min_dist_mrt
0  ANG MO KIO AVE 10  103.855026  1.360136   1149.779402
1   ANG MO KIO AVE 4  103.838826  1.378136    774.027817
2   ANG MO KIO AVE 5  103.857187  1.375021   1059.623897
3   ANG MO KIO AVE 1  103.844819  1.365271    691.062328
4   ANG MO KIO AVE 3  103.847809  1.367542    277.657001

#Retrieving the dataset on HDB resales price pooled from the original data files which contains the resale-flat-prices dated from 1990 to 2020 
resaleHDB_df=pd.read_csv("pooledHDB.csv")

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

print(resaleHDB_df.head())

     month        town flat_type block       street_name storey_range  \
0  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
1  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     04 TO 06   
2  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     10 TO 12   
3  1990-01  ANG MO KIO    1 ROOM   309  ANG MO KIO AVE 1     07 TO 09   
4  1990-01  ANG MO KIO    3 ROOM   216  ANG MO KIO AVE 1     04 TO 06   

   floor_area_sqm      flat_model  lease_commence_date  resale_price  \
0            31.0        IMPROVED                 1977        9000.0   
1            31.0        IMPROVED                 1977        6000.0   
2            31.0        IMPROVED                 1977        8000.0   
3            31.0        IMPROVED                 1977        6000.0   
4            73.0  NEW GENERATION                 1976       47200.0   

  remaining_lease  year  recalculate_remaining_lease  storey_to  storey_from  \
0             NaN  1990                           86         10           12   
1             NaN  1990                           86          4            6   
2             NaN  1990                           86         10           12   
3             NaN  1990                           86          7            9   
4             NaN  1990                           85          4            6   

   average_storey  
0             NaN  
1             NaN  
2             NaN  
3             NaN  
4             NaN

#Merging the pooled HDB resale, longitude and latitude datasets of street address and  min_dist_mrt together
combined_df=resaleHDB_df.merge(coordinate_street, how='inner', left_on='street_name', right_on='street_name')

#Generate a new column for resale price per unit area
combined_df['price_per_sqm'] = combined_df['resale_price'].div(combined_df['floor_area_sqm'])

#Generate a new column to calculate the distance from CBD using the coordinates of Raffles Place MRT Station: 103.8514631, 1.283933262
#Formulate to calculate distance from CBD
#Difference in Latitude=(latitude1-latitude2)*110.574
#Difference in Longitude=(Longitude1-longitude2)*111.32
#Distance between 2 locations=(Distance in Latitude^2+Distance in Longitude^)^0.5

dist_from_cbd = [ ((((row.LONGITUDE-103.8514631)*111.32)**2) + ((row.LATITUDE-1.283933262)*110.574)**2)**0.5 for index, row in combined_df.iterrows() ]

combined_df['dist_from_cbd']=dist_from_cbd

combined_df.head(10)

#drop columns with missing values
clean_df=combined_df.drop(['remaining_lease','average_storey'],axis=1)
clean_df

#Since the detail of the exact floor of the resale house is not provided in the dataset, utilise the average of the range
clean_df['floorlevel']=clean_df[['storey_to','storey_from']].mean(axis=1)

clean_df.to_csv('finalsecondcleanHDBdatadraft2.csv',index=False)

3. Exploratory Data Analysis by visualisation¶

# Plotting Scatterplot to examine the relationship between HDB price and distance from mrt
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['min_dist_mrt'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set_xlabel('Distance_from_MRT(m)',fontsize=20)
g.set_ylabel('Price_per_sqm(SGD$)',fontsize=20)
g.set(xlim=(0,2000))
plt.show()

# Plotting Scatterplot to examine the relationship between HDB price and floorlevel
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['floorlevel'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set_xlabel('Floor_level',fontsize=20)
g.set_ylabel('Price_per_sqm(SGD$)',fontsize=20)
plt.show()

# Plotting Scatterplot to examine the relationship between HDB price and remaining_lease
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['recalculate_remaining_lease'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set_xlabel('Years of Remaining Lease',fontsize=20)
g.set_ylabel('Price_per_sqm(SGD$)',fontsize=20)
plt.show()

# Plotting Scatterplot to examine the relationship between flatsize and resale_price
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['floor_area_sqm'],y=clean_df['resale_price'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.40, box.height*0.40]) # resize position

g.set_xlabel('Flat size(sqm)',fontsize=20)
g.set_ylabel('Resale price(SGD$)',fontsize=20)

plt.show()

# Plotting Scatterplot to examine the relationship between HDB price and distance from CBD
fig, ax1 = plt.subplots(1,1, figsize=(20,12))
g = sns.scatterplot(x=clean_df['dist_from_cbd'],y=clean_df['price_per_sqm'], hue=clean_df['flat_type'], ax=ax1) 

box = g.get_position()
g.set_position([box.x0, box.y0, box.width * 0.85, box.height]) # resize position

g.legend(loc='center right', bbox_to_anchor=(1.2, 0.5), ncol=1)
g.set(xlim=(0,22))

g.set_xlabel('Distance from CBD(km)',fontsize=20)
g.set_ylabel('Resale price(SGD$)',fontsize=20)
plt.show()

4. Hedonic Regression modelling of resales prices¶

# Import the libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split,cross_val_score 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

#We want to build the dataframe to be used for the modelling, features to be used for HDB resale price estimations: distance from mrt,flat size,floorlevel,distance from CBD

hedonic_model_data=clean_df[['resale_price','recalculate_remaining_lease','floor_area_sqm','min_dist_mrt','dist_from_cbd','floorlevel']]

hedonic_model_data

4.1 Data Transformation¶(Normalise the resale price and all other variables) to reduce data skewness¶

#resale_price before transformation
#histogram
sns.distplot(hedonic_model_data['resale_price'], bins=50, kde=False)
plt.show()

#perform data transformation of resale_price
from scipy.stats import norm
from scipy import stats
#histogram and normal probability plot
sns.distplot(hedonic_model_data['resale_price'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['resale_price'], plot=plt)

hedonic_model_data['resale_price']=np.log(hedonic_model_data['resale_price'])

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['resale_price'], fit=norm,  bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['resale_price'], plot=plt)

sns.distplot(hedonic_model_data['recalculate_remaining_lease'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['recalculate_remaining_lease'], plot=plt)

#data transformation (normalisaton of remaining_lease)
hedonic_model_data['recalculate_remaining_lease'] = np.log(hedonic_model_data['recalculate_remaining_lease'])

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['recalculate_remaining_lease'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['recalculate_remaining_lease'], plot=plt)

#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['floor_area_sqm'], fit=norm,  bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['floor_area_sqm'], plot=plt)

#data transformation (normalisaton of floor_area_sqm)
hedonic_model_data['floor_area_sqm']= np.log(hedonic_model_data['floor_area_sqm'])

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['floor_area_sqm'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['floor_area_sqm'], plot=plt)

#data transformation (normalisaton of min_dist_mrt )
hedonic_model_data['min_dist_mrt']= np.log(hedonic_model_data['min_dist_mrt'])

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['min_dist_mrt'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['min_dist_mrt'], plot=plt)

#data transformation ()
hedonic_model_data['dist_from_cbd']= np.log(hedonic_model_data['dist_from_cbd'])

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

#transformed histogram and normal probability plot
sns.distplot(hedonic_model_data['dist_from_cbd'], fit=norm, bins=50, kde=False);
fig = plt.figure()
res = stats.probplot(hedonic_model_data['dist_from_cbd'], plot=plt)

4.2 Data Preparation to fit into Regression Model¶

#Our target variable is resale price
#The rest are our non-dependent variable
y = hedonic_model_data.iloc[:,0].values
X = hedonic_model_data.iloc[:,1:6].values

print(y)

[ 9.10497986  8.69951475  8.98719682 ... 13.85280634 13.54107371
 13.25689427]

y=y.reshape((-1, 1))

print(y)

[[ 9.10497986]
 [ 8.69951475]
 [ 8.98719682]
 ...
 [13.85280634]
 [13.54107371]
 [13.25689427]]

print(X)

[[ 4.4543473   3.4339872   6.53823002  2.19991242 11.        ]
 [ 4.4543473   3.4339872   6.53823002  2.19991242  5.        ]
 [ 4.4543473   3.4339872   6.53823002  2.19991242 11.        ]
 ...
 [ 4.55387689  4.59511985  6.08695466  1.5560469  35.        ]
 [ 4.55387689  4.41884061  6.08695466  1.5560469   8.        ]
 [ 4.55387689  4.72738782  5.03302147  2.56011115 17.        ]]

4.3 Training of data into Regression Model¶

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Fitting Multiple Linear Regression to the Training set
regression = LinearRegression()
regression.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

X_test

array([[ 4.41884061,  4.29045944,  5.82349365,  2.68551385,  5.        ],
       [ 4.41884061,  4.17438727,  6.37147795,  1.71488222, 11.        ],
       [ 4.41884061,  4.79579055,  5.32283293,  2.71604536,  5.        ],
       ...,
       [ 4.55387689,  4.4308168 ,  6.83244097,  2.93284319, 11.        ],
       [ 4.39444915,  4.21950771,  5.55425558,  2.7918474 ,  2.        ],
       [ 4.52178858,  4.29045944,  5.14609487,  2.61849239,  5.        ]])

# Predicting the Test set results
y_pred = regression.predict(X_test)

df = pd.DataFrame(data=y_test, columns=['y_test'])

df['y_pred'] = y_pred

df

pred_resaleprice=np.exp(y_pred)

original_resaleprice=np.exp(y_test)

df2 = pd.DataFrame(data=original_resaleprice, columns=['original_resaleprice'])

df2['pred_resaleprice(LinearRegression)']=pred_resaleprice

df2.head(10)

# Since all values are numeric, do a correction and sort to determine the most important features relative to HDB resale prices
corr = hedonic_model_data.corr()
corr.sort_values(["resale_price"], ascending = False, inplace = True)
print(corr["resale_price"])

resale_price                   1.000000
floor_area_sqm                 0.632594
floorlevel                     0.153240
dist_from_cbd                  0.084885
min_dist_mrt                  -0.017710
recalculate_remaining_lease   -0.032795
Name: resale_price, dtype: float64

from sklearn.metrics import r2_score

yhat_train = regression.predict(X_train)
yhat_test = regression.predict(X_test)
print("R2 Score for train_dataset(LinearRegression)={}".format(r2_score(y_train,yhat_train))) # R2 score, 1.0 is best
print("R2 Score for test_dataset(LinearRegression)={}".format(r2_score(y_test,yhat_test)))

R2 Score for train_dataset(LinearRegression)=0.5283939441894563
R2 Score for test_dataset(LinearRegression)=0.5263996650021419

accuracy = regression.score(X_train, y_train)
"Accuracy for linear regression algorithm for train dataset is: {}%".format(int(round(accuracy * 100)))

'Accuracy for linear regression algorithm for train dataset is: 53%'

accuracy = regression.score(X_test, y_test)
"Accuracy for linear regression algorithm for test dataset is: {}%".format(int(round(accuracy * 100)))

'Accuracy for linear regression algorithm for test dataset is: 53%'

# Plot predictions between train and test data set
plt.scatter(yhat_train ,y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(yhat_test,y_test, c = "red", marker = "s", label = "Validation data")
plt.title("Linear regression")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([11.5, 15.5], [11.5, 15.5], c = "red")
plt.show()

from sklearn.metrics import mean_squared_error
print("RMSE for train_set(LinearRegression) is: {}".format(mean_squared_error(y_train, yhat_train)))

RMSE for train_set(LinearRegression) is: 0.16479033312033647

from sklearn.metrics import mean_squared_error
print("RMSE for test_set(LinearRegression) is: {}".format(mean_squared_error(y_test, yhat_test)))

RMSE for test_set(LinearRegression) is: 0.16357068439514563

4.4 Boosting the accuracy of pricing models with Gradient Boosting Regressor Method¶

from sklearn import ensemble

GB_default = ensemble.GradientBoostingRegressor().fit(X_train, y_train)

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

from sklearn.metrics import r2_score

y_GB_train = GB_default.predict(X_train)
y_GB_test = GB_default.predict(X_test)
print("R2 Score for train_dataset(GBregression)={}".format(r2_score(y_train,y_GB_train))) # R2 score, 1.0 is best
print("R2 Score for test_dataset(GBregression)={}".format(r2_score(y_test,y_GB_test)))

R2 Score for train_dataset(GBregression)=0.66418457210734
R2 Score for test_dataset(GBregression)=0.66326078272227

# Average R2 score and standard deviation of 5-fold cross-validation
scores = cross_val_score(GB_default, X_test, y_test, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Accuracy: 0.66 (+/- 0.01)

# Average R2 score and standard deviation of 5-fold cross-validation
scores = cross_val_score(GB_default, X_train, y_train, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Hwahwa\Anaconda3\envs\PythonCPU\lib\site-packages\sklearn\ensemble\_gb.py:1454: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Accuracy: 0.66 (+/- 0.00)

accuracy = GB_default.score(X_train, y_train)
"Accuracy for train_set(GBregression) is: {}%".format(int(round(accuracy * 100)))

'Accuracy for train_set(GBregression) is: 66%'

accuracy = GB_default.score(X_test, y_test)
"Accuracy for test_set(GBregression) is: {}%".format(int(round(accuracy * 100)))

'Accuracy for test_set(GBregression) is: 66%'

from sklearn.metrics import mean_squared_error
print("RMSE for train_set(GBregression) is: {}".format(mean_squared_error(y_train, y_GB_train)))

RMSE for train_set(GBregression) is: 0.11734186944285323

from sklearn.metrics import mean_squared_error
print("RMSE for test_set(GBregression) is: {}".format(mean_squared_error(y_test, y_GB_test)))

RMSE for test_set(GBregression) is: 0.11630199592880998

# Plot predictions between train and test data set
plt.scatter(y_GB_train , y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_GB_test, y_test, c = "red", marker = "s", label = "Validation data")
plt.title("Gradient Boosting Regressor")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([11.5, 15.5], [11.5, 15.5], c = "red")
plt.show()

pred_resaleprice_GB=np.exp(y_GB_test)

original_resaleprice=np.exp(y_test)

GB_df = pd.DataFrame(data=original_resaleprice, columns=['original_resaleprice'])

GB_df['pred_resaleprice(GB)']=pred_resaleprice_GB

GB_df.head(10)

#Examining the feature importance being analysed by the GB model
import pandas as pd

pd.concat((pd.DataFrame(hedonic_model_data.iloc[:, 1:].columns, columns = ['variable']), 
           pd.DataFrame(GB_default.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]

5. Conclusions¶

Gradient Boosting Regression Model is the best model for this data set because it has: Highest accuracy scores Highest R2 scores Lowest RMSE scores

Floor_area_sqm and number of years of remaining lease are important features in the prediction of the resale HDB flat price.

	mrt_station_english	LONGITUDE	LATITUDE
0	Jurong East MRT Station	103.741735	1.334009
1	Bukit Batok MRT Station	103.749769	1.349512
2	Bukit Gombak MRT Station	103.751727	1.359065
3	Choa Chu Kang MRT Station	103.744556	1.385426
4	Yew Tee MRT Station	103.747405	1.397535

	y_test	y_pred
0	11.995352	11.977201
1	10.714418	11.891205
2	12.337101	12.843526
3	12.881565	13.059468
4	13.028053	12.988736
...	...	...
162458	12.013701	12.149067
162459	12.899220	13.177646
162460	12.001505	12.026516
162461	11.794338	11.859944
162462	12.072541	11.829116

	original_resaleprice	pred_resaleprice(LinearRegression)
0	162000.0	159086.091431
1	45000.0	145977.145520
2	228000.0	378331.384911
3	393000.0	469520.953648
4	455000.0	437457.850855
5	156000.0	188416.218103
6	320000.0	312732.222292
7	505000.0	421527.193041
8	310000.0	127777.436617
9	275000.0	297963.199907

	original_resaleprice	pred_resaleprice(GB)
0	162000.0	157313.904895
1	45000.0	95920.476177
2	228000.0	352681.001643
3	393000.0	423058.559235
4	455000.0	393523.911626
5	156000.0	164427.684788
6	320000.0	337801.573486
7	505000.0	424158.909266
8	310000.0	92818.295564
9	275000.0	280507.717972

	street_name	LONGITUDE	LATITUDE
0	ANG MO KIO AVE 10	103.855026	1.360136
1	ANG MO KIO AVE 4	103.838826	1.378136
2	ANG MO KIO AVE 5	103.857187	1.375021
3	ANG MO KIO AVE 1	103.844819	1.365271
4	ANG MO KIO AVE 3	103.847809	1.367542

	month	town	flat_type	block	street_name	storey_range	floor_area_sqm	flat_model	lease_commence_date	resale_price	...	year	recalculate_remaining_lease	storey_to	storey_from	average_storey	LONGITUDE	LATITUDE	min_dist_mrt	price_per_sqm	dist_from_cbd
0	1990-01	ANG MO KIO	1 ROOM	309	ANG MO KIO AVE 1	10 TO 12	31.0	IMPROVED	1977	9000.0	...	1990	86	10	12	NaN	103.844819	1.365271	691.062328	290.322581	9.024223
1	1990-01	ANG MO KIO	1 ROOM	309	ANG MO KIO AVE 1	04 TO 06	31.0	IMPROVED	1977	6000.0	...	1990	86	4	6	NaN	103.844819	1.365271	691.062328	193.548387	9.024223
2	1990-01	ANG MO KIO	1 ROOM	309	ANG MO KIO AVE 1	10 TO 12	31.0	IMPROVED	1977	8000.0	...	1990	86	10	12	NaN	103.844819	1.365271	691.062328	258.064516	9.024223
3	1990-01	ANG MO KIO	1 ROOM	309	ANG MO KIO AVE 1	07 TO 09	31.0	IMPROVED	1977	6000.0	...	1990	86	7	9	NaN	103.844819	1.365271	691.062328	193.548387	9.024223
4	1990-01	ANG MO KIO	3 ROOM	216	ANG MO KIO AVE 1	04 TO 06	73.0	NEW GENERATION	1976	47200.0	...	1990	85	4	6	NaN	103.844819	1.365271	691.062328	646.575342	9.024223
5	1990-01	ANG MO KIO	3 ROOM	308	ANG MO KIO AVE 1	07 TO 09	82.0	NEW GENERATION	1976	59000.0	...	1990	85	7	9	NaN	103.844819	1.365271	691.062328	719.512195	9.024223
6	1990-01	ANG MO KIO	3 ROOM	308	ANG MO KIO AVE 1	10 TO 12	67.0	NEW GENERATION	1976	47200.0	...	1990	85	10	12	NaN	103.844819	1.365271	691.062328	704.477612	9.024223
7	1990-01	ANG MO KIO	3 ROOM	220	ANG MO KIO AVE 1	07 TO 09	67.0	NEW GENERATION	1977	35000.0	...	1990	86	7	9	NaN	103.844819	1.365271	691.062328	522.388060	9.024223
8	1990-01	ANG MO KIO	3 ROOM	219	ANG MO KIO AVE 1	07 TO 09	82.0	NEW GENERATION	1977	52000.0	...	1990	86	7	9	NaN	103.844819	1.365271	691.062328	634.146341	9.024223
9	1990-01	ANG MO KIO	3 ROOM	320	ANG MO KIO AVE 1	07 TO 09	73.0	NEW GENERATION	1977	40000.0	...	1990	86	7	9	NaN	103.844819	1.365271	691.062328	547.945205	9.024223

	variable	importance
1	floor_area_sqm	0.696131
0	recalculate_remaining_lease	0.210749
3	dist_from_cbd	0.057097
4	floorlevel	0.022349
2	min_dist_mrt	0.013674

	resale_price	recalculate_remaining_lease	floor_area_sqm	min_dist_mrt	dist_from_cbd	floorlevel
0	9000.0	86	31.0	691.062328	9.024223	11.0
1	6000.0	86	31.0	691.062328	9.024223	5.0
2	8000.0	86	31.0	691.062328	9.024223	11.0
3	6000.0	86	31.0	691.062328	9.024223	8.0
4	47200.0	85	73.0	691.062328	9.024223	5.0
...	...	...	...	...	...	...
812310	435000.0	94	67.0	517.528694	10.412512	17.0
812311	843888.0	95	95.0	440.079176	4.740046	11.0
812312	1038000.0	95	99.0	440.079176	4.740046	35.0
812313	760000.0	95	83.0	440.079176	4.740046	8.0
812314	572000.0	95	113.0	153.395793	12.937255	17.0