基于多种回归方法的Kaggle房价预测

小组分工:

  • 郝家辉、韩亚辉:数据分析和可视化、数据预处理、特征工程、文档编写
  • 许达:使用随机森林方法预测房价
  • 李生椰:使用XGBoost方法预测房价
  • 李思远:使用贝叶斯方法预测房价

项目地址:

https://github.com/HousePricesPredictionTeam/BIT_HousePricesPredict

1.问题背景与分析

Kaggle房价预测:House Prices: Advanced Regression Techniques

房价与我们的生活息息相关,房价的波动牵动着无数购房者的心。如果能够预测房价的走势,可以有效地帮助购买者做出合理的决策。本项目中,我们选择kaggle竞赛中的爱荷华州艾姆斯住宅数据集,数据集中有 79 个变量几乎描述了爱荷华州艾姆斯 (Ames, Iowa) 住宅的方方面面。我们将对数据集建模分析,并对房价进行预测。

影响房价的因素有很多,我们将进行数据可视化来分析各个因素对房价的影响,使用特征工程选择最相关的因素,利用多个机器学习算法(如决策树回归、xgboost等 )构建房价回归模型,并对比分析预测结果。

导入相关包

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

读取数据集

In [2]:
data=pd.read_csv('house-prices-advanced-regression-techniques/train.csv')
data
Out[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows × 81 columns

In [3]:
data.shape
Out[3]:
(1460, 81)

数据缺失处理

In [4]:
for i in data.columns[1:]:
    if data[i].isnull().sum()>0:
        print(i,data[i].isnull().sum())
LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406

可以看出Alley/PoolQC/Fence/MiscFeature属性缺失较多条目,可以直接进行删除

删除ID属性以及缺失30%以上数据的属性

In [5]:
Deldata= data[[column for column in data if data[column].count()/len(data)>=0.3]]
del Deldata['Id']
print("删除的属性", end=" ")
for c in data.columns:
    if c not in Deldata.columns:
        print(c, end=", ")
print('\n')
data = Deldata
删除的属性 Id, Alley, PoolQC, Fence, MiscFeature, 

In [6]:
data.shape
Out[6]:
(1460, 76)

再次查看缺失的列

In [7]:
isnull = data.isnull().sum(axis=0)
isnull[isnull > 0]/data.shape[0]
Out[7]:
LotFrontage     0.177397
MasVnrType      0.005479
MasVnrArea      0.005479
BsmtQual        0.025342
BsmtCond        0.025342
BsmtExposure    0.026027
BsmtFinType1    0.025342
BsmtFinType2    0.026027
Electrical      0.000685
FireplaceQu     0.472603
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
GarageCond      0.055479
dtype: float64

进行缺失填充

In [8]:
data.fillna(method='pad', inplace=True)

2.数据可视化探索

数据集属性

  • SalePrice-该房产的售价。这是我们要预测的目标变量。
  • MSSubClass:建筑分类
  • MSZoning:常规分区分类
  • LotFrontage:街道的线性尺寸
  • LotArea:以平方英尺为单位的地块大小
  • Street:街道的类型
  • Alley:胡同的类型
  • LotShape:房产的一般形状
  • LandContour:房产的平坦度
  • Utilities:可用的公共设施类型
  • LotConfig:地块配置
  • LandSlope:房产的坡度
  • Neighborhood:埃姆斯市区范围内的地理位置
  • Condition1:接近主干道或铁路
  • Condition2:接近主要道路或铁路(如果有)
  • BldgType:住宅类型
  • HouseStyle:住宅风格
  • OverallQual:总体材料和加工质量
  • OverallCond:总体状况的评价
  • YearBuilt:原始施工日期
  • YearRemodAdd:改型日期
  • RoofStyle:屋顶类型
  • RoofMatl:屋顶材料
  • Exterior1st:房屋外墙
  • Exterior2nd:房屋的外墙覆盖物(如果使用一种以上的材料)
  • MasVnrType:砖石饰面类型
  • MasVnrArea:砌面贴面面积(平方英尺)
  • ExterQual:外部材料质量
  • ExterCond:外部材料的当前状态
  • Foundation:基金会的类型
  • BsmtQual:地下室的高度
  • BsmtCond:地下室的一般状况
  • BsmtExposure:罢工或花园水平的地下室墙壁
  • BsmtFinType1:地下室成品区域的质量
  • BsmtFinSF1:1型成品的尺寸(平方英尺)
  • BsmtFinType2:第二个完成区域的质量(如果存在)
  • BsmtFinSF2:2型成品尺寸(平方英尺)
  • BsmtUnfSF:未完成的地下室面积(平方英尺 )
  • TotalBsmtSF:地下室总平方英尺
  • Heating:供热方式
  • HeatingQC:供热质量和条件
  • CentralAir:中央空调
  • Electrical:电气系统
  • 1stFlrSF:一楼面积
  • 2ndFlrSF:二楼面积
  • LowQualFinSF:低质量完成面积(所有楼层)
  • GrLivArea:地面(地面)以上居住面积
  • BsmtFullBath:地下室全浴室
  • BsmtHalfBath:地下室半浴室
  • FullBath:地上全浴室
  • HalfBath:地上半浴室
  • Bedroom:地下室以上的卧室数量
  • Kitchen:厨房数量
  • KitchenQual:厨房质量
  • TotRmsAbvGrd:上等客房总数(不包括浴室)
  • Functional:家庭功能等级
  • Fireplaces:壁炉数量
  • FireplaceQu:壁炉质量
  • GarageType:车库位置
  • GarageYrBlt:车库建成年份
  • GarageFinish:车库的内部装饰
  • GarageCars:车库中车库的大小
  • GarageArea:车库的大小(平方英尺)
  • GarageQual:车库质量
  • GarageCond:车库条件
  • PavedDrive:铺装的车道
  • WoodDeckSF:木制地板面积(平方英尺)
  • OpenPorchSF:开放式阳台面积(平方英尺)
  • EnclosedPorch:封闭的走廊面积(以平方英尺为单位)
  • 3SsnPorch:三季走廊面积(以平方英尺为单位)
  • ScreenPorch:纱窗走廊面积(以平方英尺为单位)
  • PoolArea:游泳池面积
  • PoolQC:泳池质量
  • Fence:围栏质量
  • MiscFeature:未包括在其他类别中的各方面特点
  • MiscVal:各方面特点的价值
  • MoSold:已售月份
  • YrSold:已售年份
  • SaleType:销售类型
  • SaleCondition:销售条件
In [9]:
data
Out[9]:
MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour Utilities LotConfig LandSlope ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 60 RL 65.0 8450 Pave Reg Lvl AllPub Inside Gtl ... 0 0 0 0 0 2 2008 WD Normal 208500
1 20 RL 80.0 9600 Pave Reg Lvl AllPub FR2 Gtl ... 0 0 0 0 0 5 2007 WD Normal 181500
2 60 RL 68.0 11250 Pave IR1 Lvl AllPub Inside Gtl ... 0 0 0 0 0 9 2008 WD Normal 223500
3 70 RL 60.0 9550 Pave IR1 Lvl AllPub Corner Gtl ... 272 0 0 0 0 2 2006 WD Abnorml 140000
4 60 RL 84.0 14260 Pave IR1 Lvl AllPub FR2 Gtl ... 0 0 0 0 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 60 RL 62.0 7917 Pave Reg Lvl AllPub Inside Gtl ... 0 0 0 0 0 8 2007 WD Normal 175000
1456 20 RL 85.0 13175 Pave Reg Lvl AllPub Inside Gtl ... 0 0 0 0 0 2 2010 WD Normal 210000
1457 70 RL 66.0 9042 Pave Reg Lvl AllPub Inside Gtl ... 0 0 0 0 2500 5 2010 WD Normal 266500
1458 20 RL 68.0 9717 Pave Reg Lvl AllPub Inside Gtl ... 112 0 0 0 0 4 2010 WD Normal 142125
1459 20 RL 75.0 9937 Pave Reg Lvl AllPub Inside Gtl ... 0 0 0 0 0 6 2008 WD Normal 147500

1460 rows × 76 columns

2.1数值数据分析

五数概括

In [10]:
numCol='LotFrontage/LotArea/MasVnrArea/BsmtFinSF1/BsmtFinSF2/BsmtUnfSF/TotalBsmtSF/1stFlrSF/2ndFlrSF/LowQualFinSF/GrLivArea/BsmtFullBath/BsmtHalfBath/FullBath/HalfBath/BedroomAbvGr/KitchenAbvGr/TotRmsAbvGrd/Fireplaces/GarageCars/GarageArea/WoodDeckSF/OpenPorchSF/EnclosedPorch/3SsnPorch/ScreenPorch/PoolArea/MiscVal/MoSold/YrSold/YearBuilt/YearRemodAdd/GarageYrBlt/SalePrice'
num=numCol.split('/')
numFeature=data[num]
numFeature.shape
Out[10]:
(1460, 34)
In [11]:
data[numFeature.columns].describe().loc['min':'max'].T
Out[11]:
min 25% 50% 75% max
LotFrontage 21.0 59.00 70.0 80.00 313.0
LotArea 1300.0 7553.50 9478.5 11601.50 215245.0
MasVnrArea 0.0 0.00 0.0 165.25 1600.0
BsmtFinSF1 0.0 0.00 383.5 712.25 5644.0
BsmtFinSF2 0.0 0.00 0.0 0.00 1474.0
BsmtUnfSF 0.0 223.00 477.5 808.00 2336.0
TotalBsmtSF 0.0 795.75 991.5 1298.25 6110.0
1stFlrSF 334.0 882.00 1087.0 1391.25 4692.0
2ndFlrSF 0.0 0.00 0.0 728.00 2065.0
LowQualFinSF 0.0 0.00 0.0 0.00 572.0
GrLivArea 334.0 1129.50 1464.0 1776.75 5642.0
BsmtFullBath 0.0 0.00 0.0 1.00 3.0
BsmtHalfBath 0.0 0.00 0.0 0.00 2.0
FullBath 0.0 1.00 2.0 2.00 3.0
HalfBath 0.0 0.00 0.0 1.00 2.0
BedroomAbvGr 0.0 2.00 3.0 3.00 8.0
KitchenAbvGr 0.0 1.00 1.0 1.00 3.0
TotRmsAbvGrd 2.0 5.00 6.0 7.00 14.0
Fireplaces 0.0 0.00 1.0 1.00 3.0
GarageCars 0.0 1.00 2.0 2.00 4.0
GarageArea 0.0 334.50 480.0 576.00 1418.0
WoodDeckSF 0.0 0.00 0.0 168.00 857.0
OpenPorchSF 0.0 0.00 25.0 68.00 547.0
EnclosedPorch 0.0 0.00 0.0 0.00 552.0
3SsnPorch 0.0 0.00 0.0 0.00 508.0
ScreenPorch 0.0 0.00 0.0 0.00 480.0
PoolArea 0.0 0.00 0.0 0.00 738.0
MiscVal 0.0 0.00 0.0 0.00 15500.0
MoSold 1.0 5.00 6.0 8.00 12.0
YrSold 2006.0 2007.00 2008.0 2009.00 2010.0
YearBuilt 1872.0 1954.00 1973.0 2000.00 2010.0
YearRemodAdd 1950.0 1967.00 1994.0 2004.00 2010.0
GarageYrBlt 1900.0 1961.00 1980.0 2002.00 2010.0
SalePrice 34900.0 129975.00 163000.0 214000.00 755000.0

绘制盒图

In [12]:
fig, axes = plt.subplots(7, 5, figsize=(10, 30))
#, figsize=(15, 20)
for i,col in enumerate(numFeature.columns):
    axes[i//5, i%5].boxplot(data[col].values)
    axes[i//5, i%5].set_title(col)
fig.tight_layout()

绘制数值属性的直方图分布

In [13]:
numFeature.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)
Out[13]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD16318808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD162CAB48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD16A67108>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15A4B688>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15A7B248>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15A7B308>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15AA3F88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B03AC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B34808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B65548>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B94188>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15BBCE88>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15BF0BC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15C1D908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15C4E648>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15C7E388>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15CAF0C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15CD8DC8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D09AC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D39808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D6A548>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D9B288>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15DC3F88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15DF2CC8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15E25A08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15E53748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15E83488>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15EB31C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15EDCEC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F0CC08>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F3E948>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F6C688>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F9E3C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15FCE108>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15FF6E08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD16028C08>]],
      dtype=object)

2.2标称数据

绘制标称属性的频率直方图

In [14]:
Nominal='MSSubClass/MSZoning/Street/LotShape/LandContour/Utilities/LotConfig/LandSlope/Neighborhood/Condition1/Condition2/BldgType/HouseStyle/OverallQual/OverallCond/RoofStyle/RoofMatl/Exterior1st/Exterior2nd/MasVnrType/ExterQual/ExterCond/Foundation/BsmtQual/BsmtCond/BsmtExposure/BsmtFinType1/BsmtFinType2/Heating/HeatingQC/CentralAir/Electrical/KitchenQual/Functional/FireplaceQu/GarageType/GarageFinish/GarageQual/GarageCond/PavedDrive/SaleType/SaleCondition'
NomFeature=data[Nominal.split('/')]
NomFeature.shape
Out[14]:
(1460, 42)
In [15]:
fig, axes = plt.subplots(14, 3, figsize=(15, 20))
#, figsize=(15, 20)
for i,col in enumerate(NomFeature.columns):
    c = data[col].value_counts()[:10]
    axes[i//3, i%3].bar(c.index,c.values)
    axes[i//3, i%3].set_title(col)
fig.tight_layout()

3.特征选择

3.1绘制heatmap,查看数值属性与salePrice的相关性

In [16]:
correlation1 = numFeature.corr()
In [17]:
f , ax = plt.subplots(figsize = (14,12))
plt.title('Correlation of Numeric Features with Sale Price',y=1,size=16)
sns.heatmap(correlation1,square = True,  vmax=0.8)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ad19c82808>

'TotalBsmtSF'、'1stFlrSF'、 'GarageCars'、 'GarageAreas'、'GrLibArea'变量与SalePrice的相关性较强

进一步对前十个相关性较强的数值属性进行挖掘

In [18]:
k= 10
cols1 = correlation1.nlargest(k,'SalePrice')['SalePrice'].index


print(cols1)
cm = np.corrcoef(data[cols1].values.T)
f , ax = plt.subplots(figsize = (14,12))
sns.heatmap(cm, vmax=.8, linewidths=0.01,square=True,annot=True,cmap='viridis',
            linecolor="white",xticklabels = cols1.values ,annot_kws = {'size':12},yticklabels = cols1.values)
Index(['SalePrice', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF',
       '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd'],
      dtype='object')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ad1b1e2508>

可以看出,'TotalBsmtSF'与'1stFlrSF'、 'GarageCars'与'GarageAreas'相关性较强,故在选择特征时,可以只使用TotalBsmtSF和GarageCars

所以数值属性特征可以选择

  • 'GrLivArea'
  • 'GarageCars'
  • 'TotalBsmtSF'
  • 'FullBath'
  • 'TotRmsAbvGrd'
  • 'YearBuilt'
  • 'YearRemodAdd'

将数值属性特征向量保存到csv文件中

In [19]:
numeric_feature = data[['GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd']]
In [20]:
numeric_feature.to_csv('numeric_feature.csv', index=False)

3.2将标称属性转变成数值离散形式

In [21]:
from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()
data2=data.copy()
for i in NomFeature.columns:
    if data2[i].dtype=='O':
        data2[i]=number.fit_transform(data[i].astype('str'))
In [22]:
data2
Out[22]:
MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour Utilities LotConfig LandSlope ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 60 3 65.0 8450 1 3 3 0 4 0 ... 0 0 0 0 0 2 2008 8 4 208500
1 20 3 80.0 9600 1 3 3 0 2 0 ... 0 0 0 0 0 5 2007 8 4 181500
2 60 3 68.0 11250 1 0 3 0 4 0 ... 0 0 0 0 0 9 2008 8 4 223500
3 70 3 60.0 9550 1 0 3 0 0 0 ... 272 0 0 0 0 2 2006 8 0 140000
4 60 3 84.0 14260 1 0 3 0 2 0 ... 0 0 0 0 0 12 2008 8 4 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 60 3 62.0 7917 1 3 3 0 4 0 ... 0 0 0 0 0 8 2007 8 4 175000
1456 20 3 85.0 13175 1 3 3 0 4 0 ... 0 0 0 0 0 2 2010 8 4 210000
1457 70 3 66.0 9042 1 3 3 0 4 0 ... 0 0 0 0 2500 5 2010 8 4 266500
1458 20 3 68.0 9717 1 3 3 0 4 0 ... 112 0 0 0 0 4 2010 8 4 142125
1459 20 3 75.0 9937 1 3 3 0 4 0 ... 0 0 0 0 0 6 2008 8 4 147500

1460 rows × 76 columns

绘制标称属性与saleprice的相关性

In [23]:
Corr='MSSubClass/MSZoning/Street/LotShape/LandContour/Utilities/LotConfig/LandSlope/Neighborhood/Condition1/Condition2/BldgType/HouseStyle/OverallQual/OverallCond/RoofStyle/RoofMatl/Exterior1st/Exterior2nd/MasVnrType/ExterQual/ExterCond/Foundation/BsmtQual/BsmtCond/BsmtExposure/BsmtFinType1/BsmtFinType2/Heating/HeatingQC/CentralAir/Electrical/KitchenQual/Functional/FireplaceQu/GarageType/GarageFinish/GarageQual/GarageCond/PavedDrive/SaleType/SaleCondition/SalePrice'
NomFeatures=data2[Corr.split('/')]
correlation2 = NomFeatures.corr()
In [24]:
f , ax = plt.subplots(figsize = (14,12))
plt.title('Correlation of Numeric Features with Sale Price',y=1,size=16)
sns.heatmap(correlation2,square = True,  vmax=0.8)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ad1c67ae88>

可以看出,标称属性中与salePrice相关性较强的有OveralQual

In [25]:
k= 10
cols = correlation2.nlargest(k,'SalePrice')['SalePrice'].index
print(cols)
cm = np.corrcoef(data2[cols].values.T)
f , ax = plt.subplots(figsize = (14,12))
sns.heatmap(cm, vmax=.8, linewidths=0.01,square=True,annot=True,cmap='viridis',
            linecolor="white",xticklabels = cols.values ,annot_kws = {'size':12},yticklabels = cols.values)
Index(['SalePrice', 'OverallQual', 'Foundation', 'CentralAir', 'Electrical',
       'PavedDrive', 'RoofStyle', 'SaleCondition', 'Neighborhood',
       'HouseStyle'],
      dtype='object')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ad1c7f1748>

标称属性的特征选择为:

  • 'OverallQual'
  • 'Foundation'
  • 'CentralAir'
  • 'Electrical',
  • 'PavedDrive'
  • 'RoofStyle'
  • 'SaleCondition'
  • 'Neighborhood',
  • 'HouseStyle'

3.3使用get_dummies将标称属性向量one-hot处理

In [26]:
catData = pd.get_dummies(NomFeature.astype(str))

catData.shape
Out[26]:
(1460, 274)
In [27]:
catData
Out[27]:
MSSubClass_120 MSSubClass_160 MSSubClass_180 MSSubClass_190 MSSubClass_20 MSSubClass_30 MSSubClass_40 MSSubClass_45 MSSubClass_50 MSSubClass_60 ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 1 0
1 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 1 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 1 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 1 0
1456 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1457 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1458 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 1 0
1459 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 1 0

1460 rows × 274 columns

保存标称特征

In [28]:
catData.to_csv('normal_feature_corr.csv', index=False)

4 使用信息增益比筛选标称属性,构建向量

分析数值属性和SalePrice的相关性可以使用连续性变量之间的相关分析方法:相关系数
分析标称属性和SalePrice的相关性可以先把SalePrice离散化,然后使用离散型变量的相关分析方法:信息增益比

4.1 SalePrice分布

In [29]:
data.SalePrice.hist()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ad1db9ce88>

4.2 SalePrice离散化

In [30]:
# 连续 => 离散
def price_map(x):
    if x <= 108000:
        return 0
    elif x <= 170000:
        return 1
    elif x <= 250000:
        return 2
    elif x <= 310000:
        return 3
    else:
        return 4
    
sale_price = data.SalePrice.map(price_map)
data['sale_price'] = sale_price
sale_price.hist()
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ad19d363c8>

4.3 计算信息增益比

In [31]:
import numpy as np
def Entropy(x):
    x = np.array(x)
    x = x/sum(x)
    return -(x*np.log2(x)).sum()

def gain_rate(data, colA, colB):
    weight = data.groupby([colA]).size()/data.shape[0]
    entropy_A = Entropy(data[colA].value_counts())
    entropy_B = Entropy(data[colB].value_counts())
    gain = entropy_A - (weight*data.groupby([colA]).apply(lambda x: Entropy(x[colB].value_counts()))).sum()
    
    return gain/(entropy_A*entropy_B)

4.4 取值个数小于20的属性的信息增益比

In [32]:
value_counts = NomFeature.apply(lambda x: x.value_counts().size, axis=0)
idx = value_counts[value_counts < 20].index

gains = []
for col in idx:
    gains.append([col, gain_rate(data, col, 'sale_price')])
    
gains = pd.DataFrame(gains, columns=['attribute', 'gainRate'])
attributes = gains.sort_values(by=['gainRate'], ascending=False).values[1:10, 0]
gains.sort_values(by=['gainRate'], ascending=False)[:20]
Out[32]:
attribute gainRate
12 OverallQual 0.258362
0 MSSubClass 0.230104
17 Exterior2nd 0.193903
16 Exterior1st 0.183762
25 BsmtFinType1 0.123090
13 OverallCond 0.034916
11 HouseStyle 0.004557
22 BsmtQual -0.027344
31 KitchenQual -0.039115
28 HeatingQC -0.045437
21 Foundation -0.052132
35 GarageFinish -0.055727
19 ExterQual -0.141866
33 FireplaceQu -0.144424
34 GarageType -0.147875
24 BsmtExposure -0.167345
18 MasVnrType -0.179390
6 LotConfig -0.295960
3 LotShape -0.332756
1 MSZoning -0.407347

4.5 使用信息增益比得到标称属性的特征选择为:

  • OverallQual
  • MSSubClass
  • MoSold
  • Exterior2nd
  • TotRmsAbvGrd
  • Exterior1st
  • BsmtFinType1
  • YrSold
  • OverallCond
  • GarageCars
In [33]:
nomfeature = pd.get_dummies(data[attributes].astype(str))

nomfeature.shape
Out[33]:
(1460, 82)
In [34]:
nomfeature
Out[34]:
MSSubClass_120 MSSubClass_160 MSSubClass_180 MSSubClass_190 MSSubClass_20 MSSubClass_30 MSSubClass_40 MSSubClass_45 MSSubClass_50 MSSubClass_60 ... BsmtQual_TA KitchenQual_Ex KitchenQual_Fa KitchenQual_Gd KitchenQual_TA HeatingQC_Ex HeatingQC_Fa HeatingQC_Gd HeatingQC_Po HeatingQC_TA
0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 1 0 0 0 1 0 0
4 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 1 0 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 1 1 0 0 0 0
1456 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1
1457 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 1 0 1 0 0 0 0
1458 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 1 0 0 0 1 0 0
1459 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 1 0 0 1 0 0

1460 rows × 82 columns

保存借助信息增益方法得到的标称特征向量

In [35]:
nomfeature.to_csv('normal_feature_gainRate.csv', index=False)

将salePrice保存到csv中

In [49]:
data.SalePrice.to_csv('SalePrice.csv',index=False)

5.随机森林模型预测房价

5.1导入相关包

In [41]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import math
import os

5.2导入数据集

In [37]:
# 归一化处理
def normalize(df):
    #df = (df - df.min()) / (df.max() - df.min())
    for col in df.columns[:-1]:
        df[col]= (df[col] - df[col].min()) / (df[col].max() - df[col].min())
    return df
In [46]:
def load_data():
    # 导入训练集
    feature1Path = os.path.join("data", "numeric_feature.csv")
    feature2Path = os.path.join("data", "normal_feature_gainRate.csv")
    pricePath = os.path.join("data", "SalePrice.csv")
    train_data = pd.concat([pd.read_csv(feature1Path), pd.read_csv(feature2Path)], axis=1)
    
    df2 = pd.read_csv(pricePath, header=None)
    # 数据归一化
    train_data = normalize(train_data)

    x = train_data.iloc[:, :-1] # 前89列为特征
    y = df2 # 最后一列为标签

    # 划分数据集(训练集占75%,测试集占25%)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = None)

    return train_data, x_train,x_test, y_train, y_test

5.3创建模型

In [44]:
# 创建随机森林模型
def model(train_x,train_y):
    my_model = RandomForestRegressor()
    my_model.fit(train_x, train_y.values.flatten())
    return my_model
In [50]:
train_data, train_x, test_x, train_y, test_y = load_data()
# 创建随机森林模型并进行训练和预测
my_model = model(train_x,train_y)
predicted_prices = my_model.predict(test_x)
#test_y_list = test_y['Label'].tolist()

# 均方误差作为评价指标
print("\n预测结果与样本的MAE为:")
print(math.log(mean_absolute_error(test_y,predicted_prices, multioutput='uniform_average')))
print("预测结果与样本的RMSE为")
print(math.log(np.sqrt(mean_squared_error(test_y,predicted_prices))))

# 保存预测结果
my_results = pd.DataFrame({'SalePrice': predicted_prices})
my_results.to_csv('PredictionResults.csv', index=False)
test_y.to_csv('PredictionSample.csv', index=False)
预测结果与样本的MAE为:
9.87717608864612
预测结果与样本的RMSE为
10.299988913858138

6.使用贝叶斯方法预测房价

6.1导入相关包

In [51]:
from sklearn import linear_model, svm, gaussian_process
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

6.2导入数据集

In [55]:
df = pd.concat([pd.read_csv("data/numeric_feature.csv"), pd.read_csv("data/normal_feature_corr.csv")], axis=1)
In [56]:
for col in df.columns:
    df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

df2 = pd.read_csv("data/SalePrice.csv", header=None)
In [57]:
train_X, test_X, train_y, test_y = train_test_split(df, df2, test_size=0.25, random_state=42)

6.3训练和预测

In [58]:
clf =linear_model.BayesianRidge()
clf.fit(train_X, train_y)
y_pred = clf.predict(test_X)
In [59]:
print("\n预测结果与样本的MAE为:")
print(np.log(mean_absolute_error(y_pred, test_y)))
print("预测结果与样本的RMSE为")
print(np.log(np.sqrt(mean_squared_error(test_y,y_pred))))
预测结果与样本的MAE为:
9.92621388263256
预测结果与样本的RMSE为
10.315740169210176

7.XGBoost预测房价

7.1导入相关包

In [60]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math
import numpy as np

7.2导入数据集

In [61]:
feature1Path = os.path.join("data", "numeric_feature.csv")
feature2Path = os.path.join("data", "normal_feature_corr.csv")
pricePath = os.path.join("data", "SalePrice.csv")
df = pd.concat([pd.read_csv(feature1Path), pd.read_csv(feature2Path)], axis=1)
In [62]:
for col in df.columns:
    df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

df2 = pd.read_csv(pricePath, header=None)
In [63]:
# 划分数据集
train_X, test_X, train_y, test_y = train_test_split(df, df2, test_size=0.25, random_state=None)

7.3训练和预测

In [64]:
# 训练回归模型
xg = XGBRegressor( n_estimators=500, learning_rate=0.1, min_child_weight=5, max_depth=4)
xg.fit(train_X, train_y)
xg.score(test_X, test_y)
Out[64]:
0.8965519510830123
In [68]:
pre_y = xg.predict(test_X)
In [69]:
print("\n预测结果与样本的MAE为:")
print(math.log(mean_absolute_error(test_y,pre_y, multioutput='uniform_average')))
print("预测结果与样本的RMSE为")
print(math.log(np.sqrt(mean_squared_error(test_y,pre_y))))
预测结果与样本的MAE为:
9.683592228283098
预测结果与样本的RMSE为
10.184159206973346