基于多种回归方法的Kaggle房价预测¶

小组分工：¶

郝家辉、韩亚辉：数据分析和可视化、数据预处理、特征工程、文档编写
许达：使用随机森林方法预测房价
李生椰：使用XGBoost方法预测房价
李思远：使用贝叶斯方法预测房价

项目地址：¶

https://github.com/HousePricesPredictionTeam/BIT_HousePricesPredict

1.问题背景与分析¶

Kaggle房价预测：House Prices: Advanced Regression Techniques

房价与我们的生活息息相关，房价的波动牵动着无数购房者的心。如果能够预测房价的走势，可以有效地帮助购买者做出合理的决策。本项目中，我们选择kaggle竞赛中的爱荷华州艾姆斯住宅数据集，数据集中有 79 个变量几乎描述了爱荷华州艾姆斯 (Ames, Iowa) 住宅的方方面面。我们将对数据集建模分析，并对房价进行预测。

影响房价的因素有很多，我们将进行数据可视化来分析各个因素对房价的影响，使用特征工程选择最相关的因素，利用多个机器学习算法（如决策树回归、xgboost等）构建房价回归模型，并对比分析预测结果。

导入相关包¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

读取数据集¶

data=pd.read_csv('house-prices-advanced-regression-techniques/train.csv')
data

data.shape

(1460, 81)

数据缺失处理¶

for i in data.columns[1:]:
    if data[i].isnull().sum()>0:
        print(i,data[i].isnull().sum())

LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406

可以看出Alley/PoolQC/Fence/MiscFeature属性缺失较多条目，可以直接进行删除¶

删除ID属性以及缺失30%以上数据的属性¶

Deldata= data[[column for column in data if data[column].count()/len(data)>=0.3]]
del Deldata['Id']
print("删除的属性", end=" ")
for c in data.columns:
    if c not in Deldata.columns:
        print(c, end=", ")
print('\n')
data = Deldata

删除的属性 Id, Alley, PoolQC, Fence, MiscFeature,

data.shape

(1460, 76)

再次查看缺失的列¶

isnull = data.isnull().sum(axis=0)
isnull[isnull > 0]/data.shape[0]

LotFrontage     0.177397
MasVnrType      0.005479
MasVnrArea      0.005479
BsmtQual        0.025342
BsmtCond        0.025342
BsmtExposure    0.026027
BsmtFinType1    0.025342
BsmtFinType2    0.026027
Electrical      0.000685
FireplaceQu     0.472603
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
GarageCond      0.055479
dtype: float64

进行缺失填充¶

data.fillna(method='pad', inplace=True)

2.数据可视化探索¶

数据集属性¶

SalePrice-该房产的售价。这是我们要预测的目标变量。
MSSubClass：建筑分类
MSZoning：常规分区分类
LotFrontage：街道的线性尺寸
LotArea：以平方英尺为单位的地块大小
Street：街道的类型
Alley：胡同的类型
LotShape：房产的一般形状
LandContour：房产的平坦度
Utilities：可用的公共设施类型
LotConfig：地块配置
LandSlope：房产的坡度
Neighborhood：埃姆斯市区范围内的地理位置
Condition1：接近主干道或铁路
Condition2：接近主要道路或铁路（如果有）
BldgType：住宅类型
HouseStyle：住宅风格
OverallQual：总体材料和加工质量
OverallCond：总体状况的评价
YearBuilt：原始施工日期
YearRemodAdd：改型日期
RoofStyle：屋顶类型
RoofMatl：屋顶材料
Exterior1st：房屋外墙
Exterior2nd：房屋的外墙覆盖物（如果使用一种以上的材料）
MasVnrType：砖石饰面类型
MasVnrArea：砌面贴面面积（平方英尺）
ExterQual：外部材料质量
ExterCond：外部材料的当前状态
Foundation：基金会的类型
BsmtQual：地下室的高度
BsmtCond：地下室的一般状况
BsmtExposure：罢工或花园水平的地下室墙壁
BsmtFinType1：地下室成品区域的质量
BsmtFinSF1：1型成品的尺寸（平方英尺）
BsmtFinType2：第二个完成区域的质量（如果存在）
BsmtFinSF2：2型成品尺寸（平方英尺）
BsmtUnfSF：未完成的地下室面积（平方英尺）
TotalBsmtSF：地下室总平方英尺
Heating：供热方式
HeatingQC：供热质量和条件
CentralAir：中央空调
Electrical：电气系统
1stFlrSF：一楼面积
2ndFlrSF：二楼面积
LowQualFinSF：低质量完成面积（所有楼层）
GrLivArea：地面（地面）以上居住面积
BsmtFullBath：地下室全浴室
BsmtHalfBath：地下室半浴室
FullBath：地上全浴室
HalfBath：地上半浴室
Bedroom：地下室以上的卧室数量
Kitchen：厨房数量
KitchenQual：厨房质量
TotRmsAbvGrd：上等客房总数（不包括浴室）
Functional：家庭功能等级
Fireplaces：壁炉数量
FireplaceQu：壁炉质量
GarageType：车库位置
GarageYrBlt：车库建成年份
GarageFinish：车库的内部装饰
GarageCars：车库中车库的大小
GarageArea：车库的大小（平方英尺）
GarageQual：车库质量
GarageCond：车库条件
PavedDrive：铺装的车道
WoodDeckSF：木制地板面积（平方英尺）
OpenPorchSF：开放式阳台面积（平方英尺）
EnclosedPorch：封闭的走廊面积（以平方英尺为单位）
3SsnPorch：三季走廊面积（以平方英尺为单位）
ScreenPorch：纱窗走廊面积（以平方英尺为单位）
PoolArea：游泳池面积
PoolQC：泳池质量
Fence：围栏质量
MiscFeature：未包括在其他类别中的各方面特点
MiscVal：各方面特点的价值
MoSold：已售月份
YrSold：已售年份
SaleType：销售类型
SaleCondition：销售条件

data

2.1数值数据分析¶

五数概括¶

numCol='LotFrontage/LotArea/MasVnrArea/BsmtFinSF1/BsmtFinSF2/BsmtUnfSF/TotalBsmtSF/1stFlrSF/2ndFlrSF/LowQualFinSF/GrLivArea/BsmtFullBath/BsmtHalfBath/FullBath/HalfBath/BedroomAbvGr/KitchenAbvGr/TotRmsAbvGrd/Fireplaces/GarageCars/GarageArea/WoodDeckSF/OpenPorchSF/EnclosedPorch/3SsnPorch/ScreenPorch/PoolArea/MiscVal/MoSold/YrSold/YearBuilt/YearRemodAdd/GarageYrBlt/SalePrice'
num=numCol.split('/')
numFeature=data[num]
numFeature.shape

(1460, 34)

data[numFeature.columns].describe().loc['min':'max'].T

绘制盒图¶

fig, axes = plt.subplots(7, 5, figsize=(10, 30))
#, figsize=(15, 20)
for i,col in enumerate(numFeature.columns):
    axes[i//5, i%5].boxplot(data[col].values)
    axes[i//5, i%5].set_title(col)
fig.tight_layout()

绘制数值属性的直方图分布¶

numFeature.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD16318808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD162CAB48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD16A67108>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15A4B688>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15A7B248>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15A7B308>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15AA3F88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B03AC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B34808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B65548>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15B94188>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15BBCE88>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15BF0BC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15C1D908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15C4E648>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15C7E388>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15CAF0C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15CD8DC8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D09AC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D39808>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D6A548>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15D9B288>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15DC3F88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15DF2CC8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15E25A08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15E53748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15E83488>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15EB31C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15EDCEC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F0CC08>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F3E948>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F6C688>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15F9E3C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15FCE108>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD15FF6E08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AD16028C08>]],
      dtype=object)

2.2标称数据¶

绘制标称属性的频率直方图¶

Nominal='MSSubClass/MSZoning/Street/LotShape/LandContour/Utilities/LotConfig/LandSlope/Neighborhood/Condition1/Condition2/BldgType/HouseStyle/OverallQual/OverallCond/RoofStyle/RoofMatl/Exterior1st/Exterior2nd/MasVnrType/ExterQual/ExterCond/Foundation/BsmtQual/BsmtCond/BsmtExposure/BsmtFinType1/BsmtFinType2/Heating/HeatingQC/CentralAir/Electrical/KitchenQual/Functional/FireplaceQu/GarageType/GarageFinish/GarageQual/GarageCond/PavedDrive/SaleType/SaleCondition'
NomFeature=data[Nominal.split('/')]
NomFeature.shape

(1460, 42)

fig, axes = plt.subplots(14, 3, figsize=(15, 20))
#, figsize=(15, 20)
for i,col in enumerate(NomFeature.columns):
    c = data[col].value_counts()[:10]
    axes[i//3, i%3].bar(c.index,c.values)
    axes[i//3, i%3].set_title(col)
fig.tight_layout()

3.特征选择¶

3.1绘制heatmap，查看数值属性与salePrice的相关性¶

correlation1 = numFeature.corr()

f , ax = plt.subplots(figsize = (14,12))
plt.title('Correlation of Numeric Features with Sale Price',y=1,size=16)
sns.heatmap(correlation1,square = True,  vmax=0.8)

<matplotlib.axes._subplots.AxesSubplot at 0x2ad19c82808>

'TotalBsmtSF'、'1stFlrSF'、 'GarageCars'、 'GarageAreas'、'GrLibArea'变量与SalePrice的相关性较强¶

进一步对前十个相关性较强的数值属性进行挖掘¶

k= 10
cols1 = correlation1.nlargest(k,'SalePrice')['SalePrice'].index


print(cols1)
cm = np.corrcoef(data[cols1].values.T)
f , ax = plt.subplots(figsize = (14,12))
sns.heatmap(cm, vmax=.8, linewidths=0.01,square=True,annot=True,cmap='viridis',
            linecolor="white",xticklabels = cols1.values ,annot_kws = {'size':12},yticklabels = cols1.values)

Index(['SalePrice', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF',
       '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd'],
      dtype='object')

<matplotlib.axes._subplots.AxesSubplot at 0x2ad1b1e2508>

可以看出，'TotalBsmtSF'与'1stFlrSF'、 'GarageCars'与'GarageAreas'相关性较强，故在选择特征时，可以只使用TotalBsmtSF和GarageCars¶

所以数值属性特征可以选择¶

'GrLivArea'
'GarageCars'
'TotalBsmtSF'
'FullBath'
'TotRmsAbvGrd'
'YearBuilt'
'YearRemodAdd'

将数值属性特征向量保存到csv文件中¶

numeric_feature = data[['GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd']]

numeric_feature.to_csv('numeric_feature.csv', index=False)

3.2将标称属性转变成数值离散形式¶

from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()
data2=data.copy()
for i in NomFeature.columns:
    if data2[i].dtype=='O':
        data2[i]=number.fit_transform(data[i].astype('str'))

data2

绘制标称属性与saleprice的相关性¶

Corr='MSSubClass/MSZoning/Street/LotShape/LandContour/Utilities/LotConfig/LandSlope/Neighborhood/Condition1/Condition2/BldgType/HouseStyle/OverallQual/OverallCond/RoofStyle/RoofMatl/Exterior1st/Exterior2nd/MasVnrType/ExterQual/ExterCond/Foundation/BsmtQual/BsmtCond/BsmtExposure/BsmtFinType1/BsmtFinType2/Heating/HeatingQC/CentralAir/Electrical/KitchenQual/Functional/FireplaceQu/GarageType/GarageFinish/GarageQual/GarageCond/PavedDrive/SaleType/SaleCondition/SalePrice'
NomFeatures=data2[Corr.split('/')]
correlation2 = NomFeatures.corr()

f , ax = plt.subplots(figsize = (14,12))
plt.title('Correlation of Numeric Features with Sale Price',y=1,size=16)
sns.heatmap(correlation2,square = True,  vmax=0.8)

<matplotlib.axes._subplots.AxesSubplot at 0x2ad1c67ae88>

可以看出，标称属性中与salePrice相关性较强的有OveralQual¶

k= 10
cols = correlation2.nlargest(k,'SalePrice')['SalePrice'].index
print(cols)
cm = np.corrcoef(data2[cols].values.T)
f , ax = plt.subplots(figsize = (14,12))
sns.heatmap(cm, vmax=.8, linewidths=0.01,square=True,annot=True,cmap='viridis',
            linecolor="white",xticklabels = cols.values ,annot_kws = {'size':12},yticklabels = cols.values)

Index(['SalePrice', 'OverallQual', 'Foundation', 'CentralAir', 'Electrical',
       'PavedDrive', 'RoofStyle', 'SaleCondition', 'Neighborhood',
       'HouseStyle'],
      dtype='object')

<matplotlib.axes._subplots.AxesSubplot at 0x2ad1c7f1748>

标称属性的特征选择为：¶

'OverallQual'
'Foundation'
'CentralAir'
'Electrical',
'PavedDrive'
'RoofStyle'
'SaleCondition'
'Neighborhood',
'HouseStyle'

3.3使用get_dummies将标称属性向量one-hot处理¶

catData = pd.get_dummies(NomFeature.astype(str))

catData.shape

(1460, 274)

catData

保存标称特征¶

catData.to_csv('normal_feature_corr.csv', index=False)

4 使用信息增益比筛选标称属性，构建向量¶

分析数值属性和SalePrice的相关性可以使用连续性变量之间的相关分析方法：相关系数
分析标称属性和SalePrice的相关性可以先把SalePrice离散化，然后使用离散型变量的相关分析方法：信息增益比

4.1 SalePrice分布¶

data.SalePrice.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x2ad1db9ce88>

4.2 SalePrice离散化¶

# 连续 => 离散
def price_map(x):
    if x <= 108000:
        return 0
    elif x <= 170000:
        return 1
    elif x <= 250000:
        return 2
    elif x <= 310000:
        return 3
    else:
        return 4
    
sale_price = data.SalePrice.map(price_map)
data['sale_price'] = sale_price
sale_price.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x2ad19d363c8>

4.3 计算信息增益比¶

import numpy as np
def Entropy(x):
    x = np.array(x)
    x = x/sum(x)
    return -(x*np.log2(x)).sum()

def gain_rate(data, colA, colB):
    weight = data.groupby([colA]).size()/data.shape[0]
    entropy_A = Entropy(data[colA].value_counts())
    entropy_B = Entropy(data[colB].value_counts())
    gain = entropy_A - (weight*data.groupby([colA]).apply(lambda x: Entropy(x[colB].value_counts()))).sum()
    
    return gain/(entropy_A*entropy_B)

4.4 取值个数小于20的属性的信息增益比¶

value_counts = NomFeature.apply(lambda x: x.value_counts().size, axis=0)
idx = value_counts[value_counts < 20].index

gains = []
for col in idx:
    gains.append([col, gain_rate(data, col, 'sale_price')])
    
gains = pd.DataFrame(gains, columns=['attribute', 'gainRate'])
attributes = gains.sort_values(by=['gainRate'], ascending=False).values[1:10, 0]
gains.sort_values(by=['gainRate'], ascending=False)[:20]

4.5 使用信息增益比得到标称属性的特征选择为：¶

OverallQual
MSSubClass
MoSold
Exterior2nd
TotRmsAbvGrd
Exterior1st
BsmtFinType1
YrSold
OverallCond
GarageCars

nomfeature = pd.get_dummies(data[attributes].astype(str))

nomfeature.shape

(1460, 82)

nomfeature

保存借助信息增益方法得到的标称特征向量¶

nomfeature.to_csv('normal_feature_gainRate.csv', index=False)

将salePrice保存到csv中¶

data.SalePrice.to_csv('SalePrice.csv',index=False)

5.随机森林模型预测房价¶

5.1导入相关包¶

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import math
import os

5.2导入数据集¶

# 归一化处理
def normalize(df):
    #df = (df - df.min()) / (df.max() - df.min())
    for col in df.columns[:-1]:
        df[col]= (df[col] - df[col].min()) / (df[col].max() - df[col].min())
    return df

def load_data():
    # 导入训练集
    feature1Path = os.path.join("data", "numeric_feature.csv")
    feature2Path = os.path.join("data", "normal_feature_gainRate.csv")
    pricePath = os.path.join("data", "SalePrice.csv")
    train_data = pd.concat([pd.read_csv(feature1Path), pd.read_csv(feature2Path)], axis=1)
    
    df2 = pd.read_csv(pricePath, header=None)
    # 数据归一化
    train_data = normalize(train_data)

    x = train_data.iloc[:, :-1] # 前89列为特征
    y = df2 # 最后一列为标签

    # 划分数据集(训练集占75%，测试集占25%)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = None)

    return train_data, x_train,x_test, y_train, y_test

5.3创建模型¶

# 创建随机森林模型
def model(train_x,train_y):
    my_model = RandomForestRegressor()
    my_model.fit(train_x, train_y.values.flatten())
    return my_model

train_data, train_x, test_x, train_y, test_y = load_data()
# 创建随机森林模型并进行训练和预测
my_model = model(train_x,train_y)
predicted_prices = my_model.predict(test_x)
#test_y_list = test_y['Label'].tolist()

# 均方误差作为评价指标
print("\n预测结果与样本的MAE为：")
print(math.log(mean_absolute_error(test_y,predicted_prices, multioutput='uniform_average')))
print("预测结果与样本的RMSE为")
print(math.log(np.sqrt(mean_squared_error(test_y,predicted_prices))))

# 保存预测结果
my_results = pd.DataFrame({'SalePrice': predicted_prices})
my_results.to_csv('PredictionResults.csv', index=False)
test_y.to_csv('PredictionSample.csv', index=False)

预测结果与样本的MAE为：
9.87717608864612
预测结果与样本的RMSE为
10.299988913858138

6.使用贝叶斯方法预测房价¶

6.1导入相关包¶

from sklearn import linear_model, svm, gaussian_process
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

6.2导入数据集¶

df = pd.concat([pd.read_csv("data/numeric_feature.csv"), pd.read_csv("data/normal_feature_corr.csv")], axis=1)

for col in df.columns:
    df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

df2 = pd.read_csv("data/SalePrice.csv", header=None)

train_X, test_X, train_y, test_y = train_test_split(df, df2, test_size=0.25, random_state=42)

6.3训练和预测¶

clf =linear_model.BayesianRidge()
clf.fit(train_X, train_y)
y_pred = clf.predict(test_X)

print("\n预测结果与样本的MAE为：")
print(np.log(mean_absolute_error(y_pred, test_y)))
print("预测结果与样本的RMSE为")
print(np.log(np.sqrt(mean_squared_error(test_y,y_pred))))

预测结果与样本的MAE为：
9.92621388263256
预测结果与样本的RMSE为
10.315740169210176

7.XGBoost预测房价¶

7.1导入相关包¶

import os
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math
import numpy as np

7.2导入数据集¶

feature1Path = os.path.join("data", "numeric_feature.csv")
feature2Path = os.path.join("data", "normal_feature_corr.csv")
pricePath = os.path.join("data", "SalePrice.csv")
df = pd.concat([pd.read_csv(feature1Path), pd.read_csv(feature2Path)], axis=1)

for col in df.columns:
    df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

df2 = pd.read_csv(pricePath, header=None)

# 划分数据集
train_X, test_X, train_y, test_y = train_test_split(df, df2, test_size=0.25, random_state=None)

7.3训练和预测¶

# 训练回归模型
xg = XGBRegressor( n_estimators=500, learning_rate=0.1, min_child_weight=5, max_depth=4)
xg.fit(train_X, train_y)
xg.score(test_X, test_y)

0.8965519510830123

pre_y = xg.predict(test_X)

print("\n预测结果与样本的MAE为：")
print(math.log(mean_absolute_error(test_y,pre_y, multioutput='uniform_average')))
print("预测结果与样本的RMSE为")
print(math.log(np.sqrt(mean_squared_error(test_y,pre_y))))

预测结果与样本的MAE为：
9.683592228283098
预测结果与样本的RMSE为
10.184159206973346

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2008	WD	Normal	250000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	1456	60	RL	62.0	7917	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2007	WD	Normal	175000
1456	1457	20	RL	85.0	13175	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	2	2010	WD	Normal	210000
1457	1458	70	RL	66.0	9042	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdPrv	Shed	2500	5	2010	WD	Normal	266500
1458	1459	20	RL	68.0	9717	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	4	2010	WD	Normal	142125
1459	1460	20	RL	75.0	9937	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2008	WD	Normal	147500

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	LotShape	LandContour	Utilities	LotConfig	LandSlope	...	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	60	RL	65.0	8450	Pave	Reg	Lvl	AllPub	Inside	Gtl	...	0	0	0	0	0	2	2008	WD	Normal	208500
1	20	RL	80.0	9600	Pave	Reg	Lvl	AllPub	FR2	Gtl	...	0	0	0	0	0	5	2007	WD	Normal	181500
2	60	RL	68.0	11250	Pave	IR1	Lvl	AllPub	Inside	Gtl	...	0	0	0	0	0	9	2008	WD	Normal	223500
3	70	RL	60.0	9550	Pave	IR1	Lvl	AllPub	Corner	Gtl	...	272	0	0	0	0	2	2006	WD	Abnorml	140000
4	60	RL	84.0	14260	Pave	IR1	Lvl	AllPub	FR2	Gtl	...	0	0	0	0	0	12	2008	WD	Normal	250000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	60	RL	62.0	7917	Pave	Reg	Lvl	AllPub	Inside	Gtl	...	0	0	0	0	0	8	2007	WD	Normal	175000
1456	20	RL	85.0	13175	Pave	Reg	Lvl	AllPub	Inside	Gtl	...	0	0	0	0	0	2	2010	WD	Normal	210000
1457	70	RL	66.0	9042	Pave	Reg	Lvl	AllPub	Inside	Gtl	...	0	0	0	0	2500	5	2010	WD	Normal	266500
1458	20	RL	68.0	9717	Pave	Reg	Lvl	AllPub	Inside	Gtl	...	112	0	0	0	0	4	2010	WD	Normal	142125
1459	20	RL	75.0	9937	Pave	Reg	Lvl	AllPub	Inside	Gtl	...	0	0	0	0	0	6	2008	WD	Normal	147500

	min	25%	50%	75%	max
LotFrontage	21.0	59.00	70.0	80.00	313.0
LotArea	1300.0	7553.50	9478.5	11601.50	215245.0
MasVnrArea	0.0	0.00	0.0	165.25	1600.0
BsmtFinSF1	0.0	0.00	383.5	712.25	5644.0
BsmtFinSF2	0.0	0.00	0.0	0.00	1474.0
BsmtUnfSF	0.0	223.00	477.5	808.00	2336.0
TotalBsmtSF	0.0	795.75	991.5	1298.25	6110.0
1stFlrSF	334.0	882.00	1087.0	1391.25	4692.0
2ndFlrSF	0.0	0.00	0.0	728.00	2065.0
LowQualFinSF	0.0	0.00	0.0	0.00	572.0
GrLivArea	334.0	1129.50	1464.0	1776.75	5642.0
BsmtFullBath	0.0	0.00	0.0	1.00	3.0
BsmtHalfBath	0.0	0.00	0.0	0.00	2.0
FullBath	0.0	1.00	2.0	2.00	3.0
HalfBath	0.0	0.00	0.0	1.00	2.0
BedroomAbvGr	0.0	2.00	3.0	3.00	8.0
KitchenAbvGr	0.0	1.00	1.0	1.00	3.0
TotRmsAbvGrd	2.0	5.00	6.0	7.00	14.0
Fireplaces	0.0	0.00	1.0	1.00	3.0
GarageCars	0.0	1.00	2.0	2.00	4.0
GarageArea	0.0	334.50	480.0	576.00	1418.0
WoodDeckSF	0.0	0.00	0.0	168.00	857.0
OpenPorchSF	0.0	0.00	25.0	68.00	547.0
EnclosedPorch	0.0	0.00	0.0	0.00	552.0
3SsnPorch	0.0	0.00	0.0	0.00	508.0
ScreenPorch	0.0	0.00	0.0	0.00	480.0
PoolArea	0.0	0.00	0.0	0.00	738.0
MiscVal	0.0	0.00	0.0	0.00	15500.0
MoSold	1.0	5.00	6.0	8.00	12.0
YrSold	2006.0	2007.00	2008.0	2009.00	2010.0
YearBuilt	1872.0	1954.00	1973.0	2000.00	2010.0
YearRemodAdd	1950.0	1967.00	1994.0	2004.00	2010.0
GarageYrBlt	1900.0	1961.00	1980.0	2002.00	2010.0
SalePrice	34900.0	129975.00	163000.0	214000.00	755000.0

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	LotShape	LandContour	Utilities	LotConfig	LandSlope	...	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	60	3	65.0	8450	1	3	3	0	4	0	...	0	0	0	0	0	2	2008	8	4	208500
1	20	3	80.0	9600	1	3	3	0	2	0	...	0	0	0	0	0	5	2007	8	4	181500
2	60	3	68.0	11250	1	0	3	0	4	0	...	0	0	0	0	0	9	2008	8	4	223500
3	70	3	60.0	9550	1	0	3	0	0	0	...	272	0	0	0	0	2	2006	8	0	140000
4	60	3	84.0	14260	1	0	3	0	2	0	...	0	0	0	0	0	12	2008	8	4	250000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	60	3	62.0	7917	1	3	3	0	4	0	...	0	0	0	0	0	8	2007	8	4	175000
1456	20	3	85.0	13175	1	3	3	0	4	0	...	0	0	0	0	0	2	2010	8	4	210000
1457	70	3	66.0	9042	1	3	3	0	4	0	...	0	0	0	0	2500	5	2010	8	4	266500
1458	20	3	68.0	9717	1	3	3	0	4	0	...	112	0	0	0	0	4	2010	8	4	142125
1459	20	3	75.0	9937	1	3	3	0	4	0	...	0	0	0	0	0	6	2008	8	4	147500

	MSSubClass_120	MSSubClass_160	MSSubClass_180	MSSubClass_190	MSSubClass_20	MSSubClass_30	MSSubClass_40	MSSubClass_45	MSSubClass_50	MSSubClass_60	...	SaleType_ConLw	SaleType_New	SaleType_Oth	SaleType_WD	SaleCondition_Abnorml	SaleCondition_AdjLand	SaleCondition_Alloca	SaleCondition_Family	SaleCondition_Normal	SaleCondition_Partial
0	0	0	0	0	0	0	0	0	0	1	...	0	0	0	1	0	0	0	0	1	0
1	0	0	0	0	1	0	0	0	0	0	...	0	0	0	1	0	0	0	0	1	0
2	0	0	0	0	0	0	0	0	0	1	...	0	0	0	1	0	0	0	0	1	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	1	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	1	...	0	0	0	1	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	0	0	0	0	0	0	0	0	0	1	...	0	0	0	1	0	0	0	0	1	0
1456	0	0	0	0	1	0	0	0	0	0	...	0	0	0	1	0	0	0	0	1	0
1457	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	1	0
1458	0	0	0	0	1	0	0	0	0	0	...	0	0	0	1	0	0	0	0	1	0
1459	0	0	0	0	1	0	0	0	0	0	...	0	0	0	1	0	0	0	0	1	0

	attribute	gainRate
12	OverallQual	0.258362
0	MSSubClass	0.230104
17	Exterior2nd	0.193903
16	Exterior1st	0.183762
25	BsmtFinType1	0.123090
13	OverallCond	0.034916
11	HouseStyle	0.004557
22	BsmtQual	-0.027344
31	KitchenQual	-0.039115
28	HeatingQC	-0.045437
21	Foundation	-0.052132
35	GarageFinish	-0.055727
19	ExterQual	-0.141866
33	FireplaceQu	-0.144424
34	GarageType	-0.147875
24	BsmtExposure	-0.167345
18	MasVnrType	-0.179390
6	LotConfig	-0.295960
3	LotShape	-0.332756
1	MSZoning	-0.407347