Airbnb是AirBed and Breakfast ( “Air-b-n-b” )的缩写,中文名称之为:空中食宿,是一家联系旅游人士和家有空房出租的服务型网站,可以为用户提供各式各样的住宿信息。
本文针对kaggle上爱彼迎在新加坡的一份数据进行探索分析。原notebook学习地址:https://www.kaggle.com/bavalpreet26/singapore-airbnb/notebook
data:image/s3,"s3://crabby-images/0bb52/0bb5211e8f0389e42488829c44d1a57af17e284f" alt="016b7331093ac73c113020c9c3b2fdfd.png"
爱彼迎将全球的租房数据进行了收集,并且放在了自己的官网上供参考,官方数据地址:http://insideairbnb.com/get-the-data.html
上面很多城市的数据,国内的有北京、上海等,都是免费可下载的,感兴趣的朋友可以玩转这些数据。
本文选择的是花园城市-狮城新加坡,是个出国旅游的好去处!
data:image/s3,"s3://crabby-images/a0866/a0866ffffa262cb983e078ab41e99942b1a2b6ef" alt="1df0a37778846402e3e1ca445f0cd48c.png"
本文相关数据集和代码大家可以直接在Kaggle官网下载。
导入库
导入数据分析需要的库:
import pandas as pd
import numpy as np# 二维图形
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
plt.style.use('fivethirtyeight')
%matplotlib inline# 动态图
import plotly as plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot, plot
init_notebook_mode(connected=True)# 地图制作
import folium
import folium.plugins# NLP:词云图
import wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator# 机器学习建模相关
import sklearn
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor# 忽略告警
import warnings
warnings.filterwarnings("ignore")
数据基本信息
导入我们获取到的数据:
data:image/s3,"s3://crabby-images/1ff09/1ff09af8a7becec12d93918f9b6a20099e0ab5c7" alt="918e3566f85ecc69ebd4f745c5ed5cdb.png"
查看数据的基本信息:形状shape、字段、缺失值等
# 数据形状
df.shape(7907, 16)# 字段信息
columns = df.columns
columnsIndex(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group','neighbourhood', 'latitude', 'longitude', 'room_type', 'price','minimum_nights', 'number_of_reviews', 'last_review','reviews_per_month', 'calculated_host_listings_count','availability_365'],dtype='object')
具体解释下每个字段的中文含义为:
id:记录ID
name:房屋名字
host_id:房东id
host_name:房东名字
neighbourhood:区域
latitude:纬度
longitude:经度
room_type:房间类型
price:价格
minimum_nights:预订最低天数
number_of_reviews:评论数量
last_reviews:最近一次评论时间
reviews_per_month:评论数/月
calculated_host_listings_count:房东拥有的可出租房屋数量
availability_365:房屋一年内可租天数
data:image/s3,"s3://crabby-images/d4529/d45290053261a077921f2ae8ccce8a0e749a96a7" alt="37cc6243a2de2d26c489cfdce86d1749.png"
通过DataFrame的info属性我们能够查看数据的多个信息:
data:image/s3,"s3://crabby-images/7cf64/7cf64cc1af3bd43114de377f30c45227a3059fd5" alt="3e43e0cb4a94a7b71c579cd61d7ef60b.png"
具体的缺失值情况:
data:image/s3,"s3://crabby-images/b8775/b87753bae5b9df2c3f1213ff284e5555726d726f" alt="7b35bb16abe2739b524de78f938c384a.png"
缺失值处理
1、先查看字段的缺失值分布情况:从下面的图形中看出来也是last_review和reviews_per_month字段存在缺失值
sns.set(rc={'figure.figsize':(19.7, 8.27)})
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis'
)plt.show()
data:image/s3,"s3://crabby-images/fae21/fae21ddc872a2415617bc63a811baf0376a99798" alt="3655f5f1f4ba03b60156271007e00285.png"
2、缺失值的字段(上面的两个)和name字段的两行记录直接删除
data:image/s3,"s3://crabby-images/d2563/d25636e32ee832198f99ebded7d1412cad7b3c37" alt="2a2ec8035672a6c598cdbb6bbf3d59ce.png"
最终的数据变成了7905行和14个字段。原始数据是7907行,16个字段属性
数据EDA
EDA全称是:Exploratory Data Analysis,主要是为了探索数据的分布情况
价格price
整体的话,价格还是在1000以下
sns.distplot(df["price"]) # 直方图
plt.show()
data:image/s3,"s3://crabby-images/1a6c9/1a6c92b472c3173a6e65e797af1c3e98f7c9c7e1" alt="93cd03fc3664cd0af4bb74ada2a58005.png"
下面我们看看价格和最低预订天数的关系:
sns.scatterplot(x="price",y="minimum_nights", # 每夜最少data=df)plt.show()
data:image/s3,"s3://crabby-images/4b56d/4b56d2db19e62338eb0dd0f1515f64aa791211a1" alt="fd0fda5139fe9484743dc3308b08b38d.png"
通过价格的散点图,也能够观察到主要的价格还是分布在最低预订天数在200以下的房源中
区域
查看房屋的区域(地理为)分布:更多的房子位于Central Region位置。
sns.countplot(df["neighbourhood_group"])
plt.show()
data:image/s3,"s3://crabby-images/260ce/260ce463bdb53e3578aafc6632ed251d6e61e0b8" alt="bb42db5ce24d20ef5be5a93e3fa3a9d8.png"
上面是从房源的数量上对比每个区域,下面是对比不同
df1 = df[df.price < 250] # 小于250房子较多
plt.figure(figsize=(10,6))sns.boxplot(x = 'neighbourhood_group',y = 'price',data=df1)plt.title("neighbourhood_group < 250")plt.show()
从箱型图中观察到:Central Region区域的房子
房价分布更为宽广
房价的均值也高于其他位置
价格分布没有比较其他的值,较为合理
data:image/s3,"s3://crabby-images/6c15e/6c15e0fefde9534e3f26f73e0bf129da98d3c5a4" alt="6d191b68f59a70d02eb8d9a793fff122.png"
上面是从房子的区域来比较,下面可以找找它们的具体经纬度:
plt.figure(figsize=(12,8))sns.scatterplot(df.longitude,df.latitude,hue=df.neighbourhood_group)plt.show()
data:image/s3,"s3://crabby-images/a68f1/a68f1968f893b4d98e484fb10c2376e55767071d" alt="d10dd890982b0d1a8ca79c3ff181963c.png"
房源分布热力图
为了绘制地理位置的热力图,可以学下下这个库:folium
import folium
from folium.plugins import HeatMapm = folium.Map([1.44255,103.79580],zoom_start=11)HeatMap(df[['latitude','longitude']].dropna(),radius=10,gradient={0.2:'blue',0.4:'purple',0.6:'orange',1.0:'red'}).add_to(m)
display(m)
data:image/s3,"s3://crabby-images/7a4db/7a4db26eebae62e9c123baaaf0a9549ef18f686e" alt="5effacf92afc2059a300ebc4d0b8da61.png"
房间类型room_type
不同房间类型的占比
统计3种不同房间类型的总数和对应的百分比:
data:image/s3,"s3://crabby-images/95999/9599909e049df64d5bbfdaf23f54d27b4456287b" alt="5396ddb432ffdeb1c940579c0313d7cb.png"
对这3种类型的占比进行可视化对比:
labels = room_df.index
values = room_df.valuesfig = go.Figure(data=[go.Pie(labels=labels,values=values,hole=0.5)])fig.show()
data:image/s3,"s3://crabby-images/73bc2/73bc24a4236ea9c506808d5d65f1a17335415fbb" alt="8ed1284d4bf11ada397132ef1ec9a783.png"
结论:整租或者公寓方式的房源占比最大,可能更受欢迎。
不同区域的房间类型
plt.figure(figsize=(12,6))sns.countplot(data = df,x="room_type",hue="neighbourhood_group"
)plt.title("room types occupied by the neighbourhood_group")
plt.show()
data:image/s3,"s3://crabby-images/6ba15/6ba15775c544569c5b41e785c3f92b7f116142b2" alt="4b2b8b035517101fb738b4df448fde33.png"
对比不同区域位置下的不同类型的房间,我们得到相同的结论:在不同的room_type下,Central Region位置的房间是最多的
个人增加部分:如何使用Plotly来绘制上面的分组状图?
data:image/s3,"s3://crabby-images/8711c/8711c77f28d1bb9b6b6a3e2b1c094b848fa42301" alt="137a83c943fa2c339b5be0e8935e4f95.png"
px.bar(type_group,x="room_type",y="number",color="neighbourhood_group",barmode="group")
data:image/s3,"s3://crabby-images/bce23/bce2328147bb729cee17143120cd1e71d60b6aa6" alt="1dab306e5cb7869414109074db7da26b.png"
房间类型和价格关系
plt.figure(figsize=(12,6))sns.catplot(data=df,x="room_type",y="price")plt.show()
data:image/s3,"s3://crabby-images/80ba9/80ba901bdb6f7e1ac2c1f67ad9514808c923d4eb" alt="b426f3772720a78dc04e4cf4d9acd982.png"
个人增加:使用Plotly绘制版本
data:image/s3,"s3://crabby-images/0a791/0a79131e38559492ef30b6bbf1a56b1bf7a4f8e0" alt="b738b4ea90a5d16a54b229761a1f9388.png"
房间名称
整体词云图
绘制基于房间名称name的词云图:
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.name)wordcloud = WordCloud(max_words=200,background_color="white").generate(text)plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation="Bilinear")plt.axis("off")
plt.show()
data:image/s3,"s3://crabby-images/273b5/273b5165992908f12e56fb78d846b212bc5eacee" alt="2d006607d3e1e398eb7e46aa60fe6f7e.png"
2BR:2 Bedroom Apartments,两室房
MRT:Mass Rapid Transit,新加坡的地铁;可能是靠近地铁的房子比较多
名字中的关键
将名字进行切割后其中的关键词:
# 将数据的名字全部装在列表names中
names = []
for name in df.name:names.append(name) def split_name(name):"""作用:切割每个名字"""spl = str(name).split()return splnames_count = []
for each in names: # 循环列表namesfor word in split_name(each): # 每个名字实行切割操作word = word.lower() # 统一变成小写names_count.append(word) # 每次切割的结果放入列表中# 计数库
from collections import Counter
result = Counter(names_count).most_common()
result[:5]
data:image/s3,"s3://crabby-images/d736d/d736dca076981737ebc18073ac3319eb2b919c5a" alt="cac8894b80af5255f63a866cc3af3e75.png"
top_20 = result[0:20] # 前20个的高频词语top_20_words = pd.DataFrame(top_20, columns=["words","count"])
top_20_words
data:image/s3,"s3://crabby-images/0bff4/0bff475448e66bffda0e68826d80891d8d8949a7" alt="31bc4cd2cea16781491f0b6c3fb8f76d.png"
plt.figure(figsize=(10,6))fig = sns.barplot(data=top_20_words,x="words",y="count")
fig.set_title("Counts of the top 20 used words for listing names")
fig.set_ylabel("Count of words")
fig.set_xlabel("Words")
fig.set_xticklabels(fig.get_xticklabels(), rotation=80)
data:image/s3,"s3://crabby-images/1c7ea/1c7ea7f8675181dc2edb247207d4008d3859efd4" alt="9b41933756cc05c48021ece3b9267008.png"
回访量统计
查看哪些房间的回访量较高:
df1 = df.sort_values(by="number_of_reviews",ascending=False).head(1000)df1.head()
data:image/s3,"s3://crabby-images/e40f7/e40f7cd70c5d7254ab78b81142d14bea42b8a668" alt="9d798bbd0643c3313a09a6da0ffe2a10.png"
import folium
from folium.plugins import MarkerCluster
from folium import pluginsprint("Rooms with the most number of reviews")Long=103.91492
Lat=1.32122mapdf1 = folium.Map([Lat, Long], zoom_start=10)mapdf1_rooms_map = plugins.MarkerCluster().add_to(mapdf1)for lat, lon, label in zip(df1.latitude,df1.longitude,df1.name):folium.Marker(location=[lat, lon],icon=folium.Icon(icon="home"),popup=label).add_to(mapdf1_rooms_map)mapdf1.add_child(mapdf1_rooms_map)
data:image/s3,"s3://crabby-images/428fe/428fe251e9069d865a5e10286754bf126f37521c" alt="5ece48bf2091a50e0f68eba23da8964e.png"
可租天数
在不同经纬度条件下,房子在一年中的可租天数对比:
plt.figure(figsize=(10,6))plt.scatter(df.longitude,df.latitude,c=df.availability_365,cmap="spring",edgecolors="black",linewidths=1,alpha=1)cbar=plt.colorbar()
cbar.set_label("availability_365")
data:image/s3,"s3://crabby-images/f9200/f9200e05ce68354c4435a0b070684e3147155130" alt="155786d069a6e165cddde2282cc70677.png"
个人增加部分:使用Plotly如何绘制?
# plotly版本
px.scatter(df,x="longitude",y="latitude",color="availability_365")
data:image/s3,"s3://crabby-images/41478/4147879b2e3a9b2ca3e44b8e3b25362c5c532d75" alt="71983c598c6df35738dbd70883ba49db.png"
price小于500的房子的分布情况:
# price小于500的数据plt.figure(figsize=(10,6))
low_500 = df[df.price < 500]viz1 = low_500.plot(kind="scatter",x='longitude',y='latitude',label='availability_365',c='price',cmap=plt.get_cmap('jet'),colorbar=True,alpha=0.4)
viz1.legend()
plt.show()
data:image/s3,"s3://crabby-images/eb3a1/eb3a16e3df5d81f07c1a5e2f950ef5bbd64d02f7" alt="caed86c1b711513124506f2a5ea3ef61.png"
增加部分:更为简洁的Plotl8y版本
# plotly版本
px.scatter(low_500,x='longitude',y='latitude',color='price')
data:image/s3,"s3://crabby-images/7af4c/7af4c43d245d0d406d2462cc3469249602bd2e33" alt="c7ac503e382e6a681dffeb6374f79334.png"
线性回归建模
预处理
基于线性回归的建模方案,先删除无效字段:
df.drop(["name","id","host_name"],inplace=True,axis=1)
编码类型的转化:
cols = ["neighbourhood_group","neighbourhood","room_type"]for col in cols:le = preprocessing.LabelEncoder()le.fit(df[col])df[col] = le.transform(df[col])df.head()
data:image/s3,"s3://crabby-images/7d65a/7d65aadb24469a994d51b7c00748dea2ef4b3275" alt="993122d0289f742a0ecbc91c4340e0f6.png"
建模
# 模型实例化
lm = LinearRegression()# 数据集
X = df.drop("price",axis=1)
y = df["price"]# 训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)lm.fit(X_train, y_train)
测试集验证
data:image/s3,"s3://crabby-images/aa733/aa73310db068d18ec44cb32b8555ce9d0b20653a" alt="8e9506c8e74046d5255687f289e3ed2c.png"
title=['Pred vs Actual']fig = go.Figure(data=[go.Bar(name='Predicted',x=error_airbnb.index,y=error_airbnb['Predict']),go.Bar(name='Actual',x=error_airbnb.index,y=error_airbnb['Actual'])
])fig.update_layout(barmode='group')
fig.show()
data:image/s3,"s3://crabby-images/aa6f5/aa6f5f19fb4004042c73d194e1560c543bb0408c" alt="f5da9076be54ef702f4def9b3cc4beb9.png"
个人增加部分:我们对比预测值和真实值,做出二者的差值diff(增加字段)
error_airbnb["diff"] = error_airbnb["Predict"] - error_airbnb["Actual"]
px.box(error_airbnb,y="diff")
data:image/s3,"s3://crabby-images/1a902/1a90262378ac12443454ca4a84b0b8ce3af6cad6" alt="9309921db6870b6df85a76e3cd96045b.png"
通过差异值diff的箱型图我们发现:真实值和预测值在有些数据中差别很大。
通过下面的descride属性也可以看到:有的居然相差了6820(绝对值),属于异常值的情况;四分之一的中位为-19,差值为19,整体上二者还是较为接近
data:image/s3,"s3://crabby-images/ce82a/ce82a21c614b3c63011e3eb6bab7e76c3477b3ca" alt="076e8449947f24a8e980a1ded70388f6.png"
以上。
●总算是把用户流失分析讲清楚了!
●品牌知名度分析