记录从什么都不会开始的kaggle之旅
电影推荐评分数据链接
- 查询相关资源、博客、源码
- 用Maven构建Hadoop项目
- 用Hadoop构建电影推荐系统
- 使用python的pandas来处理Movielens
- Python For Data Analysis’s documentation
- Data processing by python:MovieLens 1M data set–主要参考
- 10 Minutes to pandas
- Python and Pandas: Part 2. Movie Ratings–主要参考
- 推荐系统-基于用户的最近邻协同过滤算法(MovieLens数据集)
处理数据思路:
- 使用矩阵分解的思路,将数据转换成矩阵处理
- 使用python提供的pandas进行处理
首先,使用思路1将training_ratings_for_kaggle_comp.csv转换成矩阵。
python-稀疏矩阵
python读取csv某一列、行的方式
注意易出错的细节:在读取文件的路径前加 r
#将csv形式的数据转换成矩阵
import csv#读取
with open(r'E:\LP\Kaggle\Movie Recommendation\training_ratings_for_kaggle_comp.csv') as f:f_csv = csv.reader(f)userid = [row[0] for row in f_csv]with open(r'E:\LP\Kaggle\Movie Recommendation\training_ratings_for_kaggle_comp.csv') as f:f_csv = csv.reader(f) movieid = [row[1] for row in f_csv]with open(r'E:\LP\Kaggle\Movie Recommendation\training_ratings_for_kaggle_comp.csv') as f:f_csv = csv.reader(f)rating = [row[2] for row in f_csv]
training_ratings_for_kaggle_comp.csv的数据转换成稀疏矩阵形式
- 稀疏矩阵官方文档
- python科学计算六:scipy矩阵操作–加入dtype解决bug
#去掉header
row_userid = userid[1:]
col_movieid = movieid[1:]
data_rating = rating[1:]#print row_userid[:20]
#print col_movieid[:20]
#print data_rating[:20]R = sparse.coo_matrix((data_rating,(row_userid,col_movieid)),dtype=int)
print R.todense()