data cleaning (data analysis 第一步)
1. detect and delete wrong data
1.find the wrong data and make sure the data indeed wrong, if so use del statement to delete it
For instance, remove the row with the index 149 from a data set data stored as a list of list, you can use the code del data[149]
make sure you run the delstatement only once, otherwise you’ll delete more then more row. You may try the length of data set to figure out weather you have delete it.
2.remove duplicate entries
1.make sure there do exits duplicate,take one as an example
2. find all the duplicates:
设置两个数组,一个用来保存有重复的,另一个保存unique ones,遍历所有数据,若在保存唯一数组中已有就加入到重复数组中, 否则append到唯一数组中
3. remove the duplicate selectively
例如?可以选择保留最大值的数据项
solution:
· 新建一个字典,使得每一个unique中的name都是一个key,对应的value是该name的app最高的下载量
· 用该字典新建一个新的数据集,即最后需要的新数据集
代码如下:用reviews_max保存每一个app name和最大的review值,建立两个新数据集,遍历原始数据。若app name和review和reviews_max中的对应的上则添加入新的clean数据集中,利用already_added防止二次添加
检验成功:
3.Removing non-English apps
1.In Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.
2.用上述方法删除不符合条件的