问题描述
我想找到出现频率>= 30
所有单词,不包括单词"the"
、 "and"
、 "to"
和"a"
。
我尝试了以下代码:
import json
from pprint import pprint
with open ('clienti_daune100.json') as f:
data=json.load(f)
word_list=[]
for rec in data:
word_list=word_list + rec['Dauna'].lower().split()
print(word_list[:100], '...', len(word_list), 'Total words.' )
dict = {}
for word in word_list:
if word not in dict:
dict[word] = 1
else:
dict[word] += 1
w_freq = []
for key, value in dict.items():
w_freq.append((value, key))
w_freq.sort(reverse=True)
pprint(w_freq[:100])
我知道我必须在字典中输入一个条件,但我无法弄清楚是哪个。
1楼
首先过滤您的数据,然后您可以使用itertools.Counter
from collections import Counter
# I think your data is just a list. So image you have
data = ['the', 'box', 'and', 'the','cat', 'are', 'in', 'that', 'other', 'box']
# List the words we don't like
bad_words = ['the','to', 'a', 'and']
# Filter these words out
words = [word for word in data if word not in bad_words]
# Get the counts
counter = Counter(words)
结果(如果需要,您可以将其转换为常规字典)
Counter({'box': 2, 'cat': 1, 'are': 1, 'in': 1, 'that': 1, 'other': 1})
最后,您对单词数进行过滤(在本例中为空)
{word: count for word,count in counter.items() if count>=30}