问题描述
我有一个由字符串组成的 (61000L, 2L) numpy.ndarray。 如在 numpy.ndarray 中的项目是字符串。
我拆分字符串,以便它将字符串中的每个单词作为列表输出,在 numpy.ndarray 中,使用以下代码:
words_data = np.char.split(string_data)
我尝试制作一个双 for 循环来计算每个列表中找到的唯一单词。
from collections import Counter
counts = Counter()
for i in range(words_data.shape[0]):
for j in range(words_data[1]):
counts.update(words_data[i])
counts
上面代码的输出错误如下:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-680a0105eebd> in <module>()
1 counts = Counter()
2 for i in range(words_data.shape[0]):
----> 3 for j in range(words_data[1]):
4 counts.update(words_data[i])
5
TypeError: only size-1 arrays can be converted to Python scalar
这是我的数据的前 8 行:
x = np.array([["hello my name is nick", "hello my name is Nick", "hello my name is Carly", "hello my name is Ashley, "hello my name is Java", "hello my name is C++", "hello my name is Ruby", "hello my name is Python"" ],["hello my name is Java", "hello my name is C++", "hello my name is Ruby", "hello my name is Python", "hello my name is nick", "hello my name is Nick", "hello my name is Carly", "hello my name is Ashley]])
x = x.transpose()
1楼
这里不需要循环。 这是一种解决方案:
from collections import Counter
from itertools import chain
import numpy as np
string_data = np.array([["hello my name is nick", "hello my name is Nick", "hello my name is Carly",
"hello my name is Ashley", "hello my name is Java", "hello my name is C++",
"hello my name is Ruby", "hello my name is Python"],
["hello my name is Java", "hello my name is C++", "hello my name is Ruby",
"hello my name is Python", "hello my name is nick", "hello my name is Nick",
"hello my name is Carly", "hello my name is Ashley"]])
word_count = Counter(' '.join(chain.from_iterable(string_data)).split())
# Counter({'Ashley': 2,
# 'C++': 2,
# 'Carly': 2,
# 'Java': 2,
# 'Nick': 2,
# 'Python': 2,
# 'Ruby': 2,
# 'hello': 16,
# 'is': 16,
# 'my': 16,
# 'name': 16,
# 'nick': 2})