tf-idf 病态学习将“词”与词分开_python

我与其中如果在此格式“字”中找到的单词在文本分类问题的工作就会有不同的重要性，从如果以这种格式字找到，所以我尝试这个代码

    import re
    from sklearn.feature_extraction.text import CountVectorizer
    sent1 = "The cat sat on my \"face\" face"
    sent2 = "The dog sat on my bed"
    content = [sent1,sent2]
    vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'")
    vectorizer.fit(content)
    print (vectorizer.get_feature_names())

结果是

    ['"', 'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the']

我希望它在的地方

    ['bed', 'cat', 'dog', 'face','"face"' 'my', 'on', 'sat', 'the']

你的令牌模式是

token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'"

它正在寻找单词 (\\b\\w\\w+\\b) 或感叹号、问号或引号。 尝试类似的东西

token_pattern=r"(?u)\b\w\w+\b|\"\b\w\w+\b\"|!|\?|\'"

注意部分

\"\b\w\w+\b\"

它查找被引号包围的单词。

您需要根据需要调整token_pattern参数。 以下应该适用于提供的示例：

pattern = r"\S+[^!?.\s]"
vectorizer = CountVectorizer(token_pattern=pattern)

但是，您可能需要进一步细化该模式。 可能有助于让您的正则表达式恰到好处。

tf-idf 病态学习将“词”与词分开

问题描述

1楼

2楼