在文本中搜索给定单词的最佳方法是什么_java

我正在处理文本处理，我必须找到提及任何给定单词的推文数量，例如：

tweet 1: I had an egg for breakfast this morning
tweet 2: This is the book that I'll give to you tomorrow morning
tweet 3: I went there yesterday morning but you were not home. Did you go to her house this morning?
given word: this morning

对于上面的示例，频率应该为2，因为只有两条鸣叫（鸣叫1和3）以给出的确切方式提及给定的单词。 我担心如果我当前的实现效率低下（在某些方面），也许有更好的方法可以做到这一点。 到目前为止，我首先要做的是，尝试获取包含给定单词的所有推文。

public int getDF(String term) throws FileNotFoundException, IOException{
        int frequency = 0;
        File[] paths = f.listFiles();
        for(File f:paths){
            BufferedReader br = new BufferedReader(new FileReader(f));
            String line;
            String[] termTokens = term.split(" ");
            while((line=br.readLine())!=null){
                if(line.toLowerCase().contains(term)){
                    if(termTokens.length > 1){ //just for multi-word
                        if(getDFUtil(line.toLowerCase(), term.toLowerCase()))
                            frequency++;
                    }else
                        frequency++;
                }
            }
        }
        return frequency;
    }

对于给定的多词，我调用了函数getDFUtil来检查推文是否确实包含给定顺序的词。

public boolean getDFUtil(String tweet, String term){
        String[] tweetTokens = tweet.split(" ");
        String[] termTokens = term.split(" ");
        int chosenIndex = 0;
        int nextIndex = 0;
        if(termTokens.length > 1){
            for(int j=0;j<termTokens.length;j++){
                for(int i=0;i<tweetTokens.length;i++){
                    if(termTokens[j].equals(tweetTokens[i]) && j==0){
                        chosenIndex = i;
                        nextIndex = i;
                    }else if(termTokens[j].equals(tweetTokens[i])){
                        nextIndex = i;
                    }
                }
            }
            if(nextIndex - chosenIndex == termTokens.length - 1)
                return true;
        }else if(tweet.contains(term))
            return true;

        return false;
    }

但是，就像我之前提到的，我想知道（应该是）是否存在更好或更简单但功能强大的方法来做到这一点。

我认为您可以为此任务使用正则表达式（regex）（如果您不知道它是什么，那么真的值得学习如何使用它）。 您可以使用正则表达式一次将行与给定单词或短语进行匹配，而不是检查行中是否与搜索到的单词匹配的每个单词。 试试这个小应用程序：

public class Test{
    public static void main(String[] args){
        int frequency = 0;
        String term = "this morning";
        File tweets = new File(//path to file Tweets.txt);
        String regex = "(?i).*"+term+".*";
        try{
            BufferedReader br = new BufferedReader(new FileReader(tweets));
            String line;

            while((line=br.readLine())!=null){
                if(line.matches(regex)){
                    frequency++;
                }
            }
        }catch (Exception ex){
            ex.printStackTrace();
        }
        System.out.println(frequency);
    }
}

Tweets.txt包含以上示例中的推文。 简而言之-应用计数计数正则表达式与文件中的行匹配的次数。 我认为您可以轻松地在您的应用中实现类似的功能。 仅当整个字符串与给定的正则表达式匹配时， String.match()方法才返回true，因此在这种情况下，它是通过以下方式构造的：

(?i) -不区分大小写的模式，如我所见，您曾经使用过toLowerCase（）方法，这种情况下大小写在匹配中不匹配，
.* -匹配此行中的任何内容
term -您要寻找的确切字词或词组
.* -匹配此行中的任何内容

您可以在检查此特定正则表达式如何与您的推文一起使用。

在文本中搜索给定单词的最佳方法是什么

问题描述

1楼