当前位置: 代码迷 >> python >> 不要将单词边界beetwen括号与python正则表达式匹配
  详细解决方案

不要将单词边界beetwen括号与python正则表达式匹配

热度:31   发布时间:2023-07-16 10:29:03.0

我实际上有:

 regex = r'\bon the\b'

但是仅当此关键字(实际上是“ on the”上)不在文本括号之间时,才需要我的正则表达式匹配:

应该匹配:

john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)

不应该匹配:

(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)

在UNIX中,使用以下正则表达式的grep实用程序就足够了,

grep " on the " input_file_name | grep -v "\(.* on the .*\)"

像这样的事情怎么样: ^(.*)(?:\\(.*\\))(.*)$

根据您的要求,它“仅匹配不在文本括号内的单词”

因此,来自:

一些文本(括号中有更多文本),有些则不在括号中

匹配项: some text + and some not in parentheses

上方链接提供了更多示例。


编辑 :由于问题已更改,因此更改了答案。

为了捕获不在括号内的所有提及,我将使用一些代码而不是大量的正则表达式。

这样的事情会让您接近:

import re

pattern = r"(on the)"

test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''

match_list = test_text.split('\n')

for line in match_list:
    print line, "->",

    bracket_pattern = r"(\(.*\))" #remove everything between ()
    brackets = re.findall(bracket_pattern, line)
    for match in brackets:
        line = line.replace(match,"")

    matches = re.findall(pattern, line)
    for match in matches:
        print match

    print "\r"

输出:

john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach -> 
bob is at the pool (berkeley) -> 
the spon (is on the table) -> 

我认为正则表达式在一般情况下不会帮助您。 对于您的示例,此正则表达式将按您希望的方式工作:

((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])

描述:

(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below 
                 can be matched
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally
    .{3} matches any character (except newline)
        Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below 
                can be matched
    .{3} matches any character (except newline)
        Quantifier: Exactly 2 times
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally

如果要将问题推广到括号和要搜索的字符串之间的任何字符串,则此正则表达式将不起作用。 问题是括号和您的字符串之间的字符串长度 在正则表达式中,不允许Lookbehind量词是不确定的。

在我的正则表达式中,我使用肯定的Lookahead和肯定的Lookbehind,使用否定的结果也可以实现相同的结果,但是问题仍然存在。

建议:编写一个小的python代码,如果不包含括号之间的文本,则可以检查整行,因为仅regex不能胜任。

例:

import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
    for item in unWanted:
        if item in line:
            mylist.remove(line)
# look for what you want
for line in mylist:
    if mystr in line:
        print line

哪里:

mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.

希望这会有所帮助。