BeautifulSoup：如何跳过 find_all 中的子节点？_python

我有以下代码来抓取此页面：

soup = BeautifulSoup(html)
result = u''
# Find Starting point
start = soup.find('div', class_='main-content-column')
if start:
    news.image_url_list = []
    for item in start.find_all('p'):

我面临的问题是它还会抓取<div class="type-gallery">内的<p> <div class="type-gallery"> ，我想避免这种情况。 但找不到实现它的方法。 请问有什么想法吗？

您需要直接 children ，而不仅仅是任何后代，这是element.find_all()返回的。 最好的办法是改用：

for item in soup.select('div.main-content-column > div > p'):

该>运营商限制了这p是一个直接的子节点标签div的内标签div与给定的类。 您可以根据需要进行具体设置； 添加itemprop属性，例如：

for item in soup.select('div.main-content-column > div[itemprop="articleBody"] > p'):

另一种方法是循环遍历：

start = soup.find('div', class_='main-content-column')
if start:
    news.image_url_list = []
    for item in start.children:
        if item.name != 'div':
            # skip children that are not <div> tags
            continue
        for para in item.children:
            if item.name != 'p':
                # skip children that are not <p> tags
                continue

BeautifulSoup：如何跳过 find_all 中的子节点？

问题描述

1楼