问题描述
我正在创建食品卫生刮板。 我已经到了可以根据用户输入的邮政编码成功获取所有餐厅名称和地址的地步。 我已经尝试将每个结果的食品卫生等级值显示出来。
此值以以下方式存储在网页上:
<div class="rating-image" style="clear: right;">
<a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
<img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
</a>
</div>
我正在尝试提取img替代文本
我的代码如下:
import requests
import time
from bs4 import BeautifulSoup
class RestaurantScraper(object):
def __init__(self, pc):
self.pc = pc # the input postcode
self.max_page = self.find_max_page() # The number of page available
self.restaurants = list() # the final list of restaurants where the scrape data will at the end of process
def run(self):
for url in self.generate_pages_to_scrape():
restaurants_from_url = self.scrape_page(url)
self.restaurants += restaurants_from_url # we increment the restaurants to the global restaurants list
def create_url(self):
"""
Create a core url to scrape
:return: A url without pagination (= page 1)
"""
return "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + self.pc + \
"&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt"
def create_paginated_url(self, page_number):
"""
Create a paginated url
:param page_number: pagination (integer)
:return: A url paginated
"""
return self.create_url() + "&page={}".format(str(page_number))
def find_max_page(self):
"""
Function to find the number of pages for a specific search.
:return: The number of pages (integer)
"""
time.sleep(5)
r = requests.get(self.create_url())
soup = BeautifulSoup(r.content, "lxml")
pagination_soup = soup.findAll("div", {"id": "paginator"})
pagination = pagination_soup[0]
page_text = pagination("p")[0].text
return int(page_text.replace('Page 1 of ', ''))
def generate_pages_to_scrape(self):
"""
Generate all the paginated url using the max_page attribute previously scraped.
:return: List of urls
"""
return [self.create_paginated_url(page_number) for page_number in range(1, self.max_page + 1)]
def scrape_page(self, url):
"""
This is coming from your original code snippet. This probably need a bit of work, but you get the idea.
:param url: Url to scrape and get data from.
:return:
"""
time.sleep(5)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.findAll("div", {"class": "search-result"})
ratings = soup.select('div.rating-image img[alt]')
restaurants = list()
for item in g_data:
name = print (item.find_all("a", {"class": "name"})[0].text)
restaurants.append(name)
try:
print (item.find_all("span", {"class": "address"})[0].text)
except:
pass
for rating in ratings:
bleh = rating['alt']
print (bleh)
return restaurants
if __name__ == '__main__':
pc = input('Give your post code')
scraper = RestaurantScraper(pc)
scraper.run()
print ("{} restaurants scraped".format(str(len(scraper.restaurants))))
我尝试收集每个餐厅的每个卫生等级的方法是使用for循环,如下所示:
for rating in ratings:
bleh = rating['alt']
print (bleh)
问题是,当运行脚本时,该脚本在每个餐厅的名称和地址下显示在页面上所有餐厅的所有食品卫生等级,而我需要在每个餐厅下显示每个等级
我认为这可能是for循环的错误位置?
非常感谢任何关注此事的人以及提供指导的人
1楼
得到了这个工作,似乎我忘记了为评级添加for循环到try除了块。 将其添加到此块后,将正确显示每个餐厅的单个评级。
下面是完整的工作代码
import requests
import time
from bs4 import BeautifulSoup
class RestaurantScraper(object):
def __init__(self, pc):
self.pc = pc # the input postcode
self.max_page = self.find_max_page() # The number of page available
self.restaurants = list() # the final list of restaurants where the scrape data will at the end of process
def run(self):
for url in self.generate_pages_to_scrape():
restaurants_from_url = self.scrape_page(url)
self.restaurants += restaurants_from_url # we increment the restaurants to the global restaurants list
def create_url(self):
"""
Create a core url to scrape
:return: A url without pagination (= page 1)
"""
return "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + self.pc + \
"&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt"
def create_paginated_url(self, page_number):
"""
Create a paginated url
:param page_number: pagination (integer)
:return: A url paginated
"""
return self.create_url() + "&page={}".format(str(page_number))
def find_max_page(self):
"""
Function to find the number of pages for a specific search.
:return: The number of pages (integer)
"""
time.sleep(5)
r = requests.get(self.create_url())
soup = BeautifulSoup(r.content, "lxml")
pagination_soup = soup.findAll("div", {"id": "paginator"})
pagination = pagination_soup[0]
page_text = pagination("p")[0].text
return int(page_text.replace('Page 1 of ', ''))
def generate_pages_to_scrape(self):
"""
Generate all the paginated url using the max_page attribute previously scraped.
:return: List of urls
"""
return [self.create_paginated_url(page_number) for page_number in range(1, self.max_page + 1)]
def scrape_page(self, url):
"""
This is coming from your original code snippet. This probably need a bit of work, but you get the idea.
:param url: Url to scrape and get data from.
:return:
"""
time.sleep(5)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.findAll("div", {"class": "search-result"})
ratings = soup.select('div.rating-image img[alt]')
restaurants = list()
for item in g_data:
name = print (item.find_all("a", {"class": "name"})[0].text)
restaurants.append(name)
try:
print (item.find_all("span", {"class": "address"})[0].text)
except:
pass
try:
for rating in ratings:
bleh = rating['alt']
print (bleh)[0].text
except:
pass
return restaurants
if __name__ == '__main__':
pc = input('Give your post code')
scraper = RestaurantScraper(pc)
scraper.run()
print ("{} restaurants scraped".format(str(len(scraper.restaurants))))
解决问题的部分是:
try:
for rating in ratings:
bleh = rating['alt']
print (bleh)[0].text
except:
pass
return restaurants