问题描述
我正在尝试记录meta
属性的爬网路径:
import scrapy
from scrapy.linkextractors import LinkExtractor
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["www.iana.org"]
start_urls = ['http://www.iana.org/']
request_path_css = dict(
main_menu = r'#home-panel-domains > h2',
domain_names = r'#main_right > p',
)
def links(self, response, restrict_css=None):
lex = LinkExtractor(
allow_domains=self.allowed_domains,
restrict_css=restrict_css)
return lex.extract_links(response)
def requests(self, response, css, cb, append=True):
links = [link for link in self.links(response, css)]
for link in links:
request = scrapy.Request(
url=link.url,
callback=cb)
if append:
request.meta['req_path'] = response.meta['req_path']
request.meta['req_path'].append(dict(txt=link.text, url=link.url))
else:
request.meta['req_path'] = [dict(txt=link.text, url=link.url)]
yield request
def parse(self, response):
#self.logger.warn('## Request path: %s', response.meta['req_path'])
css = self.request_path_css['main_menu']
return self.requests(response, css, self.domain_names, False)
def domain_names(self, response):
#self.logger.warn('## Request path: %s', response.meta['req_path'])
css = self.request_path_css['domain_names']
return self.requests(response, css, self.domain_names_parser)
def domain_names_parser(self, response):
self.logger.warn('## Request path: %s', response.meta['req_path'])
输出:
$ scrapy crawl -L WARN example
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]
2017-02-13 11:06:37 [example] WARNING: ## Request path: [{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]
2017-02-13 11:06:38 [example] WARNING: ## Request path: [{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}, {'url': 'http://www.iana.org/domains/int', 'txt': '.INT'}, {'url': 'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}, {'url': 'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}, {'url': 'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]
这不是我所期望的,因为我希望只有response.meta['req_path'][1]
的最后一个 url,但是最后一页中的所有 url 都以某种方式找到了列表。
换句话说,预期的输出是这样的:
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/root', 'txt': 'The DNS Root Zone'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/int', 'txt': '.INT'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/arpa', 'txt': '.ARPA'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/idn-tables', 'txt': 'IDN Practices Repository'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/dnssec', 'txt': 'Root Key Signing Key'}]
[{'url': 'http://www.iana.org/domains', 'txt': 'Domain Names'}, {'url': 'http://www.iana.org/domains/special', 'txt': 'Special Purpose Domains'}]
1楼
在您的第二个请求之后,当您解析并使用append=True
调用self.requests()
(因为它是默认值)时,这一行:
request.meta['req_path'] = response.meta['req_path']
不复制列表。 相反,它获得对原始列表的引用。 然后在下一行添加(到原始列表!):
request.meta['req_path'].append(dict(txt=link.text, url=link.url))
在下一次循环迭代中,您再次获得对完全相同的原始列表(现在已经有两个条目)的引用,并再次附加到它,依此类推。
您要做的是为每个请求创建一个新列表。
例如,您可以通过将.copy()
添加到第一行来执行此操作:
request.meta['req_path'] = response.meta['req_path'].copy()
或者你可以通过这样做来保存一行:
request.meta['req_path'] = response.meta['req_path'] + [dict(txt=link.text, url=link.url)]