Scrapy：如何一起选择头部和身体标签_python

因此，我有一个搜寻器，需要从头部的meta标签和身体的某些element标签中提取一些数据。

当我尝试这个

对于response.xpath（“ // html”）中的课程：

和这个

对于response.xpath（“ // head”）中的课程：

它仅从<head>... </head>标记中的meta标记获取数据。

当我尝试这个

对于response.xpath（“ // body”）的课程：

它只会从html <body>... </body>标记内的标记中获取数据。

我如何结合这两个选择器

对于response.xpath（“ // head | // body”）的课程：

但它只从<head>... </head>返回了'meta'标记，而没有从正文中提取任何内容。

我也尝试过

对于response.xpath（“ // *”）中的课程：

它可以工作，但是效率非常低，并且需要大量时间来提取。 我相信有一种更有效的方法可以做到这一点。

这是Scrapy代码，如果有帮助的话...

yeild下的前两个元素（pagetype，pagefeatured）在<head> ... <head>标记中。 最后两个元素（coursetloc，coursetfees）在<body ... </body>标记中

是的，它看起来可能很奇怪，但是在我要抓取的网站的<body>...</body>内有“元”标记。

class MySpider(BaseSpider):
name = "dkcourses"
start_urls = ['http://www.example.com/scrapy/all-courses-listing']
allowed_domains = ["example.com"]
def parse(self, response):
 hxs = Selector(response)
 for courses in response.xpath("//body"):
 yield {
            'pagetype': ''.join(courses.xpath('.//meta[@name="dkpagetype"]/@content').extract()),
            'pagefeatured': ''.join(courses.xpath('.//meta[@name="dkpagefeatured"]/@content').extract()),
            'coursetloc': ''.join(courses.xpath('.//meta[@name="dkcoursetloc"]/@content').extract()),
            'coursetfees': ''.join(courses.xpath('.//meta[@name="dkcoursetfees"]/@content').extract()),
           }
 for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract()):
  yield Request(response.urljoin(url), callback=self.parse)

任何帮助都非常感谢。 谢谢

使用extract_first()获取extract()的第一个值，不要使用join()
使用[starts-with(@name, "dkn")]查找meta标记， //meta表示文档的所有内容。

In [5]: for meta in response.xpath('//meta[starts-with(@name, "dkn")]'):
   ...:     name = meta.xpath('@name').extract_first()
   ...:     content = meta.xpath('@content').extract_first()
   ...:     print({name:content})

出：

{'dknpagetype': 'Course'}
{'dknpagefeatured': ''}
{'dknpagedate': '2016-01-01'}
{'dknpagebanner': 'http://www.deakin.edu.au/__data/assets/image/0006/757986/Banner_Cyber-Alt2.jpg'}
{'dknpagethumbsquare': 'http://www.deakin.edu.au/__data/assets/image/0009/757989/SQ_Cyber1-2.jpg'}
{'dknpagethumblandscape': 'http://www.deakin.edu.au/__data/assets/image/0007/757987/LS_Cyber1-1.jpg'}
{'dknpagethumbportrait': 'http://www.deakin.edu.au/__data/assets/image/0008/757988/PT_Cyber1-3.jpg'}
{'dknpagetitle': 'Graduate Diploma of Cyber Security'}
{'dknpageurl': 'http://www.deakin.edu.au/course/graduate-diploma-cyber-security'}
{'dknpagedescription': "Take your understanding of cyber security to the next level with Deakin's Graduate Diploma of Cyber Security and build your capacity to investigate and combat cyber-crime."}
{'dknpageid': '723503'}

Scrapy：如何一起选择头部和身体标签

问题描述

1楼