问题描述
我想从起始网址中提到的站点中提取诸如副作用,警告,剂量之类的内容。
以下是我的代码。
正在创建csv文件,但未显示任何内容。
输出为:
before for
[] # it is displaying empty list
after for
这是我的代码:
from scrapy.selector import Selector from medicinelist_sample.items import MedicinelistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class MedSpider(CrawlSpider): name = "med" allowed_domains = ["medindia.net"] start_urls = ["http://www.medindia.net/doctors/drug_information/home.asp?alpha=z"] rules = [Rule(SgmlLinkExtractor(allow=('Zafirlukast.htm',)), callback="parse", follow = True),] global Selector def parse(self, response): hxs = Selector(response) fullDesc = hxs.xpath('//div[@class="report-content"]//b/text()') final = fullDesc.extract() print "before for" # this is just to see if it was printing print final print "after for" # this is just to see if it was printing
1楼
您的scrapy
spider类的parse
方法应return item(s)
。
使用当前代码,我看不到任何项目被退回。
一个例子是
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
sel = Selector(response)
item = Item()
item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
return item
有关更多信息,请查看中的 。
2楼
代码中的另一个问题是,您将重写CrawlSpider的parse方法以实现回调逻辑。 不能使用CrawlSpiders进行此操作,因为在其逻辑中使用了parse方法。
Ashish Nitin Patil已通过命名示例函数* parse_item *隐式指出了这一点。
爬网蜘蛛的parse方法的默认实现基本上执行的是调用您在规则定义中指定的回调。 因此,如果您覆盖它,我认为根本不会调用您的回调。 请参阅
3楼
我只是对您正在爬网的站点做了一些实验。 您想从该域的不同站点上提取有关该药物的一些数据(例如名称,适应症,禁忌症等):以下或类似的XPath表达式不符合您的需求吗? 我认为您当前的查询只会给您“标题”,但是此站点上的实际信息位于这些粗体显示的标题之后的textnodes中。
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Test.items import TestItem
from scrapy.item import Item, Field
class Medicine(Item):
name = Field()
dosage = Field()
indications = Field()
contraindications = Field()
warnings = Field()
class TestmedSpider(CrawlSpider):
name = 'testmed'
allowed_domains = ['http://www.medindia.net/doctors/drug_information/']
start_urls = ['http://www.http://www.medindia.net/doctors/drug_information/']
rules = (
Rule(SgmlLinkExtractor(allow=r'Zafirlukast.htm'), callback='parse_item', follow=True),
)
def parse_item(self, response):
drug_info = Medicine()
selector = Selector(response)
name = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Generic Name')]//..//following-sibling::text()[1])''')
dosage = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Dosage')]//..//following-sibling::text()[1])''')
indications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Why it is prescribed (Indications)')]//..//following-sibling::text()[1])''')
contraindications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Contraindications')]//..//following-sibling::text()[1])''')
warnings = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Warnings and Precautions')]//..//following-sibling::text()[1])''')
drug_info['name'] = name.extract()
drug_info['dosage'] = dosage.extract()
drug_info['indications'] = indications.extract()
drug_info['contraindications'] = contraindications.extract()
drug_info['warnings'] = warnings.extract()
return drug_info
这将为您提供以下信息:
>scrapy parse --spider=testmed --verbose -d 2 -c parse_item --logfile C:\Python27\Scripts\Test\Test\spiders\test.log http://www.medindia.net/doctors/drug_information/Zafirlukast.htm
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'contraindications': [u'Hypersensitivity.'],
'dosage': [u'Adult- The recommended dose is 20 mg twice daily.'],
'indications': [u'This medication is an oral leukotriene receptor antagonist (
LTRA), prescribed for asthma. \xa0It blocks the action of certain natural substa
nces that cause swelling and tightening of the airways.'],
'name': [u'\xa0Zafirlukast'],
'warnings': [u'Caution should be exercised in patients with history of liver d
isease, mental problems, suicidal thoughts, any allergy, elderly, during pregnan
cy and breastfeeding.']}]