当前位置: 代码迷 >> python >> 将HTML刮成CSV
  详细解决方案

将HTML刮成CSV

热度:54   发布时间:2023-06-19 09:20:03.0

我想从起始网址中提到的站点中提取诸如副作用,警告,剂量之类的内容。 以下是我的代码。 正在创建csv文件,但未显示任何内容。 输出为:

before for
[] # it is displaying empty list
after for
这是我的代码:
 from scrapy.selector import Selector from medicinelist_sample.items import MedicinelistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class MedSpider(CrawlSpider): name = "med" allowed_domains = ["medindia.net"] start_urls = ["http://www.medindia.net/doctors/drug_information/home.asp?alpha=z"] rules = [Rule(SgmlLinkExtractor(allow=('Zafirlukast.htm',)), callback="parse", follow = True),] global Selector def parse(self, response): hxs = Selector(response) fullDesc = hxs.xpath('//div[@class="report-content"]//b/text()') final = fullDesc.extract() print "before for" # this is just to see if it was printing print final print "after for" # this is just to see if it was printing 

您的scrapy spider类的parse方法应return item(s) 使用当前代码,我看不到任何项目被退回。 一个例子是

def parse_item(self, response):
    self.log('Hi, this is an item page! %s' % response.url)

    sel = Selector(response)
    item = Item()
    item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
    return item

有关更多信息,请查看中的 。

代码中的另一个问题是,您将重写CrawlSpider的parse方法以实现回调逻辑。 不能使用CrawlSpiders进行此操作,因为在其逻辑中使用了parse方法。

Ashish Nitin Patil已通过命名示例函数* parse_item *隐式指出了这一点。

爬网蜘蛛的parse方法的默认实现基本上执行的是调用您在规则定义中指定的回调。 因此,如果您覆盖它,我认为根本不会调用您的回调。 请参阅

我只是对您正在爬网的站点做了一些实验。 您想从该域的不同站点上提取有关该药物的一些数据(例如名称,适应症,禁忌症等):以下或类似的XPath表达式不符合您的需求吗? 我认为您当前的查询只会给您“标题”,但是此站点上的实际信息位于这些粗体显示的标题之后的textnodes中。

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Test.items import TestItem

from scrapy.item import Item, Field

class Medicine(Item):
    name = Field()
    dosage = Field()
    indications = Field()
    contraindications = Field()
    warnings = Field()

class TestmedSpider(CrawlSpider):
    name = 'testmed'
    allowed_domains = ['http://www.medindia.net/doctors/drug_information/']
    start_urls = ['http://www.http://www.medindia.net/doctors/drug_information/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'Zafirlukast.htm'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        drug_info = Medicine()

        selector = Selector(response)
        name = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Generic Name')]//..//following-sibling::text()[1])''')
        dosage = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Dosage')]//..//following-sibling::text()[1])''')
        indications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Why it is prescribed (Indications)')]//..//following-sibling::text()[1])''')
        contraindications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Contraindications')]//..//following-sibling::text()[1])''')
        warnings = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Warnings and Precautions')]//..//following-sibling::text()[1])''')

        drug_info['name'] = name.extract()
        drug_info['dosage'] = dosage.extract()
        drug_info['indications'] = indications.extract()
        drug_info['contraindications'] = contraindications.extract()
        drug_info['warnings'] = warnings.extract()

        return drug_info

这将为您提供以下信息:

>scrapy parse --spider=testmed --verbose -d 2 -c parse_item --logfile C:\Python27\Scripts\Test\Test\spiders\test.log http://www.medindia.net/doctors/drug_information/Zafirlukast.htm
>>> DEPTH LEVEL: 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'contraindications': [u'Hypersensitivity.'],
  'dosage': [u'Adult- The recommended dose is 20 mg twice daily.'],
  'indications': [u'This medication is an oral leukotriene receptor antagonist (
LTRA), prescribed for asthma. \xa0It blocks the action of certain natural substa
nces that cause swelling and tightening of the airways.'],
  'name': [u'\xa0Zafirlukast'],
  'warnings': [u'Caution should be exercised in patients with history of liver d
isease, mental problems, suicidal thoughts, any allergy, elderly, during pregnan
cy and breastfeeding.']}]
  相关解决方案