当前位置: 代码迷 >> 综合 >> python 用pyinstaller打包多进程multiprocessing、tkinter scrapy爬取豆瓣TOP250部电影信息
  详细解决方案

python 用pyinstaller打包多进程multiprocessing、tkinter scrapy爬取豆瓣TOP250部电影信息

热度:92   发布时间:2023-12-15 00:20:12.0

参考文章:

https://blog.csdn.net/La_vie_est_belle/article/details/79017358

https://blog.csdn.net/weixin_42052836/article/details/82315118

https://blog.csdn.net/zm147451753/article/details/85850526

常规pyinstaller 打包scrapy方式基本分2种:

1.在启动scrapy的命令begin.py文件中import 各种model,然后pyinstaller begin.py,还得把包含VERSION和mime.types的scrapy文件夹复制到生成后的dist里begin文件夹中

2.就是生成一个begin.spec文件,在里面可以hiddenimports里添加需要的model,还可以利用datas,直接把scrapy文件(上面提到的),自己写的douban scrapy文件放进生成后的dist里begin文件夹中

  (ps: spec文件内不能有中文字符!不然会报错 gbk xx什么的)

 

而我这次打包方法就是利用spec文件:

原scrapy文件如下:

https://blog.csdn.net/qq_38282706/article/details/80058548

当然,我做了些改动,毕竟是1年前的作品了,我进步了嘛!

1.自己写的scrapy爬取豆瓣的py文件:

启动程序的begin.py

# -*- coding: utf-8 -*-# 打包需要的import(我是用spec打包的,下面那些都可以删掉)
# import urllib.robotparser
# import scrapy.spiderloader
# import scrapy.statscollectors
# import scrapy.logformatter
# import scrapy.dupefilters
# import scrapy.squeues
# import scrapy.extensions.spiderstate
# import scrapy.extensions.corestats
# import scrapy.extensions.telnet
# import scrapy.extensions.logstats
# import scrapy.extensions.memusage
# import scrapy.extensions.memdebug
# import scrapy.extensions.feedexport
# import scrapy.extensions.closespider
# import scrapy.extensions.debug
# import scrapy.extensions.httpcache
# import scrapy.extensions.statsmailer
# import scrapy.extensions.throttle
# import scrapy.core.scheduler
# import scrapy.core.engine
# import scrapy.core.scraper
# import scrapy.core.spidermw
# import scrapy.core.downloader
# import scrapy.downloadermiddlewares.stats
# import scrapy.downloadermiddlewares.httpcache
# import scrapy.downloadermiddlewares.cookies
# import scrapy.downloadermiddlewares.useragent
# import scrapy.downloadermiddlewares.httpproxy
# import scrapy.downloadermiddlewares.ajaxcrawl
# #import scrapy.downloadermiddlewares.chunked
# import scrapy.downloadermiddlewares.decompression
# import scrapy.downloadermiddlewares.defaultheaders
# import scrapy.downloadermiddlewares.downloadtimeout
# import scrapy.downloadermiddlewares.httpauth
# import scrapy.downloadermiddlewares.httpcompression
# import scrapy.downloadermiddlewares.redirect
# import scrapy.downloadermiddlewares.retry
# import scrapy.downloadermiddlewares.robotstxt
# import scrapy.spidermiddlewares.depth
# import scrapy.spidermiddlewares.httperror
# import scrapy.spidermiddlewares.offsite
# import scrapy.spidermiddlewares.referer
# import scrapy.spidermiddlewares.urllength
# import scrapy.pipelines
# import scrapy.core.downloader.handlers.http
# import scrapy.core.downloader.contextfactoryfrom scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings# import os,json,csv,re,os,scrapy,pymysql,My_Tool
import subprocess
import multiprocessing
from tkinter import *class test():def __init__(self,root):self.root = root                        #创建窗口self.root.title("hello world")           #窗口标题self.root.geometry("320x220")  #设定窗口尺寸Button(text='启动方式:crawl_spider', command=self.crawl_spider).pack()Button(text='启动方式:sub_call', command=self.sub_call).pack()Button(text='关闭方法:kill_terminate', command=self.kill_terminate).pack()Button(text='关闭方法:kill_Popen', command=self.kill_Popen).pack()#使用scrapy默认的启动方法CrawlerProcess,# PS:得把原scrapy文件、跟cfg放进文件夹才能运行!def crawl_spider(self):self.the_scrapy = multiprocessing.Process(target=crawl_spider)self.the_scrapy.start()# 使用subprocess.call(Popen同效果)启动scrapydef sub_call(self):self.the_scrapy = multiprocessing.Process(target=sub_call)self.the_scrapy.start()#杀死前面启动的多进程def kill_terminate(self):self.the_scrapy.terminate()#根据pid使用cmd杀死多进程def kill_Popen(self):kill_command="taskkill /pid %s /f"%self.the_scrapy.pidsubprocess.Popen(kill_command, shell=True)def crawl_spider():# 使用scrapy默认的启动方法#两种杀死进程方法都可以from douban.spiders.spider import DoubanSpiderprocess = CrawlerProcess(get_project_settings())process.crawl(DoubanSpider)process.start()def sub_call():#使用cmd启动scrapy,多进程实际启动2个程序:python.exe,scrapy.exe#所以2种方法杀死的都是python,scrapy还是继续运行child =subprocess.call("scrapy crawl douban",shell=True)def work():root = Tk()test(root)root.mainloop()
if __name__ == '__main__':#pyinstaller 打包多进程得有下面代码multiprocessing.freeze_support()work()

这里特别说一下,为了测试能不能多种方法启动 scrapy,所以我就用了tkinter、multiprocessing、subprocess!

而且,一旦用tkinter启动scrapy,整个窗口就会一直卡住,得等到scrapy结束后才能继续操作,用了多进程的话,那么tkinter能继续操作

PS:打包好后, 使用scrapy默认的启动方法CrawlerProcess,可以在别的电脑使用:但如果你是用cmd命令的话,别的电脑得安装python跟scrapy才能启动!!!

 

items.py

import scrapyclass DoubanmovieItem(scrapy.Item):# 电影名字name = scrapy.Field()# 电影信息info = scrapy.Field()# 评分rating = scrapy.Field()# 评论人数num = scrapy.Field()# 经典语句quote = scrapy.Field()# 电影图片img_url = scrapy.Field()

 

middlewares.py

from scrapy import signals
from My_Tool import My_UA  #自写的包class RandomUserAgent(object):def __init__(self,):self.agent=My_UA()  ##headers的模块#这是随机选择uadef process_request(self,request,spider):request.headers.setdefault('User-agent',self.agent.random_ua)##随机选择UA@classmethod#这部分好像是middlewares自带的,得加上def from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

 

settings.py

import os
BOT_NAME = 'douban'SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
RETRY_ENABLED = False#是否retry重试
DOWNLOAD_DELAY = 2#间隔时间
COOKIES_ENABLED = False#是否带cookies登录
#用自己写的中间器,每次request调用不同的UA
DOWNLOADER_MIDDLEWARES = {'douban.middlewares.RandomUserAgent': 20,'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None,
}ITEM_PIPELINES = {'douban.pipelines.JsonPipeline': 1,'douban.pipelines.ImagePipeline': 4,#!!!!下载图片一定要放在最后进行!!不然return回来的有问题'douban.pipelines.DBPipeline':3,'douban.pipelines.CSVPipeline':2
}
#图片下载设定
IMAGES_STORE = os.getcwd() + '\\'  #存放path
IMAGES_EXPIRES = 90 #90天内抓取的都不会被重抓ROBOTSTXT_OBEY = False

 

pipelines.py

4种保存操作:电影信息保存为csv、json,插入到mysql库,下载海报图片

import json,csv,re,os
import scrapy
from scrapy.pipelines.images import ImagesPipeline #图片
from subprocess import call #用来调用cmd,启动、关闭mysql
from My_Tool import Use_Mysql #自写的包,用来插入、执行mysql语句class JsonPipeline(object):'''保存为json文件'''def process_item(self, item, spider):file_path =os.getcwd() + '\\'+'data.json'with open(file_path, 'a+',encoding='utf-8') as f:line = json.dumps(dict(item), ensure_ascii=False) + "\n"f.write(line)return item          #明白了,这是第一个执行的,一定得返回item,不然下一个pipe进行不了class ImagePipeline(ImagesPipeline):'''下载图片,把图片名字改为电影名字,其他return的item是正常的,这个pipeline return的都是重复的所以要放到最后进行'''def get_media_requests(self, item, info):##跟正常的spider一样了,可以单独利用这个模块下载图片yield scrapy.Request(item['img_url'],meta={'name':item['name']})def file_path(self, request, response=None, info=None):##要用上面的meta传下来才行,最后返回的是保存路径image_guid = request.url.split('/')[-1]newname = re.search(r'(\S+)', request.meta['name']).group(1)path = newname + '.jpg'return 'full/%s' % (path)class CSVPipeline(object):#writerow()里面是tuple或者list'''保存为csv'''def process_item(self, item, spider):###!!!需要加上newline=''  dialect="excel"好像可有可无file_path='data.csv'with open(file_path, 'a+', encoding='utf-8', newline='') as f:if os.path.getsize(file_path)==0:#里面没有文件就把列名添加进去csv.writer(f, dialect="excel").writerow(('name','info','rating','num','quote','img_url'))#插入每一条信息csv.writer(f,dialect="excel").writerow((item['name'],item['info'],item['rating'],item['num'],item['quote'],item['img_url']))return itemclass DBPipeline(object):'''保存到msql,先在open_spider启动mysql,创建database、table接着爬到每条信息(item)后,在process_item中插入mysql最后结束时,close_spider 关闭mysql'''def __init__(self):self.sql_conf={'host': 'localhost','user': 'winner','password': 'luochuan358','db': ''}#进入爬虫前,会先调用open_spider,结束时会调用close_spiderdef open_spider(self, spider):'''启动mysql、创建database、table'''#启动mysqlcall('net start MySQL')db = 'DOUBAN'self.table='豆瓣电影TOP250'#进入sql用户后,创建数据库database(db初始为空,即可重新创建)store = Use_Mysql(self.sql_conf)sql1="CREATE DATABASE IF NOT EXISTS %s CHARACTER SET 'utf8'"%dbstore.query(sql1)#重新进入数据库后,创建表tableself.sql_conf['db']=dbself.store = Use_Mysql(self.sql_conf)sql = "CREATE TABLE IF NOT EXISTS %s \(name char(128) PRIMARY KEY,info char(128),rating char(30),num char(30),quote char(128),img_url char(128))\ENGINE=InnoDB DEFAULT CHARSET='utf8'"%self.tableself.store.query(sql)def close_spider(self, spider):# 关闭mysqlcall('net stop MySQL')passdef process_item(self, item, spider):'''插入每部电影的信息'''#store = Use_Mysql(self.sql_conf)state = self.store.insert_one_data(self.table,dict(item))return item

 

 

主体spider.py  (其实很简单)

import scrapy
from douban.items import DoubanmovieItem
from scrapy.selector import Selectorclass DoubanSpider(scrapy.Spider):name = "douban"         #begin 好像用的就是这个名字allowed_domains = ['movie.douban.com']start_urls = ['https://movie.douban.com/top250']# 我们爬取35页的全部热门段子def parse(self, response):sel=Selector(response)movies = response.xpath('//div[@class="item"]')item = DoubanmovieItem()for movie in movies:title=movie.xpath('.//div[@class="hd"]/a').xpath('string(.)').extract()name="".join(title).strip()item['name']=name.replace('\r\n', '').replace(' ', '').replace('\n', '')infos = movie.xpath('.//div[@class="bd"]/p').xpath('string(.)').extract()info="".join(infos).strip()item['info'] = info.replace('\r\n', '').replace(' ', '').replace('\n', '')item['rating'] = movie.xpath('.//span[@class="rating_num"]/text()').extract()[0].strip()item['num'] = movie.xpath('.//div[@class="star"]/span[last()]/text()').extract()[0].strip()[:-3]quotes = movie.xpath('.//span[@class="inq"]/text()').extract()quote = quotes[0].strip() if quotes else '木有!'item['quote'] = quoteitem['img_url'] = movie.xpath('.//img/@src').extract()[0]yield item#因为最后一页分析出得next_page没有会报错,所以得用trytry:next_page = sel.xpath('//span[@class="next"]/a/@href').extract()[0]except:print('最后一页了!!')else:url = 'https://movie.douban.com/top250' + next_pageyield scrapy.Request(url, callback=self.parse)

 

我自己写的包,包括随机UA,mysql插入 

见:https://blog.csdn.net/qq_38282706/article/details/88928540

 

 

重点来了:

打包用的spec文件:

# -*- mode: python -*-import sys
sys.setrecursionlimit(5000)
#添加递归深度的设置,设置一个足够大的值来保证打包的进行
block_cipher = Nonea = Analysis(#需要打包的py文件,pathex打包的主目录(绝对路径),对于在此目录下的py文件可以只写文件名不写路径['begin.py'],pathex=['C:\\Users\\Administrator\\Desktop\\douban'],#binaries是为打包文件添加二进制文件,缺失的动态链接库可以通过这种方式自动加入到打包路径中(也是元组)binaries=[],#打包的python项目使用的相关文件,如图标文件,文本文件,接收元组(原项目中资源文件路径,打包后路径)datas=[('.\\scrapy','scrapy'),(".\\scrapy.cfg","."),('.\\douban','douban')],#因为scrapy一定得有scrapy这个文件夹,所以我用spec来打包了,不用自己手动复制#还得需要scrapy.cfg这个文件哦!#douban 原scrapy文件#需要imports的模块(ps:自己写的包 可以用datas把文件放进去)hiddenimports = ["scrapy.spiderloader",#需要imports的模块"scrapy.logformatter","scrapy.dupefilters","scrapy.squeues","scrapy.extensions.spiderstate","scrapy.extensions.corestats","scrapy.extensions.telnet","scrapy.extensions.logstats","scrapy.extensions.memusage","scrapy.extensions.memdebug","scrapy.extensions.feedexport","scrapy.extensions.closespider","scrapy.extensions.debug","scrapy.extensions.httpcache","scrapy.extensions.statsmailer","scrapy.extensions.throttle","scrapy.core.scheduler","scrapy.core.engine","scrapy.core.scraper","scrapy.core.spidermw","scrapy.core.downloader","scrapy.downloadermiddlewares.stats","scrapy.downloadermiddlewares.httpcache","scrapy.downloadermiddlewares.cookies","scrapy.downloadermiddlewares.useragent","scrapy.downloadermiddlewares.httpproxy","scrapy.downloadermiddlewares.ajaxcrawl","scrapy.downloadermiddlewares.chunked","scrapy.downloadermiddlewares.decompression","scrapy.downloadermiddlewares.defaultheaders","scrapy.downloadermiddlewares.downloadtimeout","scrapy.downloadermiddlewares.httpauth","scrapy.downloadermiddlewares.httpcompression","scrapy.downloadermiddlewares.redirect","scrapy.downloadermiddlewares.retry","scrapy.downloadermiddlewares.robotstxt","scrapy.spidermiddlewares.depth","scrapy.spidermiddlewares.httperror","scrapy.spidermiddlewares.offsite","scrapy.spidermiddlewares.referer","scrapy.spidermiddlewares.urllength","scrapy.pipelines","scrapy.core.downloader.handlers.http","scrapy.core.downloader.contextfactory","os", "json", "csv", "re",'scrapy',"pymysql",#'multiprocessing','subprocess','tkinter',"My_Tool",    #我自己写的包,在别的电脑使用时,最好也放到datas里],#自己创建的hook文件位置,这个是str(测试了用不上)hookspath=[],runtime_hooks=[],excludes=[],win_no_prefer_redirects=False,win_private_assemblies=False,cipher=block_cipher)pyz = PYZ(a.pure, a.zipped_data,cipher=block_cipher)
exe = EXE(pyz,a.scripts,[],exclude_binaries=True,name='begin',debug=False,bootloader_ignore_signals=False,strip=False,upx=True,console=True )        #设定是否显示cmd
coll = COLLECT(exe,a.binaries,a.zipfiles,a.datas,strip=False,upx=True,name='begin')

大家注意了,pyinstaller 打包spec文件的话,里面是不能有中文的,大家用的话,记得删掉

 

OK,开工:

完成:

 

开始:

 

 

成功: