码迷,mamicode.com
首页 > 其他好文 > 详细

scrapy 下载器中间件

时间:2019-11-10 13:51:20      阅读:86      评论:0      收藏:0      [点我收藏+]

标签:round   custom   tin   ons   exce   head   出现   def   dir   

下载器中间件如下列表

 

[‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware‘,

 ‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware‘,

 ‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware‘,

 ‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware‘,

 ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘,

 ‘scrapy.downloadermiddlewares.retry.RetryMiddleware‘,

 ‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware‘,

 ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware‘,

 ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware‘,

 ‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware‘,

 ‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware‘,

 ‘scrapy.downloadermiddlewares.stats.DownloaderStats‘]

 

 

下载器中间件的四个函数

 

from_crawler(cls,crawler) 配置函数

process_reuquest  处理请求

process_response 处理响应

process_exception 异常出现时触发

 

随机切换user_agent

from faker import Faker
class MySpiderMiddleware(object):
    def __init__(self):
        self.fake = Faker()

    def process_request(self,request,spider):
        request.headers.setdefault(‘User-Agent‘,self.fake.user_agent())
DOWNLOADER_MIDDLEWARES = {
#‘middle.middlewares.MyCustomDownloaderMiddleware‘: 543,
‘middle.middlewares.MySpiderMiddleware‘: 100,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘: None,
}


第一种方式 在setting 配置里面配置,我也没测试过,到底是一直是随机取其中一个还是每次请求都随机一个

USER_AGENT_LIST=[
‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36‘
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENT_LIST)

  

第二种方式 写一个自己的randomUseragent中间件 并且在setting里面启用 ,但是要修改顺序靠前,比如100 或者直接把默认启用的user_agent 设置为None

 

第三种方式 直接继承默认的userAgent中间件,然后改写方法 

中间件可以用faker来实现  或者自己招个列表也可以

def process_request(self,request,spider):
        request.headers.setdefault(‘User-Agent‘,self.fake.user_agent())

scrapy 下载器中间件

标签:round   custom   tin   ons   exce   head   出现   def   dir   

原文地址:https://www.cnblogs.com/php-linux/p/11829432.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!