Scrapy爬取某装修网站部分装修效果图

时间：2019-07-26 10:43:45 阅读：144 评论：0 收藏：0 [点我收藏+]

标签：tin member text 页面 highlight pipeline 回调函数 ngx www

爬取图片资源

spider文件
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
import time
from ..items import ZhuangxiuItem

class ZhuangxiuspiderSpider(CrawlSpider):
    name = ‘zhuangxiuSpider‘
    allowed_domains = [‘www.zhuangyi.com‘]
    start_urls = [‘http://www.zhuangyi.com/xiaoguotu/keting/p1/‘]

    rules = (
        # 提取详情页信息 callback 回调函数, 将相应交给这个函数来处理
        # 第二步:分类主页的下一页
        # Rule(LinkExtractor(allow=r‘(.*?)/p\d+‘), follow=True),
        # 第三步: 详情页面
        Rule(LinkExtractor(allow=r‘(.*?)\d+.html‘), follow=True, callback=‘parse_item‘),
    )

    def parse_item(self, response):
        img_url_list = re.findall(r‘http://pic.zhuangyi.com/Member/\d/\d+/./\d+.jpg‘, response.text)
        item = ZhuangxiuItem()
        item[‘image_urls‘] = img_url_list
        item[‘title‘] = time.time()
        yield item

items.py 中


import scrapy


class ZhuangxiuItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    image_urls = scrapy.Field()

settings

DEFAULT_REQUEST_HEADERS = {
  ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
  ‘Accept-Language‘: ‘en‘,
  ‘Referer‘: ‘http://www.zhuangyi.com/‘
}


IMAGES_STORE = ‘img‘
ITEM_PIPELINES = {
   ‘scrapy.pipelines.images.ImagesPipeline‘: 300,
}

Scrapy爬取某装修网站部分装修效果图

标签：tin member text 页面 highlight pipeline 回调函数 ngx www

原文地址：https://www.cnblogs.com/wangyue0925/p/11248709.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行