scrapy框架

时间：2019-01-15 01:00:46 阅读：311 评论：0 收藏：0 [点我收藏+]

1. 安装scrapy模块

1.
pip install wheel
安装这个模块是为安装twisted做准备, 因为安装twisted需要用到wheel
2.
下载twisted, 因为scrapy是基于twisted的异步非阻塞框架
https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
下载和python版本, 操作系统位数对应
3. 
安装twisted, 进入到下载的目录
pip install twisted
4.
pip install pywin32
5.
pip install scrapy

2. 创建一个scrapy项目

在终端中执行下面命令创建一个scrapy项目

D:\OneDrive - hk sar baomin inc\Python\project\Reptile\day06>scrapy startproject testDemo
New Scrapy project ‘testDemo‘, using template directory ‘c:\\python36\\lib\\site-packages\\scrapy\\templates\\project‘, created in:
    D:\OneDrive - hk sar baomin inc\Python\project\Reptile\day06\testDemo

You can start your first spider with:
    cd testDemo
    scrapy genspider example example.com

3. 创建一个scrapy爬虫

cd testDemo
D:\OneDrive - hk sar baomin inc\Python\project\Reptile\day06\testDemo>scrapy genspider test www.baidu.com
Created spider ‘test‘ using template ‘basic‘ in module:
  testDemo.spiders.test

4. scrapy框架的目录结构

  scrapy.cfg
│
└─testDemo
    │  items.py    # item对象, 用于数据持久化数据传递
    │  middlewares.py    # 中间件文件
    │  pipelines.py    # 管道文件
    │  settings.py    # 配置文件
    │  __init__.py
    │
    ├─spiders    #具体的爬虫文件目录
    │  │  test.py    # 这是刚创建的爬虫文件
    │  │  __init__.py
    │  │
    │  └─__pycache__
    │          __init__.cpython-36.pyc
    │
    └─__pycache__
            settings.cpython-36.pyc
            __init__.cpython-36.pyc

5. 爬虫文件

# -*- coding: utf-8 -*-
import scrapy


class TestSpider(scrapy.Spider):
    name = ‘test‘  # 爬虫文件名称
    allowed_domains = [‘www.baidu.com‘]  # 允许爬取的域名
    start_urls = [‘http://www.baidu.com/‘]  # 起始URL列表

    def parse(self, response):
        """
        用于解析请求后得到的数据
        :param response: 服务器响应的数据
        :return: item对象, 用于持久化存储
        """
        # response.text     # 获取响应的字符串
        # response.url  # 请求的URL
        # response.xpath()  # 使用xpath对数据进行解析
        # response.meta  # 用作请求传参, time_out, proxy等参数都会放在这里
        pass

6. scrapy数据持久化

7. 启动爬虫程序

5. scrapy手动发送请求

6. scrapy运行流程(核心组件)

7. scrapy请求传参的用途

8. 如何提升scrapy的爬取效率

9. 基于UA池和ip代理池爬取数据

scrapy框架

标签：one odi war amp 爬取 bsp 允许 tin ons

原文地址：https://www.cnblogs.com/594504110python/p/10269397.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行