码迷,mamicode.com
首页 > 其他好文 > 详细

scrapy 的一个例子

时间:2017-08-10 16:56:35      阅读:145      评论:0      收藏:0      [点我收藏+]

标签:create   local   python   response   ddl   pid   cache   coding   amp   

 1、目标:

  scrapy 是一个爬虫构架,现用一个简单的例子来讲解,scrapy 的使用步骤

 

2、创建一个scrapy的项目:

  创建一个叫firstSpider的项目,命令如下:

scrapy startproject firstSpider 
[jianglexing@cstudio ~]$ scrapy startproject firstSpider 
New Scrapy project firstSpider, using template directory /usr/local/python-3.6.2/lib/python3.6/site-packages/scrapy/templates/project, created in:
    /home/jianglexing/firstSpider

You can start your first spider with:
    cd firstSpider
    scrapy genspider example example.com

  

3、创建一个项目时scrapy 命令干了一些什么:

  创建一个项目时scrapy 会创建一个目录,并向目录中增加若干文件

[jianglexing@cstudio ~]$ tree firstSpider/
firstSpider/
├── firstSpider
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

4 directories, 7 files

 

4、进入项目所在的目录并创建爬虫:

[jianglexing@cstudio ~]$ cd firstSpider/
[jianglexing@cstudio firstSpider]$ scrapy genspider financeSpider www.financedatas.com
Created spider financeSpider using template basic in module:
  firstSpider.spiders.financeSpider

 

5、一只爬虫在scrapy 项目中对应一个文件:

[jianglexing@cstudio firstSpider]$ tree ./
./
├── firstSpider
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   │   ├── __init__.cpython-36.pyc
│   │   └── settings.cpython-36.pyc
│   ├── settings.py
│   └── spiders
│       ├── financeSpider.py    # 这个就是刚才创建的爬虫文件
│       ├── __init__.py
│       └── __pycache__
│           └── __init__.cpython-36.pyc
└── scrapy.cfg

 

6、编写爬虫的处理逻辑:

  以爬取 http://www.financedatas.com 网站首页的title为例

# -*- coding: utf-8 -*-
import scrapy


class FinancespiderSpider(scrapy.Spider):
    name = financeSpider
    allowed_domains = [www.financedatas.com]
    start_urls = [http://www.financedatas.com/]

    def parse(self, response):
        """在parse方法中编写处理逻辑"""
        print(**64)
        title=response.xpath(//title/text()).extract() #xpath 语法抽取数据
        print(title)
        print(**64)

 

7、运行爬虫,查看效果:

[jianglexing@cstudio spiders]$ scrapy crawl financeSpider
2017-08-10 16:11:38 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: firstSpider)
2017-08-10 16:11:38 [scrapy.utils.log] INFO: Overridden settings: {BOT_NAME: firstSpider, NEWSPIDER_MODULE: firstSpider.spiders, ROBOTSTXT_OBEY: True, SPIDER_MODULES: [firstSpider.spiders]}
.... ....
2017-08-10 16:11:39 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.financedatas.com/robots.txt> (referer: None)
2017-08-10 16:11:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.financedatas.com/> (referer: None)
****************************************************************
[欢迎来到 www.financedatas.com]   # 这里就抽取到的数据
****************************************************************2017-08-10 16:11:39 [scrapy.core.engine] INFO: Spider closed (finished)

 

 

 

----

scrapy 的一个例子

标签:create   local   python   response   ddl   pid   cache   coding   amp   

原文地址:http://www.cnblogs.com/JiangLe/p/7339902.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!