码迷,mamicode.com
首页 > 编程语言 > 详细

Python爬虫框架Scrapy 学习笔记 6 ------- 基本命令

时间:2015-01-07 19:07:36      阅读:324      评论:0      收藏:0      [点我收藏+]

标签:scrapy

1. 有些scrapy命令,只有在scrapy project根目录下才available,比如crawl命令


2 . scrapy genspider taobao http://detail.tmall.com/item.htm?id=12577759834

自动在spider目录下生成taobao.py

# -*- coding: utf-8 -*-
import scrapy


class TaobaoSpider(scrapy.Spider):
    name = "taobao"
    allowed_domains = ["http://detail.tmall.com/item.htm?id=12577759834"]
    start_urls = (
        ‘http://www.http://detail.tmall.com/item.htm?id=12577759834/‘,
    )

    def parse(self, response):
        pass

还有其它模板可以用

scrapy genspider taobao2 http://detail.tmall.com/item.htm?id=12577759834  --template=crawl

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

from project004.items import Project004Item


class Taobao2Spider(CrawlSpider):
    name = ‘taobao2‘
    allowed_domains = [‘http://detail.tmall.com/item.htm?id=12577759834‘]
    start_urls = [‘http://www.http://detail.tmall.com/item.htm?id=12577759834/‘]

    rules = (
        Rule(LinkExtractor(allow=r‘Items/‘), callback=‘parse_item‘, follow=True),
    )

    def parse_item(self, response):
        i = Project004Item()
        #i[‘domain_id‘] = response.xpath(‘//input[@id="sid"]/@value‘).extract()
        #i[‘name‘] = response.xpath(‘//div[@id="name"]‘).extract()
        #i[‘description‘] = response.xpath(‘//div[@id="description"]‘).extract()
        return i


3.列出当前项目所有spider: scrapy list

4.fetch命令用法

A. scrapy fetch --nolog http://www.example.com/some/page.html

B. scrapy fetch --nolog --headers http://www.example.com/

5.view命令在浏览器中查看网页内容

   scrapy view http://www.example.com/some/page.html

6.查看设置

scrapy settings --get BOT_NAME

7.运行自包含的spider,不需要创建项目

scrapy runspider <spider_file.py>

8.scrapy project的部署: scrapy deploy 

部署spider首先要有spider的server环境,一般使用scrapyd

安装scrapyd:pip install scrapyd

文档:http://scrapyd.readthedocs.org/en/latest/install.html

9.所有可用命令

C:\Users\IBM_ADMIN\PycharmProjects\pycrawl\project004>scrapy

Scrapy 0.24.4 - project: project004


Usage:

  scrapy <command> [options] [args]


Available commands:

  bench         Run quick benchmark test

  check         Check spider contracts

  crawl         Run a spider

  deploy        Deploy project in Scrapyd target

  edit          Edit spider

  fetch         Fetch a URL using the Scrapy downloader

  genspider     Generate new spider using pre-defined templates

  list          List available spiders

  parse         Parse URL (using its spider) and print the results

  runspider     Run a self-contained spider (without creating a project)

  settings      Get settings values

  shell         Interactive scraping console

  startproject  Create new project

  version       Print Scrapy version

  view          Open URL in browser, as seen by Scrapy







Python爬虫框架Scrapy 学习笔记 6 ------- 基本命令

标签:scrapy

原文地址:http://dingbo.blog.51cto.com/8808323/1600296

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!