码迷,mamicode.com
首页 > 其他好文 > 详细

初识scrapy

时间:2017-11-15 23:39:01      阅读:233      评论:0      收藏:0      [点我收藏+]

标签:scrapy

技术分享

scrapy由下面几个部分组成

    spiders:爬虫模块,负责配置需要爬取的数据和爬取规则,以及解析结构化数据

    items:定义我们需要的结构化数据,使用相当于dict 

    pipelines:管道模块,处理spider模块分析好的结构化数据,如保存入库等

    middlewares:中间件,相当于钩子,可以对爬取前后做预处理,如修改请求headerurl过滤等

参考 http://python.gotrained.com/scrapy-tutorial-web-scraping-craigslist/

        https://doc.scrapy.org/en/latest/

本篇文档只写了常见的spiders例子,其余部分(items、pipelines、settings等)请参考后期blog

例1 在同一个页面上抓取内容 (抓取七月在线精品课程的名称、课程信息、开课时间)

import scrapy
class julyClassSpider(scrapy.Spider):
    name=‘julyclass‘
    start_urls=[‘https://www.julyedu.com/category/index‘]
    def parse(self,response):
        for classinfo in response.xpath(‘//div[@class="item"]/div/div‘):
            classname=classinfo.xpath(‘a[1]/h4/text()‘).extract_first()
            classdate=classinfo.xpath(‘a[1]/p[2]/text()‘).extract_first()
            imageaddr=response.url+classinfo.xpath(‘a[1]/img[1]/@src‘).extract_first()
            #print("classname:%s; classdate:%s; imageaddr: %s " %(classname,classdate,imageaddr))
            yield {"classname":classname,"classdate":classdate,"imageaddr":imageaddr}

2 在连续页面上抓取内容(抓取博客园前10页的精华贴)

import scrapy
import re
class cnblogsSpider(scrapy.Spider):
    name="cnblogs"
    start_urls=[‘https://www.cnblogs.com/pick/‘+str(n)+‘/‘ for n in range(1,10)]
    def parse(self,response):
        for post in response.xpath(‘//div[@class="post_item_body"]‘):
            title=post.xpath(‘h3/a/text()‘).extract_first()
            href=post.xpath(‘h3/a/@href‘).extract_first()
            pubdate=post.xpath(‘div[@class="post_item_foot"]/text()‘)[1].extract().strip()
            pubdate=re.split(‘ ‘,pubdate)[1]+‘ ‘+re.split(‘ ‘,pubdate)[2]
            comments=post.xpath(‘div[@class="post_item_foot"]/span[1]/a/text()‘).extract_first()
            comments=re.split(‘\(|\)‘,comments)[1]
            reads=post.xpath(‘div[@class="post_item_foot"]/span[2]/a/text()‘).extract_first()
            reads=re.split(‘\(|\)‘,reads)[1]
            #print(title,href,pubdate,comments,reads)
            yield {‘title‘:title,‘url‘:href,‘pubdate‘:pubdate,‘comments‘:comments,‘reads‘:reads}

运行:scrapy runspider scrapy2.py

urls是通过for拼接而成的list


3 通过指定按钮(Next)连续抓取多个页面内容:

import scrapy
import re
class cnblogsSpider(scrapy.Spider):
    name="cnblogs"
    start_urls=[‘https://www.cnblogs.com/pick/‘]
    def parse(self,response):
        for post in response.xpath(‘//div[@class="post_item_body"]‘):
            title=post.xpath(‘h3/a/text()‘).extract_first()
            href=post.xpath(‘h3/a/@href‘).extract_first()
            pubdate=post.xpath(‘div[@class="post_item_foot"]/text()‘)[1].extract().strip()
            pubdate=re.split(‘ ‘,pubdate)[1]+‘ ‘+re.split(‘ ‘,pubdate)[2]
            comments=post.xpath(‘div[@class="post_item_foot"]/span[1]/a/text()‘).extract_first()
            comments=re.split(‘\(|\)‘,comments)[1]
            reads=post.xpath(‘div[@class="post_item_foot"]/span[2]/a/text()‘).extract_first()
            reads=re.split(‘\(|\)‘,reads)[1]
            #print(title,href,pubdate,comments,reads)
            yield {‘title‘:title,‘url‘:href,‘pubdate‘:pubdate,‘comments‘:comments,‘reads‘:reads}
        #print("========="+response.url+"==========")
        url=response.xpath(‘//div[@class="pager"]/a[last()]/@href‘).extract()[0]
        nexturl=response.urljoin(url)
        yield scrapy.Request(nexturl,callback=self.parse)

通过“Next” 按钮获取下一页的url,然后分析.

import scrapy
import re
class humorSpider(scrapy.Spider):
    name=‘humor‘
    start_urls=[‘http://quotes.toscrape.com/tag/humor/page/1/‘]
    def parse(self,response):
        for humor in response.xpath(‘//div[@class="quote"]‘):
            sentence=humor.xpath(‘span[1]/text()‘).extract_first()
            author=humor.xpath(‘span[2]/small/text()‘).extract_first()
            yield {‘sentence‘:sentence,‘author‘:author}
        next_url=response.xpath(‘//ul[@class="pager"]/li/a/@href‘).extract_first()
        pattern=re.compile(r‘/‘)
        if next_url is not None and pattern.split(next_url)[-2]>pattern.split(response.url)[-2]:
            next_url=response.urljoin(next_url)
            #print(next_url)
            yield scrapy.Request(next_url,callback=self.parse)

4 通过多个函数分析不同页面

scrapy startproject qqnews

tree
.
|____qqnews
| |______init__.py
| |______pycache__
| |____items.py
| |____middlewares.py
| |____pipelines.py
| |____settings.py
| |____spiders
| | |______init__.py
| | |______pycache__
| | |____qqnews.py
|____scrapy.cfg

cd qqnews/spiders/

cat qqnews.py

import scrapy
class qqNewsSpider(scrapy.Spider):
    name = ‘qqnews‘
    start_urls = [‘http://news.qq.com/‘]
    def parse(self,response):
        for url in response.xpath(‘//div[@class="text"]/em/a/@href‘).extract():
            yield scrapy.Request(url,callback=self.parse_news)
    def parse_news(self,response):
        try:
            title=response.xpath(‘//div[@class="hd"]/h1/text()‘).extract()[0]
            type=response.xpath(‘//div[@class="a_Info"]/span[1]/a/text()‘).extract()[0]
            source=response.xpath(‘//div[@class="a_Info"]/span[2]/a/text()‘).extract()[0]
            time=response.xpath(‘//span[@class="a_time"]/text()‘).extract()[0]
            print(title,type,source,time)
        except:
            print("exception")

运行:scrapy crawl qqnews -o news.csv

本文出自 “WorkNote” 博客,请务必保留此出处http://caiyuanji.blog.51cto.com/11462293/1982130

初识scrapy

标签:scrapy

原文地址:http://caiyuanji.blog.51cto.com/11462293/1982130

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!