python网络爬虫

时间：2017-05-27 19:14:23 阅读：283 评论：0 收藏：0 [点我收藏+]

标签：网页 button font urllib2 user ext www into obj

获取http://www.qiushibaike.com/textnew/的所有段子，并且按照页码保存到本地一共35页。
二话不说上代码，正则表达式有待研究。
网站源码片段：

<a href="/users/32215536/" target="_blank" title="吃了两碗又盛">
<h2>吃了两碗又盛</h2>
</a>
<div class="articleGender manIcon">38</div>
</div>



<a href="/article/119080581" target="_blank" class=‘contentHerf‘ >
<div class="content">



<span>一声长叹！<br/>二十多年前，简陋的农村小学的某一个班级里，有两个男生坐在同一个长条凳上，这是我和我的同桌。<br/>同桌是冤家，偶尔他回座位，我悄悄地用力，使同桌那一端微微翘起，然后不动声色地向后旋转，同桌一下坐空，摔在地上，我偷着乐！<br/>同一招不能总用，但是小时候哪懂这个。那一次，我故技重施，怎料到同桌早有防备，腿向后踢，凳子翻起，我顺利地坐到地上。<br/>到现在，就怕阴天！</span>


</div>
</a>





<div class="stats">
<span class="stats-vote"><i class="number">62</i> 好笑</span>
<span class="stats-comments">




</span>
</div>
<div id="qiushi_counts_119080581" class="stats-buttons bar clearfix">
<ul class="clearfix">
<li id="vote-up-119080581" class="up">
<a href="javascript:voting(119080581,1)" class="voting" data-article="119080581" id="up-119080581" rel="nofollow">
<i></i>
<span class="number hidden">68</span>
</a>
</li>
<li id="vote-dn-119080581" class="down">
<a href="javascript:voting(119080581,-1)" class="voting" data-article="119080581" id="dn-119080581" rel="nofollow">
<i></i>
<span class="number hidden">-6</span>
</a>
</li>

<li class="comments">
<a href="/article/119080581" id="c-119080581" class="qiushi_comments" target="_blank">
<i></i>
</a>
</li>

</ul>
</div>
<div class="single-share">
<a class="share-wechat" data-type="wechat" title="分享到微信" rel="nofollow">微信</a>
<a class="share-qq" data-type="qq" title="分享到QQ" rel="nofollow">QQ</a>
<a class="share-qzone" data-type="qzone" title="分享到QQ空间" rel="nofollow">QQ空间</a>
<a class="share-weibo" data-type="weibo" title="分享到微博" rel="nofollow">微博</a>
</div>
<div class="single-clear"></div>




</div>









<div class="article block untagged mb15" id=‘qiushi_tag_119080574‘>

<div class="author clearfix">
<a href="/users/31546279/" target="_blank" rel="nofollow">
<img src="//pic.qiushibaike.com/system/avtnew/3154/31546279/medium/20160530180038.jpg" alt="？最远的距离"/>
</a>
<a href="/users/31546279/" target="_blank" title="？最远的距离">
<h2>？最远的距离</h2>
</a>
<div class="articleGender manIcon">21</div>
</div>



<a href="/article/119080574" target="_blank" class=‘contentHerf‘ >
<div class="content">



<span>万能的糗友啊 苦舞一生情字谱<br/>这句话 谁能借下一句。</span>


</div>
</a>



python代码



# coding=utf-8
import urllib2
import re
import os

class Spider(object):


    def __init__(self):
        self.url = ‘http://www.qiushibaike.com/textnew/page/%s?s=4832452‘
        self.user_agent= ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0‘
    def  get_page(self,page_index):
        headers= {‘User-Agent‘:self.user_agent}
        try:
            url = self.url % str(page_index)
            request= urllib2.Request(url,headers=headers)
            response= urllib2.urlopen(request)
            content= response.read()
            return content
            #print ‘content:‘ + content
        except urllib2.HTTPError as e:
            print e
            exit()
        except urllib2.URLError as e:
            print ‘URLError‘+ self.url
            exit()
#分析网页源码
    def analysis (self,content):
        #pattern = re.compile(‘<div class="content">(.*?)<!--(.*?)-->.*?</div>‘,re.S)
        pattern = re.compile(‘<div class="content">.*?</div>‘, re.S)

        #res_value = r‘<span .*?>(.*?)</span>‘


        items= re.findall(pattern,content)


        return items

        #保存抓取内容
    def save(self,items,i,path):
        if not os.path.exists(path):
            os.makedirs(path)
        file_path = path + ‘/‘+str(i)+‘.txt‘
        f = open(file_path, ‘w‘)
        for item in items:

            if __name__ == ‘__main__‘:
                item_new= item.replace(‘\n‘,‘‘).replace(‘<br/>‘,‘\n‘).replace(‘<div class="content">‘,‘‘).replace(‘</div>‘,‘‘).replace(‘</span>‘,‘\n‘).replace(‘<span>‘,‘\n‘)

            f.write(item_new)
        f.close()
    def run(self):
        print ‘开始抓取页面‘
        for i in range(1,35):
            content= self.get_page(i)
            items= self.analysis(content)
            self.save(items,i,‘/Users/huangxuelian/Downloads/41527218/pythontest‘)
        print ‘内容抓取完了‘
if __name__ ==‘__main__‘:
    spider = Spider()
    spider.run()

python网络爬虫

标签：网页 button font urllib2 user ext www into obj

原文地址：http://www.cnblogs.com/huangxuelian/p/6913875.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行