码迷,mamicode.com
首页 > 其他好文 > 详细

爬虫实践-爬取简书网热评文章

时间:2017-12-11 14:05:47      阅读:150      评论:0      收藏:0      [点我收藏+]

标签:sel   www.   multi   exce   cal   进程   ted   -shared   error:   

jianshuwangarticle.py:

import requests
from lxml import etree
import pymongo
from multiprocessing import Pool

# 连接数据库
client = pymongo.MongoClient(‘localhost‘, 27017)
mydb = client[‘mydb‘]
jianshu_shouye = mydb[‘jianshu_shouye‘]


def get_jianshu_info(url):
html = requests.get(url)
selector = etree.HTML(html.text)
infos = selector.xpath(‘//ul[@class="note-list"]/li‘)
for info in infos:
try:
author = info.xpath(‘div/div[1]/div/a/text()‘)[0]
time = info.xpath(‘div/div[1]/div/span/@data-shared-at ‘)[0]
title = info.xpath(‘div/a/text()‘)[0]
content = info.xpath(‘div/p/text()‘)[0].strip()
view = info.xpath(‘div/div[2]/a[1]/text()‘)[1].strip()
comment = info.xpath(‘div/div[2]/a[2]/text()‘)[1].strip()
like = info.xpath(‘div/div[2]/span[1]/text()‘)[0].strip()
rewards = info.xpath(‘div/div[2]/span[2]/text()‘)
if len(rewards) == 0:
reward = ‘无‘
else:
reward = rewards[0].strip()
data = {
‘author‘: author,
‘time‘: time,
‘title‘: title,
‘content‘: content,
‘view‘: view,
‘comment‘:comment,
‘like‘: like,
‘reward‘: reward
}
jianshu_shouye.insert_one(data)
except IndexError:
pass


if __name__ == ‘__main__‘:
urls = [‘http://www.jianshu.com/c/bDHhpK?order_by=commented_at&page={}‘.format(str(i))
for i in range(1, 2000)]
# 创建进程池
pool = Pool(processes=4)
# 调用进程爬虫
pool.map(get_jianshu_info, urls)

技术分享图片

技术分享图片

 

爬虫实践-爬取简书网热评文章

标签:sel   www.   multi   exce   cal   进程   ted   -shared   error:   

原文地址:http://www.cnblogs.com/silverbulletcy/p/8021925.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!