码迷,mamicode.com
首页 > 系统相关 > 详细

多进程爬虫

时间:2018-12-24 10:29:16      阅读:171      评论:0      收藏:0      [点我收藏+]

标签:info   lines   range   roc   使用方法   区别   iter   ber   工厂   

多进程简介

一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序


为什么学习多线程

提升爬虫效率


多进程和多线程的区别

工厂 ==> 车间 ==> 工人


多进程的使用方法

1 from multiprocessing import Pool
2 pool = Pool(processes=4)
3 pool.map(func,iterable)
 

性能对比

爬取url:https://www.qiushibaike.com/8hr/page/1/

 1 import re
 2 import time
 3 from multiprocessing import Pool
 4 ?
 5 import requests
 6 ?
 7 headers = {
 8     User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
 9 }
10 ?
11 def re_scraper(url):
12     res = requests.get(url,headers=headers)
13     names = re.findall(<h2>(.*?)</h2>, res.text, re.S)
14     contents = re.findall(<div class="content">.*?<span>(.*?)</span>, res.text, re.S)
15     laughs = re.findall(<i class="number">(\d+)</i>,res.text,re.S)
16     comments = re.findall(<i class="number">(\d+)</i>, res.text, re.S)
17     infos = list()
18     for name,content,laugh,comment in zip(names,contents,laughs,comments):
19         info = {
20             name:name,
21             content:content,
22             laugh:laugh,
23             comment:comment
24         }
25         infos.append(info)
26     return infos
27 ?
28 if __name__ == "__main__":
29     urls = [https://www.qiushibaike.com/8hr/page/{}/.format(str(i)) for i in range(1, 36)]
30     start_1 = time.time()
31     for url in urls:
32         re_scraper(url)
33     end_1 = time.time()
34     print(串行爬虫耗时:,end_1 - start_1)
35 ?
36     start_2 = time.time()
37     pool = Pool(processes=2)
38     pool.map(re_scraper,urls)
39     end_2 = time.time()
40     print(2进程爬虫耗时:,end_2 - start_2)
41 ?
42     start_3 = time.time()
43     pool = Pool(processes=4)
44     pool.map(re_scraper,urls)
45     end_3 = time.time()
46     print(4进程爬虫耗时:,end_3 - start_3)
47  

 


1 运行结果:
2 
3 [Running] python "f:\WWW\test_py\compare_test.py"
4 串行爬虫耗时: 14.95523715019226
5 2进程爬虫耗时: 11.39123272895813
6 4进程爬虫耗时: 4.0303635597229
7 
8 [Done] exited with code=0 in 32.827 seconds

 

多进程爬虫

标签:info   lines   range   roc   使用方法   区别   iter   ber   工厂   

原文地址:https://www.cnblogs.com/xuxaut-558/p/10166642.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!