码迷,mamicode.com
首页 > 编程语言 > 详细

python网络爬虫:实现百度热搜榜数据爬取

时间:2020-03-18 15:38:38      阅读:270      评论:0      收藏:0      [点我收藏+]

标签:source   对象   http   rank   bs4   nbsp   内容   sele   save   

from bs4 import BeautifulSoup
from selenium import webdriver
import time
import xlwt
 
#打开网页
url="http://top.baidu.com/buzz?b=1&fr=topindex"
driver = webdriver.Chrome()
driver.get(url)
#time.sleep(5)
 
#获取网页信息
html=driver.page_source
soup=BeautifulSoup(html,lxml)
 
#用soup来获得所有tr标签
list=soup.find_all(tr)
result=[]
 
#将所有符合规则的tr标签里面的内容提取出来
for each in list:
    rank = each.find(span)
    key = each.find(a,{class:list-title})
    point = each.find(td,{class:last})
    if point !=None:
        point = point.find(span)
    if rank!=None and key!=None and point!=None :
        result.append([rank.string,key.string,point.string])
 
#新建xls对象
workbook = xlwt.Workbook(encoding = utf-8)
worksheet = workbook.add_sheet(Baidu Rank Data)
worksheet.write(0,0, label = rank)
worksheet.write(0,1, label = key)
worksheet.write(0,2, label = point)
 
#设置列宽
col = worksheet.col(1)
col.width=5000
 
#写入数据
i=1
for each in result:
    rank=str(each[0])
    key=str(each[1])
    point=str(each[2])
    worksheet.write(i,0,rank)
    worksheet.write(i,1,key)
    worksheet.write(i,2,point)
    i+=1
 
#保存
workbook.save(rC:\Users\me\Desktop\Data.xls)
 
print(result)
#print(len(result))
#print(len(list))

 

python网络爬虫:实现百度热搜榜数据爬取

标签:source   对象   http   rank   bs4   nbsp   内容   sele   save   

原文地址:https://www.cnblogs.com/abcdefgh9/p/12517257.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!