python网络爬虫：实现百度热搜榜数据爬取

时间：2020-03-18 15:38:38 阅读：270 评论：0 收藏：0 [点我收藏+]

标签：source 对象 http rank bs4 nbsp 内容 sele save

from bs4 import BeautifulSoup
from selenium import webdriver
import time
import xlwt
 
#打开网页
url="http://top.baidu.com/buzz?b=1&fr=topindex"
driver = webdriver.Chrome()
driver.get(url)
#time.sleep(5)
 
#获取网页信息
html=driver.page_source
soup=BeautifulSoup(html,‘lxml‘)
 
#用soup来获得所有‘tr‘标签
list=soup.find_all(‘tr‘)
result=[]
 
#将所有符合规则的‘tr‘标签里面的内容提取出来
for each in list:
    rank = each.find(‘span‘)
    key = each.find(‘a‘,{‘class‘:‘list-title‘})
    point = each.find(‘td‘,{‘class‘:‘last‘})
    if point !=None:
        point = point.find(‘span‘)
    if rank!=None and key!=None and point!=None :
        result.append([rank.string,key.string,point.string])
 
#新建xls对象
workbook = xlwt.Workbook(encoding = ‘utf-8‘)
worksheet = workbook.add_sheet(‘Baidu Rank Data‘)
worksheet.write(0,0, label = ‘rank‘)
worksheet.write(0,1, label = ‘key‘)
worksheet.write(0,2, label = ‘point‘)
 
#设置列宽
col = worksheet.col(1)
col.width=5000
 
#写入数据
i=1
for each in result:
    rank=str(each[0])
    key=str(each[1])
    point=str(each[2])
    worksheet.write(i,0,rank)
    worksheet.write(i,1,key)
    worksheet.write(i,2,point)
    i+=1
 
#保存
workbook.save(r‘C:\Users\me\Desktop\Data.xls‘)
 
print(result)
#print(len(result))
#print(len(list))

标签：source 对象 http rank bs4 nbsp 内容 sele save

原文地址：https://www.cnblogs.com/abcdefgh9/p/12517257.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行