信息领域热词分类分析01

时间：2021-07-22 17:37:04 阅读：0 评论：0 收藏：0 [点我收藏+]

标签：style 数据采集 tle res return code book ade https

1. 项目名称：信息化领域热词分类分析及解释

2. 功能设计：

数据采集：要求从定期自动从网络中爬取信息领域的相关热

词；

数据清洗：对热词信息进行数据清洗，并采用自动分类技术

生成信息领域热词目录，；

热词解释：针对每个热词名词自动添加中文解释（参照百度

百科或维基百科）；

热词引用：并对近期引用热词的文章或新闻进行标记，生成

超链接目录，用户可以点击访问；

数据可视化展示：

① 用字符云或热词图进行可视化展示；

② 用关系图标识热词之间的紧密程度。6) 数据报告：可将所有热词目录和名词解释生成 WORD 版报告

形式导出。

近期做信息领域热词分类分析：

目前已完成对数据的采集，

爬取了博客园的最新新闻，来进行信息领域热词的分析。

import requests
from bs4 import BeautifulSoup
import pymysql
import json
import lxml
import xlwt
def getTitle(url):
    response = requests.get(url, headers=headers)  # 发送网络请求
    content = response.content.decode(‘utf-8‘)
    soup = BeautifulSoup(content, ‘html.parser‘)
    list=soup.select(‘div:nth-child(2) > h2:nth-child(1) > a:nth-child(1)‘)
    for i in range(18):
        print(list[i].text)
    return list
url = "https://news.cnblogs.com/n/recommend?page={}"
headers = {‘user-agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36‘}#创建头部信息
f = xlwt.Workbook(encoding=‘utf-8‘)
sheet01 = f.add_sheet(u‘sheet1‘, cell_overwrite_ok=True)
sheet01.write(0, 0, ‘博客最热新闻‘)  # 第一行第一列
temp = 0
for i in range(1,100):
    newurl = url.format(i)
    title = getTitle(newurl)
    for j in range(len(title)):
        sheet01.write(temp + j + 1, 0, title[j].text)
    temp += len(title)
    print("第"+str(i)+"页打印完!")
print("全部打印完！！！")
f.save(‘Hotword02.xls‘)

明天继续完成其他方面的要求

信息领域热词分类分析01

标签：style 数据采集 tle res return code book ade https

原文地址：https://www.cnblogs.com/haobox/p/15041851.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行