中文维基数据处理 - 1. 下载与清洗

时间：2017-11-25 13:07:02 阅读：201 评论：0 收藏：0 [点我收藏+]

标签：.com int ref com 代码统计信息 html 需要数据

1. 数据下载

一些重要的链接：

最新转储
需要 zhwiki-latest-pages-articles.xml.bz2 这个文件
中文维基的页面统计信息
目前内容页面数大约是 978K

2. 数据处理

选择了 Gensim 这个主题工具包进行数据预处理。

2.1 xml 转 json

scripts.segment_wiki

python -m gensim.scripts.segment_wiki -f zhwiki-latest-pages-articles.xml.bz2 | gzip > zhwiki-latest.json.gz

然后就转换成了可被 Python 直接读取的 json 文档。

2.2 测试数据

from smart_open import smart_open
import json
x = 0

for line in smart_open(‘zhwiki-latest.json.gz‘):
     article = json.loads(line)

     print("Article title: %s" % article[‘title‘])
     for section_title, section_text in zip(article[‘section_titles‘], article[‘section_texts‘]):
         print("Section title: %s" % section_title)
         print("Section text: %s" % section_text)

     x += 1
     if x == 5:
         break

运行如上代码可以输出中文维基中的前 5 篇文档。

2.3 分词 / 命名实体识别 / 关系抽取

没写。

中文维基数据处理 - 1. 下载与清洗

标签：.com int ref com 代码统计信息 html 需要数据

原文地址：http://www.cnblogs.com/nlp-in-shell/p/7894719.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行