码迷,mamicode.com
首页 > 其他好文 > 详细

主题模型理解

时间:2019-06-24 00:14:29      阅读:141      评论:0      收藏:0      [点我收藏+]

标签:amp   主题   time   print   ons   txt   相似度   from   rds   

 

先上代码

 1 import nltk
 2 from nltk.tokenize import word_tokenize
 3 
 4 openfile = open(corpus.txt)
 5 texts = openfile.readlines()
 6 texts_tokenized = [[word.lower() for word in word_tokenize(text.decode(utf-8))]  for text in texts]
 7 
 8 print (texts_tokenized)
 9 
10 
11 form nltk.corpus import stopwords
12 english_stopwords = stopwords.words(english)
13 texts_filterd_stopwords = [[word for word in text_tokenized if not word in english_stopwords] for text_tokenized in texts_tokenized]
14 english_punctuations = [,,.,:,;,?,(,),[,],&,!,*,@,#,$,%]
15 
16 texts_filterd_punctuations = [[word for word in text_filterd_stopwords if not word in english_punctuations] for text_filterd_stopwords in texts_filterd_stopwords]
17 print texts_filterd_punctuations
18 
19 # 词干化
20 from nltk.stem.lancaster import LancasterStemmer
21 st = LancasterStemmer()
22 texts_stemmed = [[st.stem(word) for word in text_filterd_punctuations] for text_filterd_punctuations in texts_filterd_punctuations]
23 
24 # 写日志
25 from gensim import corpors,models,similarities
26 import logging
27 logging.basicConfig(format =%(asctime)s : %(levelname)s : % (message)s, level = logging.INFO)
28 
29 
30 #1.抽取词袋
31 dictionary  = corpora.Dictionary(texts_stemmed)
32 print(dictionary.token2id)
33 
34 # 2.文本向量化
35 corpus = [dictionary.doc2bow(text_stemmed) for text_stemmed in texts_stemmed]
36 print(corpus)
37 
38 # 3.训练LDA模型
39 tfidf = models.TfidfModel(corpus)
40 corpus_tfidf = tfidf(corpus)
41 for text in corpus_tfidf:
42     print(text)
43 
44 lda = models.LdaModel(corpus_tfidf,id2word = dictionary,num_topics = 2)
45 corpus_lda = lda(corpus_tfidf)
46 
47 for text in corpus_lda[0:6]:
48     print(text)
49 
50 # 4计算相似度
51 lda_index = similarities.MatrixSimilarity(corpus_lda)
52 sims = lda_index[corpus_lda[0]]
53 sort_sims = sorted(enumerate(sims),key = lambda item: -item[1])
54 print(sort_sims)

后面再补充说明,代码来自于 机器学习基础 吕云翔

 

主题模型理解

标签:amp   主题   time   print   ons   txt   相似度   from   rds   

原文地址:https://www.cnblogs.com/www-caiyin-com/p/11074956.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!