码迷,mamicode.com
首页 > 其他好文 > 详细

nltk中的三元词组,二元词组

时间:2017-06-29 23:44:28      阅读:408      评论:0      收藏:0      [点我收藏+]

标签:函数   单元   similar   apply   arm   diff   logs   mil   pmi   

在做英文文本处理时,常常会遇到这样的情况,需要我们提取出里面的词组进行主题抽取,尤其是具有行业特色的,比如金融年报等。其中主要进行的是进行双连词和三连词的抽取,那如何进行双连词和三连词的抽取呢?这是本文将要介绍的具体内容。

1. nltk.bigrams(tokens) 和 nltk.trigrams(tokens)

一般如果只是要求穷举双连词或三连词,则可以直接用nltk中的函数bigrams()或trigrams(), 效果如下面代码:

技术分享
 1 >>> import nltk
 2 >>> str=you are my sunshine, and all of things are so beautiful just for you.
 3 >>> tokens=nltk.wordpunct_tokenize(str)
 4 >>> bigram=nltk.bigrams(tokens)
 5 >>> bigram
 6 <generator object bigrams at 0x025C1C10>
 7 >>> list(bigram)
 8 [(you, are), (are, my), (my, sunshine), (sunshine, ,), (,, and), (and, all), (all, of), (of, things), (things, are), (are, so), (so, beautiful), (beautiful
 9 , just), (just, for), (for, you), (you, .)]
10 >>> trigram=nltk.trigrams(tokens)
11 >>> list(trigram)
12 [(you, are, my), (are, my, sunshine), (my, sunshine, ,), (sunshine, ,, and), (,, and, all), (and, all, of), (all, of, things), (of, things, are), (thin
13 gs, are, so), (are, so, beautiful), (so, beautiful, just), (beautiful, just, for), (just, for, you), (for, you, .)]
View Code

2. nltk.ngrams(tokens, n)

如果要求穷举四连词甚至更长的多词组,则可以用统一的函数ngrams(tokens, n),其中n表示n词词组, 该函数表达形式较统一,效果如下代码:

技术分享
 1 >>> nltk.ngrams(tokens, 2)
 2 <generator object ngrams at 0x027AAF30>
 3 >>> list(nltk.ngrams(tokens,2))
 4 [(you, are), (are, my), (my, sunshine), (sunshine, ,), (,, and), (and, all), (all, of), (of, things), (things, are), (are, so), (so, beautiful), (beautiful
 5 , just), (just, for), (for, you), (you, .)]
 6 >>> list(nltk.ngrams(tokens,3))
 7 [(you, are, my), (are, my, sunshine), (my, sunshine, ,), (sunshine, ,, and), (,, and, all), (and, all, of), (all, of, things), (of, things, are), (thin
 8 gs, are, so), (are, so, beautiful), (so, beautiful, just), (beautiful, just, for), (just, for, you), (for, you, .)]
 9 >>> list(nltk.ngrams(tokens,4))
10 [(you, are, my, sunshine), (are, my, sunshine, ,), (my, sunshine, ,, and), (sunshine, ,, and, all), (,, and, all, of), (and, all, of, things), (all, 11 of, things, are), (of, things, are, so), (things, are, so, beautiful), (are, so, beautiful, just), (so, beautiful, just, for), (beautiful, just, for, you),
12  (just, for, you, .)]
View Code

3. nltk.collocations下的相关类

nltk.collocations下有三个类:BigramCollocationFinder, QuadgramCollocationFinder, TrigramCollocationFinder

1)BigramCollocationFinder

它是一个发现二元词组并对其进行排序的工具,一般使用函数from_words()去构建一个搜索器,而不是直接生成一个实例。发现器主要调用以下方法:

above_score(self, score_fn, min_score): 返回分数超过min_score的n元词组,并按分数从大到小对其进行排序。这里当然返回的是二元词组,这里的分数有多种定义,后面将做详细介绍。

apply_freq_filter(self, min_freq):过滤掉词组出现频率小于min_freq的词组。

apply_ngram_filter(self, fn): 过滤掉符合条件fn的词组。在判断条件fn时,是将整个词组进行判断是否满足条件fn,如果满足条件,则将该词组过滤掉。

apply_word_filter(self, fn): 过滤掉符合条件fn的词组。在判断条件fn时,是将词组中的词一一判断,如果有一个词满足条件fn,则该词组满足条件,将会被过滤掉。

nbest(self, score_fn, n): 返回分数最高的前n个词组。

score_ngrams(self, score_fn): 返回由词组和对应分数组成的序列,并将其从高到低排列。

技术分享
 1 >>> finder=nltk.collocations.BigramCollocationFinder.from_words(tokens)
 2 >>> bigram_measures=nltk.collocations.BigramAssocMeasures()
 3 >>> finder.nbest(bigram_measures.pmi, 10)
 4 [(,, and), (all, of), (and, all), (beautiful, just), (just, for), (my, sunshine), (of, things), (so, beautiful), (sunshine, ,), (are, my)]
 5 >>> finder.nbest(bigram_measures.pmi, 100)
 6 [(,, and), (all, of), (and, all), (beautiful, just), (just, for), (my, sunshine), (of, things), (so, beautiful), (sunshine, ,), (are, my), (are, so), (for
 7 , you), (things, are), (you, .), (you, are)]
 8 >>> finder.apply_ngram_filter(lambda w1,w2: w1 in [,, .] and w2 in [,, .] )
 9 >>> finder.nbest(bigram_measures.pmi, 100)
10 [(,, and), (all, of), (and, all), (beautiful, just), (just, for), (my, sunshine), (of, things), (so, beautiful), (sunshine, ,), (are, my), (are, so), (for
11 , you), (things, are), (you, .), (you, are)]
12 >>> finder.apply_word_filter(lambda x: x in [,, .])
13 >>> finder.nbest(bigram_measures.pmi, 100)
14 [(all, of), (and, all), (beautiful, just), (just, for), (my, sunshine), (of, things), (so, beautiful), (are, my), (are, so), (for, you), (things, are), (yo
15 u, are)]
View Code

2)TrigramCollocationFinder 和 QuadgramCollocationFinder

用法同BigramCollocationFinder, 只不过这里生产的是三元词组搜索器, 而QuadgramCollocationFinder产生的是四元词组搜索器。对应函数也同上。

4. 计算词组词频

>>> sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10]
[((all, of), 1), ((and, all), 1), ((are, my), 1), ((are, so), 1), ((beautiful, just), 1), ((for, you), 1), ((just, for), 1), ((my, sunshine), 1), ((of, things), 1),
 ((so, beautiful), 1)]

###这里的key是排序依据,就是说先按t[1](词频)排序,-表示从大到小;再按照词组(t[0])排序,默认从a-z.

5. 判断的分数

在nltk.collocations.ngramAssocMeasures下,有多种分数:

chi_sq(cls, n_ii, n_ix_xi_tuple, n_xx): 使用卡方分布计算出的各个n元词组的分数。

pmi(cls, *marginals): 使用点互信息计算出的各个n元词组的分数。

likelihood_ratio(cls, *marginals): 使用最大似然比计算出的各个n元词组的分数。

student_t(cls, *marginals): 使用针对单元词组的带有独立假设的学生t检验计算各个n元词组的分数

以上是比较常用的几种分数,当然还有很多其他的分数,比如:poisson_stirling, jaccard, fisher, phi_sq等。

 1 >>> bigram_measures=nltk.collocations.BigramAssocMeasures()
 2 >>> bigram_measures.student_t(8, (15828, 4675), 14307668)
 3 0.9999319894802036
 4 >>> bigram_measures.student_t(8, (42, 20), 14307668)
 5 2.828406367705413
 6 >>> bigram_measures.chi_sq(8, (15828, 4675), 14307668)
 7 1.5488692067282201
 8 >>> bigram_measures.chi_sq(59, (67, 65), 571007)
 9 456399.76190356724
10 >>> bigram_measures.likelihood_ratio(110, (2552, 221), 31777)
11 270.721876936225
12 >>> bigram_measures.pmi(110, (2552, 221), 31777)
13 2.6317398492166078
14 >>> bigram_measures.pmi
15 <bound method type.pmi of <class nltk.metrics.association.BigramAssocMeasures>>
16 >>> bigram_measures.likelihood_ratio
17 <bound method type.likelihood_ratio of <class nltk.metrics.association.BigramAssocMeasures>>
18 >>> bigram_measures.chi_sq
19 <bound method type.chi_sq of <class nltk.metrics.association.BigramAssocMeasures>>
20 >>> bigram_measures.student_t
21 <bound method type.student_t of <class nltk.metrics.association.BigramAssocMeasures>>

6. Ranking and correlation

It is useful to consider the results of finding collocations as a ranking, and the rankings output using different association measures can be compared using the Spearman correlation coefficient.

Ranks can be assigned to a sorted list of results trivially by assigning strictly increasing ranks to each result:

>>> from nltk.metrics.spearman import *
>>> results_list = [‘item1‘, ‘item2‘, ‘item3‘, ‘item4‘, ‘item5‘]
>>> print(list(ranks_from_sequence(results_list)))
[(‘item1‘, 0), (‘item2‘, 1), (‘item3‘, 2), (‘item4‘, 3), (‘item5‘, 4)]

If scores are available for each result, we may allow sufficiently similar results (differing by no more than rank_gap) to be assigned the same rank:

>>> results_scored = [(‘item1‘, 50.0), (‘item2‘, 40.0), (‘item3‘, 38.0),
...                   (‘item4‘, 35.0), (‘item5‘, 14.0)]
>>> print(list(ranks_from_scores(results_scored, rank_gap=5)))
[(‘item1‘, 0), (‘item2‘, 1), (‘item3‘, 1), (‘item4‘, 1), (‘item5‘, 4)]

The Spearman correlation coefficient gives a number from -1.0 to 1.0 comparing two rankings. A coefficient of 1.0 indicates identical rankings; -1.0 indicates exact opposite rankings.

>>> print(‘%0.1f‘ % spearman_correlation(
...         ranks_from_sequence(results_list),
...         ranks_from_sequence(results_list)))
1.0
>>> print(‘%0.1f‘ % spearman_correlation(
...         ranks_from_sequence(reversed(results_list)),
...         ranks_from_sequence(results_list)))
-1.0
>>> results_list2 = [‘item2‘, ‘item3‘, ‘item1‘, ‘item5‘, ‘item4‘]
>>> print(‘%0.1f‘ % spearman_correlation(
...        ranks_from_sequence(results_list),
...        ranks_from_sequence(results_list2)))
0.6
>>> print(‘%0.1f‘ % spearman_correlation(
...        ranks_from_sequence(reversed(results_list)),
...        ranks_from_sequence(results_list2)))
-0.6

 

nltk中的三元词组,二元词组

标签:函数   单元   similar   apply   arm   diff   logs   mil   pmi   

原文地址:http://www.cnblogs.com/no-tears-girl/p/7096519.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!