码迷,mamicode.com
首页 > 其他好文 > 详细

论文《learning to link with wikipedia》

时间:2019-06-24 18:13:23      阅读:132      评论:0      收藏:0      [点我收藏+]

标签:topic   phrase   isa   discus   fine   multiple   distance   scribe   with   

learning to link with wikipedia

目标:It explains how the topics mentioned in unstructured text can be automatically recognized and linked to the appropriate Wikipedia articles to explain them.

 

大致技术(通过例子来阐述):a somewhat dated news story about Iranian prisoners of war left in Iraq after the first Gulf War, which has been automatically augmented using our techniques with links to pertinent topics such as the International Committee of the Red Cross and Baghdad. This process is known as wikification, and our approach differs from previous attempts in that we use Wikipedia not only as a source of information to point to, but also as training data for how best to create links. This gives large improvements in both recall and precision.

 

两大步骤:link disambiguation and link detection

 

一:Link disambiguation:

         Commonness and Relatedness

  1. The commonness of a sense is defined by the number of times it is used as a destination in Wikipedia:

 

  2.Our algorithm identifies these cases by comparing each possible sense with its surrounding context. This is a cyclic problem because these terms may also be ambiguous

        技术图片

 

         where a and b are the two articles of interest, A and B are the sets of all articles that link to a and b respectively, and W is set of all articles in Wikipedia.

        

         Some context terms are better than others

         We can determine how closely a term relates to this central thread by calculating its average semantic relatedness to all other context terms, using the measure described previously.

These two variables—link probability and relatedness—are averaged to provide a weight for each context term. This is then used when calculating the weighted average of a candidate sense to the context articles.

        

         Combining the features

         balance commonness and relatedness, we take into account how good the context is. If it is plentiful and homogenous, then relatedness becomes very telling. In Figure 2, for example, the most common sense of tree is entirely irrelevant because the document is clearly about computer science. However, if tree is found in a general document with ambiguous or confused context, then the most common sense should be chosen. By definition, this will be correct in most cases. Thus the final feature—context quality—is given by the sum of the weights that were previously assigned to each context term. This takes into account the number of terms involved, the extent they relate to each other, and how often they are used as Wikipedia links

         it considers each sense independently, and produces a probability that it is valid

 

 

 

二.link detection:

         We are able to gain much better results by only using link probability as one feature among many.

         Features of these articles—and the places where they were mentioned—are used to inform the classifier about which topics should and should not be linked:

  Link Probability:  

    Mihalcea and Csomai’s link probability(the link probability of a phrase is defined as the number of Wikipedia articles that use it as an anchor, divided by the number of articles that mention it at all.)

              These are combined into two separate features: the average and the maximum.

  Relatedness.: 

    one would expect that topics which relate to the central thread of the document are more likely to be linked.

    a second feature: the average relatedness between each topic and all of the other candidates.

        Disambiguation Confidence:

    The disambiguation classifier described earlier does not just produce a yes/no judgment as to whether a topic is a valid sense of a term; it also gives a probability or confidence in this answer. We use this as a   feature to give those topics that we are most sure of a greater chance of being linked.

    multiple confidence values for each instance because several different terms may be disambiguated to the same topic.

    average and maximum values

         Generality.:

    We define the generality of a topic as the minimum depth at which it is located in Wikipedia’s category tree.

         Location and Spread.

                   Frequency         first occurrence        last occurrence

         the distance between first and last occurrences, or spread, is used to indicate how consistently the document discusses the topic.

 

论文《learning to link with wikipedia》

标签:topic   phrase   isa   discus   fine   multiple   distance   scribe   with   

原文地址:https://www.cnblogs.com/dhName/p/11078596.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!