Two crucial tasks in natural language understanding have to do with named entities, which are the persons, organizations, locations, etc. that are explicitly mentioned in text using proper nouns: Named Entity Recognition (NER) which is finding mentions to entities in the text; and Named Entity Disambiguation (NED), also known as Entity Linking, which is the task of disambiguating the named entities by linking them to the actual entities in a Knowledge Graph (KG), whenever possible.

A NERD tool would perform both tasks in a pipeline fashion. Good NERD tools are essential in Natural Language Processing in general, and Information Extraction from text in particular. This post describes a novel and effective algorithm for the NED task, developed by PhD student Zhaochen Guo.

The following example motivates the discussion. The input to the NED problem is text where the mentions to the entities (shown underlined in the figure) have been identified. The problem is finding an assignment of nodes in the KG to the named entities.

disambiguation graph

The ambiguity of natural language has always been a challenge for NED, even to humans, because most real world entities can be referred to in many different ways (e.g., people have nicknames), while the same textual mention may refer to multiple real-world entities (e.g., different people have the same name). In the graph above, the candidates are nodes in the KG whose known names are sufficiently similar to the entities mentioned in the text. In many cases, there are hundreds of plausible candidates for some ambiguous mentions, especially when abbreviations and acronyms are used.

Good solutions to NED problem find assignments that balance two disparate sources of similarity: lexical similarity, defined as the pairwise (string) similarity between mentions in the text and names of entities in the KG; and semantic similarity, defined through some graph-theoretic property of a subgraph of the KG induced by the choice of entities for each mention. In a series of papers (see the references below), we introduce a novel notion of semantic similarity for the NED task, rooted in Information Theory and defined as the mutual information between random walks on the disambiguation graph induced by choice of entities for each mention.

In a CKIM 2014 paper, we described an approach for computing semantic similarity based on random walks. The idea is to find all candidates of all mentions in the input document and extract a subgraph of the KG containing all candidates and their neighbors. With that, we perform random walks with restarts from each candidate, inducing a probability distribution of reaching the other nodes in the graph. This probability distribution can be interpreted as similarity. Thus, to solve NED, we seek the subset of candidates that whose pairwise similarity is maximized.

That idea was further extended in an article in the Semantic Web Journal to use state-of-the-art learning to rank methods, which allow the tuning of the parameters in the standard way. In this article, it is also shown that the existing benchmarks for the NED task are not as challenging or methodical as one would hope, which is to be expected in every first step towards defining a benchmark for a discipline. In fact, we show that a rudimentary baseline based on prior probabilities is quite competitive with many systems on these benchmarks (see the line about PRIOR on the tables below):

Drawing

The line WNED corresponds to the manually tuned NED system based on random walks, while the lines indicating L2R.WNED show the results of the method where weights were learned, differing on the dataset used for such learning. With L2R.WNED-CoNLL, the parameters were learned on the AIDA CoNLL dataset annotated by the researchers at the Max Planck Institute for Informatics, authors of the AIDA system. The other line corresponds to learning the weights on the respective evaluation dataset, following standard protocols. One observation is that learning the weights on the AIDA-CoNLL dataset yields good performance on the other benchmarks, which is unsurprising as these are all news articles.

The following table shows a similar evaluation, now using two larger benchmarks developed for this work, from Wikipedia articles and ClueWeb documents. Details about how these benchmarks were created can be found in the Semantic Web Journal article.

Drawing

Below are references to the articles and links to the code and the data.

References

  • Z. Guo and D. Barbosa. Robust named entity disambiguation with random walks. Semantic Web, 9(4):459–479, 2018. doi:10.3233/SW-170273. [Bibtex]  [PDF]
  • Z. Guo and D. Barbosa. Robust entity linking via random walks. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014, 499–508. ACM, 2014. doi:10.1145/2661829.2661887. [Bibtex]
  • Z. Guo and D. Barbosa. Entity linking with a unified semantic representation. In 23rd International World Wide Web Conference, WWW '14, Seoul, Republic of Korea, April 7-11, 2014, Companion Volume, 1305–1310. ACM, 2014. URL: http://dl.acm.org/citation.cfm?id=2579705. [Bibtex]

Resources