Modern Knowledge Bases are used by millions on a daily basis in applications such as search and question answering, recommender systems, among others. Moreover, open knowledge bases like DBpedia and Wikidata are the fabric that keeps the Linked Open Data ecosystem alive, enabling scholarly research across all disciplines. Nevertheless, high-quality KBs are still built almost exclusively from human-curated structured or semi-structured data.

Precisely because of their importance, knowledge bases require constant updating to reflect changes in the real world. KB population (KBP) is the task of automatically augmenting a KB with new facts. Traditionally, KBP has been tackled with datasets for individual components to be arranged in a pipeline, typically: (1) entity discovery and linking and (2) relation extraction. Entity discovery and linking seeks to recognize and disambiguate proper names in text that refer to entities (e.g., people, organizations and locations) by linking them to a reference KB. Relation extraction seeks to detect facts involving two entities (or an entity and a literal, such as a number or date). Currently, some neural methods overcoming the limitations of pipeline systems are starting to appear.


KnowledgeNet is a benchmark dataset for populating a KB (Wikidata) with facts expressed in natural language on the web. KnowledgeNet facts are of the form (subject; property; object), where subject and object are linked to Wikidata. KnowledgeNet’s main goal is to evaluate the KBP task in an end-to-end fashion, instead of evaluating each subcomponents in isolation. The dataset supports the kind of evaluation of KBP systems by exhaustively annotating all facts in a sentence. As a benchmark, KnowledgeNet is agnostic to the architecture of the KPB system and, thus, can be used to evaluate both pipelines and neural methods.

The Text Analytics Conference (TAC) has a series of evaluation workshops organized as several tracks by NIST (Getman et al., 2018). The Cold Start track provides an end-to-end evaluation of KBP systems, while other tracks focus on subtasks (e.g., entity disambiguation and linking), and is probably the most comprehensive KPB benchmark out there. However, the evaluation protocol in TAC is too onerous for most, as TAC is concerned with very large corpora without ground truths. Instead, the evaluation is done by pooling facts from all competitors. Despite its effectiveness for running a contest, this methodology has been shown to be biased against systems that have not participated in TAC. Moreover, TAC also manually evaluates a system’s “justification”, a span of text provided as evidence for a fact. A correct fact with an incorrect justification is considered invalid. Therefore, reproducing TAC’s evaluation for new systems is challenging.

KnowledgeNet as an automated and reproducible alternative to the evaluation in TAC evaluation, by providing a much smaller but fully annotated corpus and a fixed held out dataset (for which no ground truths are available).


We developed 5 baseline methods from a recent TAC winner and added more modules based on prevailing ideas in the literature. And here are the results.

System Link F1 Text F1
Human 0.822 0.878
KnowledgeNet Baseline 5 (Baseline 4 + BERT) 0.504 0.688
KnowledgeNet Baseline 4 (Baseline 3 + noisy candidate facts) 0.491 0.621
KnowledgeNet Baseline 3 (Baseline 2 + KB information) 0.362 0.545
KnowledgeNet Baseline 2 (Baseline 1 + coreference resolution) 0.342 0.554
KnowledgeNet Baseline 1 (based on Stanford's TAC KBP winning system) 0.281 0.518

As one can see, even the best baseline still has some room to grow before it can match human annotators on KnowledgeNet. This is a good thing, as we want a benchmark that can serve as a challenge for the community.


  • F. Mesquita, M. Cannaviccio, J. Schmidek, P. Mirza, and D. Barbosa. KnowledgeNet: a benchmark dataset for knowledge base population. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 749–758. Hong Kong, China, November 2019. Association for Computational Linguistics. URL:, doi:10.18653/v1/D19-1069. [Bibtex]
  • GitHub repository with code