Knowledge Graph Augmentation (KGA) is the task of adding facts to an incomplete knowledge graph by extracting then primarily from textual sources on the Web, to improve the effectiveness of the knowledge graph in applications such as search and question answering. State-of-the-art KGA methods rely on information extraction from running text, leaving behind rich sources of facts such as tables. We help close this gap with a neural method that works on Wikipedia articles and uses contextual information surrounding tables to extract relations involving entities mentioned in a table and/or the main entity of the article. We trained and tested our method on a much larger dataset compared to previous work which we have made public and observed experimentally that our method is very promising for the task.

HRERE architecture

Our method uses an LSTM (long short term memory) network whose input consists of two entities and suitable encodings for the following contextual information: the headers and the caption of the table, the title of the section in which the table appears, and the first paragraph of that section. We use BERT embeddings to encode the textual features. Our method works both for two entities appearing in the same row of a table (e.g., Frederic VII of Denmark and his spouse Princess Louise of Sweden) or for the article entity (e.g. Louise of Hesse-Kassel) and an entity in a cell of a table.

To test our method we created a new, and much larger benchmark compared to previous work, from a March 2019 dump of Wikipedia. We annotated the tables with 28 relations from Freebase, amounting to a superset of the relations used in various previous works. We attempted to annotate the tables using distant supervision (a la Mintz et al.) but that resulted in too much noise, e.g., when multiple relations exist between the same pair of entities. We trained a Naive Bayes classifier to filter out the noisy annotations and also labeled data using hand-crafted SPARQL queries. We make our data available for others to use.

Results

We compare our neural method against a reasonable baseline that uses SPARQL queries over the knowledge graph and picks the most frequent relationship among the pairs of entities in the same row. That baseline was run on our test dataset.

Accuracy F1
SPARQL-based baseline 0.15 0.27
our method 0.92 0.95

For the sake of comparison, the numbers reported on the two most related previous works for this task are as follows: Muñoz et al. (WDM 2014) reported 0.78 F1, Cannaviccio et al. (WWW 2018) reported 0.74 F1. Although those papers report evaluations on different datasets of Wikipedia tables and relations, we think the comparison is meaningful.

We also performed an ablation study that shows which table features contribute the most:

Accuracy
Full method 0.92
Full - table captions 0.91
Full - table headers 0.76
Full - section paragraphs 0.76
Full - section titles 0.72

As one can see, the table captions were not very helpful, although upon inspection we found that they were often missing (only 7% of the tables had captions), whereas section titles had the most individual contribution (incidentally, they were available for 95% of the tables).

Conclusion

Our paper described and evaluated a neural method for predicting relations between entities mentioned in Wikipedia tables. Our method is far superior to a baseline that queries the KG and we report significantly higher accuracy than previous works that used much smaller datasets. There are many areas for future work. To begin with, we only experimented with one network architecture to investigate the efficacy of neural networks. Future research could experiment with other architectures or other features such as entity type or cell values. Finally, we tested this method using 28 relations on Wikipedia tables but could easily expand to more relations or even properties.

References

  • E. Macdonald and D. Barbosa. Neural relation extraction on wikipedia tables for augmenting knowledge graphs. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2133–2136. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3340531.3412164. [Bibtex]
  • F. Mesquita, M. Cannaviccio, J. Schmidek, P. Mirza, and D. Barbosa. KnowledgeNet: a benchmark dataset for knowledge base population. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 749–758. Hong Kong, China, November 2019. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/D19-1069, doi:10.18653/v1/D19-1069. [Bibtex]
  • M. Cannaviccio, D. Barbosa, and P. Merialdo. Towards annotating relational data on the web with language models. In Proceedings of The Web Conference 2018, 1307–1316. 2018. doi:10.1145/3178876.3186029. [Bibtex]  [PDF]