The Internet Archive is a massive digital library of information (both born digital and digitized) that can immensely help cultural and historical research. It supported by a non-profit organization and many University Libraries worldwide. There are many parts to the Archive. The largest and is the Web collection, with close to 300B web pages. The Web collection can be searched with the WayBack machine, which allows the user to view different versions of the pages over time. The next in terms of size is the Texts collection, with over 12M freely downloadable books and texts, organized into thematic sub-collections by the contributors. The IA also has 3.4M Videos, 3.5M Audios, among others.

Our goal with the PRIMA project is to develop and maintain a suite of Python tools to help field researchers, primarily Digital Humanists, to download specific collections of the Archive and apply standard information retrieval and text analysis algorithms to the documents in those collections. PRIMA builds on the following Python libraries: internetarchive (used for retrieving collections from the Archive), gensim (for topic modelling), and NLTK.

Following the internetarchive tool, PRIMA provides both command line tools to allow the user to download and manipulate the collections of their choice as well as source code packaged into libraries which can be used in user programs. The list of tools currently supported by PRIMA include topic modelling, clustering, similarity estimation via min-hashing, and retrieval with BM25.

Why PRIMA? There are many open-source and commercial platforms for large-scale text analysis such as [GATE](, Apache UIMA, and Apache OpenNLP. While extremely useful, these are complex software systems that are hard to deploy and use, especially by non-experts. These tools typically offer the whole gamut of tools and Application Programming Interfaces (APIs) supporting virtually all programming languages out there and thus have long lists of dependencies making them hard to install, update and use. This complexity creates in significant barriers for many researchers who often need a handful of tools offered by these systems. Rapid prototyping for algorithm evaluation is also difficult in these systems.

PRIMA is meant for single-user environments, with as few dependencies and required packages as possible. PRIMA is intended to be lightweight and transparent, storing all data in folders that are visible to the user. PRIMA is also meant to integrate with and build on existing Python code for information retrieval and text mining, avoiding rebuilding the wheel to the best of the authors knowledge.

PRIMA is meant for researchers. To help others reproduce the analysis, PRIMA documents all algorithms applied to a collection in a log, inside a SQLite database, thus helping researchers keep track of their work.


PRIMA is developed by Erin Macdonald and Denilson Barbosa.