This page documents a sample of resources that may have publicly available text and data mining (TDM) capabilities. This list is not meant to be exhaustive.
Please note that even "publicly available" content may be copyrighted or licensed; copyright, terms and conditions, and licensing terms should always be consulted prior to starting any TDM project.
The Digital Public Library of America (DPLA) "is an all-digital library that aggregates metadata from libraries, museums and institutions around the country" (DPLA FAQ). DPLA contains materials that are both in the public domain and under rights restrictions.
However, just because it is publicly viewable does not mean text and data mining can be conducted on the resource. Be sure to check the rights statement.
HathiTrust is a collaborative of academic and research libraries and preserves 18+ million digitized items. HathiTrust offers "reading access to the fullest extent allowable by U.S. and international copyright law, text and data mining tools for the entire corpus, and other emerging services based on the combined collection" (Welcome to HathiTrust: Our Story).
For text and data mining policies, procedures, and tools, refer to HathiTrust Research Center Analytics.
Datasets are available from PubMed Central (PMC) but restrictions are in place. Not all articles in PMC are available for text mining and/or other reuse and certain services must be used for retrieval. Tools are supported by the National Library of Medicine (NLM) and the National Center for Biotechnology Information (NCBI).
Refer to the For Developers and PMC Article Datasets pages for additional information.
Semantic Scholar "provides free, AI-driven search and discovery tools, and open resources for the global research community" and indexes 200+ million academic papers (About Semantic Scholar).
The Semantic Scholar Open Research Corpus (S2ORC - link resolves to GitHub site) provides a general use corpus of open access papers that can be used for NLP (Natural Language Processing) and text mining.
Please see below for other compiled lists of potential TDM resources that you can use to explore more open-access text mining datasets:
This guide is intended for informational purposes only and does not constitute legal advice.
For more in-depth legal information: contact the Office of University Counsel.