PubMed Central Article Datasets available via ODP cloud

The National Library of Medicine (NLM) recently announced that two PubMed Central article datasets are openly available via Open Data Sponsorship Program (ODP). This benefits researchers utilizing text mining methodology or other types of secondary analysis.


PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM). For nearly two decades NLM has supported the retrieval and download of machine-readable open access journal articles through the PMC Open Archives Initiative (PMC-OAI) and FTP (file transfer protocol). To enhance access, these datasets are now also available on the Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). Benefits to working with the datasets in the cloud include access to uncompressed individual full-text article files in XML and plain text as well as faster download and transfer speeds.


PMC Article Datasets housed on the AWS include:

  • The PMC Open Access (OA) Subset includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse.
  • The Author Manuscript Dataset includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining.

Full details of the datasets are available on the PMC Article Datasets page.Getting started documentation for using the datasets is available via AWS. Direct questions or concerns regarding the datasets to pubmedcentral@ncbi.nlm.nih.gov.